Weâre overhauling Dgraphâs docs to make them clearer and more approachable. If
you notice any issues during this transition or have suggestions, please
let us know.
group
uses raft to elect leaders.
This section aims to explain the Raft consensus algorithm in simple terms. The
idea is to give you just enough to make you understand the basic concepts,
without going into explanations about why it works accurately. For a detailed
explanation of Raft, please read the original thesis paper by
Diego Ongaro.
Term
Each election cycle is considered a term, during which there is a single leader (just like in a democracy). When a new election starts, the term number is increased. This is straightforward and obvious but is a critical factor for the accuracy of the algorithm. In rare cases, if no leader could be elected within anElectionTimeout
, that
term can end without a leader.
Server States
Each server in cluster can be in one of the following three states:- Leader
- Follower
- Candidate
Communication
There is unidirectional RPC communication, from the leader to all/any followers. The followers never ping the leader. The leader sendsAppendEntries
messages
to the followers with logs containing state updates. When the leader sends
AppendEntries
with zero logs (updates), thatâs considered a
Heartbeat. The leader sends all followers Heartbeats at
regular intervals.
If a follower doesnât receive a Heartbeat for ElectionTimeout
duration (generally between 150ms to 300ms), the leader may be down, so it
converts itâs state to candidate (as mentioned in
Server States). It then requests for votes by sending a
RequestVote
call to other servers. If it gets votes from the majority, the
candidate becomes the leader. On becoming leader, it sends Heartbeats
to all other servers to establish its authority.
Every communication request contains a term number. If a server receives a
request with a stale term number, it rejects the request.
Log Entries
Dgraph uses LSM Trees, so we call commits or updates âLog Entries.â Log Entries are numbered sequentially and contain a term number. An Entry is considered committed if it has been replicated (and stored) by a majority of the servers. On being notified of the results of a client request (which is often processed on other servers), the leader does four things to coordinate Raft consensus (this is also called Log Replication):- Appends and persists to its log.
- Issue
AppendEntries
in parallel to other servers. - Monitors for the majority to report itâs replicated, after which it considers the entry committed and applies it to the leaderâs state machine.
- Notifies followers that the entry is committed so that they can apply it to their state machines.
Voting
Each server persists its current term and vote, so it doesnât end up voting twice in the same term. On receiving aRequestVote
RPC, the server denies its
vote if its log is more up-to-date than the candidate. It would also deny a
vote, if a minimum ElectionTimeout
hasnât passed since the last
Heartbeat from the leader. Otherwise, it gives a vote and resets its
ElectionTimeout
timer.
Up-to-date property of logs is determined as follows:
- Term number comparison
- Index number or log length comparison
To understand the above sections better, you can see this interactive
visualization.
Cluster membership
Raft only allows single-server changes, i.e. only one server can be added or deleted at a time. This is achieved by cluster configuration changes. Cluster configurations are communicated using special entries inAppendEntries
.
The significant difference in how cluster configuration changes are applied
compared to how typical Log Entries are applied is that the
followers donât wait for a commitment confirmation from the leader before
enabling it.
A server can respond to both AppendEntries
and RequestVote
, without checking
current configuration. This mechanism allows new servers to participate without
officially being part of the cluster. Without this feature, things wonât work.
When a new server joins, it wonât have any logs, and they need to be streamed.
To ensure cluster availability, Raft allows this server to join the cluster as a
non-voting member. Once itâs caught up, voting can be enabled. This also allows
the cluster to remove this server in case itâs too slow to catch up, before
giving voting rights (sort of like getting a green card to allow assimilation
before citizenship is awarded providing voting rights).
If you want to add a few servers and remove a few servers, do the addition
before the removal. To bootstrap a cluster, start with one server to allow it
to become the leader, and then add servers to the cluster one-by-one.
Snapshots
One of the ways to do this is snapshotting. As soon as the state machine is synced to disk, the logs can be discarded. Snapshots are taken by default after 10000 Raft entries, with a frequency of 30 minutes. The frequency indicates the time between two subsequent snapshots. These numbers can be adjusted using the--raft
superflagâs snapshot-after-entries
and
snapshot-after-duration
options respectively. Snapshots are created only when
conditions set by both of these options have been met.
Clients
Clients must locate the cluster to interact with it. Various approaches can be used for discovery. A client can randomly pick up any server in the cluster. If the server isnât a leader, the request should be rejected, and the leader information passed along. The client can then re-route itâs query to the leader. Alternatively, the server can proxy the clientâs request to the leader. When a client first starts up, it can register itself with the cluster usingRegisterClient
RPC. This creates a new client id, which is used for all
subsequent RPCs.
Linearizable Semantics
Servers must filter out duplicate requests. They can do this via session tracking where they use the client id and another request UID set by the client to avoid reprocessing duplicate requests. Raft also suggests storing responses along with the request UIDs to reply back in case it receives a duplicate request. Linearizability requires the results of a read to reflect the latest committed write. Serializability, on the other hand, allows stale reads.Read-only queries
To ensure linearizability of read-only queries run via leader, leader must take these steps:- Leader must have at least one committed entry in its term. This would allow for up-to-dated-ness. (Câmon! Now that youâre in power do something at least!)
- Leader stores itâs latest commit index.
- Leader sends Heartbeats to the cluster and waits for ACK from majority. Now it knows that itâs the leader. (No successful coup. Yup, still the democratically elected dictator I was before!)
- Leader waits for its state machine to advance to readIndex.
- Leader can now run the queries against state machine and reply to clients.