What do etcd, a distributed KVS with Raft, a consensus algorithm choose in the CAP/PACELC Theorem

2024-03-11 kubernetes algorithm

etcd is a distributed KVS that is also used in Kubernetes.

Raspberry PiでおうちKubernetesクラスタを構築する - sambaiz-net

Kubernetes docs say that etcd is a consistent and highly-available key value store, and so I wondered it compromises partition tolerance (P) in the CAP theorem. However, it feels difficult to maintain CA when a distributed system’s network is partitioned.

Daniel Abadi stated that a CP system, which abandons availability when network is partitioned, and a CA system, which lacks partition tolerance, are essentially the same. Therefore, there are only two types: CP/CA and AP. In 2010, he introduced the PACELC theorem, which clarifies the trade-off between C and A in the CAP theorem and expands on it. This theorem considers that when a Partition occurs, whether to choose Availability or Consistency, and Else which to choose Latency or Consistency.

etcd uses the consensus algorithm Raft, in which one of the nodes becomes the leader, replicating written data to followers and considering it committed if responses from a majority of the followers are received. The leader regularly sends heartbeats, and if they stop coming, a follower determines that the leader has failed and becomes a candidate, requesting votes from other nodes. If it obtains a majority of votes, including its own, it becomes the leader for the next term.

Let’s run etcd with Kubernetes and see.

apiVersion: v1
kind: Service
metadata:
  name: etcd
  labels:
    app: etcd
spec:
  clusterIP: None
  selector:
    app: etcd
  ports:
  - port: 2379
    name: client
  - port: 2380
    name: peer

---

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd
spec:
  serviceName: "etcd"
  replicas: 3
  selector:
    matchLabels:
      app: etcd
  template:
    metadata:
      labels:
        app: etcd
    spec:
      containers:
      - name: etcd
        image: quay.io/coreos/etcd:v3.5.0
        ports:
        - containerPort: 2379
          name: client
        - containerPort: 2380
          name: peer
        volumeMounts:
        - name: etcd-data
          mountPath: /etcd-data
        command:
        - /bin/sh
        - -c
        - | 
          /usr/local/bin/etcd \
          --name \
          ${HOSTNAME} \
          --initial-advertise-peer-urls \
          http://${HOSTNAME}.etcd:2380 \
          --listen-peer-urls \
          http://0.0.0.0:2380 \
          --advertise-client-urls \
          http://${HOSTNAME}.etcd:2379 \
          --listen-client-urls \
          http://0.0.0.0:2379 \
          --initial-cluster \
          etcd-0=http://etcd-0.etcd:2380,etcd-1=http://etcd-1.etcd:2380,etcd-2=http://etcd-2.etcd:2380 \
          --initial-cluster-state \
          new \
          --data-dir \
          /etcd-data
  volumeClaimTemplates:
  - metadata:
      name: etcd-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

etcd-0 became the leader.

$ export ETCDCTL_ENDPOINTS="http://etcd-0.etcd:2379,http://etcd-1.etcd:2379,http://etcd-2.etcd:2379"
$ etcdctl endpoint status --write-out=table
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://etcd-0.etcd:2379 | 69691ed70da97612 |   3.5.0 |   25 kB |      true |      false |        30 |         35 |                 35 |        |
| http://etcd-1.etcd:2379 | 7e44220b60d58d6a |   3.5.0 |   25 kB |     false |      false |        30 |         35 |                 35 |        |
| http://etcd-2.etcd:2379 | 63cdb3774b98cc2e |   3.5.0 |   20 kB |     false |      false |        30 |         35 |                 35 |        |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

$ etcdctl put aaa bbb
OK
$ etcdctl get aaa
aaa
bbb

If the leader is temporarily taken down, an election starts and etcd-1 becake the new leader.

7e44220b60d58d6a is starting a new election at term 30
7e44220b60d58d6a became pre-candidate at term 30
7e44220b60d58d6a received MsgPreVoteResp from 7e44220b60d58d6a at term 30
7e44220b60d58d6a [logterm: 30, index: 36] sent MsgPreVote request to 63cdb3774b98cc2e at term 30
7e44220b60d58d6a [logterm: 30, index: 36] sent MsgPreVote request to 69691ed70da97612 at term 30
raft.node: 7e44220b60d58d6a lost leader 69691ed70da97612 at term 30
7e44220b60d58d6a received MsgPreVoteResp from 63cdb3774b98cc2e at term 30
7e44220b60d58d6a has received 2 MsgPreVoteResp votes and 0 vote rejections
7e44220b60d58d6a became candidate at term 31
7e44220b60d58d6a received MsgVoteResp from 7e44220b60d58d6a at term 31
7e44220b60d58d6a [logterm: 30, index: 36] sent MsgVote request to 63cdb3774b98cc2e at term 31
7e44220b60d58d6a [logterm: 30, index: 36] sent MsgVote request to 69691ed70da97612 at term 31
7e44220b60d58d6a received MsgVoteResp from 63cdb3774b98cc2e at term 31
7e44220b60d58d6a has received 2 MsgVoteResp votes and 0 vote rejections
7e44220b60d58d6a became leader at term 31

$ etcdctl endpoint status --write-out=table
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://etcd-0.etcd:2379 | 69691ed70da97612 |   3.5.0 |   25 kB |     false |      false |        31 |         38 |                 38 |        |
| http://etcd-1.etcd:2379 | 7e44220b60d58d6a |   3.5.0 |   25 kB |      true |      false |        31 |         38 |                 38 |        |
| http://etcd-2.etcd:2379 | 63cdb3774b98cc2e |   3.5.0 |   20 kB |     false |      false |        31 |         38 |                 38 |        |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

$ etcdctl get aaa
aaa
bbb

Reducing the replicas of the StatefulSet to 1, an election similarly begins, but since the majority of votes cannot gather, no leader is determined, and the data cannot be retrieved. Increasing it back to 2, a leader is elected, and data retrieval is possible again. It shows that etcd has high Availability as long as a majority of nodes do not fail but priotizes Consistency, so it is PC(/EC) system. In contrast, Cassandra can be cited as an example of a PA/EL system that prioritizes Availability during network partitions and Latency in normal conditions.

69691ed70da97612 is starting a new election at term 35
69691ed70da97612 became pre-candidate at term 35
69691ed70da97612 received MsgPreVoteResp from 69691ed70da97612 at term 35
69691ed70da97612 [logterm: 35, index: 45] sent MsgPreVote request to 63cdb3774b98cc2e at term 35
69691ed70da97612 [logterm: 35, index: 45] sent MsgPreVote request to 7e44220b60d58d6a at term 35

$ etcdctl endpoint status --write-out=table
...
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |        ERRORS         |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
| http://etcd-0.etcd:2379 | 69691ed70da97612 |   3.5.0 |   25 kB |     false |      false |        35 |         45 |                 45 | etcdserver: no leader |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+

$ etcdctl get aaa
...
Error: rpc error: code = Unknown desc = context deadline exceeded

References

システムの高可用性の担保と etcd, Raft について

How is ETCD a highly available system, even though it uses Raft which is a CP algorithm?

PACELCで理解するCAPの定理(2)