-
Notifications
You must be signed in to change notification settings - Fork 0
Token assignment
To achieve fault tolerance, you can place Cassandra replicas in different racks in case there is a rack outage. For even distribution of data you might need to alternate the nodes in the cluster into different racks. To understand how Cassandra decides the replicas to replicate you might want to read NetworkTopologyStrategy
Priam is a co-process which does not to hold any state in itself. To store the ring membership information Priam uses a external system, such as SimpleDB or even another Cassandra instance. When Priam launches for the first time on an instance it will either try to replace a dead node or enter the ring as a new node. Once it derives a token for its node, Priam will launch Cassandra and will pass along the token, allowing Cassandra to start bootstrapping.
How it works: When a node goes down and a replacing node comes online, Priam will compare the list of registered nodes from the external storage against the actual current live nodes (by calling the EC2 AutoScaleGroup API). If there is a registered node is not in the list of active instances, Priam assumes it has been terminated and will take its token. Priam instructs Cassandra to replace the dead node by setting a Cassandra JVM system property, cassandra.replace_token=.
When a new node is joining the cluster, Priam calculates the number of possible nodes in the cluster and splits the ring into equal sizes. Priam then assigns a 'slot number' to the node, using this schema (assuming 3 racks and 4 nodes per rack):
+--------+--------+--------+ | zone A | zone B | zone C | +--------+--------+--------+ | 0 | 1 | 2 | +--------+--------+--------+ | 3 | 4 | 5 | +--------+--------+--------+ | 6 | 7 | 8 | +--------+--------+--------+ | 9 | 10 | 11 | +--------------------------+
As you can see, Priam avoids placing sequential nodes in the same rack next to each other. After the slot is determined, we apply a 'region offset' to avoid having the same tokens within multiple Cassandra datacenters (this is not allowed within Cassandra). Thus, a token can be calculated like this:
100 (maximum token value) / 10 (total number of nodes in this datacenter (or AWS region)) * 3 (slot for this node) + 12 (region offset, simply calculated by taking the hash code of the datacenter name) = 42
Auto Scaling Group advice: We need to create one autoscaler per zone. In AWS the autoscaler will try to spin a instance in a zone and if it is not able to create the instance it will try to create one in the other zone. Once the other zone becomes available it will try to rebalance the zones by randomly terminating the instance in zones. Hence the optimization is to create one autoscaler per zone. Note: Priam requires at least 2 zones in the current implementation to create a list of seeds.