In this blog, I am going to showcase replacement of a node up for maintenance or a failed node in Apache Cassandra cluster by a new healthy node using Pure Storage FlashArray//X snapshots. Apache Cassandra administrators use either node repair process or have to copy the data from the failed node to a healthy new node. This is a very slow and painful process and can also slow down your Apache Cluster considerably which in turn affects your business transactions. This process can be simplified using the FlashArray snapshots. This is not the only advantage of snapshots for Apache Cassandra, in the previous blog I have shown how snapshots can be used to build dev/test clusters or cluster copy. In my next blog, I am going to showcase how to do backup and recovery of Apache Cassandra cluster using snapshots.
Why use FlashArray//X snapshots for Apache Cassandra node replacement?
There are three ways we can do the Apache Cassandra node replacement:
We can see the time comparison as below between these three different methodologies. I have tested with 1 TB of data to compare these different methodologies.
Clearly, it shows that FlashArray//X snapshot process is instantaneous even when data grows or multiplies. Whereas other methodologies(scp or repair) will get even slower if there is more data per node.
Apache Cassandra node replacement process:
To test the Rapid replacement from the failed node, I have created a three-node Apache Cassandra cluster.
As seen in the nodetool status , the cluster has three nodes 67, 69, 70. Now I am going to fail one of the nodes, in this case, it is node .67 and bring up a healthy node and attach node .68 to the cluster. There is a keyspace with replication factor 3 on the cluster and populated it with data.
Here are the steps for node replacement
Run nodetool flush or nodetool drain.
2. Freeze the filesystem for data volume in Cassandra node which is going to be replaced. The command is
xfs_freeze -f /var/lib/cassandra/data
3. Next step is to take FlashArray//X snapshot. The recommendation is to create a protection group for the Cassandra data volumes as shown below.
Take the FlashArray snapshot of all the commit log volumes in the protection group for every 5 minutes schedule. The snapshots which are more than 1 hour old will be removed automatically by setting up retention.
4. Unfreeze the filesystem for data volumes in Cassandra node which is going to be replaced. The command is
xfs_freeze -u /var/lib/cassandra/data
5. Install Cassandra on the new healthy node which is going to be added to the cluster. Change the configuration in cassandra.yaml to enable it to join the cluster. Copy the latest snapshot from the failed node or the node which is going to be replaced.
Make sure the node which is replaced in down as shown below.
Run the following command on this new healthy node:
service cassandra start -Dcassandra.replace_address_first_boot=10.21.238.67
You will see the new node joins immediately to the cluster and we can see the following on the system.log
This shows how easy it is to use Pure Storage FlashArray//X snapshots to replace Cassandra nodes.