Corruption. It happens. And when it happens to Cassandra’s data files, one form it can take is of a corrupt SSTable file. This is exactly what happened to us in the last week, and I wanted to share the steps we took to fix the corrupted data in a safe way, without losing any data.
Before we start, there are a few important things to note:
Firstly, we’re running Cassandra 1.2.8, so the output, commands and steps we took were performed using that version. If you’re running a different version of Cassandra, particularly <= v1.1 or >= v2.0 then it’s very possible that things will be different for you. In fact, hopefully the problem that caused the corruption in the first place will have been fixed in > v1.2.8 and you don’t encounter any corruption at all!
Secondly, we’re running Cassandra with a Replication Factor (RF) of 3, which ensures there are at least 3 separate nodes in the cluster with a copy of every piece of data. This is a recommended RF for Cassandra clusters and ensures that if you lose one node, you’ll still have a copy of all your data available from the remaining nodes. This is how we are able to recover the corrupted data gracefully. If your RF is less than 3, or you don’t have data redundancy available in some other way, then you may still lose data in the event of corruption. Additionally, you may still lose data if more than one of your nodes have corrupted the same data. In that case you’d probably need to restore from a snapshot which is a very different subject to what we cover in this post. If you’re running Cassandra but you aren’t sure about the implications of the Replication Factor, read up on it.
Thirdly, actual keyspace and column family names have been replaced with
With that said, let’s begin.
Cassandra regularly performs housekeeping on its data files, taking care of compaction, compression, writing new data to disk, and recording various database activities. If something’s awry with one of these files and it doesn’t work as normal, Cassandra will shout out about it in its log files:
==> /var/log/cassandra/system.log <== ERROR [CompactionExecutor:7] 2014-03-20 12:00:07,454 CassandraDaemon.java (line 192) Exception in thread Thread[CompactionExecutor:7,1,main] org.apache.cassandra.io.sstable.CorruptSSTableException: org.apache.cassandra.io.compress.CorruptBlockException: (/raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-4698-Data.db): corruption detected, chunk at 41674041 of length 47596. at org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:89) at org.apache.cassandra.io.compress.CompressedThrottledReader.reBuffer(CompressedThrottledReader.java:45) at org.apache.cassandra.io.util.RandomAccessReader.read(RandomAccessReader.java:355) at java.io.RandomAccessFile.readFully(RandomAccessFile.java:397) ...
While investigating high load on this node, I spotted this scary looking exception in Cassandra’s main log file. It announces that in this particular case, Cassandra had trouble reading the keyspace-cf-ic-4698-Data.db file due to a corruption error. This file belongs to an SSTable, which stores the data for a column family. Cassandra isn’t recovering from this problem itself, so what can we do?
Take the node offline
At this point we’ve identified a problem with the node so it’d probably be a good idea to deactivate it from the live cluster as a precautionary measure but also to give us a bit more leeway for our repair work. Do this ONLY if you have sufficient redundancy measures in place in your cluster (see important context above). It’s also a good idea to check the status of your apps (connection pools, reconnection handlers, etc.) and other nodes in the Cassandra cluster (load, logs, etc.) to make sure they’re able to handle this node going offline.
Gracefully shut down Cassandra on the affected server:
service cassandra stop
Check that Cassandra has fully shut down cleanly.
Scrub the SSTable
Cassandra ships with a tool called
sstablescrub. In its description, it states you should “Use this tool to fix (throw away) corrupted tables” and before using it you should “try rebuild[ing] the tables using nodetool scrub”. I had tried a nodetool scrub but that failed with an SSTable corruption error. The offline
sstablescrub wasn’t much different, also giving me a table corruption error, but you could try running this and seeing if it works in deleting the corrupted files for you:
Note: Be careful which system user you run this command as. It rewrites the SSTables with permissions for that user so you may have to
sstablescrub keyspace cf
Pre-scrub sstables snapshotted into snapshot pre-scrub-1395327387317 Scrubbing SSTableReader(path='/raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-5273-Data.db') ... Scrubbing SSTableReader(path='/raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-4698-Data.db') WARNING: Non-fatal error reading row (stacktrace follows) WARNING: Row at 85207395 is unreadable; skipping to next WARNING: Non-fatal error reading row (stacktrace follows) WARNING: Row at 106721044 is unreadable; skipping to next Error scrubbing SSTableReader(path='/raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-4698-Data.db'): org.apache.cassandra.io.compress.CorruptBlockException: (/raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-4698-Data.db): corruption detected, chunk at 52001433 of length 23873. ...
This command rewrote all of the other valid SSTables to new files, leaving only the corrupted one in its original un-touched state and making it show up like a sore thumb in an
ls -alh of the column family’s directory (all the other SSTable files got new consecutive -ic-xxxx suffixes).
Since the command didn’t delete the corrupted SStables files, we’ve not got much choice but to clear them up ourselves.
Remove the corrupted SSTable
Grab the prefix of your SSTable files, in this case
/raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-4698-, and move the files to a backup folder somewhere, just in case we need them later (we should already have a snapshot created by
mkdir -p /raid0/backups/corrupt-sstables
mv /raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-4698-* /raid0/backups/corrupt-sstables/
Now that we’ve effectively deleted a portion of the column family’s data on this node by removing the SSTable, we must start Cassandra back up on this server and run a repair on the column family. The repair should check the integrity of the data on the node and recover missing data from replicas stored by other nodes. The repair process takes a while (depending on the size of your data etc.) so perhaps you should run it in a terminal mutiplexer like tmux or screen, in case you need to close your connection to the server while it runs:
service cassandra start
nodetool repair keyspace cf
Once the repair process completes, verify that all of your logs are clear of corruption exceptions and that things are looking normal.
To clean up, you might want to remove the snapshot created by
nodetool clearsnapshot and remove your backup files
rm -r /raid0/backups/corrupt-sstables