Bulk Loading Options for Cassandra
Cassandra's bulk loading interfaces are most useful in two use cases: initial migration or restore from another datastore/cluster and regular ETL. Bulk loading assumes that it's important to you to avoid loading via the thrift interface, for example because you haven't integrated a client library yet or because throughput is critical.
There are two alternative techniques used for bulk loading into Cassandra: "copy-the-sstables" and sstableloader. Copying the sstables is a filesystem level operation, while sstableloader utilizes Cassandra's internal streaming system. Neither is without disadvantages; the best choice depends on your specific use case. If you are using Counter columnfamilies, neither method has been extensively tested and you are safer writing via thrift.
The key to understanding bulk-loading throughput is that potential throughput depends significantly on the nature of the operation as well as the configuration of source and target clusters and things like number of sstables, sstable size and tolerance to potentially duplicate data. Notably but not significantly, sstableloader in 1.1 is slightly improved over the (freshly re-written) version in 1.0. 
Below are good cases for and notable aspects of each strategy.
Copy-the-sstables/"nodetool refresh" can be useful if:
- Your target cluster is not running, or if it is running, is not sensitive to latency from bulk loading at "top speed" and associated operations.
- You are willing to manually, or have a tool to, de-duplicate sstable names and are willing to figure out where to copy them to in any non copy-all-to-all case. You are willing to run cleanup and/or major compaction understand that some disk space is wasted until you do. 
- You don't want to deal with the potential failure modes of streaming, which are especially bad in non-LAN deploys including EC2.
- You are restoring in a case where RF=N, because you can just copy one node's data to all nodes in the new RF=N cluster and start the cluster without bootstrapping (auto_bootstrap: false in cassandra.yaml).
- The sstables you want to import are a different version than the target cluster currently creates. Example : trying to sstableload -hc- (1.0) sstables into a -hd- (1.1) cluster is reported to not work. 
- You have your source sstables in something like s3 which can easily parallelize copies to all target nodes. s3<>ec2 is fast and free, close to best case for the inefficiency during copy stage.
- You want to increase RF on a running cluster, and are ok with running cleanup and/or major compaction after you do.
- You want to restore from a cluster with RF=[x] to a cluster whose RF is the same or smaller and whose size is a multiple of [x]. Example: restoring a 9 node RF=3 cluster to a 3 node RF=3 cluster, you copy 3 source nodes worth of sstables to each target node.
sstableloader/JMX "bulkload" can be useful if:
- You have a running target cluster, and want the bulk loading to respect for example streaming throttle limits.
- You don't have access to the data directory on your target cluster, and/or JMX to call "refresh" on it.
- Your replica placement strategy on the target cluster is so different from the source that the overhead of understanding where to copy sstables to is unacceptable, and/or you don't want to call cleanup on a superset of sstables.
- You have limited network bandwidth between the source of sstables and the target(s). In this case, copying a superset of sstables around is especially ineffecient.
- Your infrastructure makes it easy to temporarily copy sstables to a set of sstableloader nodes or nodes on which you call "bulkLoad" via JMX. These nodes are either non-cluster-member hosts which are otherwise able to participate in the cluster as a pseudo-member from an access perspective or cluster members with sufficient headroom to bulkload.
- You can tolerate the potential data duplication and/or operational complexity which results from the fragility of streaming. LAN is best case here. A notable difference between "bulkLoad" and sstableloader is that "bulkLoad" does not have sstableloader's "--ignores" option, which means you can't tell it to ignore replica targets on failure. 
- You understand that, because it uses streaming, streams on a per-sstable basis, and streaming respects a throughput cap, your performance is bounded in terms of ability to parallelize or burst, despite "bulk" loading.
- May 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- November 2009
- March 2008
- November 2007
- October 2007