Apache Cassandra

Apache Cassandra was created at Facebook around 2008, originally to power features such as inbox search across very large data, and was released as open source. In their paper “Cassandra - A Decentralized Structured Storage System,” Avinash Lakshman and Prashant Malik describe it as a distributed storage system for managing very large amounts of structured data spread across many commodity servers while providing highly available service with no single point of failure. The paper notes that at Facebook the system had to handle a very high write throughput of billions of writes per day and replicate data across geographically distributed data centers.

Cassandra is often summarized as combining the ideas of two influential systems. From Amazon’s Dynamo it borrows the techniques for availability and partitioning, including a decentralized, masterless ring where data is distributed by consistent hashing and updates are accepted even when some nodes are unreachable. From Google’s BigTable it borrows the data model, a wide-column structure in which rows are grouped into column families and a row can hold many sparse columns.

The paper explains that Cassandra does not support a full relational data model; instead it gives clients a simpler model that supports dynamic control over data layout and format. This deliberate trade, giving up relational features such as joins, is what lets Cassandra scale writes across large clusters.

Cassandra later moved to the Apache Software Foundation. Its own project site today describes it as an open source NoSQL distributed database trusted for scalability and high availability, emphasizing that there are no single points of failure, that every node in the cluster is identical, and that read and write throughput increase linearly as new machines are added. Its cross-data-center replication is described as best-in-class, letting deployments survive regional outages.

Sources

Related