The design of Apache Cassandra

Cassandra is a distributed database initially developed by Facebook and then open-sourced as an Apache project.

Goals of Cassandra

Cassandra primarily targets write-intensive workloads and aims to achieve the following objectives:

  • Achieving very high availability.

  • Enabling linear scalability with commodity machines.

  • Delivering outstanding performance characterized by high throughput and low latency.

  • Providing a flexible data model.

Goals of Cassandra
Goals of Cassandra

Note: We have to compromise over strong consistency in order to achieve very high availability and low latency. PACELC theorem states the tradeoff between availability & consistency and latency & consistency. Cassandra provides tunable consistency levels which allows its clients to create the required balance between consistency and the availability.

Design – data model

To understand the design of Cassandra, we will take the example of an online bookstore built on top of Cassandra.

Cassandra uses tables to store the data. Each row in the table is characterized by a schema that defines the structure of the row (the columns and their types). There are as many tables as the number of schemas in the schema list defined by the Cassandra user. The rows that belong to a schema form a column family, which we call the table. On the highest level, it has keyspaces. A user can create a keyspace, define the schemas, and provide the replication factor and strategy, as shown in the following illustration.

Data model of Cassandra
Data model of Cassandra

In a table, each row is uniquely identified by a primary key which may comprise a single column or more than one column. The primary key has two components: a partition key column component and a clustering column/columns component, as shown in the following illustration.

Uniquely identifying each row in the table with a primary key
Uniquely identifying each row in the table with a primary key
  • The partition key is used to split the table into different sets of rows, where each set is called a partition. Partitioning helps achieve scalability.

  • The clustering key is used to sort the data within a partition which helps in fast retrieval of the data, as illustrated below.

Partitioning

Cassandra uses a consistent hashing partitioning technique for horizontal data partitioning. We illustrate it with an example of how Cassandra distributes bookstore data on different nodes in the cluster.

The number of data partitions depends on the number of nodes in the ring, including virtual nodes. In the illustration below, bookstore's data is split into four partitions as there are four nodes in the cluster.

As in consistent hashing, nodes are arranged in a ring of length L, where L is an integer. Each node is assigned a key within the ring length range. Let’s assume Node 1 is assigned a key m, Node 2 is assigned a key n, Node 3 is assigned a key o, Node 4 is assigned a key p. Keys are all integers. The consistent hashing ring range is not required to necessarily start from zero. That’s why we have the range defined as [startend].

Design of Cassandra: Books entries by the Admin of the bookstore are written to different nodes based on the partition key, the Book ID. A single book insertion by the Admin is shown in the image.
Design of Cassandra: Books entries by the Admin of the bookstore are written to different nodes based on the partition key, the Book ID. A single book insertion by the Admin is shown in the image.

When the administrator of the bookstore inserts a book in the bookstore, a hash is computed based on the partition key (Book ID), which helps find the node responsible for storing that book. In the above image hash(273) mod Length_of_the_ring is assumed to be greater than p and less than or equal to m that’s why the book with ID 273 is stored on Node 1.

  • Node 1 stores all of the entries whose hash lies between p and m . Here, m is inclusive.

  • Node 2 stores all of the entries whose hash lies between m and n . Here, n is inclusive.

  • Node 3 stores all of the entries whose hash lies between n and o. Here, o is inclusive.

  • Node 4 stores all of the entries whose hash lies between o and p. Here, p is inclusive.

The benefit of using clustering key

In the above example, the Language column, as the clustering key, helps sort the rows of a partition in alphabetical order of book language, which in return helps find a book in a specific language efficiently. Alternatively, if we use the Quantity column as the clustering key, it will sort the rows of a partition in ascending/descending order of the quantity of the book, which in return, helps find books with fewer quantities efficiently.

Replication

The replication factor (RF) in a keyspace determines the number of times each partition is replicated, and the replication strategy determines the nodes (RF-1 out of the total number of nodes in the cluster) for replicating a partition. The replication helps achieve availability.

In conclusion, Apache Cassandra’s design aligns with its goals of high availability, scalability, and performance. It enables linear scalability by partitioning data horizontally across cluster nodes, allowing for easy addition or removal of nodes as needed. Additionally, it offers configurable replication factors and strategies to meet specific availability requirements. By utilizing efficient data partitioning and replication strategies, Cassandra delivers outstanding performance characterized by high throughput and low latency. Its flexible data model, including primary keys with partition and clustering components, supports efficient data organization and retrieval, catering to various application needs.

Test yourself

Test what you have learned so far.

1

What is one of the primary goals of Apache Cassandra?

A)

Achieving low throughput

B)

Delivering high availability

C)

Minimizing data partitioning

D)

Ensuring strong consistency

Question 1 of 20 attempted

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved