Understanding the architecture of Cassandra


In this blog, I will introduce the Apache Cassandra according to the introduction provided by Datastax documentation and videos
We must firstly understand what is NoSQL Databases and then the Cassandra architecture.

We used to work with the traditional SGBD the relational database like mysql or PostgreSQL.
RDBMS are fine to manage a Medium data:
  • They guarantee ACID  (Atomicity, Consistency, Isolation, Durability).
  • They scale vertically that means the data resides on a single node and scaling is done through multi-core i.e. spreading the load between the CPU and RAM resources of that machine.

But we must thinking about "Can RDBMS work with Big data?"
  • The data is replicated asynchronously.
  • It's difficult to run full read/write in parallel. Consistency
  • Sharding is a Nightmare!
  • One point of failure: High avalability  so how can we recover data if there is a problem in the system?
  • Scaling up is expensive!
  • What if the application is distributed and the Data arrive from many locations?

NoSQL is the solution 

NoSQL Databases are called "Not Only SQL" or "Non-relational" databases. NoSQL databases have many advantages compared to traditional relational databases. 
  • They can manage a Big Data in real-time web application.
  • They are distributed and open source databases.
  • They partition the Data to minimize the impact of failure and distribute the load of read/write operations. If only one node fails, the data belonging to that node is impacted, but not the entire data store. 
  • They keep multiple copies (Replicas) of the same Data to ensure the high availability.
  • They are easily scaled horizontaly (scale Out). With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool not like Vertical-scaling that's often limited to the capacity of a single machine.
  • The most NOSQL Databases provide a way of growing your data cluster, without bringing the cluster down or forcing a complete re-partitioning. One of the known algorithms that is used to deal with this, is called consistent hashing. 
There are more than 225 NoSQL Databases which are listed here

Let's start!



Appache Cassandra  is an open source distributed NoSQL Database. It designed to manage big data and provides high  availability with no single point of failure. 


Have you ever wondered why did they call it Cassandra?
That was a Greek princess's name. She was so beautiful  and she had the ability to see the future. But nobody   believed her. You can find the whole story of Cassandra here.
We must not repeat the mistakes of the history.. so Cassandra knows our Data and we must believe that and trust her 😃.

Architecture 

If you want to think about Cassandra conceptionally you can think about Hash/Token Ring: All nodes in Cluster are equals and the data is distributed across a cluster in the form of ring.

This ring starts at the position 0 until the position 2127.

Cluster: The group of the nodes. It contains one or more datacenters. It can span physical locations.

Nodes: Virtual machines or a machines that can be physical computers. Node is the place where you store your data. It is the basic infrastructure component of Cassandra.



This is an example from Datastax documentation, of  a Cluster where each node is responsible for 25% of the token ring.

  • Each token determines the node's position in the ring and its portion of data according to its hash value.
  • Each node once a range of Hashes like a packet of hashes.
  • When you define a model in Cassandra and you create a table, you must specify the Primary key and a part of Primary key is called "Partition key".
  • Each node stores data determined by mapping the partition key to a token value within a range from the previous node to its assigned value. Each node also contains copies of each row from other nodes in the cluster.
  • The data is replicated to multiple servers.
  • All nodes hold data and answer queries (reads and writes).

Replication

In Cassandra the Data is replicated automatically.

Replication Factor: How many copies of each piece of Data should you have in your cluster?
If a machine is down, the missing Data is replayed via hinted handoff to be able to replay a node comes backup and rejoin the cluster.

Consistency level

How many replicas do I need when I do a read or write before Cassandra gives the data back to the client?
The consistency level determines the number of replica nodes that must respond before the results of a read/write request can be sent back to the client.


  • ONE: provides the lower consistency level and satisfies the needs of most users.
  • QUORUM: provides strong consistency if you can tolerate some level of failure. A quorm is calculatred (sum_of_replication_factors / 2) + 1. For example in a single data center cluster using a replication factor of 3, a quorum is 2 nodes.
  • ALL: provides the highest consistency and the lowest availability of any other level.

Distribution

Any given node can service the write request. The node that will do this operation is called "the cordinator".  
The coordinator acts as a proxy between the client application and the nodes that own the data being requested.

Write Path


  1. Writes are written to commit log to ensure data durability.
  2. Data is then indexed and written to an in-memory structure, called a memtable, which resembles a write-back cache. 
  3. Once the memory structure is full, the data is written to disk in an SSTable data file.
  4. New memtable is created in memory.
  5. All writes are automatically partitioned and replicated throughout the cluster.
Deletes are a special write case, called a tombstone.
Using a process called compaction Cassandra periodically consolidates SSTables, discarding obsolete data and tombstone.

Read Path

Cassandra looks for the nodes with the requested key, on each node the data is pulled from SSTable and merged using compaction in background.

Compaction is the process of taking small SSTable and merges them together into bigger one (using the timestamp with the last write wins) and then send the response to the client.

The read path is more complicated than write path, you can see more details here.

Conclusion

Cassandra is a row-oriented Database. If you know the syntax of SQL you will be able to create your Database Tables and to manage your data using CQL because CQL uses a similar syntax to SQL. 

Commentaires