01 Apache Cassandra – Introduction Notes

01 Apache Cassandra – Introduction Notes

1. What is Cassandra?

  • Apache Cassandra is a distributed NoSQL database designed for managing large amounts of structured, semi-structured, and unstructured data.
  • Originally developed at Facebook, it became an Apache Software Foundation project.
  • Built for high availability, fault tolerance, scalability, and performance.

2. Key Features

  • Distributed & Decentralized: No single point of failure. Every node is equal (peer-to-peer architecture).
  • Highly Scalable: Scales horizontally by adding more nodes without downtime.
  • Fault Tolerant: Replication and data distribution ensure resilience.
  • High Write Throughput: Optimized for fast write operations.
  • Schema-free (or schema-optional): Supports dynamic column families (NoSQL flexibility).
  • Tunable Consistency: Lets you configure consistency levels per operation.
  • Support for Multi-Data Center Replication.

3. Data Model Basics

  • Inspired by Google Bigtable and Amazon Dynamo.
  • Data is organized as:

    • Keyspace: Top-level namespace (like a database).
    • Table (Column Family): Similar to an RDBMS table.
    • Row: Identified by a Primary Key.
    • Column: Each row can have different columns (flexible schema).

4. Core Concepts

  • Partitioner: Determines which node stores a given row.
  • Replica: A copy of data stored on multiple nodes.
  • Consistency Level: Defines how many replicas must respond before considering a write/read successful (e.g., ONE, QUORUM, ALL).
  • Hinted Handoff: Temporarily stores writes when a node is down, to be delivered later.
  • Gossip Protocol: Nodes exchange state information with each other periodically.
  • SSTable (Sorted String Table): Immutable files stored on disk after memtables are flushed.

5. Write & Read Path

  • Write Path:

    • Write → Commit Log → MemTable → SSTable.
    • Eventually, MemTables are flushed to disk into SSTables.
  • Read Path:

    • Query → Check MemTable → Check Bloom Filter → Check SSTables.

6. Use Cases

  • Real-time big data applications.
  • Logging and time-series data.
  • IoT sensor data storage.
  • Messaging systems, recommendation engines.

7. Comparison to RDBMS

Feature Cassandra RDBMS
Schema Flexible Fixed (strict schema)
Joins Not supported Supported
Scalability Horizontal Vertical (mostly)
ACID compliance No (uses BASE) Yes (ACID compliant)
Query Language CQL (Cassandra Query Language) SQL

8. Cassandra Query Language (CQL)

  • Similar to SQL but with limitations.
  • No joins, subqueries, or aggregate functions like in traditional SQL.
  • Sample:
CREATE TABLE users (
   user_id UUID PRIMARY KEY,
   name TEXT,
   email TEXT
);

INSERT INTO users (user_id, name, email)
VALUES (uuid(), 'John Doe', 'john@example.com');

Let me know if you'd like this in PDF format or want more details on any section like installation, data modeling, or CQL commands.