Apache Cassandra is a high performance, extremely scalable, fault tolerant (i.e. no single point of failure), distributed post-relational database solution. Cassandra combines all the benefits of Google Bigtable and Amazon Dynamo to handle the types of database management needs that traditional RDBMS vendors cannot support. From a commercial software standpoint, DataStax is the leading worldwide commercial provider of Cassandra products, services, support, and training.
There are many technical benefits that come from using Cassandra.
Apache Cassandra is a standout among the NoSQL/post-relational database solutions on the market for many reasons. Today, major companies, educational institutions, and government agencies are using Cassandra to power key aspects of their business because of the benefits they derive from the following core features:
Massively scalable peer-to-peer architecture – Based on the best of Amazon Dynamo and Google BigTable, Cassandra’s peer-to-peer architecture overcomes the limitations of master-slave designs and allows for both high availability and massive scalability. Cassandra is the acknowledged NoSQL leader when it comes to comfortably scaling to terabytes or petabytes of data.
Linear scale performance – Nodes added to a Cassandra cluster (all done online) increase the throughput of your database in a predictable, linear fashion for both read and write operations.
No single point of failure – Data is replicated to multiple nodes to protect from loss during node failure, and new machines can be added incrementally while online to increase the capacity and data protection of your Cassandra cluster.
Transparent fault detection and recovery – Cassandra clusters can grow into the hundreds or thousands of nodes. Because Cassandra was designed for commodity servers, machine failure is expected. Cassandra utilizes gossip protocols to detect machine failure and recover when a machine is brought back into the cluster – all without your application noticing.
Flexible, dynamic schema data modeling – Cassandra offers the organization of a traditional RDBMS table layout combined with the flexibility and power of no stringent structure requirements. This allows you to store your data as you need to without performance penalty for changes as your needs evolve. Plus, Cassandra can store structured, semi-structured, and unstructured data.
Guaranteed data safety – Cassandra far exceeds other systems on write performance, while ensuring durability, due to its innovative append-only commit log. Users no longer have to trade off durability to keep up with immense write streams. Data is absolutely safe in Cassandra; there is no possibility of data loss.
Distributed, read/write anywhere design – Cassandra’s peer-to-peer architecture avoids the hotspots and read/write issues found in master-slave designs. This means you can have a highly distributed database (multi-geography, data center, etc.) and read or write to any node in a cluster without concern over what node is being accessed.
Tunable Data Consistency – Cassandra is a distributed system that can span multiple machines, multiple racks, and multiple data centers. Because you know your requirements for latency across those barriers better than anyone, it allows you to choose strong consistency or allow varying degrees of more relaxed consistency (incorporating advanced anti-entropy protocols). The full ‘CAP‘ spectrum between consistency and availability is yours. Data consistency can be controlled on a per-operation basis (i.e. per INSERT, per UPDATE, etc.)
Multi-datacenter replication – Whether it’s keeping your data in multiple locations for disaster recovery scenarios or for blazing performance to keep it near your end user, Cassandra offers support for multiple data centers. Simply configure how many copies of your data you want in each data center, and Cassandra handles the rest – replicating your data for you. Cassandra is also rack-aware and can keep replicas of data stored on different physical racks, which helps ensure uptime in the case of single rack failures.
Cloud enabled – Cassandra’s architecture maximizes the benefits of running in the Cloud. Plus, Cassandra allows for hybrid data distribution where some data can be kept on premise and some in the Cloud.
Data compression – Cassandra supplies built-in data compression, with some use cases showing up to an 80% reduction in raw data footprint. Plus, Cassandra’s compression results in no performance penalty, with some use cases showing actual read/write speedup’s due to less physical I/O being managed.
CQL (Cassandra Query Language) – Cassandra provides a SQL-like language called CQL that mirrors SQL’s DDL, DML, and SELECT syntax. CQL greatly lessens the learning curve for those coming from RDBMS systems because they can use familiar syntax for all object creation and data access operations.
No caching layer required – Cassandra offers caching on each of its nodes. Coupled with Cassandra’s scalability characteristics, and you can incrementally add nodes to the cluster to keep as much of your data in memory as you need. The result? There’s no need for a separate caching layer. Caching + disk persistence in one layer – ease of development, ease of operations.
No special hardware needed – Cassandra runs on commodity machines and requires no expensive or special hardware.
Incremental and elastic expansion – The Cassandra ring allows you to add nodes easily without manual migration of data needed from one to another. The result is your Cassandra cluster can grow as you need it to – and you can increase your cost incrementally as your data needs demand. Simply add new nodes to the Cassandra cluster as needed.
Simple install and setup – Cassandra can be downloaded and installed in minutes, even for multi-cluster installs.
Debian installation instructions
Upgrade your software
sudo apt-get upgrade
sudo vi /etc/apt/sources.list
Add following lines to your source.list
deb http://www.apache.org/dist/cassandra/debian 11x main deb-src http://www.apache.org/dist/cassandra/debian 11x main
sudo apt-get update
Now you will see an error similar to this:
GPG error: http://www.apache.org unstable Release: The following signatures couldn't be
verified because the public key is not available: NO_PUBKEY F758CE318D77295D
This simply means you need to add the PUBLIC_KEY. You do that like this:
gpg --keyserver pgp.mit.edu --recv-keys F758CE318D77295D gpg --export --armor F758CE318D77295D | sudo apt-key add -
Starting with the 0.7.5 debian package, you will also need to add public key 2B5C1B00
using the same commands as above:
gpg --keyserver pgp.mit.edu --recv-keys 2B5C1B00 gpg --export --armor 2B5C1B00 | sudo apt-key add -
Run update again and install Cassandra
sudo apt-get update sudo apt-get install cassandra
sudo service cassandra start
Start Cassandra from package without installation
Download latest cassandra version from following url
Start Cassandra by using following command
Starting Cassandra involves connecting to the machine where it is installed with the proper security credentials, and invoking the cassandra executable from the installation’s binary directory. An example of starting Cassandra on Mac could be:
The basic command line interface (CLI) for logging into and executing commands against Cassandra is the cassandra-cli utility, which is found in the software installation’s bin directory.
An example of logging into a local machine’s Cassandra installation using the CLI and the default Cassandra port might be:
Welcome to the Cassandra CLI.
Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. [default@unknown] connect localhost/9160; Connected to: "Test Cluster" on localhost/9160 [default@unknown]