Logo F2FInterview

Cassandra Interview Questions

Q   |   QA

Cassandra can be used in many different data management situations. Some of the most common use cases for Cassandra include:

  • Serving as the operational/real-time/system-of-record datastore for Web or other online applications needing around-the-clock transactional input capabilities
  • Applications needing “network independence”, meaning systems that cannot worry about where data lives. This oftentimes equates to widely dispersed applications that need to serve numerous geographies with the same fast response times
  • Applications needing extreme degrees of uptime and no single point of failure
  • Retailing or other such systems needing easy data elasticity, so that capacity can be added to service peak workloads for various periods of time and then shrink back when a reduction in user traffic allows – all done in an online fashion
  • Write intensive applications that have to take in continuous large volumes of data (e.g. credit card systems, music download purchases, device/sensor data, Web clickstream data, archiving systems, event logging, etc.)
  • Real-time analysis of social media or similar data that requires tracking user activity, preferences, etc.
  • Systems that need to quickly analyze data, and then use the results of that analysis as input back into the real-time system. For example, a travel or retail site may need to analyze patterns on the fly to customize offers to customers in real-time.
  • Management of large data volumes (terabytes-petabytes) that must be kept online for query access and business intelligence processing
  • Caching functionality that delivers caching tier performance response times without resorting to separate caching (e.g. memcached) and database tiers
  • SaaS applications that utilize web services to connect into a distributed, yet centrally managed database, and then display results to SaaS customers
  • Cloud applications that require elastic data scale, easy deployment, and a need to grow through a data-centric scale-out architecture
  • Systems that need to store and directly deal with a combination of structured, unstructured, and semi-structured data, with a requirement for a flexible schema/data storage paradigm that allows for easy and online structure modifications

Cassandra is typically not the choice for transactional data that needs per-transaction commit/rollback capabilities. Note that Cassandra does have atomic transactional abilities on a per row/insert basis (but with no rollback capabilities).

The primary difference between Cassandra and Hadoop is that Cassandra targets real-time/operational data, while Hadoop has been designed for batch-based analytic work.

There are many different technical differences between Cassandra and Hadoop, including Cassandra’s underlying data structure (based on Google’s Bigtable), its fault-tolerant, peer-to-peer architecture, multi-data center capabilities, tunable data consistency, all nodes being the same (no concept of a namenode, etc.) and much more.

HBase is an open-source, column-oriented data store modeled after Google Bigtable, and is designed to offer Bigtable-like capabilities on top of data stored in Hadoop. However, while HBase shared the Bigtable design with Cassandra, its foundational architecture is much different.

A Cassandra cluster is much easier to setup and configure than a comparable HBase cluster. HBase’s reliance on the Hadoop namenode equates to there being a single point of failure in HBase, whereas with Cassandra, because all nodes are the same, there is no such issue.

In internal performance tests conducted at DataStax (using the Yahoo Cloud Serving Benchmark – YCSB), Cassandra offered literally 5X better performance in writes and 4X better performance on reads than HBase. 

MongoDB is a document-oriented database that is built upon a master-slave/sharding architecture. MongoDB is designed to store/manage collections of JSON-styled documents.

By contrast, Cassandra uses a peer-to-peer, write/read-anywhere styled architecture that is based on a combination of Google BigTable and Amazon Dynamo. This allows Cassandra to avoid the various complications and pitfalls of master/slave and sharding architectures. Moreover, Cassandra offers linear performance increases as new nodes are added to a cluster, scales to terabyte-petabyte data volumes, and has no single point of failure.

In order to link this F2FInterview's page as Reference on your website or Blog, click on below text area and pres (CTRL-C) to copy the code in clipboard or right click then copy the following lines after that paste into your website or Blog.

Get Reference Link To This Page: (copy below code by (CTRL-C) and paste into your website or Blog)
HTML Rendering of above code: