In a relational database, you must specify a data type for each column when you define a table. The data type constrains the values that can be inserted into that column. For example, if you have a column defined as an integer datatype, you would not be allowed to insert character data into that column.
In Cassandra, you can specify a data type for both the column name (called a comparator) as well as for row key and column values (called a validator).
Column and row key data in Cassandra is always stored internally as hex byte arrays, but the compartor/validators are used to verify data on insert and translate data on retrieval. In the case of comparators (column names), the comparator also determines the sort order in which columns are stored.
Cassandra comes with the following comparators and validators:
|BytesType||Bytes (no validation)|
|UTF8Type||UTF-8 encoded strings|
|LexicalUUIDType||128-bit UUID by byte value|
|TimeUUIDType||Version 1 128-bit UUID by timestamp|
|CounterColumnType*||64-bit signed integer|
* can only be used as a column validator, not valid as a row key validator or column name comparator
A simple example might be:
CREATE COLUMNFAMILY Standard1 WITH comparator_type = "UTF8Type";
In Cassandra, the keyspace is the container for your application data, similar to a schema in a relational database. Keyspaces are used to group column families together. Typically, a cluster has one keyspace per application.
Replication is controlled on a per-keyspace basis, so data that has different replication requirements should reside in different keyspaces. Keyspaces are not designed to be used as a significant map layer within the data model, only as a way to control data replication for a set of column families.
When comparing Cassandra to a relational database, the column family is similar to a table in that it is a container for columns and rows. However, a column family requires a major shift in thinking for those coming from the relational world.
In a relational database, you define tables, which have defined columns. The table defines the column names and their data types, and the client application then supplies rows conforming to that schema: each row contains the same fixed set of columns.
In Cassandra, you define column families. Column families can (and should) define metadata about the columns, but the actual columns that make up a row are determined by the client application. Each row can have a different set of columns.
A Cassandra column family can contain regular columns (key/value pairs) or super columns. Super columns add another level of nesting to the regular column family column structure. Super columns are comprised of a (super) column name and an ordered map of sub-columns. A super column is a way to group multiple columns based on a common lookup value.
The primary use case for super columns is to denormalize multiple rows from other column families into a single row, allowing for materialized view data retrieval.
Super columns should not be used when the number of sub-columns is expected to be a large number. During reads, all sub-columns of a super column must be deserialized to read a single sub-column, so performance of super columns is not optimal if there are a large number of sub-columns. Also, you cannot create a secondary index on a sub-column of a super column.