Cassandra Gift

Capture your complex relational model in Cassandra

Source Code: https://bitbucket.org/johnmpage/gift/src/main/

Alignment of Concerns

The Cassandra database achieves scale by distributing data across multiple nodes. These nodes can be located on physically distant servers. This distributed architecture comes at a cost however; fast, efficient queries are best achieved by limiting the number of servers consulted when responding to data requests. Naturally a query that needs to consult a single server or node will be the most efficient. Every table in Cassandra supports multiple keys. The key designated as the “partition” key on a Cassandra table determines which node a row of data is stored on.

If we can organize our data model around the partition key, we can ensure data is returned quickly and efficiently. Gift seeks to simplify this process and enable a complex data model to be possible through a simple set of data annotations.

One constraint is placed upon the model to achieve an efficient retrieval of your relational data from Cassandra. A “root” data object must carefully chosen to organize the data model. Beneath this root object a data model of surprisingly complexity can be supported and queried efficiently.

Gift makes this simple. It automatically maps a relational schema into your Cassandra database, using several Hibernate-like annotations. The developer has only to generate a set of data classes that implement the data model. Once the relationship between the classes have the appropriate annotations added to the class definitions, Gift can automatically assemble the object tree including nested many-to-one from the results of a Cassandra query language (CQL) query.

Single Table Schema

How does Gift accomplish this? Cassandra does NOT supporting JOINs between tables, the typical means of mapping out one-to-many relationships, a feature of relational database. Cassandra does support multiple indexed keys. Gift leverages multiple keys and the efficient storage of null values.

Storing multiple data types in a single table, we leverage the capability to support a large number of row indexes. One dedicated column uniquely identifies row types, while the other indexed columns act as foreign keys to provide an association between individual rows. These keys can be empty if there is no relationship. The only required column is the key associating the row with the root entity.

How does the data in the table look? Below we see an example of a hypothetical dataset for a pet store. Because this dataset supports a Customer based interface, every row includes key that associate the row with a particular Customer. In this simplified example, we can see that Customer Smith has one address in Boston. Order 123 is associated with Customer Smith and includes one Cat.

Customer IdAddress IdOrder IdAnimal IdData TypeLast Name CityOrder NoType
cus1CustomerSmith
cus1add1AddressBoston
cus1ord1Order123
cus1ord1ani1AnimalCat

The secondary indexes act not as unique identifiers for data types as well as foreign keys, mapping relationships between entities.

The only restriction is that every row must be a child of the root entity.

Usage

Gift provides a simple way to build a relational data model. In order to generate the pet store model described above, creates four data classes:

The Customer class can be defined as follows:

import net.johnpage.cassandra.gift.annotations.*;

@Root
public class Customer {
    public Customer(){}
    @Id
    @ClusterKey
    @Column(value = "cstId")
    public String customerId;
    @Column(value = "cstlastname")
    public String lastname="";
}

The @Root annotation establishes this class as the root class of data. The key for this data type is designated by the @Id annotation. The Cassandra column name is specified with the @Column annotation. In this example we chose to use a three letter prefix “cst” in front of all the columns associated with the Customer. The last name column is defined as “cstlastname“.

We define the first child of the Customer entity by first adding the child as a property of Customer.

@Root
public class Customer {
    public Customer(){}
    @Id
    @ClusterKey
    @Column(value = "cstId")
    public String customerId;
    @Column(value = "cstlastname")
    public String lastname="";
    @ChildCollection(childClass = Order.class)
    public List<Order> orderList = new LinkedList<>();
}

The Order class is as follows:

public class Order {
    public Order(){}
    @Id
    @ClusterKey
    @Column(value = "ordId")
    public String orderId;
    @Column(value = "cstid")
    @ParentKey(parentClass = Customer.class)
    public String customerId;
}

Here the unique id for Orders is annotated with @Id. Every class needs to identify the root class that they are associated with it. In Order, we add the "customerId" to maintain the relationship to the root class.  "orderId" provides a unique identifier for the class.

This child of Customer can in turn have children of it’s own. Here we define a list of Animals for example:

@ChildCollection(childClass = Animal.class)
public List<Animal> animalList = new LinkedList<>()

The Animal class is defined as follow:

public class Animal {

    public Animal(){}

    @Id
    @ClusterKey
    @Column(value="anmid")
    public String animalId;

    @ParentKey(parentClass = Order.class)
    @Column(value = "ordid")
    public String orderId;
}

Once your dataset has been assembled in your business code, it can be saved and/or updated with a a single line of code:

CassandraClient.insert(customer);

To query the database, a single line will suffice:

String query = "SELECT * FROM customer where clsid='c1'";
Customer this customer = CassandraClient.query(query, Customer.getClass())

Currently some knowledge of the schema is required to query the database, but a simple query like the one before quickly returns the complete Customer record for one customer, including their Address and all of their Orders.

Because the root I’d also servers as the Partition Key, the query only visits one node and the query is fast and efficient. The framework handle all the keys seamlessly under the hood. The developer can begin using the Customer immediately after running the query.

Process Threads in Apache Tomcat vs AWS Lambda

Comparing Apache Tomcat threading to AWS Lambdas we see several points:

  • Apache handles concurrent requests internally with a multi-threaded Java Virtual Machine (JVM). The JVM used by AWS Lambas does NOT allow multi-threading. Concurrent requests are handled by multiple Lambda instances
  • Scaling with Apache Tomcat is achieved with multi-threading and load-balancing additional servers. AWS Lambdas scales by provisioning additional instances as needed.

SOLR indexes tend to be larger than the documents they index.

Examining the relative size of a data store and the size of the SOLR index of that data, one finds the size of the index is usually larger than the data indexed. This may seem counter-intuitive at first, but it actually makes perfect sense.

In order to understand why, it’s helpful to create a simplified version of a document store and an index. Consider the following collection of 7 documents. Each row consists of a reference number and a document. This data store is 91 bytes in size.

[0,"ABC DEF"]
[1,"AB DEF"]
[2,"AC DEF"]
[3,"BC DEF"]
[4,"ABC DE"]
[5,"ABC EF"]
[6,"ABC DF"]

The index is intended to provide a quick lookup by letter combination. Instead of having to scan all the documents to identify the documents with a given text fragment, we simply lookup the query and receive a list of documents containing the character sequence.

"A"=[0,1,2,4,5,6]
"AB"=[0,1,4,5,6]
"ABC"=[0,4,5,6]
"AC"=[2]
"B"=[0,1,3,4,5,6]
"BC"=[0,3,4,5,6]
"C"=[0,2,3,4,5,6]
"D"=[0,1,2,3,4,6]
"DE"=[0,1,2,3,4]
"DEF"=[0,1,2,3]
"DF"=[6]
"E"=[0,1,2,3,4,5]
"EF"=[0,1,2,3,5]
"F"=[0,1,2,3,5,6]

The index is 225 bytes. That’s more than twice the size of our document store.

Weirdness when every function returns a Column: Chained when (Spark)

When when is chained, the chain breaks at the point that the test returns true.

import org.apache.spark.sql.Column     
val isTrue = lit(true)

def getWithChainedWhen():Column = {
when(isTrue,"1st")
.when(isTrue,"2nd")
.when(isTrue,"3rd")
}

val df =
sc.parallelize(List[(String)](("A")))
.toDF("a")
.withColumn( "chained",getWithChainedWhen() )
.show(false)

The results of running the above code is as follows:

+---+-------+
|a |chained|
+---+-------+
|A |1st |
+---+-------+

Only the first when is evaluated. This, it could be argued, is logically inconsistent. The function is being called on a particular dataframe: the returned dataframe from the first when statement. It could be argued that the output of the 1st function should be evaluated in the 2nd when… and the 2nd value be returned and so on until we reach the 3rd value.

The following code explicitly describes these logical forks and returns the same result.

def getWithChainedWhen():Column = {
  when(isTrue,"1st").otherwise(
    when(isTrue,"2nd").otherwise(
      .when(isTrue,"3rd"))) 
}

This is logically consistent and makes it easy to anticipate the outcome of the function, but creates a deeply nested code block.

The first strategy, using chained “when”s, the function is syntactically simpler and generates more compact source code.

Moral: For explicit logic, use when with otherwise. For succinct code, use chained whens.

This post was edited in 2025 to acknowledge the merits of chaining whens.