Cassandra Gift

Capture your complex relational model in Cassandra

Source Code: https://bitbucket.org/johnmpage/gift/src/main/

Alignment of Concerns

The Cassandra database achieves scale by distributing data across multiple nodes. These nodes can be located on physically distant servers. This distributed architecture comes at a cost however; fast, efficient queries are best achieved by limiting the number of servers consulted when responding to data requests. Naturally a query that needs to consult a single server or node will be the most efficient. Every table in Cassandra supports multiple keys. The key designated as the “partition” key on a Cassandra table determines which node a row of data is stored on.

If we can organize our data model around the partition key, we can ensure data is returned quickly and efficiently. Gift seeks to simplify this process and enable a complex data model to be possible through a simple set of data annotations.

One constraint is placed upon the model to achieve an efficient retrieval of your relational data from Cassandra. A “root” data object must carefully chosen to organize the data model. Beneath this root object a data model of surprisingly complexity can be supported and queried efficiently.

Gift makes this simple. It automatically maps a relational schema into your Cassandra database, using several Hibernate-like annotations. The developer has only to generate a set of data classes that implement the data model. Once the relationship between the classes have the appropriate annotations added to the class definitions, Gift can automatically assemble the object tree including nested many-to-one from the results of a Cassandra query language (CQL) query.

Single Table Schema

How does Gift accomplish this? Cassandra does NOT supporting JOINs between tables, the typical means of mapping out one-to-many relationships, a feature of relational database. Cassandra does support multiple indexed keys. Gift leverages multiple keys and the efficient storage of null values.

Storing multiple data types in a single table, we leverage the capability to support a large number of row indexes. One dedicated column uniquely identifies row types, while the other indexed columns act as foreign keys to provide an association between individual rows. These keys can be empty if there is no relationship. The only required column is the key associating the row with the root entity.

How does the data in the table look? Below we see an example of a hypothetical dataset for a pet store. Because this dataset supports a Customer based interface, every row includes key that associate the row with a particular Customer. In this simplified example, we can see that Customer Smith has one address in Boston. Order 123 is associated with Customer Smith and includes one Cat.

Customer IdAddress IdOrder IdAnimal IdData TypeLast Name CityOrder NoType
cus1CustomerSmith
cus1add1AddressBoston
cus1ord1Order123
cus1ord1ani1AnimalCat

The secondary indexes act not as unique identifiers for data types as well as foreign keys, mapping relationships between entities.

The only restriction is that every row must be a child of the root entity.

Usage

Gift provides a simple way to build a relational data model. In order to generate the pet store model described above, creates four data classes:

The Customer class can be defined as follows:

import net.johnpage.cassandra.gift.annotations.*;

@Root
public class Customer {
    public Customer(){}
    @Id
    @ClusterKey
    @Column(value = "cstId")
    public String customerId;
    @Column(value = "cstlastname")
    public String lastname="";
}

The @Root annotation establishes this class as the root class of data. The key for this data type is designated by the @Id annotation. The Cassandra column name is specified with the @Column annotation. In this example we chose to use a three letter prefix “cst” in front of all the columns associated with the Customer. The last name column is defined as “cstlastname“.

We define the first child of the Customer entity by first adding the child as a property of Customer.

@Root
public class Customer {
    public Customer(){}
    @Id
    @ClusterKey
    @Column(value = "cstId")
    public String customerId;
    @Column(value = "cstlastname")
    public String lastname="";
    @ChildCollection(childClass = Order.class)
    public List<Order> orderList = new LinkedList<>();
}

The Order class is as follows:

public class Order {
    public Order(){}
    @Id
    @ClusterKey
    @Column(value = "ordId")
    public String orderId;
    @Column(value = "cstid")
    @ParentKey(parentClass = Customer.class)
    public String customerId;
}

Here the unique id for Orders is annotated with @Id. Every class needs to identify the root class that they are associated with it. In Order, we add the "customerId" to maintain the relationship to the root class.  "orderId" provides a unique identifier for the class.

This child of Customer can in turn have children of it’s own. Here we define a list of Animals for example:

@ChildCollection(childClass = Animal.class)
public List<Animal> animalList = new LinkedList<>()

The Animal class is defined as follow:

public class Animal {

    public Animal(){}

    @Id
    @ClusterKey
    @Column(value="anmid")
    public String animalId;

    @ParentKey(parentClass = Order.class)
    @Column(value = "ordid")
    public String orderId;
}

Once your dataset has been assembled in your business code, it can be saved and/or updated with a a single line of code:

CassandraClient.insert(customer);

To query the database, a single line will suffice:

String query = "SELECT * FROM customer where clsid='c1'";
Customer this customer = CassandraClient.query(query, Customer.getClass())

Currently some knowledge of the schema is required to query the database, but a simple query like the one before quickly returns the complete Customer record for one customer, including their Address and all of their Orders.

Because the root I’d also servers as the Partition Key, the query only visits one node and the query is fast and efficient. The framework handle all the keys seamlessly under the hood. The developer can begin using the Customer immediately after running the query.

Process Threads in Apache Tomcat vs AWS Lambda

Comparing Apache Tomcat threading to AWS Lambdas we see several points:

  • Apache handles concurrent requests internally with a multi-threaded Java Virtual Machine (JVM). The JVM used by AWS Lambas does NOT allow multi-threading. Concurrent requests are handled by multiple Lambda instances
  • Scaling with Apache Tomcat is achieved with multi-threading and load-balancing additional servers. AWS Lambdas scales by provisioning additional instances as needed.

SOLR indexes tend to be larger than the documents they index.

Examining the relative size of a data store and the size of the SOLR index of that data, one finds the size of the index is usually larger than the data indexed. This may seem counter-intuitive at first, but it actually makes perfect sense.

In order to understand why, it’s helpful to create a simplified version of a document store and an index. Consider the following collection of 7 documents. Each row consists of a reference number and a document. This data store is 91 bytes in size.

[0,"ABC DEF"]
[1,"AB DEF"]
[2,"AC DEF"]
[3,"BC DEF"]
[4,"ABC DE"]
[5,"ABC EF"]
[6,"ABC DF"]

The index is intended to provide a quick lookup by letter combination. Instead of having to scan all the documents to identify the documents with a given text fragment, we simply lookup the query and receive a list of documents containing the character sequence.

"A"=[0,1,2,4,5,6]
"AB"=[0,1,4,5,6]
"ABC"=[0,4,5,6]
"AC"=[2]
"B"=[0,1,3,4,5,6]
"BC"=[0,3,4,5,6]
"C"=[0,2,3,4,5,6]
"D"=[0,1,2,3,4,6]
"DE"=[0,1,2,3,4]
"DEF"=[0,1,2,3]
"DF"=[6]
"E"=[0,1,2,3,4,5]
"EF"=[0,1,2,3,5]
"F"=[0,1,2,3,5,6]

The index is 225 bytes. That’s more than twice the size of our document store.

Weirdness when every function returns a Column: Chained when (Spark)

When when is chained, the chain breaks at the point that the test returns true.

import org.apache.spark.sql.Column     
val isTrue = lit(true)

def getWithChainedWhen():Column = {
when(isTrue,"1st")
.when(isTrue,"2nd")
.when(isTrue,"3rd")
}

val df =
sc.parallelize(List[(String)](("A")))
.toDF("a")
.withColumn( "chained",getWithChainedWhen() )
.show(false)

The results of running the above code is as follows:

+---+-------+
|a |chained|
+---+-------+
|A |1st |
+---+-------+

Only the first when is evaluated. This, it could be argued, is logically inconsistent. The function is being called on a particular dataframe: the returned dataframe from the first when statement. It could be argued that the output of the 1st function should be evaluated in the 2nd when… and the 2nd value be returned and so on until we reach the 3rd value.

The following code explicitly describes these logical forks and returns the same result.

def getWithChainedWhen():Column = {
  when(isTrue,"1st").otherwise(
    when(isTrue,"2nd").otherwise(
      .when(isTrue,"3rd"))) 
}

This is logically consistent and makes it easy to anticipate the outcome of the function, but creates a deeply nested code block.

The first strategy, using chained “when”s, the function is syntactically simpler and generates more compact source code.

Moral: For explicit logic, use when with otherwise. For succinct code, use chained whens.

This post was edited in 2025 to acknowledge the merits of chaining whens.

Stream IIS logs to Kafka

Introducing KafkaTailer

Kafka is a game-changer.  As a powerful, centralized messaging tool, it performs extraordinarily well compared to other messaging applications. Popular in the JVM-Nux-Nix realms, it is now possible to add your favorite Microsoft IIS application to your streaming pipeline.  Using the best open-source libraries available, KafkaTailer can stream your IIS logs to any Kafka topic.

The flexibility of this tool comes from the simplicity of its approach: It simply tails standard log files.  Combining the  Apache IO Tailer,  the latest Kafka Producer, and the Apache Commons Daemon, KafkaTailer watches your IIS log directory and sends the log messages up to a Kafka server.  Within minutes, it’s possible to start sending your IIS logs out to Kafka.

Quick Start

  1. Open up the administration you IIS instance.
  2. Configure the logs that interest you to log to a dedicated directory.
  3. If you don’t have a Java Virtual Machine on your Windows machine, you will need to install it.
  4. Setup a Kafka instance to publish your logs to, if you don’t have one already.
  5. Go to https://github.com/johnmpage/KafkaTailer. Read the summary.
  6. Download the latest release of KafkaTailer (currently v2.1).
  7. Configure the Kafka Producer with a kafka-producer.properties file. The minimal set of values would be as follows:
    bootstrap.servers=127.0.0.1:9092
    value.serializer=org.apache.kafka.common.serialization.StringSerializer
    key.serializer=org.apache.kafka.common.serialization.StringSerializer
    
  8. Open up a command prompt and type the following command, substituting values that reflect your environment as needed:
    java -classpath kafka-tailer-2.1-jar-with-dependencies.jar net.johnpage.kafka.KafkaTailer directoryPath=C:\\iis-logs\\W3SVC1\\ producerPropertiesPath=C:\\iis-logs\\kafka-producer.properties kafkaTopic=a-topic
  9. Open you browser and navigate to your IIS website.
  10. KafkaTailer reports its operations in the command prompt. Confirm that it has started up successfully and no exceptions are being thrown.
  11. Monitor your Kafka topic and review the logs are being added to the topic.

Once the basic setup is working, you will probably want to configure your Kafka Producer to use SSL, refine which fields the IIS logs make use of, and run KafkaTailer as a Windows Service.

Setting up a Windows Service

Included in the KafkaTailer project is the skeleton of a Microsoft Windows Service.  If you’ve run Tomcat on Windows, the GUI will be familiar to you.  Apache Tomcat uses the same Daemon project as KafkaTailer does.

The Windows service is included in the winsrvc directory. The install.bat script and the kafka-producer.properties file will require customization to reflect your environment.

Note

Please report any bugs or issues!

 

Why We Tag

 

Alternate Title: The Lynch Pin of Safe Patch Releases

Some would argue Patch Releases to Production are inherently risky. In fact with the right approach the risk involved in a Patch Release can be small.  The key to managing this risk is Continuous Integration and a disciplined release process.

When a bug is discovered in code that was released 2 or 3 weeks ago, developers may have already begun work on the next big release. Ambitious new features that touch sensitive business logic may be partially implemented. Developers sometimes struggle to navigate this moment when there seems to be no firm ground to stand on. If they release the code in the state it is in, they have untested changes and stand a good chance of introducing new bugs into the Production system.

What is required is a “snapshot” of the code exactly as it appeared the last time it was thoroughly tested and reviewed… that is… the last time it was released. If the bug fix can be applied to well-tested code, the risks of introducing new bugs are greatly reduced.
Why We Tag Releases
In the diagram, we can see how tagged code can make patch releases fast and low-risk. In the first section, the normal loop of development occurs. The source code is in a state of flux.  Developers are making changes and releasing to the shared server, where it is visible to business stakeholders. Product owners, testers, and stakeholders are reviewing the work, providing their thoughts, and identifying issues. When the new features and improvements are completed to the developers satisfaction, a release is prepared. The release is tagged. Tagging a release is a process that takes minutes, but is crucial to the managing risk. It marks the code in time. The release that goes into production is built from this snapshot.

This tested, reviewed, and approved code is released. Once again, new features begin to take shape. The code base is in flux. Two weeks into this new effort, a bug report comes in. The schema changes have had a an expected effect on another feature. An important client is unhappy. Business raises the issue to highest level. This bug must be patched.

The developers can proceed calmly and with confidence. In seconds they can pull down a snapshot of the code exactly as it looked when it was released. The bug was missed but the fix is simple. Testing is completed in a short time. Only the places that might be influenced by the one new line of code. The team knows they are working with a body of source code which was tested thoroughly during the last major release.

The patch can go out quickly. The developers are relaxed. Business is ecstatic.