As a distributed NoSQL database management system, Cassandra is designed to manage large volumes of data across multiple servers, with high availability and no single point of failure. By comparison, in Greek mythology, Princess Cassandra was fated to make accurate prophesies that nobody believed. But fear not, because by applying the following best design practices, you can establish your Cassandra database as the trusted go-to resource for applications as diverse and demanding as Internet of Things data analytics, investment management, and fraud detection.
Start Your Design with Your Queries
If there is one place to start in designing a Cassandra database, it is the output you want from it and therefore the queries you will make to it. This is a significant difference in approach compared to other databases, for which design starts with data relationships or objects. These items have their place in Cassandra design too, but they come afterwards. First, define your queries, then build tables that pre-construct the answers to those queries. This is a critical element in achieving high performance with Cassandra. The result is typically one table for each query pattern, and therefore multiple tables for multiple query patterns.
De-normalize and Duplicate Data as You Desire
Answers that have been pre-built to queries should be available in the blink of an eye, without searching several tables or partitions. Consequently, the same data should be written to all rows that need it to rapidly satisfy specific query patterns. Disk space is still the cheapest resource in the IT ecosystem, and denormalization of data is encouraged: a few more writes for better read performance is a small price to pay. Cassandra is also architected to make such writes very fast. On the other hand, avoid excessive updating and deleting of data, as this can lead to massive garbage collection activity by Cassandra, lowering performance for reads.
You can therefore code your applications to write data into all the places where it might be needed later, to serve a query fast from just one row. Bear in mind that Cassandra does not offer table joins. This means that (sometimes ugly) hacks available in other DBMSs to serve queries will not work in Cassandra. It also underlines the importance of understanding from the start what you want to get out of your database (your queries) and using this to drive your design.
Tradeoffs Between Higher Consistency and Lower Latency
By default, Cassandra favors availability and partition tolerance, with the data duplication discussed above. In this case, and by the CAP theorem, consistency takes a back seat. The default is eventual consistency with periods, albeit short, during which two different versions of data may exist on two different nodes. However, consistency is tunable. The tradeoff is in the latency of the database transactions.
Balancing Minimum Partitions and Evenly Spread Data
Ideally, you want the smallest number of partitions for the fastest transactions, but also the most even spread of data to avoid overloading any one node. Unfortunately, this is like wanting to have your cake and eat it - in other words, you cannot optimize both at the same time. The best tradeoff between the two will depend on how your application uses Cassandra. For example, a requirement for many reads on data organized into moderately sized groups may suggest giving priority to reducing the number of partitions. Conversely, a need to potentially expand any group of data to become very large, but with only a few reads required, suggests favoring evenly spread data at the expense of a greater number of partitions, with groups of data spread over several partitions.
You can reduce the number of partitions in different ways, including by using tools available in Cassandra like static columns, user-defined data types, and collection data types that store multiple values in a single variable. In the reverse direction, you can split partitions by adding an additional column to the partition key or bucketing to organize data in smaller sized blocks. Cassandra also helps even out the amount of data per node by distributing rows around a Cassandra cluster of nodes according to the modulo-based hashing of the row partition key (the first part of the row primary key).
RDBMS-Think? Leave It at the Door
You are more likely to get the most out of a Cassandra database by working with it on its own terms, rather than trying to apply dos and don'ts from the standpoint of an RDBMS. Nevertheless, for those coming from a relational database background, it may help to summarize differences. A relational database emphasizes relations between tables, foreign keys, normalization and joins, and may be easier to expand vertically (bigger system) than horizontally (several systems). A Cassandra database relates tables to queries, uses partition keys to indicate data locality over several distributed nodes, encourages de-normalization and has no joins.
And finally... There is more to good Cassandra database design than the space in this article lets us describe, but the basic indications here should already help you to start off in the right direction.
Post new comment