Introduction to the series
The "What even is" series is my attempt to write introductory posts about "big data" tools.
This is intentionally not a detailed guide - there are books and official documentation for that. My goal is to provide you with enough information so you can consider those tools when faced with the problems that arise when dealing with large amounts of data.
In this first post we cover Apache Kafka. In future posts I intend to write about Spark, Hive, Snowflake, Redshift, Presto, AWS Athena and other similar tools.
In very simple terms, Apache Kafka is like a message broker such as RabbitMQ with durability properties but with less routing abilities.
It was initially built by LinkedIn to be a centralized way of integrating heterogeneous systems. To fulfill their demands, LinkedIn built a platform that can:
- Decouple message producers and consumers.
- Persist messages in a fault-tolerant way so multiple consumers can process them at their own time.
- Handle a very large number of messages and scale horizontally as necessary to handle more data.
Kafka was designed to transport any kind of message. Thus it can be used to integrate fairly different kinds of data like application monitoring metrics, user-behavior tracking events, application events, database commit logs, etc.
Kafka allows you to decouple applications (or parts of just one application) by allowing them to communicate asynchronously through messages. Applications (producers) can publish messages into topics that can be consumed by other applications (consumers).
Each topic is a stream of messages being written by producers and read by consumers. Topics are defined by one or more partitions that can be written to and be read in parallel. Partitioning is one of the ways Kafka allows you to scale horizontally, since each broker (Kafka process) in a cluster can be responsible for a different partition.
Those messages are serialized into a binary format by Kafka, so they can be a simple piece of text or a complex object (as long as it can be serialized). Apache Avro is a very efficient serialization format that is commonly used on Kafka, but simpler formats like JSON are also supported. Each message can include an optional key that will then be used to define which partition of the topic the message should go. The message key is also available for consumers.
Publish-subscribe messaging is a fairly common integration pattern so I won't go into any more detail, but you can read more about it here.
Differently from RabbitMQ and other message brokers, Kafka persists published messages in an append-only ordered sequence on disk in a per-partition (and per-topic) basis. Each message is then assigned with an offset integer that represents its position in the sequence.
This approach gives full control to consumers of how to read those sequences of messages. Consumers can choose to read only new messages or replay every single message available. They can also choose to work in realtime or with periodic batches without risking losing data. New consumers can also be added and consume messages published even before the consumer was created.
This feature can be very handy in some scenarios. For instance, imagine that a bug was introduced in a consumer and you only noticed that after some messages were incorrectly consumed. With Kafka you can consider just replying those messages after fixing the bug.
As you may imagine those sequences of messages can grow very quickly. Because of that each topic can be configured with a cleaning policy, like deleting every message older than an amount of time or deleting messages when the partition reaches a certain size.
An interesting cleaning policy that is available is called compaction. With it you can configure Kafka to keep only the most recent version of each key on the topic. This cleaning strategy has many uses, like storing the latest preference chosen by users or keeping the current state of an entire application so consumers can rebuild their state if necessary.
Kafka Streams is a very convenient JVM client side library for writing applications to process and analyze data on Kafka in realtime.
It allows you to implement a lot of features that require data processing with very little code. That is accomplished through high level operations on streams like
reduce and the generic
aggregate. There are several use cases for this library, like anomaly detection, data enrichment on ETL pipelines, rate monitoring, etc.
Kafka Streams and stream processing in general is a very interesting topic but not very short, so if you want to read more, the official docs is a good place to start.
Kafka Connect is a framework that simplifies pulling/pushing data from/into third party applications like databases, search engines, other messaging brokers, etc. You can write custom connectors yourself, but you can also use the connectors made available by Confluent and other vendors.
As an example of what connectors can do, if you're using the Confluent CLI, then starting up a connector that pulls data in "realtime" from any database accessible via JDBC can done with just a config file and a few commands.
Now we cover a short example of how Kafka can be used. In order to make it clear how Kafka allows very heterogeneous applications to talk to each other, our example consists of a Ruby producer publishing JSON-encoded messages and a Java consumer. So we have:
- A Ruby producer using the ruby-kafka gem that publishes bids into a Kafka topic.
- A Java consumer using the official Consumer API that consumes bids individually and write them to the console (it could be writing it to a database but the console is good enough for this example).
Here is the Ruby producer:
Now the Java consumer just uses the Consumer API to fetch batches of published messages and process each of them individually in a synchronous way:
And here are the logs from executing the producer/consumer pair above:
Kafka Streams examples
I also considered adding a Kafka Streams example, but my attempts to write anything that was somewhat interesting and simple weren't very successful, so instead I'm linking to a few of the examples from Confluent that I found interesting:
- WordCountLambdaExample: The classic word count example that splits pieces of text published to a topic and count the occurrence of each word as they are encountered.
- SessionWindowsExample: This example uses window and aggregation functions to count the number of events that happened within a user session. The user sessions are determined by the window function and an inactivity gap:
.windowedBy(SessionWindows.with(INACTIVITY_GAP))(at this line at the time of writing).
- KafkaMusic: This is a fairly complicated but very interesting example of how Kafka Streams can be used to produce a key-value store accessible via HTTP based on events published to topics and some stream processing.
Keep in mind that even though some of those examples aren't trivial (at least for someone like me that doesn't work with Java daily), they still consist of a fairly small amount of code that produces useful aggregations on data that can be used to build very useful features. Also don't forget that each of those examples have a very strong foundation: they are fault-tolerant, horizontally scalable, etc.
Apache Kafka is extremely versatile and powerful. It can be used as a simple pub-sub broker, as the message bus for a microservices ecosystem, as a log delivery platform and even as an ETL pipeline.
Its flexibility is one of the things I find more interesting on it. You can, for example, start using it just to integrate a few systems in a very robust way. Then suddenly you'll realize you can easily start publishing all your data streams sources to it and consume them via Kafka Streams or even integrate it with other tools like Spark.
Kafka is also in constant evolution. For instance, in a few weeks KSQL will be released for production and will make creating Kafka Streams applications even easier.
Here are some notes if you're considering using Kafka but don't know where to start:
- The Confluent distribution of Kafka is very handy. Their CLI simplifies a lot dealing with Kafka and its dependencies.
- If you're going to run and manage a Kafka cluster, then you should understand its internals before making any critical part of your system depend on it. You're also going to need to understand how to run and manage Apache Zookeeper. You also going to need several instances in your cluster to make it reliable.
- You can use Heroku Kafka, Confluent Cloud or other providers if you don't want to manage your own cluster.
- The official docs are quite good and Kafka: The Definitive Guide is also not bad if you like books.
- Make sure you pick the right replication factor for your streams. You can quite easily lose data if you don't.
- Make sure you pick the right processing semantics for your needs (exactly-once, at-most-once, at-least-once).
- If you're not used to Java or Scala, I recommend you to use an IDE like IntelliJ IDEA (the CE edition is free!) to write your consumers/producers. With it you can just type class names and ask your IDE to import the required classes for you. Even the smallest of consumers may need to import dozens of packages and that's very hard to keep track without help. By the way, IntelliJ IDEA has a great VIM emulation ;-)