Kitopi Operational Metrics

By Adrian Gonciarz, QA / SRE Director


Life in the kitchen business can be hard. You deal with a variety of things, from food quality issues, delivery problems, and a slow internet connection to opening up the fridge and noticing the chicken is not there.


As we approach new challenges that require smarter solutions, for Kitopi’s system observability in 2021, we decided to integrate with Dynatrace as our main solution. Our initial goal was to not only be able to use it for internal monitoring but also use the same idea of metrics, and alerting, to be able to track operational problems. Quite a brave notion, right? Well, we went beyond that - knowing the power the Davis engine provides - we wanted to use it to give us the reactive and proactive solutions for detecting operational issues as well. Issues such as problems with the internet in the kitchens, integration with ordering portals being down, or order rejections and cancellations that might suggest some of our ingredients are out of stock.



Internal Data on Kafka

Kitopi is a very dynamic system that has to handle a lot of orders coming in from diverse sources. In order to handle high-volume asynchronous traffic, many of our internal operations are event-based, ie. services communicating via Kafka topics. Producers are responsible for notifying downstream apps (the consumers) about orders’ life cycle changes. This means nearly all information related to our orders is reported live, as messages, on Kafka topics.


In case we need information about a single order, we can, of course, use APIs or web applications to search for the related data. However, since we are interested in aggregated, statistical data about our system and its performance, both on the application side (handled mostly by the development team) and the operational side (handled by support teams and business departments), we use Kafka instead. A good practice we follow is having such aggregated information (accepted orders per restaurant, number of failures of delivery, etc.) as metrics in Prometheus with Grafana as the tool for visualization.


Dynatrace Metrics and Dimensions

Since we began our journey with Dynatrace, we learned a lot about its capability to quickly analyze data through metrics. Aside from the very useful and out-of-the-box built-in metrics that it provides, we were able to use our own custom application metrics. Here we stumbled upon a problem with our current implementation of metrics - because of their amount, it would be quite costly to enable them. Also, most of the existing application metrics became less important since Dynatrace is able to measure a lot of the data on its own and provides really good support for early problem detection and Root Cause Analysis in case of any issues.


The Idea

In 2021, our main goal became creating a Command Center unit, a team specialized in dealing with operational issues. Our idea was to equip them with simple and reliable data visualization and alerts that could cut down reaction time for issues and improve their daily work. In order to achieve that, we needed operational metrics carrying business-level information: the number of incoming orders per country, rejected orders per restaurant, etc. We decided to apply the Kafka Streaming approach, aggregating the internal data and exposing it as meaningful business metrics. We decided on the implementation in Python with the use of the Faust Streaming library (a community driven version of the Faust library but actively maintained). It had all the Kafka Streams concepts we needed to start our project and was lightweight.


Our architecture can be deconstructed into:

  • A processor component responsible for streaming and aggregating the data from input Kafka topic(s).

  • Aggregated data is produced to an intermediate Kafka topic, called dynatrace-metrics, with a very strict schema.

  • Another component is responsible for picking up dynatrace-metrics messages and sending them over to the Dynatrace via its metric ingestion API.


The processors had to be standalone Kubernetes applications due to the (semi-)persistent nature of the data held in Stream Tables. In order to facilitate the process of following the schema of the dynatrace-metrics topic, we created a small Python library which provided an API for serializing those messages as models.


The Ingestion part, however, was a great case for a serverless function (AWS Lambda), fetching messages from the dynatrace-metrics topic and sending them to the Dynatrace API.


Example: Order Creation and Rejection Metrics

Kitopi integrates with third-party order aggregators sending orders to our system. Upon receiving them, the orders can be accepted and created or rejected for multiple reasons.


Information about orders successfully processed was published on the topics. Working with the team responsible for handling incoming orders, we decided that each rejection event will be published on a separate Kafka topic. This way we have two data sources with events corresponding to order creation or rejection. However similar, the cases are very different, primarily due to the volume of input data - we have tens of thousands of orders daily while maintaining the number of rejections at a very low rate (under 0.5%). We realized that for created orders, we do not need minute precision, instead, we will be good to go with 5-minute buckets calculated for each aggregator. It will still provide us with good insight while keeping the costs of DDUs lower. On the other hand, live reporting is important for rejections.

With this in mind, an order-processor app was created. The app contained two separate Faust workers, one for handling orders creation and the other for orders rejection.

The workers have a bit of a different flow:

  • A worker responsible for order creation calculations processes Kafka messages. Each message increases the counter of orders per aggregator in a so-called tumbling table. This kind of table lets us store data in 5-minute windows and aggregate the number of orders per aggregator on window close. Then another window is opened to store the next time bucket data. The window closing function produces messages to dynatrace-metrics topics following the established schema.

  • A worker responsible for order creation and rejection, processes Kafka messages and immediately produces a metric message to dynatrace-metrics topic.


Then the part responsible for transferring the metrics from dynatrace-metrics to Dynatrace API does its job and we can visualize creations and rejections in the graphs



Once we have that visualized and we can analyze a few days of data, we are able to introduce alerting as well. We learned about the “rollup” function, which defined reactive and proactive alerts for order rejections. Reactive would trigger an alert whenever we get 3 or more rejections for a given aggregator within 10 minutes. Proactive alerts have the same conditions but for a 20-minute period. An example alert can be seen below.


Summary

From our extended R&D phase, it is possible to successfully introduce meaningful operational metrics incorporating data from internal sources, such as Kafka topics. Streaming and aggregating this data via Kafka Stream compatible framework is relatively easy, although we learned that some features of Faust library, such as windowed tables, are not obvious at first sight. Architectural strategy with multiple lightweight processors and intermediate dynatrace-metrics topics, consumed by serverless apps, responsible for batching and sending metrics via Dynatrace API, has proven to be reliable, scalable, and stable. We believe it’s an inspiring pattern that others might use for their purposes in order to bring some more business observability to Dynatrace.

.