Navigating the Data Stream: Building an Event Processing System from Scratch

September 10, 2024, 11:25 pm

AnalyticsDataDatabaseFastManagementTechnology

Location: United States, California, Portola Valley

Total raised: $300M

In the age of Big Data, the ability to process and analyze vast streams of information is crucial. Companies are inundated with data from various sources, and the challenge lies in transforming this data into actionable insights. This article explores the intricacies of building an event processing system, focusing on the lessons learned and the strategies employed to tackle common challenges.

Imagine a river flowing with countless streams of data. Each stream represents a different source: logs from devices, network routers, and user interactions. The goal is to harness this flow, filter out the noise, and extract valuable information. This is the essence of event processing.

The task at hand was daunting. We needed to process 80,000 messages per second. Each message contained logs from devices running on different operating systems. The sheer volume was overwhelming. Yet, the stakes were high. Our product, categorized under User and Entity Behavior Analytics (UEBA), relied on this data to identify anomalies and enrich information.

Before diving into the architecture, it’s essential to understand the context. We faced a dual challenge: ensuring a correct timeline of events and managing the size of the messages. If the timeline was off, our enrichment services would yield incorrect results. Large message sizes slowed down processing, creating bottlenecks.

Our first step was to define the types of events we would process. We identified 45 distinct event types based on the content of the messages. Filtering these events was crucial. Only relevant events would proceed to the next stage of processing.

Next came the enrichment phase. This involved determining the maturity of users and hosts. Maturity refers to the state where a user or host has accumulated a certain number of events for each type. We also enriched data with historical context, geolocation based on IP addresses, and device group identification.

Identifying anomalies was another critical aspect. We needed to track the first activity of users and detect inactivity. A user was deemed inactive if no events were received for a significant duration. This information was vital for our analytics.

Now, let’s discuss the architecture. Initially, we considered using established solutions like Kafka Streams or Apache Flink. However, the learning curve and resource requirements were prohibitive. We opted for a custom solution, allowing us flexibility in design and control over the processes.

Our first iteration employed a monolithic architecture. This approach enabled rapid development and the release of a Minimum Viable Product (MVP). However, we soon encountered challenges. The workload was unevenly distributed, leading to resource wastage. Additionally, monitoring and restarting processes became cumbersome.

Recognizing these issues, we transitioned to a microservices architecture. This shift allowed us to isolate services, improving scalability and fault tolerance. Each service could now handle specific tasks, making the system more efficient.

One of the primary challenges was event type identification and filtering. Our initial implementation in Python was slow, hampered by the volume of messages. We migrated this logic to ClickHouse, leveraging materialized views for efficient processing. This change significantly improved performance without increasing resource consumption.

The timeline of events was another critical factor. We needed to ensure that incoming data formed a correct chronological sequence. Initially, we used a priority queue for sorting, but this approach consumed excessive memory. Instead, we opted for a simpler solution: querying ClickHouse for already sorted data. This adjustment streamlined our process and reduced memory usage.

Message size posed a significant hurdle. Large messages hindered our processing speed and increased network load. Our first attempt to compress data using Kafka’s standard tools fell short. We pivoted to an Event Driven Architecture (EDA), focusing on differential data transmission. This method allowed us to send only the essential payload, drastically reducing message sizes from 6 kilobytes to 200-300 bytes.

However, this solution introduced a new challenge: data merging. We needed to find a balance between service performance and message processing delays. Our approach involved directing data streams to specific instances based on key event fields, enhancing resource management and minimizing synchronization issues.

For fault tolerance, we implemented a dual-instance system. One instance handled the primary tasks, while the second monitored its status. If the first instance failed, the second would take over, ensuring continuity.

The final step involved merging the original event with the differential data. We utilized ClickHouse’s capabilities, creating a Kafka table to process the diff. This method allowed us to efficiently handle large datasets without overwhelming memory resources.

In conclusion, building an event processing system is akin to navigating a complex river. Each twist and turn presents challenges, but with the right strategies, it’s possible to harness the flow of data. Our journey taught us the importance of flexibility, efficient resource management, and the power of custom solutions. As data continues to surge, mastering the art of event processing will be paramount for organizations seeking to thrive in a data-driven world.

Stay tuned for more insights and updates from the Crosstech Solutions Group. The journey of data processing is just beginning, and there’s much more to explore.