Engineering a SIEM part 1: Why did we need to build our own SIEM?

Published

Feb 15, 2024

This blog is the first in a series of posts covering how we built an engineering-first security information and event management (SIEM) system at Rippling.

Indeed, you read that right—Rippling consciously chose to develop its own SIEM rather than opting for an off-the-shelf solution. In this series, we’ll explore the rationale behind this decision and gain insights into the strategic choices that led us down this (quite) unique path, where we prioritize engineering practices, principles, and methodologies in the design and development process.

You might be wondering what exactly “engineering-first” means for us at Rippling. It's about a commitment to integrating the best engineering practices in our work. This encompasses robust data engineering for effective information management, the use of reliable software development methodologies, and embedding security right from the start. We believe this approach helps us lay a solid foundation for our SIEM system while acknowledging that we’re constantly learning and evolving.

We’ll guide you through our team's engineering-driven process at Rippling, tackling challenges and celebrating victories as we create an internal tool adept at log ingestion, creating detections, enriching logs, managing alerts, and automating response actions. Utilizing a security data lakehouse is central to our approach, elevating the system's capabilities. Our initial post will focus on the foundational features, delving into the “what” and “why” behind our decision to build. 

The importance of SIEM and efficient log management

Understanding SIEM: The foundation of cyber security

To begin, let's explore the concept of Security Information and Event Management (SIEM). What exactly is SIEM, and why does it hold such significance in the realm of cybersecurity? Essentially, SIEM is a sophisticated system that aggregates and analyzes logs from various sources within an IT infrastructure. Its primary role is to detect and respond to security incidents, acting as a vigilant sentinel guarding the cyber boundaries of a business. SIEM enables us to swiftly identify and react to threats, ensuring the safety of digital assets.

The importance of efficient log management

Efficient log management is a term that goes beyond the mere collection and storage of log data. It encapsulates the idea of organizing, analyzing, and effectively utilizing the data to its maximum potential. But why is such efficiency critical? Firstly, logs are the lifeblood of the SIEM system, providing a detailed chronicle of events within your IT environment. These range from routine notifications to alerts about potential security breaches. Without a robust log management system, SIEM would be deprived of the vital data it requires to operate effectively.

Furthermore, efficient log management ensures that significant security threats are accurately identified amidst the countless events occurring every second in the digital landscape. It plays a crucial role in regulatory compliance as well, with many industries bound by rules that mandate the collection, storage, and analysis of log data for specified periods. So, effective log management isn't just beneficial—it's a regulatory necessity.

Additionally, it’s instrumental in troubleshooting and forensic investigations, providing an audit trail to trace back cyberattacks or system failures. This not only helps organizations rectify immediate issues but also learn and improve from past incidents.

Our journey to a custom SIEM solution

In the dynamic landscape of cloud technology and massive data influx, we found ourselves at a crossroads. The challenge wasn't just about finding a SIEM system; it was about figuring out what resonates with our needs. Here, we’ll dive into what it took to build our SIEM, focusing first on our functional requirements before delving into the challenges we aimed to overcome.

As we discuss our SIEM solution, it's important to define how we use certain terms:

  • Logs: The raw records from systems, applications, or devices
  • Events: Specific incidents identified within logs, indicating notable activities
  • Data: A general term encompassing both logs and events, representing all information processed in our SIEM system

Functional requirements - building blocks of our SIEM

Scalability

As a company grows, so does its data. This growth demands scalability from SIEM systems—the ability to efficiently adapt to expanding data volumes without sacrificing performance. Ideal scalability should be automatic, requiring minimal human intervention to adjust the system's capacity in response to the current data influx.

However, scalability extends beyond data handling. It also applies to the capacity to build and manage an increasing number of detections without hindering system performance. A scalable SIEM should support the development of hundreds of detections, maintaining functionality regardless of the detection volume.

Change review and automatic deployment

To mitigate the risk of manual errors in our SIEM system, it’s essential to adopt an Everything as Code approach. This requirement entails treating all configurations as code—from log ingestion and detection development to automated alert handling. Such an approach necessitates version-controlled changes and comprehensive reviews prior to deployment. Additionally, integrating this strategy with CI/CD pipelines is mandatory for automatic testing and deployment, ensuring our system operates efficiently and significantly reduces human error risk.

Harnessing semi-structured data

In a world where log files are often semi-structured data, the ability to handle the JSON format effectively is crucial. Have you considered the limitations of a system that applies static schemas to logs, which often requires reindexing the entire dataset when updates are needed?

In contrast, imagine the flexibility of making your logs searchable on fields that weren't initially extracted during parsing. This would allow you to easily extract any subfield from within the JSON blob, applying filters as necessary and proving especially beneficial when building detections. For us, this dynamic and adaptable approach to handling semi-structured log data is mandatory.

Near-real-time detections

A successful SIEM solution hinges on the real-time—or near-real-time—ingestion of all log sources. Given that these logs are used to build detections and rapidly respond to potential threats, it's crucial to minimize the latency in processing them. Consider the gap between the log entry generation time and the log parsing time on the platform. To maintain an efficient threat detection and response system, we mandate that this latency should be less than 10 minutes.

Long-term log retention

While the typical retention period for data in SIEMs ranges from 30 to 180 days, an extended retention period has its benefits. For us, retaining logs in the platform for extended periods is required without compromising the ability to search through it. Additionally, we insist the solution doesn’t dramatically impact the budget.

Extended retention periods are beneficial for Incident Responders dealing with security events. Often, they need to delve into past data to fully remediate incidents. The process of rehydrating data from archives can be time-consuming and, in our view, an unacceptable delay.

Seamless integration with AWS

In an era of cloud computing, how well does your SIEM solution integrate with your chosen cloud service? Seamless integration with AWS—our relied-upon cloud service—is paramount. This integration ensures smooth operation and optimal utilization of AWS resources. It's not just about convenience; this compatibility forms the foundation of our data management strategy, enabling us to fully leverage the cloud's scalability and flexibility. Beyond basic integration with AWS, we require that the SIEM utilizes native cloud components and is fundamentally designed to operate in sync with cloud architecture.

Efficient query language

At Rippling, a robust and widely used query language for querying and managing our log data is non-negotiable. The reasoning behind this is twofold. First, it allows us to manipulate and extract value from our log data more efficiently. Second, it circumvents the need for additional hours spent on training for a platform-specific language. Our preference leans towards a universally accepted query language like SQL.

Standardization and modularity

In the context of SIEM solutions, the process of integrating new log sources can be complex and time-consuming, especially when dealing with custom sources that require interaction with API interfaces. Standardization and modularity can provide a clear, repeatable path for onboarding new log sources and ensuring a consistent, efficient approach.

With this in mind, we've prioritized standardization and modularity in our architecture requirements. Regardless of the log source nature, any new log source—even a custom one—should be onboarded following the same streamlined, modular approach within a maximum of three days. This expectation also extends to log sources necessitating custom code for API interactions.

As part of our standardized and modular approach, we emphasize the need for comprehensive documentation for all ingested log sources. Documentation serves as a valuable resource not only for understanding the intricacies of each log source but also for debugging any potential ingestion issues that may arise.

Role-based access control

SIEM solutions handle a wealth of data, some of which can be sensitive. With varying teams and roles within a company, ensuring appropriate access is a critical concern. Here, role-based access control (RBAC) emerges as a vital tool, enhancing privilege control and reducing maintenance efforts.

For our ideal SIEM solution, we envision robust RBAC capabilities to granularly manage access to logs. We anticipate assigning access to different teams within the company based on roles. This isn’t merely about restricting or granting access, but about doing so with a degree of granularity that meets our specific needs—from subsets of logs generated by a single source to logs from multiple sources.

Tamper-proof logs

Ensuring the security of our SIEM environment is not just a technical necessity; it's fundamental to maintaining the trust and reliability of the entire system. Data integrity within a SIEM is paramount, as any tampering or unauthorized changes could lead to incorrect insights, potentially making us blind to threats and compromising incident response. Our aim is to make logs tamper-proof or, if that's not entirely possible, ensure that any alterations trigger immediate alerts to the appropriate team.

Broad log ingestion capabilities

As businesses today operate in a technologically diverse landscape and use a range of corporate tools and cloud platforms, a SIEM solution must be ready to ingest logs from this variety of sources. The scope of our requirement includes corporate tools such as Zoom, G Suite, and Slack, as well as AWS cloud services like Cloudtrail, Cloudwatch, GuardDuty, DNS query logs, S3 access logs, and many more. 

In addition, the SIEM solution must be capable of ingesting logs from within Kubernetes clusters, databases, and even other non-standard sources—like in-house developed applications. By accommodating all these sources, we ensure that no potential threat vector remains unnoticed, enhancing our capability to detect and respond to security incidents effectively.

Automated notifications

In the modern, ever-evolving cybersecurity landscape, a SIEM solution must offer automated alerts for platform issues. This includes, but is not limited to, disruptions in log flow or failed detections. By alerting us in real-time, we can swiftly identify and rectify these issues, reducing potential security risks.

Navigating the challenges

When dealing with large-scale data sources, such as cloud-based environments, applications, network devices, and more, traditional SIEM solutions may struggle to handle the sheer volume of logs generated. Ingesting and retaining all logs from numerous sources can lead to excessive license usage and data storage costs. Consequently, security teams are often forced to make tough choices regarding which data sources to prioritize for ingestion and which ones to exclude. The consequences of this decision can be far-reaching. Opting to ingest only a subset of logs might save on costs, but it also means potentially missing crucial insights and valuable security context from the excluded data sources. Threat actors can exploit these blind spots, evading detection and putting the organization at risk.

Traditional setups that limit data ingestion may inadvertently hinder the development of high-fidelity detections. When critical log sources are excluded, security analysts lack the necessary context to accurately correlate events and identify advanced threats. False positives may increase, leading to alert fatigue, where real threats get buried amid a sea of false alarms. This reduces the effectiveness of the entire security monitoring process.

Last but not least, the absence of essential log data complicates incident investigation and response efforts. Incident responders may face difficulties piecing together the attack story, leading to prolonged containment and eradication times. Moreover, the lack of complete logs can result in an incomplete understanding of the incident, potentially leaving the organization vulnerable to recurring threats.

Flexible and cost-efficient log ingestion

Our SIEM must avoid situations where we need to exclude a significant portion of logs due to licensing costs. We reject fixed license models in favor of a flexible spending approach, where costs scale reasonably with increased data ingestion. Our expenditure on log ingestion should proportionally reflect the volume of data processed, ensuring that we can maintain comprehensive monitoring without spending an unbalanced amount of our security budget. Also, the system must be capable of ingesting and processing two to three terabytes of logs daily without incurring unreasonable expenses. 

Cost-effective expansion of detections

As we develop and implement new detections, the associated costs should remain manageable. It's crucial that adding new detections to enhance our security posture doesn’t lead to a steep increase in expenses. This requirement ensures that our security measures evolve without adversely impacting our overall budget.

Low operational workload

Effective log management is a key component of any SIEM solution, but when improperly executed, it can create a substantial operational burden. This burden manifests in countless hours spent debugging issues, identifying root causes, and managing potential data loss that can leave security teams ill-equipped to tackle unseen threats.

Therefore, our goal is to minimize the operational load as much as possible. Rather than exhaust our resources and cycles dealing with repetitive issues stemming from log ingestion or detection creation, we need to focus on streamlining these processes. The acceptable amount of time we can allocate to operational issues every week should ideally be a few hours, say between three and five. Any time beyond that quickly becomes unjustified, detracting from our primary goal of ensuring robust, comprehensive security.

Bridging SIEM with the companywide data lakehouse

In our pursuit of an effective SIEM solution, we discovered a significant opportunity in the existing data lakehouse managed by our data engineering team. This data, particularly from our product and application logs, holds immense potential for advancing our security investigations, attributing potential attackers, and developing new detections. 

The challenge, however, was to harness this wealth of information without redundantly re-ingesting it into our SIEM system—a process that would be resource-intensive and time-consuming in terms of integration development. Therefore, we set out to find a way to seamlessly integrate with the existing data platform. Our goal was clear: leverage this existing data lakehouse effectively, thereby enriching our security insights and capabilities, all while aligning with our objective of minimizing operational load and avoiding unnecessary efforts.

Conclusion: Crafting our own path

Considering all the challenges and requirements mentioned above, our decision to build our own SIEM solution was driven by a comprehensive understanding of our security landscape. Traditional SIEM solutions often fell short in meeting the specific demands of our dynamic environment, especially concerning cost efficiency, data ingestion flexibility, and integration capabilities with our current systems.

Evaluating existing solutions

Before settling on building our own SIEM, we reviewed existing tools offered by companies like Panther and Sumo Logic. We even conducted a Proof of Concept with Panther in our environment. After several weeks of assessing its capabilities, we realized that we would still need to develop numerous integrations for log ingestion and create many custom detections. This is primarily because Rippling offers a wide array of tools developed in-house and directly integrated with our product, necessitating a more tailored approach.

Embracing innovation and creativity

Rippling prides itself on fostering a spirit of innovation and creative thinking. We believe in exploring the possibility of in-house development first, guiding us to challenge ourselves to build something that precisely fits our needs—rather than defaulting to an off-the-shelf solution. By giving it a try, we can assess whether the outcome justifies the effort. 

We’re developing our SIEM to create a system that is not only finely tuned to our infrastructure but also embodies our commitment to innovation. The goal is to achieve seamless integration with the companywide data lakehouse and adeptly manage the vast and varied data generated by our network and applications, all while aligning with our ethos of creative problem-solving and self-sufficiency.

Coming up in part 2: Our solution—security data lakehouse and modular design

Join us in the next post as we explore how we leveraged the data lakehouse concept and adapted it to create a scalable security data lakehouse solution. We’ll address all the challenges and requirements identified in the first part!

Are you intrigued by our progress and challenges? Do you see yourself thriving in a team that takes on exciting projects? If yes, then you're in luck! Rippling is hiring. We're seeking passionate individuals eager to explore the frontiers of technology. Stay tuned for updates, and join us in shaping the future of secure cloud infrastructure.

last edited: April 11, 2024

Author

Piotr Szwajkowski

Staff Security Engineer

Piotr serves as the Staff Security Engineer on Rippling's Security Operations Team, where he specializes in developing detection strategies and responding to security incidents.