Better Monitoring and Observability at Procore
Procore is on a journey to improve our architecture, to better deliver our vision to customers, and to improve our internal developer experience. A key pillar to supporting this journey is improving our Observability.
Observability, as defined by Honeycomb co-founder and CTO, Charity Majors, is “being able to ask arbitrary questions about your environment without having to know ahead of time what you wanted to ask.” The current methods of observability at Procore involve a patchwork approach, relying on log capturing, focused metrics, and some tracing, but the siloed nature of these tools makes it difficult to correlate their results to ask those arbitrary questions. Our engineers often have to jump between tools to correlate data which means it takes longer to detect and identify the root cause of incidents, and often requires guesswork to spot trends.
A modern Observability stack empowers engineering teams to explore and understand changes by asking arbitrary questions across all relevant data providing a holistic view into what’s happening with a service at any given time. This results not only in a better overall understanding of the system’s health (and the company itself by extension), but can support better planning through data-driven decision making and an overall more efficient engineering organization.
Traditional monitoring and alerting systems revolve around answering known questions that are predicted and instrumented ahead of time. A modern Observability stack enables teams to answer unknown questions, allowing for near real-time debugging and more complete investigation of unanticipated problems without requiring new deploys that may prolong investigations. It’s crucial that the stack be flexible, strongly decoupled, allow high cardinality, and connect disparate data sources. It also requires a cultural shift that clarifies the value of observability as a standard part of software development.
High level overview of our proposed Observability solution
Strong Observability systems collect and correlate telemetry from many sources, including application code, libraries, infrastructure, tests, observability tooling, and third-party systems such as VCS and customer service systems. This telemetry is automatically tagged by the pipeline to enable high cardinality and faceting. This data stream is powered by metrics, traces, and structured logging instrumentation in the code, as well as any integrations and system monitoring tools outside it. Instrumenting in this fashion facilitates the use of tracer bullets, feature flags, synthetic monitoring, dynamic sampling, and other techniques to collect telemetry and debug more efficiently. This combined with good SLI/SLO practices, error budgets, and operational review disciplines, provides better visibility into systems’ health and also drives better monitoring practices.
Decoupling collection pipelines from the larger Observability toolchain allows for greater flexibility and avoids vendor lock-in. It also enables “log everything” patterns with transformations and filtering to reduce our ingestion on Observability tooling, allowing all telemetry flowing through the pipeline to be archived on cheap nearline storage services like AWS Glacier. Doing this successfully provides techniques for instrumenting the Observability toolchain and transforming data to improve consistency between integrations and code allowing for “self-service infrastructure as a code” patterns for adding new data sources and sinks, and enabling features such as dynamic and tail-based sampling.
Most developers rely on some form of visualization tooling to surface contextually relevant system information under a unified interface, ideally eliminating the need to juggle multiple tools. To maximize efficacy, these interfaces should be easy to query, modify, and be able to connect workflows to multiple systems, and across environments, services, users, requests, and disparate data types. Queries should be easy to share and understand to facilitate collaboration, and anomalies should be highly visible, and able to communicate data flow across multiple Procore services. These tools should also support shared dashboards and alerting to support operational review meetings and incident alerts.
Now that our Observability strategy is laid out, the next step is to implement it. That’s the primary purpose of our newly formed Observability team, and something we’re excited to undertake. Our first step is building a new pipeline to empower this collection vision, and we’re looking forward to sharing more as we do.
If this kind of challenge excites you, then maybe you should come work for us!