Observability Powered by SQL: Understand Your Systems Like Never Before With OpenTelemetry Traces and PostgreSQL

Observability Powered by SQL: Understand Your Systems Like Never Before With OpenTelemetry Traces and PostgreSQL
⚠️
While part of this content may still be up to date, we regret to inform you that we decided to sunset Promscale in February 2023. Read our reasons why and other FAQs.

Troubleshooting problems in distributed systems built on microservices and cloud-native technologies is a challenge, to say the least. With so many different components, all their interactions, and frequent changes introduced with autoscaling and multiple deployments, we can’t possibly predict everything that could go wrong.

So far, we know that the first step to solving this problem is to start collecting detailed telemetry data from all our cloud-native systems in the form of traces, metrics, and logs. Unlike monoliths, troubleshooting distributed systems is a “trace first, metrics and logs second” process.

By themselves, metrics and logs don’t provide the level of understanding required to identify anomalous behaviors in those systems. Traces, however, capture a connected view of how the entire system works to process requests, allowing software developers to figure out what is actually happening when a problem occurs. Metrics and logs expand the information that traces provide—especially if directly correlated to traces.

OpenTelemetry, a Cloud Native Computing Foundation (CNCF) standard for instrumentation, is driving a revolution in this space by unifying the collection of traces, metrics, and logs and making them universally available. Since OpenTelemetry focuses on helping solve the challenges of operating distributed systems, it is no surprise that the observability framework first targeted tracing support, which is currently more mature than the support for the other two signals.



Having access to all that telemetry data—especially traces—solves only half of the problem. The other half requires tools to analyze, correlate and make sense of that data. That is what observability is about. It’s essentially an analytics problem: you want to be able to interrogate telemetry data with arbitrary questions and get results back in real time.

Given that observability is an analytics problem, it is surprising that the current state of the art in observability tools has turned its back on the most common standard for data analysis broadly used across organizations: SQL.

Good old SQL could bring some key advantages: it’s surprisingly powerful, with the ability to perform complex data analysis and support joins; it’s widely known, which reduces the barrier to adoption since almost every developer has used relational databases at some point in their career; it is well-structured and can support metrics, traces, logs, and other types of data (like business data) to remove silos and support correlation; and finally, visualization tools widely support it.

However, implementing full SQL support from scratch on top of an observability store is a major undertaking that would likely take years of development.

But what if we didn’t have to? What if we could leverage a well-known and reliable open-source SQL database to solve this problem?

That’s exactly what we’ve done :)


PostgreSQL is the second most used and most loved database among developers, backed by more than 25 years of development and a healthy and thriving community with hundreds of integrations available.


You're probably thinking that observability data is time-series data that relational databases struggle with once you reach a particular scale. Luckily, PostgreSQL is highly flexible and allows you to extend and improve its capabilities for specific use cases. TimescaleDB builds on that flexibility to add time-series superpowers to the database and scale to millions of data points per second and petabytes of data.

With PostgreSQL and TimescaleDB, we have a robust and scalable database for time-series data. Now, how do I add my OpenTelemetry data to start using SQL for observability?



Promscale bridges that gap.

Today, we are announcing the general availability of OpenTelemetry tracing support in Promscale, the observability backend for metrics and traces built on the rock-solid foundations of PostgreSQL and TimescaleDB. We’re now closer to becoming the unified data store for observability powered by SQL, bringing the PostgreSQL and observability worlds together.

Traces are the first step toward truly understanding cloud-native architectures. They capture a connected view of how requests travel through their applications, helping developers understand (in real time) how the different components in their system behave and interact with one another.


With today’s announcement of tracing support, Promscale equips software engineers worldwide with observability powered by SQL for distributed systems, allowing them to unlock unprecedented insights about their systems.

OpenTelemetry Tracing: Observability Powered by SQL

With tracing support, engineers can interrogate their observability data with arbitrary questions in SQL and get results back in real time to identify production problems faster and reduce mean time to repair (MTTR).

OpenTelemetry tracing support in Promscale includes all these features:

  • A native ingest endpoint for OpenTelemetry traces that understands the OpenTelemetry Protocol (OTLP) to ingest OpenTelemetry data easily
  • Trace data compression and configurable retention to manage disk usage
  • Full SQL superpowers to interrogate your trace data with arbitrary questions
  • A fully customizable, out-of-the-box, and modern application performance management (APM) experience in Grafana using SQL queries on OpenTelemetry traces, to get new insights immediately after deployment
  • Out-of-the-box integrations to visualize distributed traces stored in Promscale using Jaeger and Grafana, so you don’t have to learn new tools or change existing workflows
  • Support for ingesting traces in Jaeger and Zipkin formats via the OpenTelemetry Collector, so you can also benefit from all these new capabilities even if you’re not yet using OpenTelemetry
  • Correlation of Prometheus metrics and OpenTelemetry traces stored in Promscale through exemplars as well as SQL queries
  • Integration in tobs, the observability stack for Kubernetes, so you can deploy all those new capabilities with a single command in your Kubernetes cluster

Keep reading this blog post for more details on these capabilities and an introduction to powerful, pre-built Grafana dashboards that you can integrate into your Grafana instance. These dashboards will give you instant information on your distributed systems’ dependencies, helping you quickly identify (and correct) performance bottlenecks.


You can deploy Promscale in your environment today—it’s 100 % free—or experiment with it by running our lightweight OpenTelemetry demo. To learn how to deploy a microservices application and a complete OpenTelemetry observability stack running on Promscale in only a few minutes, check our OpenTelemetry demo blog post for a step-by-step guide.

Sending and Storing Trace Data in Promscale

A key advantage of OpenTelemetry is that it is vendor agnostic. This allows us to instrument our code with OpenTelemetry libraries and send the telemetry generated to one or multiple observability backends via exporters. OpenTelemetry also defines a line protocol (OTLP). All the OpenTelemetry language SDKs include an exporter to send telemetry data to OTLP-compliant backends. A number of other exporters have been built by the community and vendors to send the data to non-OTLP-compliant backends.

Promscale has an endpoint that listens for OTLP over Google Remote Procedure Calls (by default on port 9202). So, it can ingest traces using the standard OpenTelemetry tools without needing a proprietary exporter.

OpenTelemetry also includes the OpenTelemetry Collector, which is a component that allows it to receive telemetry in multiple formats (not just OTLP), process it, and export it via OTLP with the OpenTelemetry exporter or in other formats through proprietary exporters.

An architecture diagram illustrating the core components of the OpenTelemetry collector, the inputs it accepts, and possible outputs.
Architecture diagram illustrating the core components of the OpenTelemetry collector, the inputs it accepts, and possible outputs


To send trace data from your application instrumented with OpenTelemetry to Promscale, you have two options:

  • You can configure the OTLP exporter of the OpenTelemetry SDK you used to instrument your application to send traces directly to Promscale.
  • You can configure the OTLP exporter of the OpenTelemetry SDK to send data to the OpenTelemetry Collector and then from the OpenTelemetry Collector to Promscale.

We advise you to use the OpenTelemetry Collector for better performance because it can send data to Promscale in larger batches, which speeds up ingestion. Also, if you want to send data to another backend, you can change the configuration in one place. See our documentation for more details.

You can also easily send Jaeger and Zipkin traces to Promscale through the OpenTelemetry Collector by configuring it to receive traces in those formats and convert them to OpenTelemetry.

Promscale supports the full OpenTelemetry trace schema. It stores and queries all the data, including resources, spans, events, and links with their corresponding attributes. Promscale stores OpenTelemetry traces in TimescaleDB using an optimized database schema and leverages TimescaleDB’s compression capabilities to reduce disk space requirements by up to 10-20x. The retention period is configurable and is set to 30 days by default.


Querying Traces With SQL

To make querying trace data with SQL easier, Promscale exposes database views that make querying spans, events, and links (with all their corresponding attributes) as easy as querying single relational tables.

For example, the following query returns the average response time (i.e., latency) in milliseconds for requests to each service in the last five minutes:

SELECT
    service_name AS "service",
    AVG(duration_ms) as "latency (ms)"
FROM ps_trace.span s
WHERE 
   start_time > NOW() - INTERVAL '5m'
   AND 
   (parent_span_id IS NULL
   OR
   span_kind = 'SPAN_KIND_SERVER'
   )
GROUP BY 1
ORDER BY 1 ASC;

ps_trace.span is the span view. The condition parent_span_id IS NULL OR span_kind = 'SPAN_KIND_SERVER ensures that only root spans (i.e., those that have no parent span) or spans that represent a service receiving a request from another service (i.e., this is what kind server indicates) are selected. Finally, we compute the average of the duration of those spans across the last five minutes for each service.

This is the result of visualizing the query above in Grafana with a Table panel:

Table showing the latency in milliseconds for requests to each service in the last five minutes.
Latency in milliseconds for requests to each service in the last five minutes

Because traces are represented as a hierarchy of spans, the ability to use SQL joins, aggregates, and recursive queries to perform sophisticated analysis on trace data is extremely valuable. We’ll say it again: observability is essentially an analytics problem, and SQL is the most popular tool for analytics.

Imagine that we notice our customers are experiencing prolonged response times from a specific API endpoint we offer. When we receive a request to that API endpoint, there are many operations in multiple microservices that are called to fulfill that request. Identifying where the bottleneck is in a distributed environment is not trivial. How could we quickly identify where most of the time is effectively being spent?

Note that there are a couple of important considerations to take into account.

First, the microservices and operations involved in requests to that API endpoint could also be involved in many other requests. We only want to measure the performance of those service operations when they are called as part of a request to that API endpoint and not when they are called because of requests to other endpoints.

Secondly, we want to look at the effective execution time, that is, the time when a service operation is actually doing some processing and not just waiting for the response from another service operation (within the same service or another service). The graphic below illustrates the problem.

In that particular trace, what is the operation that is consuming the most time? Well, if we just look at the duration of each span, the initial span (i.e., the root span) will always take more time since it includes all other spans. But that’s useless because it doesn’t help us identify the bottleneck. To compute the effective execution time of an operation, we need to subtract its duration from the duration of all child spans. In our example, the effective execution time of Operation A is t1 + t2 + t6, which is shorter than t4, which is the effective execution time of Operation C. Therefore, Operation C is the main bottleneck in this request.

Figure representing the relationship between the root span and its child spans, illustrating how to look at the duration of the child spans can give us clues on which operation may be causing a bottleneck.
The root span can be broken down into child spans—the duration of each child span will give us clues on which operation may be causing a bottleneck

Note that if we used parallelization or asynchronous operations, we would need to do this differently, but for the sake of simplicity, we’ll assume operations are sequential.

To address those two problems, we need to take several steps:

  1. Compute the duration of each downstream service and operation only when they are being called as part of a request to the slow API endpoint
  2. Subtract to each of those the duration of child spans
  3. Aggregate the results by service and operation

To do this in SQL, we use a recursive query that uses the following syntax:

Syntax of a recursive query in SQL.
Syntax of a recursive query in SQL

This is how it works.

It first runs the initial query, which returns an initial set of results, and then runs the recursive query against that initial set of results to get another set of results. After that, it reruns the recursive query on that set of results and continues doing so until the recursive query returns no results.

Then, it takes all the different results from all the individual queries and concatenates them. That’s what the UNION does.



And finally, it runs the final query against all those concatenated results—that is what the query returns.


The following query is the one we can run to compute the effective execution time of each operation over the last 30 minutes. In this case, the API endpoint we are analyzing is the generator.generate operation ( span_name is the operation’s name) of the generator service.

WITH RECURSIVE x AS
(
    -- initial query
    SELECT
        s.trace_id,
        s.span_id,
        s.parent_span_id,
        s.service_name,
        s.span_name,
        s.duration_ms - coalesce(
        (
            SELECT sum(z.duration_ms)
            FROM ps_trace.span z
            WHERE s.trace_id = z.trace_id
            AND s.span_id = z.parent_span_id
            AND z.start_time > NOW() - INTERVAL '30 minutes'
        ), 0.0) as effective_duration_ms
    FROM ps_trace.span s 
    WHERE s.start_time > NOW() - INTERVAL '30 minutes'
    AND s.service_name = 'generator'
    AND s.span_name = 'generator.generate'

    UNION ALL
    -- recursive query
    SELECT
        s.trace_id,
        s.span_id,
        s.parent_span_id,
        s.service_name,
        s.span_name,
        s.duration_ms - coalesce(
        (
            SELECT sum(z.duration_ms)
            FROM ps_trace.span z
            WHERE s.trace_id = z.trace_id
            AND s.span_id = z.parent_span_id
            AND z.start_time > NOW() - INTERVAL '30 minutes'
        ), 0.0) as effective_duration_ms
    FROM x
    INNER JOIN ps_trace.span s
    ON (x.trace_id = s.trace_id
    AND x.span_id = s.parent_span_id
    AND s.start_time > NOW() - INTERVAL '30 minutes')
    
)
-- final query
SELECT
    service_name,
    span_name,
    sum(effective_duration_ms) as total_exec_time
FROM x
GROUP BY 1, 2
ORDER BY 3 DESC

These are the key things to note in this query:

  • The recursive query syntax
  • The initial and the recursive queries compute the effective duration of a span by using a subquery to sum the duration of all child spans. The latter is then subtracted from the duration of the span (coalesce is used to return 0 in case the span has no child spans and the subquery returns NULL)
s.duration_ms - coalesce(
(
	SELECT sum(z.duration_ms)
	FROM ps_trace.span z
	WHERE s.trace_id = z.trace_id
	AND s.span_id = z.parent_span_id
	AND z.start_time > NOW() - INTERVAL '30 minutes'
), 0.0) as effective_duration_ms
  • The recursive query uses a join to traverse the downstream spans by selecting the child spans. Since the recursive query runs again and again on the results, applying the recursive query to the previous results ends up processing all downstream spans across all traces that originate in our API endpoint
INNER JOIN ps_trace.span s
ON (x.trace_id = s.trace_id
AND x.span_id = s.parent_span_id
AND s.start_time > NOW() - INTERVAL '30 minutes')
  • The final query aggregates all the individual execution times by service and operation

This is the result of the query in Grafana using the Pie chart panel. It quickly points out the random_digit operation of the digit service as the main bottleneck.

Pie chart in Grafana showing the total execution time of each operation.
Pie chart in Grafana showing the total execution time of each operation. With such a view, we can clearly identify that the random_digit operation of the digit service is the main bottleneck

A Modern APM Experience Integrated Into Grafana

An overview of the Application Performance Monitoring dashboards for Promscale in Grafana.
An overview of the Application Performance Monitoring dashboards for Promscale in Grafana (right-click on "Open Link in New Tab" for a better view)

Since it is directly integrated into your Grafana instance, you don’t need to set up a new tool or learn a new user interface. Additionally, the Grafana dashboards can be updated, which means you can customize them and extend them to meet your specific needs.

The new APM experience within Grafana consists of six dashboards linked to each other:

[1] Overview: provides a general overview of the performance across all services to surface the most common problems that could require your attention.

Promscale APM dashboards: Overview (sreenshot).

[2] Service Details: provides a detailed view of the performance of a service to quickly identify specific problems in that service related to throughput, latency, and errors.

Promscale APM dashboards: Service Details (screenshot).

[3] Service Map: a real-time automatic map of how your services communicate to identify dependencies and quickly validate their implementation.

Promscale APM dashboards: Service Map (screenshot).

[4] Downstream Dependencies: a real-time detailed node graph with all the downstream services and operations across all selected service and operation traces. This helps you troubleshoot in detail how downstream services and operations impact the performance of an upstream service.

Promscale APM dashboards: Downstream Dependencies (screenshot).

[5] Upstream Dependencies: a real-time detailed node graph with all the upstream services and operations across all selected service and operation traces. This helps you quickly identify the potential blast radius of an issue in a specific service and operation and determine which upstream service and operation are causing problems (like a sudden increase in requests impacting performance) on a downstream service.

Promscale APM dashboards: Upstream Dependencies (screenshot).

Note: Some of these dashboards use the Node graph panel, which was introduced in recent versions of Grafana. It’s a beta version, so it may break. It worked for us in Grafana 8.5.1.

These dashboards are available on GitHub (filename starts with apm-). Check our documentation for details on how to import the dashboards into Grafana.

Observability Is an Analytics Problem

OpenTelemetry, a CNCF standard for instrumentation, makes telemetry data easier to collect and universally available. Traces, in particular, hold a treasure of valuable information about how distributed systems behave. We need powerful tools to analyze the data to get the most value from tracing. Observability is essentially an analytics problem.

SQL is the lingua franca of analytics. When applied to trace data, it helps you unlock new insights to troubleshoot production issues faster and proactively improve your applications. Get started now with Promscale on Timescale (free 30-day trial, no credit card required) or self-host for free.

Promscale is an observability backend built on the rock-solid foundation of PostgreSQL and TimescaleDB. With the new support for OpenTelemetry traces and integrations with the visualization tools you use and love, you have full SQL superpowers to solve even the most complex issues in your distributed systems. You can install Promscale for free here.

Ingest and query in milliseconds, even at terabyte scale.
This post was written by
13 min read
Observability
Contributors

Related posts