Traces Vs Logs Vs Metrics

3 views
Skip to first unread message

Yvone Samiento

unread,
Jul 25, 2024, 8:47:32 PM7/25/24
to OpenEVSE

Metrics focus on one area of the system making it hard to track issues across a distributed system. Best practices suggest collecting data in regular intervals and using numerical values to store metrics.

Logs are records. They differentiate from metrics in that they record events, but, similar to metrics, these events are anything deemed important by the business or software. This includes anything from general information, to any event surpassing a certain threshold, to warnings and errors. The historical record created by logs provides insight into issues within a software environment. When an error occurs, the logs show when it occurred and which events correlate to it.

Traces track the end-to-end behavior of a request as it moves through a distributed or microservice system. The data collected in distributed tracing brings higher visibility to requests that use multiple internal microservices. Traces provide insight into how a request behaves at specific points in an application, such as proxies, middleware and caches, to identify any forks in the execution flow or network hops across system boundaries.

Breaking free of the restraints of vendor lock-in opens software systems up to numerous database options for storing telemetry data. Of those options, a strong place to store telemetry data is in a purpose-built time series database. First, observability measures a software system over time, and time series databases store high volumes of data written and queried across ranges of time. Analyzing time series data, like metrics, requires querying data across ranges of time. Those queries are easy to execute with a time series database and difficult for other database types to execute efficiently.

InfluxDB 3.0, launched in April, makes working with OTEL more accessible than ever. The new database engine was built on top of Apache Arrow and brought many performance improvements over previous versions of InfluxDB. The database now supports unlimited cardinality data without affecting performance, which translates into 100 times faster queries against high-cardinality data. InfluxDB 3.0 ingests and queries data in real time, making it ideal for applications that require real-time analytics, such as observability. Real-time analytics also makes identifying anomalies much faster.

InfluxDB is purpose-built to work with time series data, and telemetry data is time series data. The creation of OpenTelemetry opened many doors when it came to how developers handled observability practices. Working with OTEL and InfluxDB brings the power of real-time analytics, unlimited cardinality and fast querying to your telemetry data.

These data types play such a key role in cloud-native observability workflows that they're known as the three pillars of observability. Each pillar provides a different perspective of an organization's resources. When these data sources are combined and analyzed, the organization gains a holistic understanding of what's happening within its complex application environments.

Logs are files that record events, warnings and errors as they occur within a software environment. Most logs include contextual information, such as the time an event occurred and which user or endpoint was associated with it.

For example, a log file for a web server might include when the server started, requests from clients and how the server responded to those requests. It records information about each successful transaction as well as errors such as failed connections to clients.

Errors and warnings are sometimes recorded in separate log files, but all types of logging data can be recorded in a single file. For observability purposes, it doesn't matter how logs are organized, as most observability tools aggregate data from multiple log files and analyze it collectively.

Logs are a pillar of observability because they provide a comprehensive record of all events and errors that take place during the lifecycle of software resources. If you want to know when a problem occurred, or which events or trends correlate with it, logs are an excellent source of visibility.

However, logs can have important limitations. One of the biggest is that they record only the events, warnings and errors the logging software has been configured to record. Unless your logging tools and settings are configured to register certain information, it won't appear in your log files.

Another challenge with logs from an observability perspective is that log data isn't always persistent. For instance, in most cases logs created by containerized applications will disappear permanently when the container shuts down. Engineers can address this issue by moving the log data somewhere else while the container is still running, but there is still a risk that some log files will be overlooked or lost.

Metrics are quantifiable measurements that reflect the health and performance of applications or infrastructure. For example, application metrics might track how many transactions the application handles per second, while infrastructure metrics measure how many CPU or memory resources are consumed on a server.

There are many possible types of metrics that can be tracked. Two popular methods of defining metrics are Weaveworks' RED Method, which focuses on rates, errors and request duration; and Google's Golden Signals method, which measures latency, traffic, errors and saturation.

The main benefit of metrics is that they provide real-time insight into the state of resources. If you want to know how responsive your application is or identify anomalies that could be early signs of a performance issue, metrics are a key source of visibility.

By correlating metrics with data from logs and traces, organizations gain the fullest possible context on system performance or potential availability issues. This is why metrics are particularly important for observability.

However, like logs, metrics only keep track of the application and infrastructure data they were designed to record. In addition, metrics aren't typically useful for pinpointing the source of a problem, especially in a complex distributed system. For example, while metrics data might indicate that your application is experiencing a high rate of errors, metrics aren't granular or detailed enough to identify exactly which service within a microservices architecture is triggering the errors. Metrics only show that the application is experiencing errors.

A distributed trace is data that tracks an application request as it flows through the various parts of an application. The trace records how long it takes each application component to process the request and pass the result to the next component. Traces can also identify which parts of the application trigger an error.

If you need to research the root cause of a problem, distributed traces are the most effective way to accomplish this. Although logs and metrics might help you know a problem exists, it's difficult to pinpoint the source of the problem in microservices environments without running traces.

The major limitation of distributed traces is that only a fraction of all application requests are traced in most cases. Running traces takes too much time and consumes too many resources to trace every request an application receives. This means you might not always have tracing data available when an error occurs.

In addition, because every application request can be unique, the data in one distributed trace doesn't necessarily enable you to troubleshoot problems related to other requests. The data associated with the requests, endpoints and client-side configurations is likely to vary between requests, so the extent to which you can extrapolate on the basis of one trace to draw conclusions about the application as a whole is limited.

As noted earlier, logs, metrics and traces each provide a valuable, but limited, level of visibility into software environments. However, when you combine these sources, you get a relatively complete picture of what's happening in an environment.

For instance, you might notice from continuous metrics tracking that the application response rate is slowing down, which could indicate a performance issue. But before assuming there's a problem, you'd want to look at the application's logs to check whether the slower responses can be explained by a benign change, such as the app handling more complex transactions than it normally does. If you determine that the application performance degradation reflects a problem, you could then use distributed trace data to identify which specific microservice is triggering it.

Although analyzing logs, metrics and traces simultaneously enables engineers to gain a broad understanding of the state of an environment, teams should not limit themselves to these three data sources alone. The more data you have to inform observability workflows, the better.

It can be useful, for example, to contextualize logs, metrics and traces with data from a CI/CD pipeline to help you determine which application update or redeployment correlates with a performance degradation. Likewise, business metrics, such as customer retention rates, could be correlated with technical observability metrics to help gauge the effects of technical problems on business performance.

If you want to observe cloud-native environments, start by collecting and analyzing logs, metrics and traces. These aren't the only potential sources of observability, but they are the most important ones, which is what makes them the three pillars of observability.

Log files are the historical records of your systems. They are time stamped and often come in binary, plain text, or structured format. Structured logs can combine text with metadata to facilitate faster querying.

Metrics are numerical values that show how well a system is doing over a period of time. The default structure of metrics includes a set of attributes including name, value, label, and timestamp, which facilitates faster and easier querying and optimizes storage, allowing you to keep them longer.

Reply all
Reply to author
Forward
0 new messages