At work, I am involved with Observability and in order to improve my knowledge I went looking for books to learn more about observability, telemetry, and everything related to it. It was in this search that I found the book Distributed Tracing in Practice: Instrumenting, Analyzing, and Debugging Microservices.

This is the first post of Learning from a more technical book, so it may be a bit different than the other posts (check the Learnings from Books Series), as when I read I focus more on the parts that I looking to learn and just skim through the parts that I am not interested in.

The book is good and has very useful information, if you are new or at the begging of your journey to better understand tracing, it may be worth it for you.

If are new to the whole distributed tracing world, maybe it is good to read my other post Observability Basics first.

Learnings from Distributed Tracing in Practice

As the industry moved to the microservice to achieve some independence during development and deployment (CI/CD) when the service runs in production, they are interdependent, when one slows down, others that depend on it will slow down also.

“Distributed architectures give clear benefits, especially with scalability, reliability, and maintainability. The biggest drawback, however, is that distributed architectures break traditional methods of profiling, debugging, and monitoring.”

As everything is connected and interdependent we cannot rely only on logs and metrics about a specific microservice, and distributed tracing is part of the solution. “Without tracing data, we are reduced to guess-and-check across seas of disorganized logging data and metrics dashboards”.

The book defines Distributed Tracing as “Distributed tracing […] is a type of correlated logging that helps you gain visibility into the operation of a distributed software system for use cases such as performance profiling, debugging in production, and root cause analysis of failures or other incidents. It gives you the ability to understand exactly what a particular individual service is doing as part of the whole, enabling you to ask and answer questions about the performance of your services and your distributed system as a whole”.

The keyword here is understanding. Distributed tracing enables us to understand the whole request and as the software scales in-depth, logs and metrics alone are not enough to quickly identify the problems that happen on production.

It does that by providing “context that spans the life of a request and can be used to understand the interactions and shape of your architecture”.

You get the most value with these spans when you annotate with metadata (known as attributes or tags) and events (also referred to as logs) that will help you understand the current inner state of the application.

The book also talks about the different ways that you can add instrumentation, that agent-based, and library-based instrumentation. And the importance that you should also manual instrumentation for important things.

The book also talks about the importance of using an open-source and common standard, like OpenTelemetry, to avoid vendor lock-in with the telemetry provider. The book also cites another example, but I believe OpenTelemetry is the most important.

One important point made by the authors is that while we have to add information and trace the majority of the operations of your system, we also have to pay attention to the trade-off. Add span, logs and other telemetry has a cost of memory, processing, network, data, etc. So be mindful that instrumentation is not free, the ability to be able to understand the system comes with a cost. So adding span can add latency, reduce throughput and increase your infrastructure costs.

When you reach a high volume, you could consider sampling your spans because not all spans have equal value. With that, you make sure that you get the value of trace and the return on your investments.

Tips on how to instrument better your application

Below are some practical tips that you can use when instrumenting your application.

The book indicates that for serveless and microservices, a good option is to start with a white box instrumentation first. The white box is where you have access to the code source.

One good practice is to create attributes (or tags) for important things. Everything that you think can help you understand the behavior of the system in production. Tags are the main way that you can increase the information on the span. “tags will allow you to slice and dice that information to better understand the why of your query”.

In general, don’t try to trace extremely long operations, distributed tracing works best when the entire trace operation takes a few seconds or a few minutes.

When defining the span names, they should describe actions, not resources. Also, think about aggregation when you name the span. Which name will help you aggregate many calls to discover a pattern?

For longer or loosely related operations, you can add links that declare the relationship between the traces.

Span has status, use this to help classify it and able to understand the system, like the error status for example.

Also, “named and tagged spans can significantly reduce the amount of logging statements required by your code. When in doubt, make a new span rather than a new logging statement”

And lastly, a good rule is that your service should emit as many spans as logical operations it performs.


The book also talks about metrics, show some code examples, talks about the history of tracing, talks about context propagation, Zipkin, and many many other things that I cannot cover here, so if you are interested in digging deeper, you can pick up a copy of the book Distributed Tracing in Practice.

Happy coding!


Liked this post? Check out other posts part of the series Learnings from books where my goal is to share what I learned from the book that I read. It is a mixture of review and summary with a bit of my opinion and point of view