Learnings from the book Mastering Distributed Tracing

The book Mastering Distributed Tracing: Analyzing Performance in Microservices and Complex Systems is the second about distributed tracing that I have read, but this is by far the best of the two.

I first read Distributed Tracing in Practice but this one has much more practice than the other one. The author shows a bunch of practical use and code example for various situations (and usually in three languages: Go, Java, and Python). So, if you feel like reading a book about distributed tracing, this one is the one that I would recommend.

This post of Learning is from a technical book, so it may be a bit different than the other posts of the collection Learnings from books, as when I read technical books I focus more on the parts that I looking to learn and for my current situation.

This book is very hands-on and practical, so if you are new and want to understand how to add tracing, propagate context, or any other thing related to distributed tracing, it will have information.

The book has Jaeger as the tracing backend and uses a lot of OpenTracing examples. It is good to understand, but I would say that today I would prefer OpenTelemetry as it is the junction of OpenTracing with OpenCensus.

If are new to the whole distributed tracing world, maybe it is good to read my other post Observability Basics first.

Writing a post about a second book on the same subject is a bit tricky, so I will try not to repeat much of what I wrote on Distributed Tracing in Practice and try to write what is new and good.

Learnings from Mastering Distributed Tracing

The book was written by the technical lead for the tracing team at Uber Technologies. Where Jaeger was born, so it makes sense to use Jager as the tracing backend.

One of the things that the author mentions that I completely agree with is that “building a tracing system is “the easy part”; getting it widely adopted in a large organization is a completely different challenge altogether”. I see that in many places. People have to change the way they think, from the whole local log mentality to a more distributed and complex system.

That is because building a distributed system is hard but debugging them is even harder as we have many parts and many more places for errors.

Observability

The book talks about a definition of observability that I like: “observability for a software system is its “capability to allow a human to ask and answer questions”. The more questions we can ask and answer about the system, the more observable it is.”

It is not that an observable system will tell you what is going on, but it allows you to dig deeper and ask many questions, so every time that you add span or logs you can ask that question for yourself, will this help me answer any question or not?

The book also talks about the “three pillars of observability” (metrics, logs, and traces) and that traditional monitoring tools were designed for monolith systems, lacking the context needed to understand distributed systems.

Distributed Tracing basics

The basic concept of distributed tracing is simple: We add instrumentation into chosen points of the program’s code (tracepoints) and that produces profiling data when executed that will help us understand how that works (make it observable, as we said above)

This profiling data is collected in a central location, correlated to the specific execution context, ordered in the causality order, and combined into a trace that can be visualized or further analyzed.

In other words, for the same request, we collect many data points, group them, order them, and create a single view that will help us understand all the interactions for that given request.

As the author said, “the ability to collect and correlate profiling data for a given execution or request initiator, and identify causally-related activities, is arguably the most distinctive feature of distributed tracing, setting it apart from all other profiling and observability tools.”

For more on how it works, check the post Observability Basics

Running it in production

No trace solution is complete until it runs in production, and that is the most challenging part. One of the things that I haven’t thought about before reading the book is the importance of considering the cost of the tracing in production, mainly the performance one as I was aware of the storage costs.

When we gather monitoring data in production, we are always doing a compromise between how granular we want the monitoring data and the costs.

The costs are both in data storage, as well as in performance overhead of collecting the monitoring data. “The more data we collect, the better we hope to be able to diagnose the situation, should something go wrong, yet we don’t want to slow down the applications or pay exorbitant bills for storage.”

When adding distributed tracing, in addition to the traditional logging level, you have to pay attention to the verbosity of the instrumentation as tracing data can easily exceed the volume of the actual business traffic sustained by an application.

Regarding the performance, “Collecting all that data in memory, and sending it to the tracing backend, can have a real impact on the application’s latency and throughput.”

And storage is also important, if you use an external service you will probably pay by usage (number of events sent) or if you use an in-house, you will probably need to concern with storage and performance of the analyzing tool.

Sampling is important

Sampling comes to help with the reduction of these costs. We basically have two ways to sample: Head-based and Tail-based.

Head-based consistent sampling also known as upfront sampling makes the sampling decision at the beginning of the trace. As the decision is made at the beginning, it is not aware of the whole trace. But as the trace spans many applications, the instrumentation knows that it can discard (doesn’t need to collect) the trace that is sampled, reducing the memory and also storage.

Tail-based sampling makes the sampling decision at the end of the trace, so you can analyze the whole trace to make the decision. This helps us make a more intelligent decision on what to keep and what. The trade-off is that you will need to collect the whole trace, so you don’t have any reduction of application overhead (only save storage).

Organizational challenges

When starting to implement distributed tracing in your company, start with the workflows that are most important to your business. As you probably see the most value and also be able to showcase it to other engineerings and management.

Someone needs to be in charge of actually deploying and maintaining tracing infrastructure so that the users can reap the benefits. If your company is large enough, maybe you can have a dedicated tracing team, otherwise, a team (platform, infrastructure) will have to be responsible for it.

The hardest part will be convincing and spreading the adoption of the libraries, framework or just changing the mentality from classic logging to a more distributed tracing. End-to-end has a clear value proposition but people just don’t know about it, so you will need to get the word out.

If possible, standardizing on a small set of technologies, as it will be easier to transfer skillsets between teams, more efficient, and also easier to create libraries if you need.

Also, try to integrate the tracing tools into engineers’ daily routines. This will help create a virtuous feedback loop where they start suggesting new features or even building their own analysis tools based on the tracing data.

And lastly, look for an open-source solution (like OpenTelemetry). Instrumenting a code base is expensive, so you want to avoid vendor lock-in with vendor-specif libraries.

Favorite quotes

These are my 5 favorite quotes from the book.

“In the microservices-ruled world, end-to-end tracing is no longer a “nice-to-have” feature, but “table stakes” for understanding modern cloud-native applications.”

“As soon as we start building a distributed system, traditional monitoring tools begin struggling with providing observability for the whole system, because they were designed to observe a single component, such as a program, a server, or a network switch”

“Collecting all that data in memory, and sending it to the tracing backend, can have a real impact on the application’s latency and throughput”

“Sometimes, when organizations rush to adopt microservices and break up the existing monoliths, they end up with a “distributed monolith” characterized by a high number of microservices that are tightly interconnected and dependent on each other”

“As you get serious about tracing your distributed architecture, you may start producing so much tracing data that the cost of bandwidth for sending all this data to a hosted solution may become a problem, especially if your company’s infrastructure operates in multiple data centers”

These are my learnings from the book Mastering Distributed Tracing: Analyzing performance in microservices and complex systems. If you want more context related to observability, check the posts with the tag Observability.

Happy coding!

Google Anaytics (functional)

Learnings from Mastering Distributed Tracing#

Observability#

Distributed Tracing basics#

Running it in production#

Sampling is important#

Organizational challenges#

Favorite quotes#