It was Friday afternoon when one of the product managers informed me that they were not able to see alerts from specific API calls to our system in a service that we upgraded from Spring 2.7.x to 3.2.x. We checked the triggers to see if they were configured correctly and they were. We also did not see the distributed trace of the calls in our observability tool (Honeycomb).
One of the possible scenarios was that our sampling at the collector level sampling (discarding) this API request. As we have a good volume of traces that the service generated we introduced a tail-based sampling tool called Refinery which is provided by our vendor Honeycomb. What this does is analyze the whole trace (or part of it if it is too long) and check against the rules configured to check if we should sample or not.
There was a small possibility that it could be Refinery the problem, but according to our sampling rules, we would be sampling only requests that happen a lot in a very small timeframe and not specific API calls that are not very frequent.
But anyway, we decided to exclude that specific API call on the sampling rules to check if it was being discarded by our sampling or not. And guess what it was not.
The thing is, there was no trace or any span of the API call after our NGIX reverse proxy. In downstream services, we could see some log lines that had the trace ID, but we didn’t find its spans or trace. So it was generating the trace, but for some reason, it was not arriving in the observability tool.
After a lot of debugging and checking where the trace should be starting in our API gateway, I noticed that some of the tracers started by the micrometer were noop traces instead of real traces.
Digging a little bit deeper and after a lot of debugging we found out that in the Spring Boot 3, they introduced a property that defines automatic sampling probability of sampling 90% of the traces. Here again, by sampling, I mean discarding 90% of the traces and only keeping 10%. They use the inverse concept that I use where by sampling they mean keeping the tracing (where I use sampling for discarding)
I was so surprised to see this and so worried at the same time because I realized that we had been missing 90% of the traces after the upgrade. I was also very pissed because it was a hidden property that is enabled by default. I know that people that work in the Spring Projects are a lot smarter than I am but I believe that this property should be optional.
We should always send 100% of the traces and if you want it, you could change it or disable it, but not have it by default to send a very small subset of data. I understand that they did that to not overload the services, but that should be an option for the service developer not a default thing.
I was glad that during the time that the traces were missing, we had no big problems so we didn’t need to go deep, but if we had a severity one issue during that time, it would be really hard to find the calls because most of the observability traces were gone.
After that, I upgraded my post of Migrating Spring Sleuth to Micrometer Observability in Spring Boot 3 to include this change.
Despite the bad effect on the system. It was also a very good learning opportunity because I had to go a little bit deeper into how the traces were being started in the services and how that was different in the new version compared to the Spring Boot 2.
Cheers.