Designing Data-Intensive Applications is one of the best (if not the best) software engineering books that I have read. It explains very well the many characteristics and challenges that we can face in distributed systems.

This book covers a wide range of topics—from databases to messaging systems, transactions, and even clock skew. It’s definitely worth a second read, and I’ve been recommending it to all my coworkers.

This post of Learning is from a technical book, so it may be a bit different than the other posts of the collection Learnings from books, as when I read technical books I focus more on the parts that I looking to learn and for my current situation.

Learnings from the book Designing Data-Intensive Applications

As I said above, it is a very complete book, it is something that everyone who works with microservices or any distributed system has to read.

The book has a lot of learnings, but to not be too long a post, I will share my 5 most important learnings from the book.

time is also relative in systems

I knew that I could not trust 100% in the times between machines because we could have differences but one thing that didn’t cross my mind is that in the same machine we could have clock skews. In other words, in the same machine, if asked for the time, it can be a lot more or even travel back in time.

Basically, we have two types of clocks, time-of-day clocks and monotonic clocks.

A time-of-day clock does what you intuitively expect of a clock: it returns the current date and time according to some calendar (also known as wall-clock time). A good example is the Java System.currentTimeMillis() which returns the number of seconds (or milliseconds) since the epoch: midnight UTC on January 1, 1970, according to the Gregorian calendar, not counting leap seconds.

The time-of-day clocks are usually synchronized with NTP which means that the local clock is too far ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in time

A better time, from a single-machine perspective, is monotonic clocks. They are guaranteed to always move forward (whereas a time-of-day clock may jump back in time) and they are great for checking the time elapsed between two operations as they don’t go back in time nor jump into the future. In Java, you can get the monotonic clock value by calling System.nanoTime()

concurrency is not only things that happen at the same time

Usually, when dealing with software development I thought about concurrent operations as being the things that happen “at the same time” by having an overlap in time. While that is true, the other important concept is operations that are unaware of each other. So if they don’t know about each other and they handle the same object, they can be considered concurrent.

As it is in the book, “For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred.”

not all ACIDs are the same

The ACID acronym which refers to the set of 4 key properties that define a transaction: Atomicity, Consistency, Isolation, and Durability seems to be a very popular subject for interview and it is good in general to know the concept and good to keep in mind when developing software.

But one thing that is important is that not all databases implement these properties the same way, as the book explains, “there is a lot of ambiguity around the meaning of isolation. The high-level idea is sound, but the devil is in the details. Today, when a system claims to be “ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately become mostly a marketing term.”

advantages of log-based message broker

I have been using RabbitMQ as a message broker for a few years now and haven’t actually used others like Kafka. In RabbitMQ, when we consume a message from the queue and acknowledge it, it is removed from the queue and deleted. But in log-based message brokers (like Kafka) reading a message does not delete it from the log so we can replay messages if we want. I found that a really good feature as we can sometimes “lose” messages and as they were not persited (unless the application does it) it is really hard to replay it.

a correct algorithm does not guarantee correct behavior in distributed systems

When we design software, we try to make sure that our code is working correctly and the logic has no bugs. But even if you have a perfect running code it does not mean it will always behave correctly. One of the many reasons that can cause the behavior to not be correct is that a “process may pause for a substantial amount of time at any point in its execution (perhaps due to a stop-the-world garbage collector), be declared dead by other nodes, and then come back to life again without realizing that it was paused.”


These are my 5 most important learnings from the book Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems written by Martin Kleppmann.

Happy coding!


Liked this post? Check out other posts part of the series Learnings from books where my goal is to share what I learned from the book that I read. It is a mixture of review and summary with a bit of my opinion and point of view