RabbitMQ is a very popular and one of the most widely deployed open-source message brokers. And when taking it to production, one of the main concerns is how to have high availability of the queues and exchanges. In this post, I will draft some strategies for making your RabbitMQ Highly available.

RabbitMQ approaches the issue of high availability by replicating data, the same as storage solutions like databases and other infrastructures when data integrity and service continuity are of primary importance.

The main objective is not only to avoid the possibility of data loss but also to avoid reducing downtime due to both scheduled maintenance and system malfunctions.

It is also important to think about HA when using cloud environments, like, Amazon Web Services (AWS), it is highly recommended to distribute the application in clusters between availability zone to have a more reliable set up.

Let’s talk about some approaches that can help your RabbitMQ to have higher availability.

Clustering

A RabbitMQ cluster creates a seamless view of RabbitMQ across two or more servers. The clustering built into RabbitMQ was designed with two goals in mind: allowing consumers and producers to keep running in the event one RabbitMQ node dies, and linearly scaling messaging throughput by adding more nodes.

The clustering mechanism replicates all the data and states across all the nodes for reliability and scalability. In a RabbitMQ cluster, a runtime state containing exchanges, queues, bindings, users, virtual hosts, and policies is available to every node.

Because of this shared runtime state, every node in a cluster can bind, publish, or delete an exchange that was created when connected to the first node. The general structure of the clusters would be changed dynamically, according to the addition or removal of any clusters from the systems. Furthermore, RabbitMQ tolerates the failure of each node.

Despite the advantages of using RabbitMQ’s built-in clustering, it has limitations and downsides. One is that clusters are designed for low-latency environments. You should never create RabbitMQ clusters across a WAN or internet connection. State synchronization and cross-node message delivery demand low-latency communication that can only be achieved on a LAN.

Another important issue to consider with RabbitMQ clusters is cluster size. The work and overhead of maintaining the shared state of a cluster is directly proportionate to the number of nodes in the cluster. So with each new node addition, your overall cluster becomes more complex and with more overhead to manage.

Node types

When you add a node to your cluster, you should choose one of the Node types that affect the storage place; these are disk nodes or RAM nodes. If you choose a RAM node, RabbitMQ stores its state in memory. However, if an administrator chooses to store its state in a disk, then RabbitMQ stores its state on both, memory and disk.

Classic queues mirroring

RabbitMQ clusters don’t mirror queues by default. The queues are stored in the broker nodes connected to the clients that created them. Whenever such a node fails, all the queues and the messages stored within it aren’t available.

If you have defined the queues as durable and the messages as persistent, it’s possible to restore the node without losing data but this is not sufficient if you hand the death of one component without interruption of the service.

The ha-policies help to solve this problem by mirroring a queue across all the nodes in the cluster. Replicating messages on more than two nodes improves the system availability minimally, but if the cluster grows because of the higher load, it will negatively impact the application’s performance.

A few policies that you can take a look at:

ha-mode

ha-mode defines to which members of the cluster it will mirror, I usually use all.

ha-sync-mode

ha-sync-mode mode defines how the nodes synchronize when they join the cluster.

The default option is manual meaning that there will be no synchronization until you say so (by pressing the button in Management UI for example) and because of that will not receive existing messages, it will only receive new messages. At some point, when the old messages are consumed, the new node will be fully synchronized with the lead node.

And automatic means that the node will sync everything when joining the cluster.

The most important thing to consider here is that synchronization is a blocking operation, meaning that while the node is syncing with a queue, that queue will be blocked in all nodes and will not receive or allow the consumption of messages. Or in other words, “while a queue is being synchronized, all other queue operations will be blocked.”

Depending on many factors, like the size of the queue, the syncing operation can take minutes, hours, or days. And while synching, you basically cannot use the queue. So you may if your publishers are not using publishing confirms.

Leaving the value in manual means that the queue will not enter the syncing state so it will not be blocked, but you run the risk of losing data in the unfortunate event of all the other nodes that have old messages die.

You will have to balance and check for each queue, which is the less bad scenario, running the risk of having a queue blocked for a long time or being at the risk of not having the queue synched for a time while the consumers consume old messages not present in the new node.

Check here for more info on synchronization.

ha-sync-batch-size

You can use the ha-sync-batch-size policy to define the size of the synchronization payload. It should not be so small (like 1) because that can make your synchronization takes a lot of time and not a too large payload, as to have network problems. The RabbitMQ docs advise taking the average message size, network throughput between RabbitMQ nodes, and net_ticktime value into consideration when defining the batch size. Read more here.

queue-mode

As the docs states, “By default, queues keep an in-memory cache of messages that is filled up as messages are published into RabbitMQ. The idea of this cache is to be able to deliver messages to consumers as fast as possible. Note that persistent messages can be written to disk as they enter the broker and kept in RAM at the same time.”

Because of that, they introduce the lazy queues, “Lazy queues attempt to move messages to disk as early as practically possible. This means significantly fewer messages are kept in RAM in the majority of cases under normal operation. This comes at a cost of increased disk I/O.”

So, the lazy put the messages to disk as early as possible and default according to internal RabbitMQ rules that take memory and other resources into account. The risk here that I have witnessed is that when we are synchronizing a large number of messages, if we have the default value, we hit many time the high memory watermark and the syncing failed a few times (tested syncing a queue that had a size in bytes larger than the memory available). To be able to sync, we had to set it to lazy.

lazy seems to be a safer alternative, as it stores to disk, we have all nodes failed at the same time, we will not lose the message when it comes back as it was saved to disk. But you should take care as it can increase your disk usage in normal operations.

To read more about lazy queues, check this page of the docs.

Quorum queues

Implemented on version 3.8, the quorum queue is a modern queue type for RabbitMQ implementing a durable, replicated FIFO queue. It is preferred when we need replication between queues.

The quorum queue type is an alternative to durable mirrored queues purpose-built for a set of use cases where data safety is a top priority. I have not explored this queue type yet, but it is supposed to be a better version for high availability data.

Federation

Federation in a simple way means the transmission of messages between brokers. It is one of the powerful ways of handling lots of messages while using multiple RabbitMQ servers. The main goal of Federation is to transmit messages between brokers without the need for clustering. This is better indicated when you want a less coupled, WAN-friendly, and scalable solution.

The idea is that when you have a higher latency, and geographic distance, you should consider doing federation instead of clustering. Federation works in an asynchronous fashion. If you are using AWS for example, when using different Availability Zones, you should prefer federation instead of clustering.

The federation plugin does not come enabled automatically, but you can enable it.

Read more about federation here.

Publish Confirms and consumer acknowledgments

Having high availability for the messages that exist in RabbitMQ is awesome, but you can not reach HA if the message never reached the broker.

For that, you should also check if you using publish confirms. Publish Confirms help the sender make sure that the message was received and safe there. Without confirmation, the sender can believe that message is in the broker but for some reason, it never reached the broker.

Also, due to some error in the consumer, it can lose the messages and the broker can discard the message thinking everything went well in the consumer. For this, you should understand how your code acknowledges the message. Take a look at consumer acknowledgments


I hope this piece helps you make your RabbitMQ more available.

Happy coding!


References

  • RabbitMQ Docs
  • RabbitMQ Cookbook by Gabriele Santomaggio and Sigismondo Boschi
  • RabbitMQ in Depth by Gavin M. Roy
  • Mastering RabbitMQ by Emrah Ayanoglu, Yusuf Aytaş, and Dotan Nahum