Messaging middleware at Fortigent. Challenges and perspectives.

1 Oct

Introduction

A large enterprise such as LPL has multiple applications that are being built independently. These applications often need to work together and exchange information.

One approach to integrating multiple applications is data integration. This is when the application workflows proceed independently of each other, and the applications exchange information implicitly, by reading and writing their data to and from the same database. An example of this is a Customer Relation Management (CRM) application having access to the sales orders data created by the web store application.

Another major kind of application integration is process integration (Ross, 2006). This is when an event occurring in one application triggers a reaction in another application. An example of this would be a creation of a sales order by the web store triggering a shipment process in the delivery subsystem.

Evolution of application integration

In practice, the process integration often begins its life as an aspect of data integration, and only later, as architecture complexity grows, it gradually acquires its distinct message-centric character. To build up on the previous example, a sales order may initially be a simple insert into an Orders table in a Relational Database Management System (RDBMS). The CRM app simply reads from the Orders table. When the Shipment Subsystem appears on stage, it starts treating the Orders table as its work queue in which an individual order represents a work request, or message. This marks the advent of process integration.

Initially, the process integration may be pollingbased — the Shipment System would simply query the table at regular intervals. As data volumes grow, the scalability of the system becomes important. To avoid bottlenecks, some of the subsystems may have to exist in multiple instances in order to allow for increased throughput. For process integration this poses a problem of concurrency as a work item is handed off from one subsystem to another. In our order processing example, if second instance of the Shipping System is added, the polling-based queue solution needs to be robust enough to guarantee that the order will not be shipped twice. In RDBMS world, this can be solved by wrapping the access to the queue table in a database transaction, locking the record for the duration of the status update.

As the number of downstream subsystems grows (adding e.g. payment, service activation etc.) the database-driven solution gets progressively more complex. An order in the Orders table may need one or more status columns added, to indicate order’s position in each of the downstream workflows. At some point, instead of having the multiple workflows feed off of the same Orders table, their state and transition history data gets spun off to a separate set of tables, representing multiple queues and event logs. Furthermore, as business matures and the Service Level guarantees get tighter (Smith, 2006), the polling-based solution may hit its scalability limits. This warrants an eventbased process integration. At this point a mechanism (based e.g. on database triggers) needs to be devised to signal the Shipping Subsystem that a new order has been created.

Message Brokers

The above was meant to illustrate why even though in principle both data- and process- integration solutions can be built on top of a generic DBMS, a more strategic approach is to base the process integration on something specifically built for the purpose.

This is because the message-centric approach requires operating at a higher level of abstraction than what a generic DBMS provides out of the box. Implementing messaging solution on top of a generic DBMS requires lots of custom coding to translate lower level database concepts (e.g. tables, queries, status columns) into higher level messaging concepts (e.g. queues, subscriptions, message delivery).

From the above context, a message broker emerges as a specialized kind of database. Unlike generic databases, designed first and foremost for storage and retrieval of data, message brokers are designed from the ground up to allow applications to exchange packets of data “frequently, immediately, reliably, and asynchronously, using customizable formats” (Hohpe, 2003).

Before we get to choose a specific message broker implementation, we need to know how to compare them. Let us zoom in on the messaging problem and see what features make a difference.

Messaging concepts

While data-centric systems usually manipulate data in bulk (indeed, in SQL systems even a single-record update operation is but a special case of a multi-record update), message-centric systems work with one individual message at a time. One application sends messages to the broker, one at a time. Another application receives messages from the broker, again, one at a time.

How does the broker know which application should receive which message? This depends on the capabilities of a message broker. In simple brokers, each application has its own inbox, or queue. The sender specifies an exact destination address. If multiple copies of the same message need to be delivered to certain recipients, it is sender’s responsibility to identify and target each recipient.

In more advanced brokers the senders do not target the messages at specific receivers. Instead, the new message is evaluated against a set of routing rules configured by a system administrator. Based on the rules, the message may be delivered to one or more applications. In even more advanced brokers, applications themselves express their interest in receiving certain messages. This capability serves to enable a popular messaging pattern known as the Publish-Subscribe, or pub/sub for short. In a pub/sub architecture subscribers receive messages without knowledge of what specific publishers there are (Gorton, 2006).

The most widespread kind of message routing is topic-based routing. In this approach, a message carries a special meta-data field, called a topic, which serves as the main routing factor. Alternatively, content-based routing allows arbitrary elements of the message examined by the routing rules, allowing more flexibility at the cost of reduced runtime performance.

What if a message cannot be delivered, for example, due to the recipient application being offline? Different message brokers provide different guarantees. Most brokers will try to redeliver a message a configurable number of times before discarding it, or rerouting to a dead-letter queue. Some brokers store their messages in memory and lose them when the broker itself is restarted. Some will persist the message to disc and recover them upon broker restart. Some message brokers allow recipient to examine and conditionally reject a message. Most message brokers allow their messages to have a limited life span, such that an unpicked message will eventually expire instead of sitting in queue indefinitely.

Finally, depending on concrete implementation, message brokers differ in their performance characteristics (lag, throughput), ease of administration, and the level of integration with the underlying operating system.

Fortigent Messaging Story

Similar to the web store example above, our journey started with simple data-centric integration solution. Our applications shared the same database, with consumers doing most of the work to aggregate our highly normalized data into the presentable form, at query time.

When this simple approach could no longer scale, there were several worker processes introduced, such as ABP (Account Based Positions) and CBP (Composite Based Positions) that would pre-aggregate the data in bulk. As we continued to optimize away any redundant computations, the worker processes evolved from recalculating all data every time, to polling for changes, and finally, to SQL-trigger-based activation.

Since some of our computations were easier to express in C# than in SQL, with passage of time certain operations (such as exploding our bucket inheritance hierarchy) were moved outside the database into managed code. A stored procedure, fed by a SQL trigger, would insert the work item into the queue table polled by a .NET application. Such hybrid architecture, known as the “Master Queue”, still relied on SQL server for persistence and concurrency management.

Our first attempts at employing a dedicated message broker used MSMQ. This choice seemed obvious given MSMQ status as a native Windows component, requiring virtually no installation and providing full integration with Active Directory, native Windows transactions (including DTC), and having its API built-in the standard .NET Base Class Library. Before long we discovered some of the more painful limitations of MSMQ.

This may be a good point to pause and review all the issues we had with MSMQ to prepare the stage for an alternative message broker.

Problems with MSMQ

First, in MSMQ there is no concept of a central broker. Each application is supposed to have its own queue defined locally on the same machine. The sender is required to fully specify the destination address, including the remote machine(!) and queue name. This forces the applications to depend on stable network infrastructure, and is made worse by MSMQ’s lack of support for dynamic DNS. The impact of this limitation can be somewhat reduced by going against Microsoft recommendation and having all applications have their queues reside on a central server — which introduces a single point of failure, reduces the performance and nullifies the delivery guarantee. Also, because MSMQ disallows creating queues on remote machines, the central server approach taxes the efficiency of the Continuous Integration process. Because the queues cannot be created at application installation time, a separate manual step is required to define queues on the central server.

In addition not only does MSMQ  use 3 TCP and 2 UDP ports(!), it dynamically allocates ports at OS startup! This significantly complicates firewall configuration to let MSMQ traffic through.

Second, MSMQ lacks any routing capability whatsoever. This makes implementing a Publish-Subscribe pattern rather difficult and leaves applications painfully aware of each other’s existence.

In short, MSMQ dates back to 1997 and that shows. Its distributed design combined with the lack of routing support makes it inadequate for a role of enterprise-grade middleware unless an extra development effort is made to build the lacking facilities on its top.

Options Considered

Before abandoning MSMQ altogether, we tried to squeeze a maximum out of it, by implementing our own routing mechanism on top of MSMQ’s barebone queues. Our solution was inspired by one of the approaches described in (Hohpe, 2003) and was a hybrid of their Routing Slip and Process Manager patterns (see the book for more information about these patterns).

Eventually convinced about inadequacy of MSMQ for our messaging needs, we decided to try a third-party message broker. Among the alternative message brokers we considered were ActiveMQ, 0MQ, and RabbitMQ. All three are relatively new products, incorporating the lessons learned by the industry in the 15 years since the inception  of MSMQ.

  • ActiveMQ is a central server type broker, rich with features and very popular among the enterprise Java crowd. It is a mature solution with full support of flexible rule-based routing. It is Open Source software distributed under the Apache license and requires a Java VM to run.
  • 0MQ is broker-less distributed solution (this aspect makes it similar to MSMQ) optimized for faster-than-TCP low-latency high-throughput local network operations. It is a native C++ app distributed under the LGPL license.
  • RabbitMQ is a central server type broker, commercially supported by VMWare and freely distributed under the Mozilla Public License. One of the most attractive features of RabbitMQ is its robust routing support with fully programmable subscription declaration.

After careful consideration and consultations with colleagues in other companies we have settled on RabbitMQ.

RabbitMQ advantages

Here are some of the advantages RabbitMQ offers over MSMQ:

  • RabbitMQ is a central server type solution, with full support for DNS and failover clustering. This simplifies configuration management while avoiding a single point of failure.
  • RabbitMQ fully supports topic-based routing. This makes implementing a full-blown Publish-Subscribe architecture very simple.
  • RabbitMQ traffic goes through single TCP port — making it easier to manage firewall configuration.
  • AMQP, the standard RabbitMQ is based on, is a programmable protocol. This means the admin does not have to define queues and subscriptions manually — they are created as needed by the client application.
  • According to some tests, RabbitMQ has about 7-8 times higher message throughput than MSMQ.

Even with the above advantages, adoption of RabbitMQ was not completely painless. The biggest challenge that has surfaced so far is RabbitMQ’s lack of integration with Windows Authentication (aka trusted connection). This requires passwords to be stored in application configuration files, which triggered a knee-jerk reaction from our IT infrastructure team. In one case we were able to work around the issue by encrypting the application’s configuration file. In another case, we had to build a wrapper WCF service to avoid exposure of RabbitMQ credentials to application users.

Despite the difficulties, our experience with RabbitMQ was very pleasant. Most importantly, its support of the pub/sub model allows us to achieve a high degree of decoupling between the individual applications which will help to make our architecture more flexible and future-proof.

Conclusion

The need to sustain system responsiveness given the growing volume of data at Fortigent has led to proliferation of worker processes whose job it is to aggregate the data through intermediate stages to its presentable form. As the final result depends on a multitude of configurable factors, the data needs to be recomputed in response to the configuration changes. This puts pressure on integrating the computation across multiple processing agents, and led to the advent of an event-driven messaging architecture at Fortigent.

The recent adoption of RabbitMQ is a step in an ongoing process of sophistication of software architecture in response to challenges posed by the continuing business development.

Bibliography

Gorton, I. (2006). Essential Software Architecture. Springer.

Hohpe, G. (2003). Enterprise Integration Patterns. Addison-Wesley.

Ross, J. W. (2006). Enterprise Architecture As Strategy. Harvard Business Review Press.

Smith, G. (2006). Straight to the Top. Wiley.

Videla, A. (2012). RabbitMQ in Action. Manning Publication.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: