I have worked over the last decade with many customers who were consolidating their MQ footprint. It’s a familiar pattern – there are many queue managers, they tend to be lightly loaded, why not consolidate to a central hub? Now that many of the projects with which I have firsthand knowledge have been in Production for a few years some common patterns are emerging and they aren’t good.
There are several specific issues but the thing they tend to have in common is that rather than reduce overall cost as intended, the cost has merely been shifted from accountable line items (licenses, servers) to unaccountable, unmeasured costs (admin overhead, organizational friction, increased turnover).
In many cases the infrastructure costs are actually greater after the consolidation but since they are unmeasured they are recognized only in terms of unmet expectations…
- The infrastructure is inflexible
- Time to market has increased, not decreased
- Internal customer satisfaction has decreased
- It’s impossible to apply patches and fixes
- Defects and outages have increased
Patch management and concurrent versioning
One example of this is the need for greater authority within Ops to dictate patch management on the messaging hub. What I usually advise is to implement a central group or cluster of queue managers instead of a single huge queue manager. When a patch comes out, it is applied to one of the nodes and applications have [X] days to test and migrate to the new node. If there are several nodes in the cluster, the upgrades continue as applications test, migrate, and free up capacity on the nodes with the old version. As of a specific date [X], any remaining nodes are patched and the process continues up the SDLC levels from Dev eventually through Production.
In a large enough environment a clustered hub would ideally support three concurrent versions of MQ. Some number of apps are on the -1 version migrating to the +0 version and some are on the +0 version migrating to v.next. The bulk of nodes would be at the +0 version and as nodes in -1 free up, they are upgraded to v.next.
In some cases, any applications unable to meet the upgrade schedule are provided an enclave to run on the old version until such time as they can run on the new version. Any such cases may require executive approval and/or to bear the full cost of the hardware and licenses.
If it seems as though this is more of a discipline than a configuration, that’s because it is. Shops that succeed tend to approach it as such and commit to developing or improving core competecy in concurrent versioning.
You may have noticed in that description several things that shops tend not to do, such as a requirement to easily move applications from node to node for load balancing and patch management, or that the Ops team has authority to dictate patch dates. This is where the problems start.
When I suggest that Ops should have the power to dictate the patch schedule the usual response is “that would never work here.” Problem is if there are many apps using the hub, that’s a bit like saying “a messaging hub would never work here.” With some past projects several years out, I know of shops with 10 or 20 applications using the hub and no possibility to coordinate among them a time when they can all take an outage and so the queue manager is never patched, never upgraded.
Several shops I know of are un-consolidating to provide platforms for apps that need the new versions or patches of MQ, and that process has its own set of issues. In these cases it is common to discover that relocating applications designed for the hub is difficult at best, or at worst a showstopper issue. For example, many of these apps are designed such that the system of record and the client app are coupled tightly to the same queue manager. Therefore, if the system of record moves, all the client apps need to move at the same time. Worse, some of those client apps need to talk to multiple systems of record and if they aren’t designed to manage multiple connections per interface, then the systems of record become coupled to one another through their client applications.
The more client apps use the service, the more likely this outcome becomes. Between the problem of coordination across many client apps and the coupling of the apps to the network topology, gridlock can set in and all progress managing and operating the messaging network comes to a complete halt.
Shops that succeed tend to enforce certain standards including making the apps use client reconnect, queue manager groups in the CCDT, single points of connection to the cluster, and delegating destination resolution to MQ instead of hard-coding it.
Unfortunately, this isn’t the only route to problems in the messaging hub. Many shops run into problems scaling the hub. There’s only so much CPU and memory to add and eventually things like file handles and process table limits begin to encroach. If the apps and network are not designed to be clustered this also is a path to gridlock.
Even when the hub is managed as a group of resources, managing capacity can be an issue. Shops that provision capacity on an app-by-app basis get mired down in extremely detailed, dense, and ultimately error prone calculations.
Shops that succeed tend to define “hub” not as a single monolithic QMgr but as a cluster of queue managers that deliver a pool of capacity. They manage that capacity in uniform increments and build out ahead of demand. Application-specific capacity is managed by redistributing across queue managers in the pool.
Classes of Service
Another common pitfall is to assume all apps are equal on the hub. They are not. An app that needs low-latency response and fast failover is going to butt heads with one that processes huge messages and has very long-running messages. MQ is extremely well optimized and people often believe that means it does all things for all workloads. What that level of optimization actually provides is the opportunity to fine-tune MQ to handle different types of workloads. But tune it for one extreme and it won’t perform well for workloads at the other.
Shops that succeed tend to provide multiple classes of service defined as patterns. Typically there’s a baseline for all queue managers (for example enabling events) and the pattern-specific tuning is layered on top of that. The patterns are well defined and limited to the minimum set capable of meeting the requirements.
Often the consolidation is driven by cost reduction but behind that is a notion that fewer nodes are more manageable and therefore defects and outages will go down. But it isn’t that straightforward. In a shop with many discrete queue managers, the admin challenge is sheer size. In one with a high level of sharing of resources the admin challenge is orchestration and coordination. It is a major cultural shift and that fact is most often overlooked.
But it’s difficult to transition without freeing up resources to learn the new environment and establish processes, procedures and accountability. Some bits of this can be carried forward but it is usually less than anticipated. Where many shops fail in their hub deployment is underestimating the investment of resources required.
Shops that succeed tend to utilize Enterprise-Grade tools for something other than monitoring. All the MQ stakeholders need access and that access must be provisioned quickly, securely and at an incremental cost of near zero. There are many central web-based tools available that fill this need and they continue to generate ROI long after the hub project is completed.
Avoiding the anti-pattern
The definition of an anti-pattern is when the intuitive response to a common problem is at best ineffective or at worst actually harmful. I’ve been helping with MQ consolidation projects for more than a decade now, at banks, card processors, retailers, insurance companies, government, military, and more. Although the initial implementations are generally straightforward, very many of these networks slowly gel in place over the subsequent years, becoming resistant to change and harmful to the organization. I’ve seen enough damage to believe that consolidating the messaging network has become an anti-pattern.
But it doesn’t have to be that way. Shops that succeed tend to focus on several specific tactics to avoid the anti-pattern:
- Approach the hub as a discipline rather than as a once-and-done configuration.
- Commit to develop or improve core competency in concurrent versioning, typically using swing hardware and by designing for dynamic relocation of apps.
- Enforce implementation standards that decouple apps from each other. For example, do not allow configurations that require a sender and receiver app to rendezvous on a particular queue on a particular queue manager.
- Implement classes of service.
- Approach classes of service as a discipline rather than a once-and-done configuration. For example, once multiple classes of service exist, how does an administrator know which queue managers provide it and which apps are assigned to those?
- Utilize Enterprise-Grade tools to both reduce costs and provide function that was infeasible otherwise.
The trick to successfully consolidating the MQ network is a willingness to invest as well as rearrange. It may require cultural change, such as adopting a Devops approach to infrastructure patch management. It may require some re-engineering of the applications to support location independence and clustered operations. It may require new tooling and automation to manage and operate in the hub-and-spoke environment. And it may require the advice – preferably early in the project – of someone who has done this a dozen or so times, in order to steer clear of the tar pits that await the unsuspecting.
Want to know more? Let’s meet up at MQTC and discuss your consolidation plans. Or we can chat via phone or email. Either way, just don’t step in the tar pit.