MQ start/stop woes

After I tweeted a link to an IBM blog post on how to start and stop IBM MQ using systemd, an IBMer responded to say “it surprised me to hear that some #IBMMQ customers have to manually restart their QMs when the box comes up.”

My reply was brutally frank: “Should be no surprise – the serviceability gap with MQ start/stop API resembles an open-pit mine. As a result most shops either don’t do it well, or don’t do it at all. Mortgage payments on the technical debt needed here desperately.”

Not wanting to leave that hanging out there with no explanation, this post describes in excruciating detail what’s wrong. Hopefully, that’s the first step to getting it fixed.

So you want to start MQ?

Sounds easy enough, right? Just write a systemd service to kick off strmqm and move on, yes? Not so fast! Unless there’s a default queue manager, the service has to at least know the name of the queue manager to start and if there is more than one queue manager, the service needs to know about all of them.

The IBM blog post addresses this by using systemd templates. There’s one generic file with the ability to pass in a substitution parameter. Each queue manager gets a symbolic link to the template file. The link file name contains the queue manager name which is the source of the substitution.

So, assuming we have a good template and several links we are good to go, yes? Not so fast! Remember that, like any other boot-level service, systemd runs as root and its configuration files must therefore not be writable by anyone who isn’t root. That means if the recommend practice of giving every queue manager a unique name is followed, the MQ administrator must engage the UNIX/Linux System Administration team for every invocation of crtmqm or dltmqm.

 

So you want to start multi-installation MQ?

Multiple MQ installations on the server breaks the assumption that starting a queue manager is as simple as running strmqm. Now the systemd service is obliged to properly set the environment for each of the queue managers it needs to start. Since systemd cannot source setmqenv, the MQ admin must either write an MQ start script to perform this function or else use crtmqenv to generate an environment file for systemd to reference. There are issues with either of these solutions.

Using a shell script to set the environment and kick off MQ makes it much harder for systemd to determine which process represents the queue manager’s service. It is customary for daemons to either execute and stay active, or else use one level of fork to run the daemon in the background. MQ natively forks strmqm to start the Execution Controller which becomes the parent of all the other processes. Insert a shell script in there and now the Execution Controller is three processes deep. There is no PID file so systemd just takes a best guess as to which process represents the health of the service.

 

The high cost of root

The alternative to creating an environment file and embedding it into the service may be worse. In addition to now requiring root for every invocation of setmqm, this approach suffers from the more general problem of increasing the amount of customization embedded in the systemd configuration files.

Shops that value security impose a fairly high cost to obtain root privileges, specifically to minimize the amount of routine work performed as root. Passing that cost on to the requestor in the form of elevated approval, formal change control, request latency, and other means, creates a strong incentive not to request root in the first place. As a result, the more customization is required in the systemd configuration files, the less dynamic the MQ network becomes. Once something is set up, it is set in stone.

Some shops have a system of streamlined privilege delegation such that the MQ Admin can obtain root and perform maintenance on the systemd files directly. While this reduces the cost to the MQ admin team, some of that cost is transferred to whoever manages privilege escalation. If it’s managed using a commercial product, the company has the cost to buy it. In addition to those costs, the burden of audit proof rises with the volume of privileged access logs because there is more log content to wade through. Meanwhile, the ability to spot anomalous activity in those same logs decreases as they grow larger.

Anything that requires root for mundane daily tasks or to store tunable per-instance configuration imposes very high barriers to adoption. As long as that day-to-day ongoing cost exceeds the cost of occasional downtime to manually start MQ after a server reboots, MQ won’t ever be enabled under systemd.

 

So you want to start multi-instance MQ?

First question: how can systemd tell if a given queue manager is multi-instance?

It’s a serious question. There is no permanent configuration setting to query and no command one can issue to determine before starting it whether a given queue manager is intended to run as a singleton or as multi-instance. The way that systemd can reliably know a queue manager is supposed to be multi-instance is that someone obtained root privileges and hard-coded strmqm -x into the service template.

But wait…doesn’t that mean all queue managers on the box must be multi-instance or else all of them must be singletons? Sure, if there’s only one template. If you want to mix singleton and multi-instance queue managers on the same box you need to create and maintain template files for each type, or else get creative (read: complex) with pre-start capabilities of systemd.

At the end of the day what his means is that even more routine root-level access is required in order to maintain the growing set of custom per-instance configuration embedded into systemd files.

 

So you want to stop MQ?

If you got this far, we can assume your service files are both multi-installation and multi-instance aware with regard to starting queue managers. The next question is “What command do you use to stop them?”

Ideally the queue manager shutdown should proceed through attempts at controlled shutdown, then immediate shutdown, then preemptive shutdown, then finally manual shutdown. Unless you opt for a one-size-fits-all solution that doesn’t shut down in stages, there is no getting around the need to run a shutdown script. A shutdown script isn’t optimal, but at least it won’t confuse systemd‘s PID tracking like start scripts can.

What’s in the script? The last one of these I wrote worked like this:

  • Set a timer and issue endmqm -c
  • If the timer pops and the queue manager has processes left, repeat using -i, then -p.
  • If any processes remain, kill them in the order prescribed by IBM. Note that the processes and recommended order change by version so it is necessary to query the queue manager’s version and dynamically construct the kill list from a version table.
  • Shut down any satellite processes that were not started as child processes of the queue manager. Since MQ services always run as mqm, anything that has to run with non-administrative privileges will likely have been started some other way and may hold shared memory segments until they are killed.

If there are multiple queue managers, all of this should be happening in parallel if we hope to get everything shut down before the system reaper shows up and starts wantonly killing random processes. Robustly managing multiple parallel child processes from a bash script is not a trivial requirement.

 

So you want to stop multi-instance MQ?

This is where things get a bit complicated. I know you are probably saying “Wait, what? Wasn’t all that other stuff complicated?” Sorry, no. Not by comparison.

The multi-instance queue manager is a single logical construct composed of processes running in two physical places. Up to now we’ve dealt only with the behavior of those local processes. We hard-coded whether a queue manager was intended to run multi-instance by adding -x to the strmqm command if it was. Now we actually need to know the current and the desired state of the logical queue manager before telling the local processes what to do.

For example, let’s say you started the queue manager with -x. Intuitively, you should be able to stop it with -x too, right? Nope. That only works if it is running as the standby instance. If it is the active primary instance you need to use the -s command because the -x is ignored.

Well, sort of. The -s parameter tells the queue manager to switch to the standby node. If the standby instance is not running when endmqm -s is issued, it will not be possible to switch instances. In that case the queue manager stays running and emits the error AMQ7276: WebSphere MQ queue manager cannot switch over. What you really need to do in this case is issue endmqm without the -x and without the -s.

In short, stopping a multi-instance queue manager requires the stop script to first query the state of the logical queue manager, then issue endmqm with either the -x, or the -s, or neither, depending on the current state of the queue manager and the intended state after running the command (i.e. queue manager remains available on another node or shuts down on all nodes).

To make sure the MQ admins are never bored, the tools that query the state of a multi-instance queue manager don’t tell whether the queue manager is intended to be a singleton or a multi-instance queue manager, only which mode the running instance was started in, and nothing at all about HA if at least one running instance isn’t visible to the dspmq command.

Just for kicks, IBM threw a -r in there to tell clients whether to try to reconnect or not, and all of the -x, -s, and -r options interact with the -c, -i and -p options. For the sake of brevity I won’t go into detail about the interactions between all of these other than to say that shutting down a multi-instance queue manager sets a high skill bar that every customer must clear.

Interactively we can eventually get MQ shut down using trial and error. That’s the lowest common denominator across all 10,000 or so MQ shops. It is what we do by default. Coding it is a different story. An admin can manually execute thousands of MQ start or stop events without exceeding the resource cost of correctly implementing all the exception handling needed by a shutdown script. Based on that, you should be able to correctly guess what approach many of us take and why.

 

So you want to start or stop MQ from the command line?

You really are a glutton for punishment, aren’t you? When systemd starts processes, it sets their resource limits, their kernel tuning, their environment, and more. It also places every process that is spawned in a control group named after its service. Start MQ as a service, it gets the environment and tuning specified for the MQ service. Start MQ from the command line and MQ gets the environment configuration tailored for the getty service that manages user login sessions. Start several queue managers from the command line, they all share the pool of resources allocated to that user by the getty service. Whoops. The variety of possible outcomes there can make for some interesting troubleshooting.

What’s an admin to do? The standard advice is to use systemctl to start and stop MQ services. In case you are not familiar with systemctl, running it is one more task that requires root. The MQ shop is faced with the choice either to escalate to root for every invocation of strmqm and every invocation of endmqm, or to try to reconcile the environment of the getty service to the environment of the mqm service and hope for the best. Or, perhaps avoid using systemd to start MQ.

A more subtle effect of this is that in any shop where there is a mix of manual and auto-started MQ, the MQ admin must be aware of which approach is used on any given server and move back and forth between different standard procedures fluently. As with any system that depends on human vigilance for quality assurance, some background rate of failure is inevitable. Over time, the chance of queue managers running inside the getty control group and inheriting from its environment approaches certainty.

Note that avoiding systemd doesn’t actually make things better since we still end up with all the queue managers running under the getty container rather than a tuned MQ service container. It’s just that doing nothing is the default, and getting from doing nothing to doing something robust and reliable requires an investment in skill and resources so great as to become a showstopper for many MQ shops.

 

Still surprised?

Once upon a time, IBM’s strategy for customizing MQ was to expose exit points in the product and trust that each and every customer would have a deep bench of C programmers capable of writing system-level, re-entrant, thread-safe code. Given that legacy it may be understandable why IBM would be surprised to learn that some MQ customers don’t use automation to start and stop MQ. After all, it’s just scripts right? How hard can it be?

 

Man of straw

IBM has always advised that the best approach to getting enhancements accepted is to specify the functionality and not the implementation. I’m going to overstep that boundary a bit here in proposing an idealized straw man solution. My complete lack of Linux system programming experience is sure to be revealed by this but we have to start somewhere. Since this is my own blog and not an RFE, I figure I get to do that here.

For the sake of this post I’m going to assume that the hypothetical new MQ daemon is called strmqd. It seemed the obvious choice given traditional MQ and UNIX naming conventions. The design would be compliant with systemd expectations:

The daemon…

  • Is started by the system at boot.
  • Controls all of the queue managers defined to the server.
  • Manages configuration data such as whether a particular queue manager is intended to run at boot, is enabled or disabled, is singleton or multi-instance, etc.
  • Is capable of figuring out the state of the logical queue manager in order to issue the correct parameters to endmqm.
  • Is capable of progressive, staged queue manager shutdown, including ultimately killing the processes in the prescribed order.
  • Performs something meaningful in response to a refresh instruction from systemd.
  • Provides a command-line interface that MQ administrators can run without requiring root privileges in order to manage configuration data and control queue managers.
  • Provides a command-line interface to set and query the daemon’s configuration parameters individually.
  • Provides a command-line interface to set and query the daemon’s configuration parameters as a group such that the output of the query is capable of being fed back to the set command to restore the daemon configuration.

You may have noticed that so far none of these requirements affects existing MQ code. The proposed strmqd is a stand-alone daemon designed on one hand to play well with systemd, and on the other hand to manage queue managers in a way that makes sense to MQ.

There are some areas of overlap with existing MQ code that need to be addressed, though. Let’s look at those next.

 

Setting the installation

For some reason, the setmqenv command is capable of figuring out which installation a queue manager belongs to but strmqm is not. On a multi-installation node something is therefore obliged to run setmqenv to establish the environment before starting the queue manager.

An alternative approach would be for the strmqm command to figure out which installation the requested queue manager belongs to and then, if it isn’t the one from which strmqm was run, set the correct installation and exec the strmqm command from there.

Obviously, I don’t know the MQ internals well enough to say whether or why this is a bad idea, but this is exactly the kind of thing IBM objects to when they advise against making recommendations with implementation details. So even though making strmqm dependent on setmqenv may seem like a glaringly obvious deficiency, we have to proceed on the assumption that IBM keeps it this way for reasons we will never know.

Fortunately, strmqd is well positioned to take ownership of this task. Until and unless IBM removes the dependency to externally set the installation, the strmqd daemon can figure it out even if that means running setmqenv or crtmqenv and scraping the output.

 

General instrumentation capability

Here I am referring the ways in which MQ does or does not interface with external tools to perform administrative or operational tasks or query the state of the queue manager. For example, dmpmqcfg has an option to write each object definition on a single line. This makes it easy to grep through object backup files to identify all objects that meet some criteria. But runmqsc doesn’t have that option so if you want to write a script to parse queue status it will need to be smart enough to correctly parse multi-line, multi-column output. One of these is friendlier to instrumentation than the other.

When it comes to instrumentation capability, an honest evaluation of MQ would rate it poorly. Or, as I phrased it in my tweet, “the serviceability gap with MQ start/stop API resembles an open-pit mine.” This post is about interfacing to systemd rather than instrumentation in general, so I will limit my justification for that claim to an abbreviated bullet list.

  • Many long-standing MQ features lack round-trip capability. To give just one example, if the queue buffers are tuned there is no way to capture that in a backup and apply those settings during a restore.
  • Configuration round-trip does not appear to be a primary baseline requirement for new functionality.
  • There are no primitives to manipulate or query namelists. Even IBM has tripped over this. An early version of the FTE installer overwrote the QPUBSUB namelist without first checking for any user-defined values.
  • We wouldn’t need to torture namelists if MQ actually supported user-defined metadata.
  • The runmqsc utility lacks the ability to generate 1-line output, lacks conditional operations, lacks substitutable inserts.
  • Local scripts and agents have no way to manage the multi-instance queue manager’s state at a logical queue manager level. The only way to issue commands to a logical multi-instance queue manager are over a client connection and then only if it is running.
  • Lots of query functions fail if the queue manager is not running, even for singleton queue managers.
  • Many configuration elements and operational states lack query capability. For example, there are commands to determine the MQ data directory if the queue manager is running. If it isn’t running the qm.ini and mqs.ini files must be parsed. An operational example is that there is no way to tell if the queue manager’s in-memory kdb is stale, which is critical when staging certificate renewals.

Unfortunately, our hypothetical strmqd daemon needs some of this functionality which is why I mention it here. Ideally, the existing MQ components would be enhanced (some might say fixed) over time to include this functionality. In the meantime, our hypothetical strmqd would need to work around some of these gaps.

 

Native command interception

Let’s say IBM grants my wish and delivers strmqd. What should happen when the MQ administrator runs the strmqm or endmqm command? Or any other operational management command?

Ideally we don’t want to require the MQ administrator to first figure out if strmqd is in use and then alternate between two completely different sets of operational commands. In an ideal world the native commands would just do the right thing. One approach might be to relocate the native commands and replace them with symlinks to strmqd which knows where to find the original executables if it needs them. I’m sure there are more elegant ways to do this but that’s a good description of the functionality required.

This imposes the greatest impact on existing code compared to everything else proposed so far. Yet if it were up to me I would not want to deliver a solution that did not address this. After all, the point of this exercise is to relieve the MQ administrator of the burden of complexity and the cost of a high skill barrier. If the folks in Hursley solve the problem once so that native commands just work regardless of whether strmqd is active or not, the problem is solved for 10,000 customers of all sizes and skill levels

 

Wrapping up

One of the benefits of being a consultant is the view of the same product from so many different vantage points. Over time, patterns begin to emerge from commonality of success and failure. Anti-patterns are a peculiar subset that are attractive enough at first glance that many shops tend to gravitate toward them, but that ultimately fail.

Given the current capabilities of the product, MQ start/stop instrumentation under systemd qualifies as an anti-pattern. Everyone wants to use it. IBM seems to think we all are. But walk down that path a little way and suddenly you find yourself in a tar pit where the effort to make additional progress is Herculean but for political reasons you can’t easily go back. The result is that many shops either don’t use systemd at all, or else use whatever primitive implementation of an MQ service they had developed at the time they discovered the tar pit and realized they had seriously underestimated the resource commitment to do it right.

What is most puzzling is that the IBM dev shop in Hursley hasn’t figured this out yet, especially given their emphasis on using MQ for cloud and virtualization since these use cases practically scream for instrumentation.

Instead, IBM implicitly offloads the task of setting up start/stop to the system admin specialists by completely ignoring the subject. When the Linux SA shows up looking for documentation or examples of MQ service files, there are none. Plugging systemd into the search box of the v9.0 Knowledge Center I get back 11 results. Clicking through to the results, none of them actually contain the word systemd so I’m not sure what the hits are supposed to have found. Look for template files or sample scripts in the distribution, there aren’t any.

The one thing we did have for a while, SupportPac MSL1, was quietly withdrawn without replacement. When I asked Mark Taylor why at the last MQTC he said “because it didn’t work.” Apparently, IBM found themselves in the same tar pit as everyone else except that they actually were able to back out, erase all the trail marks they had blazed along the way, and pretend the whole thing never happened.

My hope is that IBM would implement strmqd or equivalent functionality so that when we create pattern files to provision MQ into the cloud, into containers, or into images spun up on demand, we can bake start/stop functionality into the virtualized host without requiring lots of customization in root-owned files or requiring root escalation.

In the meantime, keep your eyes open for tar pits.

This entry was posted in Fail, IBMMQ. Bookmark the permalink.

4 Responses to MQ start/stop woes

  1. Carl Rood says:

    So is there an actual solution to this. I’ve tried external scripts with the setmqenv. Every time the second instance hangs on the systemctl start command. The standby comes up according to dspqm -x, but is killed when the systemctl times out.

  2. Pingback: MQGem Monthly (October 2017) | MQGem Software

  3. FJ Brandelik says:

    I guess it all gets a little less complicated if you make a new container per queue manager, and have 2 types, std and multi-instance….

    • T.Rob says:

      Depends – define “container” in this context. If you mean Docker containers or similar, and if the QMgrs all have the same name, then it would be possible to make a static systemd unit file. But then if all the QMgrs have the same name, it means they can’t be in a cluster or connect to more than one instance of a remote node. That solves a problem with systemd but makes work for the admin elsewhere and constrains the network.

      Assuming that the QMgr names were unique and that the name could be propagated to the systemd unit file when the virtual image was deployed, we could get rid of the need to have root to do configuration. But all the other issues mentioned, including the need to get root or delegate authority to start/stop MQ, and the issues with stopping multi-instance QMgrs and stopping QMgrs in general, remain.

      Given all that, I’d say that in the phrase “it all gets a little less complicated if…” the key word is “little” and whether that increment is even measurable is questionable.

Leave a Reply