Command Server Gotcha

The title of my last post, Client-based runmqsc gotcha, was a bit of a misnomer.  Although I stumbled onto it using runmqsc client, the root cause of the issue really lies with the Command Server.  That means anything that drives commands through the Command Server can run into this issue, making it a much bigger deal than I originally thought.

IBM has replied to the PMR so I’ll discuss that along with a suggested strategy to mitigate the risk and hopefully avoid Production outages.

How it breaks

Root cause

If a PCF command causes the Command Server to return more result messages than the MAXDEPTH of the reply-to queue, the Command Server may return partial result sets and silently abort further processing.  Usually the client consuming the responses can get them fast enough that a result set only slightly larger than MAXDEPTH will succeed.  However, as the result set size increases beyond that the failure rate increases. In my case when the Command Server was asked to return a result set of 4005 objects into a reply-to queue with MAXDEPTH(3000) the failure rate was about 60%.

Differing behaviors

I saw three different failure behaviors in runmqsc.  In the best case runmqsc printed an “AMQ8101: WebSphere MQ error (805) has occurred” to indicate a problem.

But it also sometimes printed  “MQSC found the following response to a previous command on the reply q” in the middle of the result set.  Based on the numbers it appears in these cases to have lost some varying number of the non-persistent responses between when the queue filled and enough free space opened up for the Command Server to continue.  Since the reply queue is dynamic it should be impossible to find a “response to a previous command” so this suggests that the Command Server has at least some retry logic though it doesn’t appear to be very robust.1

The other behavior I saw was that no indication of a problem occurred other than receipt of an incomplete result set.  This suggests that some of the non-persistent responses simply disappeared.

The Command Server does place a single message on the Dead Queue when this happens, if a DLQ is defined.  My objection to this is that this is a fatal error in MQ’s primary administrative interface and deserves to be logged in a more robust fashion.  An error log entry is permanent and is virtually guaranteed to be recorded.  On the other hand, a freshly defined QMgr with IBM’s default configuration won’t be able to receive these so this method of exception reporting can hardly be called reliable. In short this is slightly better than trying to put an error message indicating a QFULL onto the queue which is full, but not by any means robust.1

Roger Lacroix of Capitalware did some testing using his MQ Auditor exit and discovered that on hitting QFULL, the Command Server attempts to put an AMQ8101 message onto the reply queue.  When this succeeds, the requesting application (runmqsc in my case) has some positive acknowledgement of the error.  However since, by definition, the target queue was full last time the Command Server checked, putting an error message there is probably not the ideal strategy.  In my case runmqsc did not consistently report the AMQ8101 error which suggests the Command Server experienced a second QFULL when putting the error message, at least some of the time.

Extended outage times

The fact that the Command Server fails to write an entry to MQ’s error log when aborting processing has its own set of cascading effects.  From the perspective of the client application, the primary indicator of a problem is that the Command Server returns a variable result set for things that should be relatively stable such as DIS QCLUSTER(*).  If the results are fed into a dashboard, the variability in the result set shows up as large chunks of the MQ estate winking in and out of existence.

Worse, the problem may go completely unnoticed for things that tend to be more dynamic such as DIS CHS(*).  Since connections can be volatile anyway, a slightly higher degree of volatility could appear normal until and unless someone was looking for a specific connection and saw it appear to wink in and out of existence.

Due to the total lack of error log entries on the queue manager side, the MQ Administrator may spend a lot of time focusing on the client application since this is the only component where the variable result set and/or an error shows up.  In my case I ended up taking traces at the client first.  When I saw it was emptying the queue without apparent error, I then took traces of amqpcsea (the process name of the Command Server).  Only after I saw it was getting QFULL did I think to look at MAXDEPTH on the model queue.

 

Widespread impact

The reason for this follow-up post is that the last one characterized the issue as a “runmqsc gotcha” which grossly understates the  impact.  This is in fact a Command Server gotcha which means it impacts anything and everything that relies on the Command Server.

In IBM-land that includes runmqsc, of course, as well as MQ Explorer, the REST API, dmpmqcfg, MFT, AMS, MQXR, and probably a few MQ components I’ve overlooked, all of which are in theory fully supported components of MQ that you might expect to be a tad more reliable than this.

Then there are all the IBM products built on top of or that bundle MQ which use the Command Server such as IIB, Message Sight, Tivoli CAM and Omegamon agents, InfoSphere Information Server, Rational Integration Tester, and many, many more.  Almost all IBM products stacked atop MQ either talk to the command server routinely, or else have setup wizards that do so at build time.  Fortunately, something like WebSphere Retail Server is unlikely to issue the kinds of commands that return massive result sets.  On the other hand, these are exactly the kind of result sets something like Rational Integration Tester is likely to call for.

Obviously, all the 3rd party products that use MQ’s Command Server are also affected.  My current client uses Avada’s IR-360 and by playing around with MAXDEPTH on IR-360’s model queue I was able to get the Command Server to fail in exactly the same way that it did for runmqsc.  This issue potentially affects all the commercial and freeware MQ Explorer replacements, administrative tooling, developer tooling, testing harnesses, monitors, PCF-based agents, etc.

Keep in mind that even though the only error you see will be on the client side, the QFULL occurs on the QMgr side and the Command Server may not be able to place the error message onto the full queue.  The client is limited to doing things like re-issuing the request but there is no “resume DIS QCLUSTER(*) from response #3215” type of functionality that might get it past the previous error.  This means that even Paul Clarke’s MQMon program can’t stop the Command Server from losing non-persistent responses or aborting with an incomplete result set. (Although I would not be surprised to find that MQMon gave a more meaningful error to the user than does runmqsc.)

 

IBM’s response

My PMR proposed three possible actions to address this:

  1. Bump the default MAXDEPTH on the model queue.
    (See below for issues with this approach.)
  2. Enhance amqpcsea to produce an error log entry from it’s QFULL abort routine.
  3. Add some guidance in the Knowledge Center that might lead the MQ Admin to bump MAXDEPTH without having to trace everything, open a PMR, or undertake other lengthy diagnostics that extend the outage.

The Command Server already emits error log entries for other fatal exceptions.  It’s the primary administrative interface for an otherwise rock-solid product, one would hope that it would emit an error log entry for any fatal exception.

When reporting to any vendor “hey, the primary administrative component of your product is unreliable under these common conditions” the expected response is along the lines of “Wow, how’d we miss that? Thanks for bringing it to our attention. We’ll get right on it.” Unfortunately, the official response to my PMR amounts to “working as designed,” although thankfully the writer didn’t use those exact words.

Unofficially, I’ve pled my case to some of the MQ Dev team management and am hoping this is taken up as a usability issue. The notion that the primary administrative component of MQ fails to emit an error log entry during a fatal exception and it is not considered a defect seems like a perverse result and I’m hoping the folks in the lab will see it that way too.

For now though it’s up to us to understand this and come up with our own mitigation strategies.  Which brings me to…

 

Mitigation strategies

Use persistent messages

I know you were thinking “just bump MAXDEPTH” but bear with me a moment.  One of the behaviors that I saw was an incomplete result set where some of the missing messages were from the middle of the result set and no AMQ8101 message was received. In other words, partial a result set that looks completely normal unless you already know exactly how many responses to expect and realize this isn’t it.

There are plenty of conditions that result in loss of non-persistent messages but do not involve QFULL, and any client-based instrumentation or tooling is more likely to encounter these than something running locally and using native MQ shared memory bindings. Similarly, issuing MQSC commands through an intermediate queue manager can drop non-persistent messages if the XMitQ has a low MAXDEPTH and/or the channel uses NPMSPEED(FAST).  The last message of the result set will have the “last in group” flag set.  Any scenario in which the messages lost do not include the last one therefore show up as a successful call, even though the  result set  is incomplete.

On the other hand, a QFULL reason code in a Dead Letter Header would be almost as useful to the MQ Admin as an error log entry.   Possibly more so because the MQ Admin has to know to go looking for an error log entry in the first place.  The Command server could return incomplete result sets for years without anyone noticing, whereas messages landing in the DLQ are as visible as emergency flares.

Using persistent messages prevents all the routine cases in which MQ silently discards or loses them so there will be evidence in the DLQ or elsewhere of any issues.  The down side is that this strategy requires use of permanent dynamic queues and chances are high that many components aren’t coded to issue delete commands against PermDyn queues.  That means it’s up to the MQ Admin to prune the unused orphaned reply queues from time to time.

Increase MAXDEPTH on the model queue

The obvious fix is to increasing MAXDEPTH so the Command Server never hits a QFULL condition, but this has its own issues.  For starters, which model queue or queues do you change and by how much? There are a few well-known model queues:

  • SYSTEM.MQEXPLORER.REPLY.MODEL with default MAXDEPTH(5000)
  • SYSTEM.MQSC.REPLY.MODEL with default MAXDEPTH(3000)
  • SYSTEM.DEFAULT.MODEL.QUEUE with default MAXDEPTH(5000)

There are also some not-so-obvious model queues.  Many of IBM’s layered products define their own model queue.  Many 3rd party products do as well.  If the strategy is to bump MAXDEPTH on any queue used by the Command Server, the MQ Admin must be able to identify all of them, including the ones that are not among IBM’s defaults.

Assuming all the model queues are correctly identified, how high should MAXDEPTH be?  At minimum it must be greater then the number of messages in the highest possible result set.  At the other extreme, setting MAXDEPTH(999999999) is one way to ensure the program will never get a QFULL.  From a security perspective this setting opens up lots of new vectors for resource exhaustion attacks, especially if the queue in question is SYSTEM.DEFAULT.MODEL.QUEUE.

The chance of an attacker exploiting MAXDEPTH on model queues is far smaller than the near certainty in a moderately large MQ shop of the Command Server hitting QFULL. On the other hand, the potential impact of a resource exhaustion attack is loss of the entire QMgr when the filesystem fills up.  Maybe this is monitored for in the filesystem. Maybe not.  Each shop needs to do their own cost/risk/benefit analysis on this one, but all other things being equal, this is one case in which I don’t mind so much setting a really high MAXDEPTH even though I generally advise against this practice for application queues.

Use custom model queues

For your own code and any vendor code that allows it, moving to dedicated model queues provides a bit of useful granularity.  If you are forced, for example, to bump SYSTEM.DEFAULT.MODEL.QUEUE to MAXDEPTH(999999999) but are able to move your applications to a dedicated model queue with a more reasonable MAXDEPTH then there are fewer vectors for that resource exhaustion attack .  In the event of an actual attack or, more likely a misbehaving program, it will be easier to identify the source.

Use NPMSPEED(NORMAL)

In the case of sending commands through an intermediate queue manager, it’s possible to lose non-persistent replies that traverse fast channels.  Setting NPMSPEED(NORMAL) eliminates that point of failure but also impacts all other non-persistent message traffic on that channel.  Again, this needs a bit of cost/benefit analysis and I can’t make a blanket recommendation

Defensive coding

My original post describes some defensive coding techniques for scripts that scrape runmqsc output.  On the MQ Listserv Tim Zielke describes a solution for applications that work directly with the PCF responses.  He proposes “tracking the number of results that come back and then comparing that number to the MQCFH.MsgSeqNumber for the result that has the MQCFH.Control = MQCFC_LAST.”2

MAXMSGL – wait, what?

In addition to selecting an unusually small default for MAXDEPTH on the MQSC Reply queue, IBM also set the MAXMSGL on that queue to a very small non-standard value of about 32k.  Presumably someone added up the fields for the largest possible PCF response to come up with this number but I haven’t personally double-checked that.  I’d like to think that’s how they came up with that number but then again, until recently I also liked to think the Command Server had robust exception handling and would cut an error log entry for any fatal exception.

The behavior on hitting MAXMSGL can easily be tested by setting a low MAXMSGL on the model queue and I have left that as an exercise for the reader.  If the Command Server cuts an error log entry or FDC when MAXMSGL is exceeded then perhaps we can leave it at 32k, confident that any failures will be reported.

On the other hand, I’m sure some readers will want to bump it to 4mb or even 100MB just on principle.  I’m not calling anyone out but the name of the person I have in mind rhymes with Bren Glumbaugh.  You know who you are. 😉

 

Final thoughts

I’ve been using runmqsc for 25 years without incident, but that’s always been the local version.  When I noticed the problem I was able to confirm it by running the same commands locally and over client then comparing the results.  The local version consistently gave me the complete result set under all test conditions.

The conclusion I draw from this is that a vast body of existing scripts and other tools that use the Command Server probably contain implicit assumptions about reliability and lack robust exception handling.  Converting them from local to MQ Client-based operation without also adding robust exception handling may result in some level of unreliability, even when using a really large MAXDEPTH.  If you are interested in some defensive coding techniques, go read the original post Client-based runmqsc gotcha.  There isn’t a good way to detect the loss of responses in the middle of a result set but this will at least pick up any errors that do get reported.

As always, any comments or feedback are greatly appreciated.  Much of my feedback tends to be over email for reasons of confidentiality but I try to sanitize it and republish when possible so that the knowledge gained can be useful to others.  This post contains feedback from Roger Lacroix and two others, all of whom I am deeply grateful to.  Thanks!


Updates:

  1. 20180514 – Added paragraph on messages put to the DLQ by amqpcsea. This was omitted in the original because I wanted to focus on the need for robust  error logging and the explanation of how DLQ messages fail did not seem to add constructively.  I have in email been advised this was a bit unfair to the Command Server.
  2. 20180514 -Added material provided by Tim on the MQ list regarding defensive PCF coding.
This entry was posted in IBMMQ, IIB, MQ AMS, MQMFT, News. Bookmark the permalink.

Leave a Reply