Client-based runmqsc gotcha

Most shops running more than a handful of queue managers end up writing a script or two that sends commands to runmqsc and parses the responses.  Until recently the options were to run those scripts locally or to access remote queue managers using the MQ network.  Neither of these options is ideal.  Use of local scripts requires maintaining a code base distributed across a variety of hosts and often disparate platforms. Sending runmqsc commands across the MQ network solves those problems but requires a security model in which at least some of the adjacent queue managers can administer each other.

The new MQ Client capabilities in runmqsc give us the best of both worlds. We can maintain a single set of scripts in a central location and use them to connect directly to each of the queue managers using MQ client.  Some functions that require correlation of state across multiple queue managers are actually easier to build using MQ client.

There is, however, a gotcha that anyone using client-based runmqsc should be aware of.  I’ll explain that here along with some techniques to mitigate the issue and feedback from the PMR I opened.

Workflow

The steps to complete a client runmqsc call are as follows:

  1. The runmqsc component is executed in client mode on the remote node and requests a connection to the target queue manager.
  2. The queue manager’s Message Channel Agent process amqrmppa receives the connection…
  3. …creates a dynamic reply-to queue based on SYSTEM.MQSC.REPLY.QUEUE…
  4. …and places the command on SYSTEM.ADMIN.COMMAND.QUEUE.
  5. The Command Server process amqpcsea receives the command from the queue.
  6. The Command Server places the responses on the dynamic Reply-To queue.
  7. The Channel Agent transmits the messages over the network.
  8. The runmqsc client receives and displays the response set.

runmqsc–>amqrmppa–>amqpcsea–>amqrmppa–>runmqsc

Note that runmqsc connecting to a local queue manager might also use the command server.  Although it could directly access MQ objects in this scenario, it would make sense for it to use similar code paths in local and client mode.  This is far less complex than having one code path to use local, privileged, internal MQ calls and another completely different code path for clients.  For purposes of this discussion the important thing about runmqsc running locally is it talks to MQ directly, which is to say it’s really fast.

 

What’s changed?

When IBM delivered runmqsc client capability they also gave us a new model queue named SYSTEM.MQSC.REPLY.QUEUE dedicated for that use.  In a curious and typically  inscrutable design decision, IBM set the default MAXDEPTH for this queue to 3,000.

Since this is less than the system-global default of 5,000 set on most other queues one assumes there is some risk that is mitigated by choosing a more conservative setting.  Whatever that risk is must be significant since the trade-off is that client runmqsc behaves unreliably as the number of response messages approaches and exceeds this value.

 

How it breaks

A client-based runmqsc creates a fast producer/slow consumer situation because amqpcsea can dump messages into the reply-to queue much faster than amqrmppa can retrieve them and deliver over the network. Accordingly, a runmqsc client call that results in more than MAXDEPTH responses runs a risk of overrunning the queue.

Recently I’ve been converting my Cluster Health scripts to use runmqsc client and the cluster I’m working with has more than 4,000 advertised queue objects. Over successive runs the total number of responses returned to runmqsc fell short of the expected total about 60% of the time.

This meant that about 40% of the time amqrmppa was able to pull the first 1k messages from the queue before amqpcsea was done writing the full 4k messages so the reply-to queue never filled up.  The other 60% of the time were cases in which amqpcsea encountered a 2053 QFULL condition and the exception handling did not recover correctly.

I tested this by adjusting MAXDEPTH very low and very high and got back the expected behavior reliably in both configurations. Whenever MAXDEPTH was low the error rate increased. Setting it to a very high value completely eliminated the errors.

 

Why is this a ‘gotcha’?

The problem is that neither runmqsc nor amqpcsea provides an error message that might lead the MQ Admin to adjust MAXDEPTH on the model queue.  In fact, amqpcsea doesn’t provide any error message at all.  Anyone trying to track this down on the queue manager side will find no clues whatsoever.  No error log message, no event message, no FDC.

On the client side the messages can be buried among the responses and go completely unnoticed.  Because the Command Server doesn’t provide header or footer records with totals to reconcile back to, it is not immediately obvious when partial result sets are received.  A script that doesn’t explicitly look for these error messages will probably fail silently and use the partial result set as if it were complete.

The chance this will cause problems is a very high number that approaches certainty.  Over more than 20 years participating on the MQ Listserv, in various other online communities, and attending more conferences than I can count, I don’t remember any discussion on the topic of runmqsc results being unreliable.  I point this out to support my next assertion which is that the vast body existing scripts built around runmqsc virtually all contain the implicit assumption that the result sets returned are complete.  When using runmqsc client, this level of reliability can no longer be taken for granted.

In making this claim I do realize that converting scripts to use client-based calls introduces the entire class of errors related to connection management and that tech-savvy MQ Admins will have already added some exception handling code to their scripts.  Great!  But what conditions do these scripts look for?  Some of the errors I encountered are brand new.   I looked through the Knowledge Center to assemble a list of error messages but am to this day not confident I’ve identified the complete set.  Even if I have, this is new function and IBM may add a new error message as they refine their internal exception handling.

 

Practical workarounds

The first and most obvious workaround is to bump MAXDEPTH on the SYSTEM.MQSC.REPLY.QUEUE to some value much higher than 3,000. This is where I would really like to know IBM’s reasoning for choosing the value they did because the last thing I want to do is get into an iterative cycle of fail-and-fix.  My preferred solution would be to set a MAXDEPTH value that I will never need to adjust again, and then make that part of my baseline.  Since I can’t imagine any harm in setting it to a high value, at least nothing that is worse than the behavior it exhibits with the default, I set it to 500,000.

The other approach I took was to avoid looking for specific error messages.  The messages to expect from a good result are well known.  When running CLUSQMGR the replies come back as AMQ8441 messages.  Similarly, QCLUSTER commands return AMQ8409 messages, QSTATUS commands return AMQ8450, and so on.  So rather than look for specific errors, I just look for messages that are not the expected set.

For example:

CLUSQM="$(echo "DIS CLUSQMGR(*) CLUSTER($CLUSTER) ALL" | runmqsc -e -w 10 -c $QMGR)"
ERR=$(echo "$CLUSQM" | grep -v -e '^AMQ8441' | grep -e '^AMQ[:digit:]+' | wc -l)

This fetches the results of DIS CLUSQMGR(*) and then counts all the AMQ messages other than the expected ones.  If $ERR ends up with a non-zero value we can take appropriate action even if the specific error is unknown when the script is written.  If over time unhandled AMQ messages that are routine and expected make themselves known by tripping the error code, we simply extend the grep -v call with another -e pattern.

 

The PMR and IBM’s response

Over the years my consulting clients have typically been very large banks, insurers, state and federal government agencies, and other large institutions.  Virtually all of these large shops have queue managers and clusters with more than 3,000 objects.   Extrapolating from there I’m assuming that virtually all of IBM’s best MQ customers will find client-based runmqsc to be unreliable.

Based on this I offered IBM two different proposals.  The first was that if runmqsc client is “working as designed” then the Knowledge Center should be updated to explain the issues, what the MQ Admin needs to look for, and the suggested remediation.  If there’s a compelling reason for SYSTEM.MQSC.REPLY.QUEUE to have a low MAXDEPTH that can be explained here and users can weigh that against the unreliable behavior.  Of course, that’s the worst-case fix since it requires IBM to document that runmqsc is routinely and silently unreliable but that this is by design.

My preferred proposal is that IBM would enhance amqpcsea to emit error log entries, and possibly event messages, when encountering a QFULLL condition.  This would allow for automated detection of the events as well as leave a bread crumb trail for the MQ Admin.

Bumping MAXDEPTH of the model queue would be helpful but does not address root cause unless it gets bumped to 999,999,999 similar to the Dead Queue.

IBM has confirmed that all of this revolves around the Command Server hitting QFULL.  They have not provided much detail as to the level of exception handling or accounting for the variety of error messages seen on the client side.  Their initial response was that they have passed my enhancement request to the lab but that QFULL is a generic condition and not appropriate for error logs or event messages.  I replied that the concern is limited to amqpcsea and in that context a QFULL should be a reportable error because it makes the result set unreliable.  As an MQ Admin I don’t want QFULL instrumented.  I want MQ to be reliable or to alert me to conditions in which it isn’t rather than silently omit half a result set.

 

Next steps

Obviously, I will post the final results when the PMR is closed.  Check back, subscribe, or watch for an announcement on the Vienna MQ Listserv if you are curious.

Regular readers who made it this far in the post are surely thinking “he said ‘MQ client’ but didn’t mention security!”  Correct!  If you attend this year’s MQTC I’ll address this at one of my sessions which will focus on getting runmqsc and dmpmqcfg to run in client mode over mutually authenticated TLS channels.

 

This entry was posted in IBMMQ, MQTC. Bookmark the permalink.

3 Responses to Client-based runmqsc gotcha

  1. Pingback: MQGem Monthly (May 2018) | MQGem Software

  2. Pingback: Command Server Gotcha | Store and Forward

  3. Ruud van Zundert says:

    Hi T.Rob. I too have been using runmqsc in client mode over the last few months. I am also successfully running dspmqcfg in client mode, but was surprised at the difference in parameters:

    – runmqsc -c
    – dspmqcfg -c default

    Note also that there are several bugs in dmpmqcfg:
    – if specifically dumping,say, SYSTEM.DEF.CLNTCONN … it won’t
    You have to dump the whole lot and then scan for this channel.

    – dmpmqcfg -m -a -o 1line
    produces the wrong output for SYSTEM.DEFAULT.SUB
    DEFINE SUB(‘SYSTEM.DEFAULT.SUB’) TOPICSTR(”) TOPICOBJ(‘ ‘) PUBAPPID(‘ ‘) SELECTOR(”) USERDATA(‘ ‘) PUBACCT(0000000000000000000000000000000000000000000000000000000000000000)
    DESTCLAS(PROVIDED) + DEST(‘ ‘) + DESTQMGR(‘ ‘) + DESTCORL(000000000000000000000000000000000000000000000000) EXPIRY(UNLIMITED) PSPROP(MSGPROP) PUBPRTY(ASPUB)
    REQONLY(NO) SUBSCOPE(ALL) SUBLEVEL(1) VARUSER(ANY) WSCHEMA(TOPIC) SUBUSER(‘ ‘) REPLACE

    … works fine if ‘1line’ is left out.

    – if you dump a SENDER chl via dmpmqcfg, eg dmpmqcfg -m -t channel -n RUUD.TEST.CHL -a

    you get an odd layout towards the end of the print. Everything – like the runmqsc output – is alphabetical, except for the following (TRPTYPE):

    STATCHL(QMGR) +
    TRPTYPE(TCP) MODENAME(‘ ‘) TPNAME(‘ ‘) +
    USEDLQ(YES) +

    When I queried this via a PMR I got the following odd reply:
    received the following information from L3:

    TRPTYPE(TCP) MODENAME(‘ ‘) TPNAME(‘ ‘) +
    ModeName and TpName attributes are relevant only if ‘TRPTYPE’ is LU 6.2
    . Hence these values are grouped together in ‘dmpmqcfg’ output. This is
    a design decision

    To which I replied:
    My comment was not whether the dmpmqcfg could be used – without problems – to do a restore,
    but it had changed not just the layout of the output, but also the order.

    One of the problems with the ‘display’ commands in previous versions was that the output was not alphabetic
    and that searching for a keyword could take some time for the human eye.
    Fortunately that was changed to the alphabetic order.

    So to summarise:
    – the alphabetic order is no longer maintained
    – the ‘one attribute’ per line is no longer maintained
    – and this was a ‘design issue’? …come on, IBM can do better than that.

    Cheers … Ruud

Leave a Reply