Sure, its always an MQ problem. Why is that a bad thing?

I don't always have problems, but when I do I blame them on MQ

It’s always an MQ problem

One recurring theme in the MQ community is that all problems are MQ problems.  Never mind that they almost always turn out to be application, network, firewall, SAN, account maintenance, resource constraints, human error or even sabotage, it’s an MQ problem.  If all the problem tickets have “MQ problem” in the title and are all assigned to the MQ team, the team and MQ itself can get a bad reputation within the organization.  I’ve been working with MQ since the mid 90’s and it has always been thus.  The complaint was raised again today.

I used to try and correct the record and language used for EVERY incident. This worked for a while but then with a lot of turnover, we are back to where we were a couple of years ago. I have had very little success trying to re-educate the current development staff.

Question, how do you deal with an organizational and cultural issue like this?

By the time I started an MQ admin team at my former employer, MQ had a terrible reputation.  In fact, the only reason I was assigned to it full time was that there were frequent outages and management decided it needed a lot of care and feeding.  One of the developers had become the resident MQ expert and after the first app went into production he provided the MQ support.  After two and then three apps were in production he spent as much time supporting MQ as writing code.  Since I wasn’t qualified to write C code (I was a COBOL programmer then) and since everyone knew a trained monkey could watch over MQ, he got to go back and coding full time and I was given 4 queue managers and a banana.  I quickly discovered that every problem was an MQ problem.  After a while they weren’t just MQ problems, they were T.Rob problems since my name was on every trouble ticket.  I began to think taking on the MQ admin role would ruin my career.

There were some genuine problems with channel stability that I solved fairly quickly by consolidating all the QMgr installations to the latest version and applying fixes.  I also standardized the installations and added things like auto-start on server reboot.  It didn’t take long before the genuine MQ problems were rare exceptions but this didn’t seem to help my reputation.  I’d figure out the non-MQ root of a problem and assign it to the right person, often describing the exact fix.  “You see, there’s this thing called a poison message…”  Or “don’t use the physical IP address on the connection, use the virtual IP address.”  Despite my best work, MQ was still in the title of all the trouble tickets, I was doing all the grunt work and all my management saw was “MQ,” “T.Rob,” “Outages”.  I had tried mounting direct challenges to this perception a couple of times, showing documented cases where root causes were traced back to other systems.  Despite all the chest-beating, management remained unimpressed.

I pondered this over a banana one day and came up with a new approach.  I would create systemic incentives for the application teams to avoid attributing all problems to MQ and positive incentives to avoid the problems in the first place.  Since by this time MQ was blamed for everything, MQ had visibility at fairly high levels.  The spotlight is good, it’s just the spin that is bad.  I started working on the stick phase of my plan.

I began to personally participate in every “MQ problem” that was reported.  Because of the high visibility, things were escalated fast.  I always made sure to provide near-continuous status reporting and included at least one tier of management above whoever it had already been escalated to.  Rather than holding the problem until resolution I would start with MQ and work my way back outward, transferring the ticket as soon as I had documented that MQ was in the clear and had enough additional information to indicate the next logical diagnostic owner.

Since MQ was usually eliminated very quickly as the root cause, most of the problem reporting focused on what was now an app issue, network issue, firewall issue, etc.  Very quickly, all involved would hear “well, we now know it’s not an MQ issue but indicators point to…” and whoever thought escalating to light a fire under me soon found themselves the focus of all the attention.  Not that I pointed fingers, I just presented diagnostics.  I pointed a giant spotlight on the problem and followed the trail confident the trail would usually lead somewhere else.   During the handoff I made sure to attribute the likely source to an app or a component, or even to an administrative action, but never to a person.  Often the root cause was in the system where the problem was discovered and first reported.  In that case I’d give them a free pass or two and then gently suggest that “I helped you resolve these and didn’t point fingers, how about you don’t call these MQ problems when you write them up?”  If that didn’t work, I’d be a bit more public in resolving things quickly as “not an MQ problem.”

Of course, if it actually were an MQ problem, I’d take full responsibility.  I could afford to take this position because by then it hardly ever was MQ’s fault.  The fact that I embraced and took ownership of the things that actually were MQ gave me more credibility when I said something wasn’t MQ.  I also published shop best practices for coding MQ apps and offered lunch-n-learn and other training for project teams.  If someone persisted in blaming WMQ despite all my other efforts, I might point out in my rebuttal that “not only was this not an MQ problem, but the internal MQ Admin web site has docs describing how to avoid this exact problem and we cover it once a quarter in the lunch-n-learn.”

Over time this helped rehabilitate MQ’s reputation as well as my own.  Management now appreciated that doing a good job with MQ perhaps required more than just a trained monkey and the team gained a few top-notch members.  The stick had worked well.  The team was finally staffed for 24×7 support, outages were at their lowest levels ever and we provided this level of service while managing more QMgrs per person than ever before.  In the good graces of management and our application team customers, we had some latitude and a bit of clout.  Now it was time to implement the carrot phase of the plan.

When an app labeled a problem as “MDB not receiving messages” instead of “MQ not delivering messages” and then worked with my team to fix it, I was their best friend.  The outage duration would be much shorter due to the lack of friction and teams claiming ownership of tickets rather than reassigning them.  Afterward, I’d sign off on the problem report saying that the MQ team had collaborated on the solution and that I’d been assured as of the next change cycle it would be permanently fixed.  Of the apps using MQ, those who fought with us had the majority of outages and those who worked closely with us were very stable.  This did not pass unnoticed.  Because of that, earning our endorsement on the trouble ticket improved management’s confidence in the solution and generally diminished negative consequences for the app team.

This sounds like a lot of work and if I’d had to practice it all the time it would have been.  But the point was to build positive and negative incentives into the system to guide people to do the right thing and reward them when they did.  It became risky to label something incorrectly as an MQ problem but politically rewarding to collaborate on solutions.  Problems incorrectly attributed to MQ might end up looking like the app team were trying to shift blame and/or had ignored the training provided.  But state the problem in neutral terms and engage the MQ team early and we’d help you track down the root cause and resolve the outage.  If we weren’t busy, we’d even help you with things that obviously weren’t MQ related since by this time we had working relationships with the network team, firewall team, DNS team, sysops, accounts management and every other discipline that MQ depends on and had become valuable facilitators.

The MQ team earned a rep as the problem solvers rather than the problem source, and that meant something.  It also gave us political capital to enforce best practices and shop standards against an organization over which we had no formal authority.  If the project manager said something was going to Production, the division my team was in had no veto power.  But if I discovered an app wasn’t printing linked exceptions or had some other defect my team considered sev-one, there was a good chance I could block the production implementation even though my boss couldn’t.  No longer content to fend off bad press and aspire to the level of mere neutrality, we used our position as middleware admins to recast ourselves as the A-Team of problem resolution.

Once we established the reputations of MQ and the team, maintaining that position required a lot less work.  In fact, the app teams were some of our best ambassadors.  On several occasions I overheard someone in an adjoining cube say something like “I think there’s a problem with MQ, my app isn’t getting any messages” and another developer would respond “It’s probably a poison message, did you read the poison message guidelines yet?”  Sometimes problems reappeared.  At one point we began to see poison messages after a two or three year remission.  Rather than blaming MQ, management encouraged the responsible app team to consult with us.

As middleware administrators, we often lament that our product gets blamed for everything due to, well, being in the middle.  We protect ourselves and the systems we manage from reputational attack by building walls to deflect attribution and blame.  We dream of the day when people stop dumping their problems on the MQ team.

What we often overlook to our detriment is that being in the middle gives us an unparalleled vantage point from which to observe and interact with all of the mission-critical systems running the company.  If you have some troubleshooting skill and a goal to become the most valuable person in the company’s IT department, you’d have a hard time finding a better position from which to accomplish that then as an MQ administrator.  From this perspective the ideal product to manage would be virtually trouble free yet attract demand from adjacent systems for your services.

Go ahead. Make my day.

You want to assign that problem to me?
Go ahead. Make my day.

Want to change the culture?  I say tear down the walls.  Don’t deflect blame but use it as the raw material from which you refine your reputation as the go-to team for problem solving.  The day people stop dumping their problems on the MQ team is the day the MQ team ceases to be relevant.  But do give them reasons to focus on solutions rather than casting blame.  Provide real value.  Make the other teams look good for the wins and accept responsibility for the losses.  Show the other teams you are much more valuable as an ally than an adversary and don’t hold grudges.  Make it politically expensive to burn you but extremely valuable to engage you in good faith.  Whatever else you do, be happy they keep coming back.  It’s an MQ problem.  It’s always an MQ problem.  It always will be.  That’s a blessing, not a curse.

This entry was posted in IBMMQ and tagged , , , , . Bookmark the permalink.

3 Responses to Sure, its always an MQ problem. Why is that a bad thing?

  1. Teresa says:

    Enjoyed the humorous and positive way T.Rob put it 🙂
    It’s also because MQ (also ESB now) has much monitoring and is indicative of things going wrong, that is why it’s always the one sounding the alarm, ….. and got blamed. Applications which don’t even logs error would never got blamed!!!
    Every MQ, ESB admin should read this article, and be proud to be one!!

  2. RC says:

    If I didn’t know better, I would have thought I wrote the above myself. It warms the cockles of my heart to know that other companies have the same issues with “It’s an MQ problem”. It always starts as an MQ problem and seldom ends up being one. Since MQ interfaces with just about everything, I know a lot about a lot of technologies just like the author who wrote the article.

  3. mr. MQ says:

    100% agree.

    I still fight same issues as for 15 years ago….
    I belive that the reason for saying it’s a MQ problem is to get the MQ admins involved
    Because the code worms have learned that we (MQ ins) have a good understanding of most application models, and therefore can help to pinpoint the root cause….

    Just my $0.02

    Regards
    Jørgen

Leave a Reply