A growing number of my clients are deploying IBM MQ on Amazon EC2 instances and a common need that I see emerging is for instrumentation and tooling. When the MQ instance is ephemeral, deploying instances on demand and decommissioning just as suddenly, lots of things the MQ Admin used to do by hand need to be automated. This includes build-time things such as defining objects, run-time tasks like enabling or disabling queues in the cluster, and forensic capabilities such as archiving error logs.
It is this last item that concerned a recent customer. Their main requirement was to ingest MQ error logs in real time, or at least close to it, so those logs would survive death of the virtual host on which they were generated. Getting Splunk to ingest the logs was ridiculously easy. Just define the log files as a Splunk data input and immediately they become available through the Splunk search interface.
That’s all well and good if all you want to do is browse through the logs or search them for particular error codes. To get the benefit of Splunk analytics requires the error logs to be parsed into fields. Then instead of merely searching for error codes you already know about, you can ask Splunk to show you a report of all the error codes sorted by prevalence, by frequency over time, or even which ones are the rare outliers. All the analytic capabilities are usable once the fields are parsed. Better yet, parse logs from many queue managers and now you can spot trends or pick out nodes that are showing early signs of distress. That’s really useful stuff and Splunk provides it right out of the box, but only for log types it knows how to parse. So let’s teach it to parse IBM MQ error logs, shall we?
MQ error log structure
Before wading into the transform code, let’s look at a couple of typical IBM MQ error log entries. The first of these is reporting an error. The second one reports a normal channel startup.
12/18/2017 04:12:54 PM - Process(99475.1) User(mqm) Program(runmqchl) Host(mq8007.ioptconsulting.com) Installation(Installation1) VRMF(126.96.36.199) QMgr(ASH) AMQ9999: Channel 'ASH.BIRCH' to host 'localhost(11412)' ended abnormally. EXPLANATION: The channel program running under process ID 99475 for channel 'ASH.BIRCH' ended abnormally. The host name is 'localhost(11412)'; in some cases the host name cannot be determined and so is shown as '????'. ACTION: Look at previous error messages for the channel program in the error logs to determine the cause of the failure. Note that this message can be excluded completely or suppressed by tuning the "ExcludeMessage" or "SuppressMessage" attributes under the "QMErrorLog" stanza in qm.ini. Further information can be found in the System Administration Guide. ----- amqrccca.c : 1090 ------------------------------------------------------- 12/18/2017 04:12:54 PM - Process(100143.1) User(mqm) Program(runmqchl) Host(mq8007.ioptconsulting.com) Installation(Installation1) VRMF(188.8.131.52) QMgr(ASH) AMQ9002: Channel 'ASH.BIRCH' is starting. EXPLANATION: Channel 'ASH.BIRCH' is starting. ACTION: None. -------------------------------------------------------------------------------
Splunk is really good at finding timestamps so without any customization it splits the multi-line log events as you might expect, based on each event beginning with a timestamp and only one timestamp per event. After that it gets a bit more challenging. Splunk will tease out fields when it finds them in the format key=value but as seen here, MQ doesn’t use that format. At the top are some fields in the format Key(Value). These are followed by the error code and text fields. These are represented as bare values so we will need to provide the key names. After that come the explanation and action fields, and these are delimited by a colon and a newline and contain variable text which may extend over one or more lines.
Notice also that the eyecatcher between entries, the line with all the dash characters, has optional fields that contain the name of the C source file and line number within that. These are not always present but when they are there they are worth capturing.
The example error entries above are from the ASH queue manager in my MQ sandbox. MQ also maintains a set of server-global error logs for use when the error is not specific to a queue manager or the queue manager can’t be determined. Ideally, we want to parse MQ’s server-global error logs along with the queue manager error logs, but entries in the global logs will not have a field containing the queue manager name. That means the queue manager name must be an optional field, similar to the C source and line number fields.
Here is an example of a global error log entry that lacks the QMgr field:
12/18/2017 05:17:50 PM - Process(105765.1) User(peermap) Program(setmqinst) Host(mq8007.ioptconsulting.com) Installation(Installation1) VRMF(184.108.40.206) AMQ8584: Insufficient permission to update installation configuration. EXPLANATION: An attempt was made to update the WebSphere MQ installation configuration for installation 'Installation1' path '/opt/mqm' but the request was rejected as the current user does not have sufficient authority to make the update. ACTION: Issue the command from a user with sufficient authority to update the installation configuration. ----- amqiset0.c : 922 --------------------------------------------------------
Requirements for a transform
Splunk needs us to tell it a few basic things about the structure of the logs:
- Since this is a multi-line log format, Splunk needs to know how to tell when one entry ends and a new one begins.
- Delimiters for any key/value pairs in a log entry. This includes the delimiters between pairs, as well as the delimiter that joins the key to the value in the pair.
- How to find any fields that are not in a key/value format.
- Names for any fields that are not in a key/value format.
- Which fields, if any, are optional.
All of these things must be expressed in regex terms. Although Splunk allows for the definition of regex primitives that are combined to make more complex constructs, I chose to build one big regex. It’s easier for me to work with and test that way, and it meant I didn’t need to learn how to define and reference the primitives. I suspect this is one area where a more experienced Splunk admin would take a different approach. If so, I’d love to hear about it in the comments.
Splunk is really good at finding and parsing timestamps so we don’t need to give it any help there. But we do need to tell it how to find the first field after the timestamp and how to find the end of the entry.
This regex identifies the beginning of the record:
From the beginning of the line, look for one or more characters that are not a dash, followed by a dash, a space and the word ‘Process’. The end of the record is a bit simpler:
That’s one or more dash characters, zero or more space characters to make the regex a bit forgiving of variations in line endings, all leading up to the end of the line on which these appear.
Although it’s possible to specify delimiters and let Splunk parse the field names from the data, I wanted to define an IBM MQ namespace so I assigned my own field names. The first two fields are parsed by the following regex:
This finds each field by its literal name, then extracts the value between the parentheses and assigns it to the field name in angle brackets. The ?P is Splunk’s notation for a named field. The field name appears within angle brackets and the value is expressed as [^)]+)\) which means “one or more characters up to but not including a right parentheses.” The unescaped right parentheses is where the field-capture ends and the escaped one is the literal we see in the log entry.
Between the two fields is \s+ which means “at least one and possibly more whitespace characters.” Based on the log sample we have today a simple literal space would work. Allowing for variable number of whitespace characters makes the regex tolerant of some variation there such as future use of tabs or spaces by IBM to align fields.
The next two fields are the error code and the error text. Strictly speaking this is a key-value pair where the error code is the key. But if we let Splunk parse it that way we lose the ability to have Splunk calculate the distribution of error codes over time. What we really want is a single field with the error code itself as the value. The error text can have substitutable values so that isn’t a fixed string even though the error code is. I chose to capture the error text as its own field. Even though the error text was a single line in all my examples, we can’t count on this always being true. Something like a long hostname inserted into the error text might be sufficient to cause the line to wrap and we need to account for that.
The resulting regex looks like this:
This looks for the literal AMQ at the beginning of the line followed by one or more non-space characters, a space, and a colon, then captures everything from the AMQ to the last non-space character as the error code. The error code in all of the examples I looked at was AMQ and 4 digits but I allowed for characters other than digits and for more or less than 4 of them. This regex does require the error code to start with AMQ and I’m unsure if that holds true for all MQ errors. I suspect that it does not but have yet to see a counter example. For now, this will have to do.
The error text was a bit trickier since it can span a variable number of lines. The way I handled this was to gather everything after the colon and space up to the last non-space character before the EXPLANATION: tag. The field captured will thus span any number of lines but doesn’t pick up stray whitespace at the end.
Note that the field names are assigned here the same way they were for the key-value pairs. The only difference is that we use the position relative to other fields instead of literal strings to identify the field, then assign names of our own choosing.
Line-delimited key-value pairs
The Explanation and Action fields are key-value pairs but with different delimiters than before. As you might expect, we capture them in much the same way as before:
In both cases we look for the literal string of the field name, followed by at least one whitespace character. Then all the text is captured up to the last non-whitespace character before the next field delimiter. The literal ACTION: at the beginning of a line delimits the terminates the Explanation field. A string of at least five dashes at the beginning of a line marks the end of the Action field.
Optional queue manager name
Three of the fields need to be defined as optional. The regex syntax allows for an optional group but care must be taken that when the group is absent, the remainder of the regex does not break. This can be seen in the boundaries of the field as represented in the regex below:
When it is present, the queue manager field is bounded by a space character followed by the literal ‘QMgr’. That space will always be there and when the field is present, at least one more will be present at he end as well. But we cannot count on there being two space characters here when the queue manager field is absent so the terminator uses an asterisk instead of a plus sign. The result reads like “a space followed by the QMgr tag and field, if present, followed by zero or more whitespace characters.” Thus we require at least one whitespace character after the prior field, and it might be a newline.
When the queue manager name is present, it is recognized by the tag ‘QMgr’ with any number of characters prior to the next right parentheses. We don’t need to worry about whether those are valid characters for queue manager names since we know a right parentheses bounds the field and the name can’t contain one of those.
The whole group is bounded by (?: which begins a non-capturing regex group, and )? Which closes the group and uses a question mark to specify zero or one occurrences. Because the whole thing is contained in a \s+…\s* pair, removing the field matches a single whitespace character which is the minimal delimiter that we need to omit the field without breaking the rest of the regex.
Optional C source file and line
The C source code file name and line number that threw the error are optionally embedded into the terminating eye-catcher line at the end of the log entry. When these fields are present there are always five dashes before the first field, and these start at the beginning of the line. There are also always some number of dashes at the end of the line, varying as needed to left-justify the line.
If we wanted to match the eye-catcher without the optional fields, the following regex isn’t exactly pretty but would work:
This says “From the beginning of a line, match four literal dashes, then one or more dashes, then one or more dashes up to the end of the line.” We know there are at least 5 dashes in all lines, regardless of whether the field is present or not. The four literal and one or more variable dashes matches that. The next iteration of one or more dashes matches whatever remains in the case that the fields are present and also in the case that hey are not. All that is needed now is to add the fields in an optional group:
^----+(?: (?P<ibmmq_err_csource>\w+\.c) : (?P<ibmmq_err_lineno>\d+) )?-+$
Note that the spaces which delimit the fields must be contained within the optional group or it won’t work. The source file name us located by looking for a string of one or more alpha-numeric characters ending with the .c suffix. The line number is at least pretty simple. It looks for one or more consecutive digit characters bounded by space characters.
There are a few assumptions embedded here that may break, such as that the source code file will always be written in c and that any insert in this line will look like a file name and line number. If these assumptions do not always hold true then these fields won’t always parse. The good news is that when I intentionally cause the regex to break on these fields all the prior fields still parsed accurately. I suspect this is because these are the last fields in the regex and the line-breaker function resets at the next event boundary. At some point I plan to test how forgiving Splunk is if I break the regex somewhere further upstream but for now this seems to work.
The transform specification
You may have noticed the field names I used all began with ibmmq-err-. My intention is to create a field namespace based on IBMMQ. This accommodates additional future formats such as exit log output or FDC files. In keeping with this naming convention, the transform is called ibmmq-err to match the prefix of all the fields it defines. The Splunk docs for transforms.conf suggest a comment describing the transform stanza and the fields it parses. My stanza with the complete regex is below:
[ibmmq-err] # Extracts fields from IBM MQ error logs # Extracts: ibmmq_err_process, ibmmq_err_user, ibmmq_err_program, ibmmq_err_host, ibmmq_err_installation, ibmmq_err_vrmf, # ibmmq_err_qmgr, ibmmq_err_code, ibmmq_err_text, ibmmq_err_explanation, ibmmq_err_action, ibmmq_err_csource, ibmmq_err_lineno REGEX = ^[^-]+- Process\((?P<ibmmq_err_process>[^\)]+)\)\s+User\((?P<ibmmq_err_user>[^\)]+)\)\s+Program\((?P<ibmmq_err_program>[^\)]+)[^\)\n]*\)\s+Host\((?P<ibmmq_err_host>[^\)]+)\)\s+Installation\((?P<ibmmq_err_installation>[^\)]+)\)\s+VRMF\((?P<ibmmq_err_vrmf>[^\)]+)\)\s+(?:QMgr\((?P<ibmmq_err_qmgr>[^\)]+)\))?\s*^(?P<ibmmq_err_code>AMQ\d+):\s+(?P<ibmmq_err_text>.*?)\s+^EXPLANATION:\s+(?P<ibmmq_err_explanation>.*?)\s*^ACTION:\s+(?P<ibmmq_err_action>.*)\s*^----+(?: (?P<ibmmq_err_csource>amq\w+\.c) : (?P<ibmmq_err_lineno>\d+) )?-+$
The regex in the stanza is one really long line that wraps.
The properties specification
The transform is useless without something that references it. A stanza in props.conf performs that purpose by defining a new source type called ibmmq-error-log that uses the new transform. The stanza I have been using is provided below:
[ibmmq_error_log] pulldown_type = true REPORT-access = ibmmq-err SHOULD_LINEMERGE = True TIME_PREFIX = ^ category = IBMMQ LINE_BREAKER = ----------([\r\n])+ description = Parse IBM MQ error logs
In addition to parsing error logs, I created another transform to parse the output of API exit logs. That transform is specific to a 3rd party vendor program so I did not include it here, but I mention it to explain the use of the IBMMQ category in the props.conf stanza. When selecting source types and in certain displays, all my MQ-related configurations now show up under the IBMMQ category. If you have only the error log transform this adds an extra layer to drill down through and you may wish to omit the category.
Most of this is copied from stanzas I found in the default props.conf file. Although this works, it is the part I’m most unsure of since I’m so new to Splunk. The more I read and use the tool the more nuances I discover which lead me to believe there’s lots of room for improvement here. For example, I read some docs that suggest SHOULD_LINEMERGE can be set to false since a line breaker is specified. I have yet to try this but I plan to set up a virtual machine with checkpoints so I can do A/B testing to figure out how all these stanza elements work.
Configuring data inputs
Once the stanzas are entered into the local props.conf and transforms.conf files and Splunk restarted to pick them up, all that is left is to set up the data inputs. Splunk is good enough to allow wildcards in path names to I used /var/mqm/errors/AMQERR01.LOG and /var/mqm/qmgrs/*/errors/AMQERR01.log and set them to the new source type of ibmmq-error-logs. The first path grabs the global error log files and the second one gets the error logs of any queue managers defined today or in the future. I like configurations that are as future-proof as possible.
These are the default paths and the queue manager’s logs or the global log might be located elsewhere. A possible future enhancement would be to use MQ’s own tooling to tell Splunk where to find the logs for a given queue manager. The dspmqfls command will parse the mqs.ini and the qm.ini and tell you exactly where a queue manager lives.
You might have noticed that I specified only the AMQERR01.LOG file in each case. While it is true that there are three files, all of the info in the 02 and 03 files was run through the 01 file first. What I do not want is to get two or three entries for each log event by having Splunk parse all three logs. By using only the AMQERR01.LOG file there is a slight window in which MQ can write to the log while Splunk is not looking then rotate the logs and a few log entries are lost. That window can be reduced by bumping the size of the error log up from the default so that the files rotate less frequently. This was the option my client chose and why my data inputs capture only the 01 file. At some point I’ll test using all three files and force some log rollover events to see how they are indexed but that’s a future enhancement.
Splunk MQ error source type in action
This was tested with MQ v8.0 and Splunk v7.1. There are many variations in the MQ error logs over the years but the basic format is pretty stable. Supporting minor variations in the key-value pairs should be doable by making some fields optional or adding new optional fields. I will update on GitHub with any changes as I am able to test with more versions of MQ and/or Splunk. Feel free to send a pull request, although you may need to ping me by phone or email if I miss the Splunk notification. I’m not on there every day.
There are undoubtedly many ways to do this within the Splunk framework and I’m not suggesting this is the best way, or even that this is the “right” way according to Splunk best practices. This is merely what worked for me and I’m posting in the hope that others find it useful. Any improvements or suggestions are welcome.
You can find the source code for this on GitHub at /tdotrob/IBMMQ-Logs-Splunk.