Saturday, September 25, 2010

Diagnosing Queuing Problems – Part II


Over the last 2 days we have been seriously polishing our "magnifying glasses" in effort to improve our mail delivery efficiency. Thursday, we covered the basics of the queue manipulation commands. Yesterday, we covered how to use those commands to address the most common mail queuing issues.
 In this article we will cover the final four common queuing issues:
  • Messages rejected by destination
  • Cannot route one or more recipients of a message
  • Messages stuck in the submission queue
  • Messages show in the poison queue

Messages rejected by a remote machine

The last delivery error is either a per-message error received during a command like "mail from" or an aggregate of the recipient errors. The admin may see a recipient-specific error when looking at the message but would not know which recipient caused the problem. To get even more detailed information on a particular message run:
$m=get-message <MessageIdentity> -IncludeRecipientInfo
$m.Recipients | fl

This will also retrieve the recipients and will then display them in a detailed view (the message object will only print the email addresses). Each recipient has its own Status and LastError fields which can help identify what recipient caused the error and take an action (maybe the recipient wasn't found in AD or the mailbox is full, etc) In addition, the RetryCount property displays the number of times delivery has been attempted for that message.
When messages are rejected, NDRs are usually generated. If NDR messages start queuing up, there may be two problems: delivering to recipients and delivering the NDR to the sender. NDRs are easy to identify in the queue viewer: their FromAddress field is "<>" and the subject usually starts with "Undeliverable:" It is useful in this case to take a look at the NDRs themselves. To do so we can export the NDR messages and look at their body. To export a message it must be first suspended:
suspend-message <message identity>
export-message <message identity> -Path: <path to directory or file>
resume-message <message identity>

Cannot route one or more recipients of a message

When some recipients cannot be routed, the message with the subset of un-routable recipients will end up unreachable queue. The main question an admin must answer is "why did this message end up in the unreachable queue?" As in the previous cases, the LastError field will help diagnose the problem – all messages in the unreachable queue will have the LastError field populated. The value of this field is a concatenation of all the errors encountered when routing all recipients. There are errors like "A matching connector cannot be found to route the external recipient" or "The mailbox recipient does not have a MDB".
Of course, this doesn't help much until we realize what recipient caused what error. To do so, run the same sequence of commands as above by adding the IncludeRecipientInfo parameter to the get-message task to dump the recipients. All the recipients of the messages in this queue should have a last error string that describes the reason why they couldn't be routed.
In most case the actions needed to fix these issues follow from the error description. For example if the error on the message was "A matching connector cannot be found to route the external recipient" and the recipient is known to be valid, then it is likely that the send connectors are misconfigured (e.g. – missing address space, missing connector). After fixing the connectors, the unreachable queue will be automatically resubmitted too; this will result in those messages being drained and routed to the appropriate delivery queues.
To prevent automatic resubmission (e.g., when a few connectors need to be changed or added and we don't want the unreachable queue to be resubmitted after each configuration change because many messages may end up right back in the same queue) the queue can be suspended first, later resumed, and then resubmit can be performed manually:
suspend-queue Unreachable
 … fix whatever is preventing the connector from connecting
resume-queue Unreachable
retry-queue Unreachable -Resubmit:$true

Messages stuck in the submission queue

The administrator will be alerted by MOM when the size of the Submission queue grows over the accepted limit. This can mean sometimes that we simply have a spike in incoming mail and the queue drains slower than the usual, but some other times we may see that no messages are going through the categorizer and into the delivery queues. This is usually an indication that something is wrong inside the categorizer component.
As usual, before attempting to investigate further, make sure the submission queue is not suspended (the status must be Ready).
get-queue Submission
One case when the above behavior happens is when the messages reach the categorizer but are being deferred and put back in the submission queue because of AD errors – the recipients cannot be resolved (this will only happen in the Hub role, as the Edge role doesn't have a resolver). The fix is to look at the messages in Retry
get-message –Filter:{Queue –eq 'Submission' –and Status –eq 'Retry'} 
Note: there is no way to know exactly how long a message will be deferred for, but generally, 400-level errors in remote delivery will defer a message for the time span configured in the transport server property MessageRetryInterval (default is 1 minute). The messages in the submission queue are usually deferred for 30 minutes (non-configurable) if errors like AD connectivity failures. In addition a categorizer agent could defer a message for any duration. Unlike in the case for queues, there isn't a way for the admin to change or reset the retry time for messages.
If AD is unavailable, the last error on those messages will say something like "AD transient failure during resolve." The AD connectivity problem must then be investigated.
All message retries that are triggered by the categorizer have a reason associated with them. There are LastError messages indicating whether the message has been deferred by an agent ("Message deferred by categorizer agent."), that a failure happened during content conversion ("A storage transient failure has occurred during content conversion.") etc. These errors are not always giving an exact indication on what the problem is but they make a good starting point. For example, there won't be any indication in the last error field about which agent deferred the message and why, but if get-message returns too many messages in Retry with the same "deferred by agent" last error, it likely means that one of the categorizer agents has encountered a problem. The next steps may be trying to identify the agent by disabling all categorizer agents or rules, then enabling them one by one. Ultimately, debugging the agent or analyzing the tracing log of that agent, if available, may be needed.
Another case that we have encountered is when a categorizer agent is deadlocked and cannot finish processing a message. By default, the categorizer can only process 20 messages at once (in various stages of categorization). If all those 20 jobs are stuck, no more messages will be picked up from the submission queue for processing, and as a result, the submission queue will grow continuously until someone intervenes. To figure out which messages are just being processed by the categorizer run:
get-message –Filter: {Queue –eq 'Submission' and Status –eq 'Active'} | ft Identity
Active messages are those being currently in various stages of categorization (routing, resolving, content conversion). This query should return at most 20 results. Run the query a few times in a row. If the same messages are returned each time the query is run (you can see that by the fact that the identities are the same) then it is very likely that we have a stuck categorizer agent problem. The quick fix to restore mail flow is to disable the agents. The in-depth fix would be to attach a debugger and identify the call stack on all the stuck threads which will point to the "stuck" code.

Messages show in the poison queue

The poison queue does not show up when get-queue is run, unless there are messages in it. Hopefully, you have never even heard of the poison queue; let alone needed to check it. The poison queue is a special mail queue for holding severely impaired messages. This means that the server has crashed at least twice while processing those messages. This can have many causes: bugs in our code not knowing how to deal with certain kinds of input, bugs in agents, misconfigurations, etc. In general, if poison messages exist, they have uncovered a bug somewhere. It is important to realize that the messages in the poison queue are usually not invalid or malicious. They would only become malicious if attackers discover that they can crash/exploit Exchange servers in this way.
If you discover poison messages after a crash, you can get more information using the following commands:
First, take a look at the poison queue
get-queue Poison
or
get-queue <ServerName>\Poison
This will tell how many poison messages exist. Then the messages must be looked at individually and decisions must be made on a case-by-case basis.
get-message –Queue:Poison
All those messages are considered suspended. The admin now has a couple of options:
  1. Resume the messages one by one and figure out if they still make the transport service crash resume-message <poison message identity>
  2. Export the poison messages to files and have the content looked at by microsoft developers (the following command exports all poison messages in the temp directory) get-message –Queue:Poison | export-message –Path: "C:\temp"
  3.  Delete messages if they are indeed poison (and choose to send or not to send NDR) remove-message <poison message identity> -withNDR:$false

If the problem is suspected to be happening because of some agent, disable the agent and resume the poison messages.

Messages in the Poison queue never expire; they have to be either resumed or deleted by an admin.

No comments:

Post a Comment