Friday, September 24, 2010

Diagnosing Mail Queuing Problems – Part I


With yesterday's posting covering the basics of the queue viewer commands (Playing Traffic Cop – Commands for Working with the Exchange Mail Queues), we now have a new set of tools to allow us to fulfill our role as the "Sherlock Holmes of NDRs". With that said, most mail flow interruption scenarios that can usually be classified in one the following categories:
  • Cannot connect to destination machine to deliver mail
  • Messages stuck in a delivery queue, but destination is working properly
  • Messages rejected by destination
  • Cannot route one or more recipients of a message
  • Messages stuck in the submission queue
  • Messages show in the poison queue
    When faced with these situations, one needs to resolve the following two questions:
  1. What caused this situation?
  2. How can I fix it so that mail flow is restored again?
We will cover the first two common scenarios today and the remaining four in tomorrow's article.

Cannot connect to destination machine to deliver mail

This is far and away the most common problem. Some of the common causes may be:
  • Destination machine is down or there are network connectivity problems
  • Destination is being overwhelmed with traffic
  • No MX records can retrieved for the destination
  • Destination rejects connections because of certain limitations 
One may be alerted to a possible mail delivery problem by a number of reasons: Users complaining of slow delivery times, a MOM/SCOM alert saying that the total number of messages in the transport queues on some server has exceeded the established threshold, etc. Since the information in these initial warnings is usually rather vague (for example, the counter used by MOM/SCOM only provides an aggregate message count for all delivery queues), the first step in the discovery process is to identify which queue seems to accumulate messages without delivering them.

To get to this information run the command: get-queue –SortOrder:-MessageCount
(Alternatively, one can launch the Queue Viewer from the Exchange Administrator Console and sort by Message Count)

 
Note: all commands that do not pass a full queue identity (prefixed with the server name) assume that the administrator is logged on the transport server. To run these tasks remotely add the "-Server:<server name>" parameter.

 
This will return all the queues on the server ordered descending by message count so the largest queue will come first in the results (change –MessageCount to +MessageCount to order ascending by the same field).

 
The first queue returned will likely be the queue that creates the problem (let's assume it's a delivery queue, that is, its DeliveryType member is not Submission or Unreachable). To figure out if this is the case let's get all the data about the queue with the largest number of messages. The "-Results:<desired result count>" argument can be passed to the task to limit the number of queues returned.
get-queue –SortOrder:-MessageCount –Results:1 | fl

 
The Status and LastError queue fields will help diagnose the problem. Usually, if the queue cannot establish a connection with the destination, its Status will be Retry and the LastError will be an error message in the form of an SMTP response. Healthy, active queues should be delivering messages, so the message count should be dropping with time or at least fluctuating up and down. (In the event it is increasing rapidly, check to make sure that you do not have a runaway mail merge, spambot, or other mass-mailing process gone awry using the get-message command as shown later in the article.)

Assuming one does not have a runaway mailer, make sure the queue status is not suspended. It is entirely possible that it may have been suspended by somebody else and left in this state accidentally. (If you work in an organization with more than one mail administrator, check to make sure that the queue is not intentionally in this state) A suspended queue will not even attempt to establish connections to attempt delivery. If the queue is suspended and shouldn't be, simply resume it by running the following command, which will automatically attempt to connect to the destination: resume-queue <QueueIdentity>

From the status and LastError fields, you may be able to determine whether the remote machine actively rejects connections (e.g. it has reached some maximum number of incoming connections) or a connection cannot be made because the remote IP address/port combination is not listening, or because the store driver cannot connect to a mailbox server. DNS lookup failures are also reflected in this field. What can be done to fix these problems? The LastError queue field should be descriptive enough to make it obvious that the problem lies with the server being diagnosed or the remote server or somewhere in between (MX records). The actions are different according to whether the destination belongs to the same org (AD site, routing group) or not (some remote domain on the internet)

If the destination is internal, besides the obvious checks that it is up and working it is also useful to check the routing information. If some send connectors are left enabled pointing to a smarthost that doesn't exist anymore or to an AD site which doesn't have any Hub servers, messages may be routed to a queue that acts like a dead-end. It is important that this scenario be detected early so that the messages don't stay in that queue forgotten until they expire. Note: In most cases, the NextHopDomain property of a queue is the name of either another SMTP server or of a mailbox machine. In these situations the destination can be promptly identified and actions can be taken. But In the case of the delivery to another AD site, the next hop domain is the name of the AD site and the actual machine our server is trying to connect to will be dynamically chosen by DNS.
 The next step would be to bring the connectivity log and the protocol log or message tracking log in the picture and identify the IP of the bridgehead that responded with an error. In the case that the destination is rejecting connections because of certain limitations, a wealth of information is available in the protocol log, which offers the advantage of showing the history of our attempts to connect and deliver as opposed to the queue viewer which only exposes the current situation and the last error. This situation may also uncover other problems – if the destination machine that rejects messages is under your control, it may be useful to inspect the event log on destination which can give additional information about why the messages or connections were rejected – connection rate or message rate exceeded, etc.

Messages stuck in a delivery queue but destination seems OK

Sometimes after, trying to diagnose the state of a queue the admin realizes that the destination is ok, i.e. – it can be contacted via TCP/IP on port 25 (see Whose Fault is it Anyway), or if the queue is local delivery, the mailbox store is up and running. The queue status may be Ready or even Active, but the message count won't go down. In this situation we want to look more in depth at the messages. We could have a situation where the queue simply has a large backlog and the system is only slow. In this case you have to determine whether or not some messages are being delivered.
The first thing to do is call:
get-queue <QueueIdentity>

As in the previous situation described, first make sure first the queue status is not Suspended. Next, check if the queue status is Retry. After the "glitch retry" time (4 times at 15 second intervals) the queue will not attempt a connection for the next hour (local delivery queues however have a flat 5-minute retry time). The time of the next connection attempt is given by the NextRetryTime queue field. If the destination is known to have had a problem that was fixed in the meanwhile, run:retry-queue <QueueIdentity>
This will attempt to establish a connection immediately.

If the queue is Active and the message count fluctuates up and down then the queue is working as expected: messages are being delivered but they can also be rejected – in both cases they disappear from the queue. If the queue message count seems to be increasing constantly, it's time to take a look at the messages: get-message -Queue <QueueIdentity> 
This will return all messages from the queue identified by <QueueIdentity>.

If there are too many messages this can be slow and spew too much information, so limiting the number of results may help. Here's how to get the first 10 results:

get-message –Queue <QueueIdentity> -Results:10

To diagnose the problem further we must check the status of the messages. However we must keep in mind that the order in which the messages are returned by get-message is not related to the order in which the messages are delivered.If the status of some messages is Active it means the queue is doing its job delivering messages. To get the active messages run the following command a few times and see if any results are returned:
get-message –Filter:{Queue –eq '<QueueIdentity>' –and Status –eq 'Active'}
(Run this a few times since messages in deliver may quickly disappear, so you may as well get no messages back sometimes.)

If many or all messages are Suspended, as in the case of the queues, no delivery will be attempted. Messages may have been suspended by somebody else and forgotten in this state.If the status of many messages is Retry (you can obtain them using a query filter like the one for Active messages) check their LastError. This should explain why they were put in Retry. Retry messages are usually accompanied by a SMTP error. This error is typically enough to diagnose the error.

Well, I have rambled on for long enough for one day; more to come tomorrow.

No comments:

Post a Comment