Enterprise 101: BlackBerry Slowness in an Exchange Environment

If your BlackBerry users are complaining about receiving emails every fifteen to thirty minutes in batches, or not receiving emails for many hours, then you may be experiencing a problem between your BlackBerry Enterprise Server(s) (BES) and your Exchange server(s).

The reasons that this occurs can sometimes be hotly contested between the people that run your Exchange servers and the people that run your BESs. This article should help you diagnose the problem and get your email flowing again in real time.

How Does it Work?

When you add a BlackBerry user to a BES, the user's account is assigned to one of five Message Agents. The Message Agent then registers with your Exchange server (the one that holds the user's mailbox). This registration process is the same one followed when someone starts Microsoft Outlook on their desktop. The registration process asks Exchange to send a notification when a new message arrives in the user's mailbox. It tells Exchange where to send the notification (the IP address of the BES in this case).

From that point onwards, when a new email arrives in the user's mailbox, Exchange sends a notification to the BES. When the Message Agent receives the notification, it uses a MAPI Worker Thread to retrieve the email from the mailbox using a Remote Procedure Call.

Before we continue, let's cover a few of these new terms.

  • MAPI - Mail Application Programming Interface
    • This is a DLL that is installed on a computer that allows other applications to become "Exchange email aware". It allows them to use define calls to the DLL that allow them to communicate to exchange servers
    • When you install a BES, you also install the Exchange System Manager (ESM). The ESM contains a special, more robust version of MAPI that the BES uses.
      • As a side note, I have been on a support call with Microsoft and one of their high up support people stated that the MAPI that is installed by the ESN was never designed for how the BES uses it. They stated that the BES "abuses" it. Of course this is completely inaccurate because this more robust version of MAPI is designed for multiple clients and multiple sessions. This version of MAPI limits the number of MAPI threads that it can support so it is impossible for it to be "abused".
  • MAPI Worker Thread
    • This is a part of a program that performs a task. A program can have multiple threads, each doing different things at the same time. In the context of the BES, a MAPI Worker Thread is a thread of the Message Agent that performs a task, like retrieves a message, scans a mailbox, etc.
    • MAPI Worker Threads use RPCs to communicate with Exchange
  • RPC - Remote Procedure Call
    • An RPC is a way of programming whereby you send a request to a server. The server then works locally to honor the request (e.g. get a new message from a user's mailbox), and then passes the result back to the process that requested it.
    • This allows a client to request something from a server without having to know how the result is obtained, it just waits for the result. This causes the server to do all of the work. 

To recap, the Message Agent on the BES registers with the Exchange server for event notifications for all users' it is handling. If a new email arrives in one of the user's mailboxes, Exchange sends the notify message to the BES. For the record, the notify message is a single User Datagram Protocol (UDP), also called Unreliable Datagram Protocol because the data is sent without a request for confirmation of delivery.

Once the notify is received by the Message Agent, the agent creates a new worker thread and that worker thread sends an RPC to Exchange asking to be sent a copy of that message.

Once the message arrives in the Message Agent, it is forwarded on to the BlackBerry via the RIM Network Operations Center (NOC).

The entire process from notify to receiving the message on your BlackBerry is normally very short, just a few seconds in length in a well performing environment.

What Can Go Wrong?

There are a number of hops n this process and one or more of the hops can cause problems.

Notify Packet
Since Exchange sends the Notify packet as UDP (unreliable), the packet could get lost on the wire. If the Exchange server and BES are on different subnets, or multiple hops apart, this UDP packet may not reach its destination.

The Exchange server may not send the Notify for some reason. This can happen because the Notifications go into a queue on the Exchange server. If they sit there too long, the Exchange server flushes the queue eradicating the notify.

The effect of the BES not receiving the Notify, is that it is unaware that new email has arrived for a particular user and therefore does not start the process of retrieving it. The user then does not receive the email.

The BES has a backup mechanism to waiting for the Notifies. Since the Notifies are sent unreliably, each Message Agent performs a scan of each mailbox that it supports every fifteen to thirty minutes depending on how busy it is. If during the scan it finds new email, it retrieves it via RPC and sends it to the user's BlackBerry. You can see this in the Message Agent logs. Each Message Agent logs its activity to a log file that is called _MAGT__.log (e.g. NYBES01_MAGT_01_20090105.log). If you see an entry that includes this: "Queuing new mail through rescan".

Exchange Server Too Busy
Any client that requests information from an Exchange server via RPC, causes the Exchange server to work on that request. That request may have it opening a database to retrieve your email, or reading something in memory. No matter the task, the Exchange server needs to complete that task and respond back to the requesting application.

You can see how efficiently the Exchange server is by looking at a Performance Monitor counter on the Exchange servers called RPC Averaged Latency. The definition of this counter is "The RPC Averaged Latency performance counter records the average time, in milliseconds (ms), that it takes for the last 1024 packets to be processed. The latency represents how long it takes from the time the Store.exe process received the packet to the time it returned the packet". Microsoft goes on further to say "The maximum value of the RPC Averaged Latency performance counter should be less than 50 milliseconds at all times".

So the magic number is 50ms. If RPC Averaged Latency goes above 50ms for a few seconds or more, then the Exchange server is too busy and cannot handle the requests in a timely manner. Microsoft later stated that if most of your Outlook clients are in Cached Mode (Cached Mode is where the Outlook client works offline using a local mail store and only receives updates from Exchange) then the RPC Averaged Latency can go up to 100ms.

This last statement about 100ms is not relevant if you use one or more BESs since the BES cannot operate in Cached Mode. If your RPC Averaged Latency is normally higher than 50ms but below 100ms for periods of time and Microsoft tells you that your Exchange server is healthy, they are wrong. The reason is that the heaviest Exchange user is the BES. For every BlackBerry that a BES supports, the BES adds the equivalent load of an extra 3.6 real-time Outlook users. That means if you had 100 Outlook users, and each one had a BlackBerry, your Exchange server would feel the load of 460 users. If the users were on the road most of the time and not logged into Outlook, the load would still be 360 real-time users.

If your Exchange server experiences RPC Averaged Latency of higher than 50ms for a few seconds or more, the BES starts to feel the latency and things start breaking down.

It should be noted that RPC Averaged Latency shows that Exchange is taking X number of milliseconds to complete an RPC request. We know that it should never go higher than 50ms. If it does, it is an indication that the Exchange server is underperforming, but it does not tell you why it is underperforming. Most people find that their disk subsystem is taking too long to service requests, but at the end of the day you do need to investigate why there is too much latency.

Effect of Latency on the BES
As mention earlier, all communication between BES and Exchange is done via MAPI RPC calls. RPC Averaged Latency measures how long those requests are taking. Remember that the counter itself is already an average so you cannot use the Average value when viewing the counter, you must look at the real-time measurements. If the latency goes above 50ms then a thread that is requesting an email must wait at least 50ms before getting a response and before it can move onto the next task it needs to perform.

Each Message Agent has multiple threads, each having to wait for a response from Exchange, and each of them queuing up tasks that they need to perform after the current one is complete. Based on the latency, the queue of tasks gets longer and longer.

The effect to the BlackBerry user is that their emails stop arriving in real-time. As the latency gets worse, emails take longer and longer to arrive. If the latency is too long, the MAPI thread hangs. This is called a Hung Worker Thread. If a worker thread hangs then that thread cannot complete the task it was assigned. If that task was to retrieve an email for a user, they that user will not receive their email until the Message Agent requests that periodic mailbox rescan.

If worker threads hang then it is an indication that the Exchange server is severely underperforming. A service called the Controller monitors the Message Agents and looks for Hung Worker Threads. The Controller does a health check every 10 minutes and if it observes hung threads it reports them in its log. You will see an entry similar to this:

[30000] (12/01 12:07:12.972):{0x9D4} Performing system health check (BlackBerry Controller Version
[30000] (12/01 12:07:20.409):{0x9FC} 'NY_BES01' agent 1: hung threads detected. WaitCount = 1
[30000] (12/01 12:17:12.968):{0x9D4} Performing system health check (BlackBerry Controller Version
[30000] (12/01 12:17:20.437):{0x9FC} 'NY_BES01' agent 1: hung threads detected. WaitCount = 2
[30000] (12/01 12:27:12.964):{0x9D4} Performing system health check (BlackBerry Controller Version

Upon observing hung threads, the Controller attempts to clear the thread. If it is not successful, that hung thread will be reported as still hung during the next health check as seen above. If after 6 health checks (or 60 minutes), a thread is still hung, the Controller will purposely crash the Message Agent that the thread was a part of in an attempt to clear the thread. Sometimes this attempt goes wrong and a Message Agent stays down. If this occurs, you as the BES Administrator need to manually restart the Controller service to get email flowing again.

If you want to look and see what the thread was doing when it hung, look in the Message Agent log for entries similar to this that occur at the same time of the health check.

[30181] (12/01 12:07:20.347):{0xF70} Performing system health check (BlackBerry Mailbox Agent 1 - BESX Version
[30038] (12/01 12:07:20.347):{0xF70} Worker Thread: *** No Response *** Thread Id=0x2018, Handle=0x12F4, WaitCount=1, WorkingTime=11 min, LastActivity=11 min, Event: RESCAN_CONTACTS, User: [email protected], Server: NY_Exchange1, Activity: Rescan PIM Items

As you can see from the log snippet above, the BES worker threads do not just get email, they perform all actions relating to email, calendar, contacts, tasks, etc. This thread sent a request to scan the contacts for changes via RPC but had not received a response for 11 minutes. At the 10 minute health check, the Controller deemed it a hung thread.

What can happen in extreme situations is that all threads can become hung. This happens because when a particular thread hangs and becomes unresponsive, the Message Agent spawns a new thread since it needs to continue communicating with Exchange. If this new thread also becomes hung, the agent spawns yet another thread. Eventually if all threads hang, the agent becomes unable to process anything. When this happens, all users that are being handled by that agent stop receiving email. In addition, any actions that the agent must perform, like doing a wireless enterprise activation will also fail. All other updates to the BlackBerry will stop with the exception of any MDS-CS communications, like web browsing. This is because MDS-CS is a different service.

If this situation occurs then the only course of action is to restart the Controller Service in order to clear the threads. On some rare occasions this still does not work and the entire server must be restarted.

If you are monitoring RPC Averaged Latency at the time Hung Worker Threads occur, you will see latency much higher than 50ms for sustained periods. This is an indication that Exchange is not able to keep up.

How To Rectify

There are a number of steps you can take to rectify the situation once you have determined what the cause is. Is it Notifies being lost, or is it high latency?

If you suspect that Notifies are being lost, make sure that each BES is on the same subnet as the Exchange server(s) that house the mailboxes being processed by that BES. The maximum network latency between BES and Exchange should be 35ms. Anything over 35ms will tend to cause problems.

If you network design forces your BES and Exchange servers) apart, say over a Wide Area Network (WAN) connection where latency is higher than 35ms, then the best way to handle the situation is to isolate the users who have a remote Exchange mailbox. You can do this by using static Message Agents. Besides the 5 agents that the BES creates and uses, you as a BES Administrator can create static agents and assign users to them. Try and use the same static agent for each geographic location or network segment. By isolating the users on their own static message agent, allows any latency to be contained in that agent, and not affect other users.

This trick can also be applied to VIP users.

You can also choose to install a new BES just for the remote users, but that does require an extra BES license. Using static agents works in the same way and is just as effective.

Exchange Too Busy
It is of paramount importance that the BES administrators are fully involved with all Exchange planning and modifications. I am aware of a situation where a large Exchange environment was changed so that 20 Exchange servers were consolidated into 3 Exchange Clusters. These changes were planned and executed without the knowledge of the BES team. After this change, terrible problems were seen on the BESs including high percentage of emails obtained by a rescan rather than a Notify, and many hung worker threads.

Because the BES is the most demanding on Exchange, and because the BES cannot run in Cached Mode like Outlook clients, the extra load that the BES places on Exchange must always be engineered into the Exchange planning. Remember that for each BlackBerry user, Exchange needs to be able to bear the equivalent of an extra 3.6 Outlook users running in real-time. If you want to consolidate Exchange servers, you must take into account the exponential rise in extra users it is taking on when factoring in the "BES load factor" of 3.6. If an Exchange cluster or Exchange server supports 1000 Outlook users, and 500 of those have BlackBerrys, you need to plan for that Exchange cluster or Exchange server to support 2800 Outlook users. That planning must have an end result that has RPC Averaged Latency well below 50ms.

If latency is the problem and you know the latency is not being introduced on the network itself (you have to follow the 35ms network latency rule as mentioned above), then you need to investigate why the RPC Averaged Latency is higher than it should be. There are many avenues that you can investigate including memory utilization, CPU utilization, disk subsystem performance (including Storage Attached Networks [SAN] and Network Attached Storage [NAS]), and RPC requests per second. This Microsoft article describes how to troubleshoot Exchange performance issues.

If you still cannot get RPC Averaged Latency under 50ms and your BES is still getting hung threads and emails are being delivered in batches, then you should look at adding Exchange servers to lower each ones load and hopefully latency. Another choice would be to spread out the BlackBerry users as evenly as possible amongst the Exchange servers so that the extra load that eh BES places on each Exchange server is reduced.

Tools to Help Monitor and Diagnose

One frustrating thing about diagnosing BESs is that all you have to work with are the BES logs. The BES-specific Performance Counters are all geared towards the BlackBerry device side, and not the BES and its services. Exchange administrators can become weary of you approaching them with concerns when all you have are the BES logs. I know of BES admins who have been severely disrespected by the Exchange support personal because they appear not to be proving their case.

I have even heard of the bizarre case where during a BlackBerry slowness crisis, the Exchange team refused to invite the BES Admins to meetings because they decided that the BES Admins were not providing enough data and obviously had no idea what they were talking about. When those BES Admins asked to have permission to review the Exchange PerfMon data, like RPC Averaged Latency, they were told that would know what they were looking at. It sounds terrible I know, but it is all true.
That is why it is important for BES Admins to make use of other tools to help them show what is going on. Not only that, the tools can help prevent these situations if action is taken to prevent them.

Third Party Tools
RIM now provides a BES Monitoring service that is free. It is useful to a point, but does not trend the important data that we are after. For that you should look to tools like BoxTone. A BoxTone server collects data from all of the BES logs and the BES SQL databases. It analyses this log data in real time and trends it graphically. For example, it figures out the percentage of emails discovered by a rescan as appose to a Notify. It does this by analyzing the Message Agent logs. The graphical trend is updated every few minutes. This is invaluable in determining how message delivery is affected throughout a day as you see the rescan percentage graph rise and fall every few minutes.

BoxTone has a multitude of other uses, including special help desk consoles that show message flow for a particular user that calls in, alerts sent by email or SNMP when services go down or there are other BES-specific alerts to send, and historical data to help diagnose issues after the case, but for this purpose, the graphically trended information is key.

If you cannot afford tools like BoxTone, you can make use of the free BlackBerry Resource Kit (BRK). The BRK is BES version specific so always make sure you use the BRK for your BES version. The BRK consists of command line tools including one called Mesageflow.exe. This tool analyses the Message Agent, Dispatcher, and Router logs and can be set to export a list of all messages sent to BlackBerrys and how the BES became aware of them (Notify or rescan). Using Excel you can compute the percentage of messages discovered by a rescan instead a Notify. This does not provide a real-time trend every few minutes like BoxTone, but does provide the overall percentage over the time span of the log files.

Besides the BES tools and log files, always keep track of RPC Averaged Latency on all of your Exchange servers or clusters. If this counter starts approaching a high of 50ms, it is time to take action. If you do not act before it reaches 50ms, you will have a BlackBerry slowness problem on your hands. If RPC Averaged Latency increases, investigate why. It could be rogue backs up running during the day, a runaway process on a user's desktop (like Google Desktop), or it could just be growth of your Exchange and/or BlackBerry population.

One Last Stop Gap
If you have a situation on your hands where the RPC Averaged Latency cannot be brought under control right away, you can request a special Registry modification from Research In Motion's (RIM) TSupport group that modifies the way that BlackBerry users are distributed between Message Agents. Normally the BES puts as many users from the same Exchange server in one agent as possible to reduce the number of threads that communicate with Exchange. Sometimes in high latency situations, it is better to apply this Registry modification which sets the BES to simply evenly distribute users over the agents. RIM does not recommend running BESs with this Registry modification for long because it makes them much more susceptible to the effects of latency, but it should buy you a couple of months to rectify the source of the problem, which is the latency. At that time you can remove the Registry modification. If you do not address the latency, after a couple of months your rescan percentage will creep back up, causing the BlackBerry slowness problems to return.

[ Craig Johnston is the author of Professional BlackBerry and is CrackBerry.com's Podcast co-host and resident enterprise guru and all-round BlackBerry expert. If you have an enterprise application or topic that you would like to have addressed by Craig, send him an email at crackberrycraig @ crackberry.com. ]

We may earn a commission for purchases using our links. Learn more.