Today marks the one week anniversary of the last BlackBerry Outage. I say last in hopes of it having a double meaning – last, as in the last one to occur, and last because I hope it never happens again. As a BIS user and full-fledge CrackBerry Addict, I dealt with the outage the same way many of you likely did – pulled my battery in hopes things would miraculously start working when I put it back in (nope!), checked the forums for updates of what the heck was going on, played some BrickBreaker and prayed to the BlackBerry Gods! But push come to shove I was just one consumer/end-user who wasn’t overly put out by what went down – I just sat back in my living room and coped.
But had I been in my previous job where I was surrounded by literally thousands of BlackBerrys with users who really depended on them, I can just imagine how panicked that work environment could become if the BlackBerrys quit working and nobody knew what was going on.
I sent out a few emails after the outage to people I know in the BES admin world, asking what they did/how they coped with last Monday’s drama. While a lot has been written about the outage, not a lot of ink has been spilled from the perspective of the enterprise BlackBerry platform administrator – the men and women who have to scramble to figure out what they were seeing, what to tell end users, and what they should be doing next. I wanted to get some details.
Last year when I went to the Wireless Enterprise Symposium I heard a lot about BoxTone, a 3rd party monitoring and management app… at the time it seemed a bit Greek to me but after reading this reply from a BES admin associate I contacted it now strikes me just how important/needed/effective of an enterprise solution BoxTone is. His response is a good read and explains exactly what went on during the outage AND how he stayed on top of it. Read it after the jump!
[ email published with permission, but sender had to remain nameless due to their company rules ]
Hi Kevin, thanks for the email. Hope it's not too cold in the WinterPeg.
Monday's event made for an interesting day for sure, but it was managable. As you may recall, we have a very large environment and I have been managing the BES and mail systems for years. Seems each SRP outage is different. For example, as compared to the outage last April 2007 when the SRP connection was down for hours, the biggest difference I saw this time was that the SRP for all my BES dropped at 3:22pm, but then recovered pretty quickly–in less than 30 mins. I have spoken to a few people who checked their SRP at 3:45pm and thought everything was doing fine. But I was able to see in real time that this was not the case.
We got the notification from RIM and from the carriers much later than the alert from BoxTone. The early warning was very helpful, kept us from having to hunt around in the dark and chase RIM and the carriers. In fact, the SRP clearing alert from BoxTone came through before the carrier notification!
When the SRP connection went down, my first indicator was an email alert I got from BT that specifically identified that the SRP connection had gone down. Having an alert of pending messages is helpful, but the specific SRP connection down alert is always better.
I immediately checked my boxtone console and confirmed that, indeed, SRP had been lost at 3:22 EST across all my BES in North America. But our BES in Europe and Asia were not affected. I also got alerts for very high pending message counts for my NA users.
In the bt console I could see that >96% of users on my BES had pending messages and that no email was being sent or received by all my users in NA. I quickly alerted my front line help desk of the outage (although they didn’t need it since they have boxtone as well). I also alerted key executive assistants and my CIO.
Since SRP connection failures are rare, I hoped it would clear quickly. Sure enough, about 15 minutes later I got a clearing alert from BoxTone that the SRP connection had been restored. But that was only half the story. Lessons learned from earlier outages were that one of the major side effects of the outage is the actual recovery cycle for all my users.
When the SRP drops, and comes back on line, e-mails should start flowing again. But the reality is that each user recovers at their own pace. For some they get all the emails immediately, for others they can be delayed for hours receiving and sending emails, and for others they can send but not receive for hours. It really is different user by user.
So after the SRP reconnect, I could still see in the BT consoles that there were very high pending message counts across all my users and overall very slow message delivery times. So even though I had the all clear on the SRP, I could clearly see that users still weren’t getting mail. So I updated my CIO and executives that the mail recovery was going slow and to be patient. And I didn’t have to update the help desk because they have their own BoxTone console where they can look up each user that calls and see the state of their recovery if/when they call…less work for me! :-)
With boxtone measuring each individual’s mail delivery performance, we could look up any user and see who was getting email and who wasn’t. They have this cool hop graph widget thing that shows each mail flow. You can see where flows backed up and where flows cleared and restarted. In fact, there were all kinds of crazy line patterns based on the different recovery patterns. My device was back up in about 40 minutes, some of the VIPs came back quickly and others took over 2 hours.
Monday’s outage still made for a stressful day, but things within our organization were always under control because we always had the information we needed right at our fingertips thanks to the software we have in place. Without it, things would have been way worse. I cringe at the thought of not having BoxTone!
So there you have it. One BES Admin's take on the recent BlackBerry outage. I thought it made for a pretty good read! My take away - if I'm ever in an organization where there a lot of BlackBerrys and people really depend on them, I'll definitely be taking a good look at 3rd party monitoring and management software like BoxTone.