BlackBerry Outages, BES Admins and BoxTone!

By Kevin Michaluk on 18 Feb 2008 11:55 am EST
-
loading...
-
loading...
-
loading...

BlackBerry Outages, BES Admins, and BoxTone

Today marks the one week anniversary of the last BlackBerry Outage. I say last in hopes of it having a double meaning – last, as in the last one to occur, and last because I hope it never happens again. As a BIS user and full-fledge CrackBerry Addict, I dealt with the outage the same way many of you likely did  – pulled my battery in hopes things would miraculously start working when I put it back in (nope!), checked the forums for updates of what the heck was going on, played some BrickBreaker and prayed to the BlackBerry Gods! But push come to shove I was just one consumer/end-user who wasn’t overly put out by what went down – I just sat back in my living room and coped.

But had I been in my previous job where I was surrounded by literally thousands of BlackBerrys with users who really depended on them, I can just imagine how panicked that work environment could become if the BlackBerrys quit working and nobody knew what was going on.

I sent out a few emails after the outage to people I know in the BES admin world, asking what they did/how they coped with last Monday’s drama. While a lot has been written about the outage, not a lot of ink has been spilled from the perspective of the enterprise BlackBerry platform administrator – the men and women who have to scramble to figure out what they were seeing, what to tell end users, and what they should be doing next. I wanted to get some details.

Last year when I went to the Wireless Enterprise Symposium I heard a lot about BoxTone, a 3rd party monitoring and management app… at the time it seemed a bit Greek to me but after reading this reply from a BES admin associate I contacted it now strikes me just how important/needed/effective of an enterprise solution BoxTone is. His response is a good read and explains exactly what went on during the outage AND how he stayed on top of it. Read it after the jump!

[ email published with permission, but sender had to remain nameless due to their company rules ]


Hi Kevin, thanks for the email. Hope it's not too cold in the WinterPeg.

Monday's event made for an interesting day for sure, but it was managable. As you may recall, we have a very large environment and I have been managing the BES and mail systems for years. Seems each SRP outage is different. For example, as compared to the outage last April 2007 when the SRP connection was down for hours, the biggest difference I saw this time was that the SRP for all my BES dropped at 3:22pm, but then recovered pretty quickly–in less than 30 mins. I have spoken to a few people who checked their SRP at 3:45pm and thought everything was doing fine.  But I was able to see in real time that this was not the case.

We got the notification from RIM and from the carriers much later than the alert from BoxTone. The early warning was very helpful, kept us from having to hunt around in the dark and chase RIM and the carriers. In fact, the SRP clearing alert from BoxTone came through before the carrier notification!

When the SRP connection went down, my first indicator was an email alert I got from BT that specifically identified that the SRP connection had gone down. Having an alert of pending messages is helpful, but the specific SRP connection down alert is always better.

I immediately checked my boxtone console and confirmed that, indeed, SRP had been lost at 3:22 EST across all my BES in North America.  But our BES in Europe and Asia were not affected. I also got alerts for very high pending message counts for my NA users.

In the bt console I could see that >96% of users on my BES had pending messages and that no email was being sent or received by all my users in NA. I quickly alerted my front line help desk of the outage (although they didn’t need it since they have boxtone as well). I also alerted key executive assistants and my CIO.

Since SRP connection failures are rare, I hoped it would clear quickly. Sure enough, about 15 minutes later I got a clearing alert from BoxTone that the SRP connection had been restored. But that was only half the story. Lessons learned from earlier outages were that one of the major side effects of the outage is the actual recovery cycle for all my users.

When the SRP drops, and comes back on line, e-mails should start flowing again.  But the reality is that each user recovers at their own pace. For some they get all the emails immediately, for others they can be delayed for hours receiving and sending emails, and for others they can send but not receive for hours. It really is different user by user.

So after the SRP reconnect, I could still see in the BT consoles that there were very high pending message counts across all my users and overall very slow message delivery times. So even though I had the all clear on the SRP, I could clearly see that users still weren’t getting mail. So I updated my CIO and executives that the mail recovery was going slow and to be patient. And I didn’t have to update the help desk because they have their own BoxTone console where they can look up each user that calls and see the state of their recovery if/when they call…less work for me! :-)

With boxtone measuring each individual’s mail delivery performance, we could look up any user and see who was getting email and who wasn’t. They have this cool hop graph widget thing that shows each mail flow. You can see where flows backed up and where flows cleared and restarted. In fact, there were all kinds of crazy line patterns based on the different recovery patterns. My device was back up in about 40 minutes, some of the VIPs came back quickly and others took over 2 hours.

Monday’s outage still made for a stressful day, but things within our organization were always under control because we always had the information we needed right at our fingertips thanks to the software we have in place. Without it, things would have been way worse. I cringe at the thought of not having BoxTone! 

So there you have it. One BES Admin's take on the recent BlackBerry outage. I thought it made for a pretty good read! My take away - if I'm ever in an organization where there a lot of BlackBerrys and people really depend on them, I'll definitely be taking a good look at 3rd party monitoring and management software like BoxTone.

Topics: Editorial

Reader comments

BlackBerry Outages, BES Admins and BoxTone!

4 Comments

I wrote the following article for CIO.com about how the Boston Red Sox IT shop uses Zenprise, another BlackBerry environment monitoring, troubleshooting and problem resolution solution. Zenprise, or another product like Boxtone, surely would've been valuable to organizations during last week's outage. Depending on the size of the organization, such a product could be a necessity.

Check it out.

http://www.cio.com/article/164650/

This link doesn't work for the BlackBerry browser, though. It does work in Opera Mini...

AZA43

First off - way to shameless promote yourself / zenprise.

Secondly - Does this article bash or even mention Zenprise? I don't think so. So how is your post even relevant to this thread?

Thirdly - this article talks about configuration issues being a culprit which is also not relevant to an SRP outage.

I don't know much about either product, never talked to anyone representing either product, but what I do know is that from what I've been reading online and my observations are as follows:

* Zenprise seems more of a config management product.
* Most if the screenshots tend to show configuration issues or really simple stuff.
* The others are all repeats - it's just funny how each article that alleges to have screens from "zenprise in our environment" all are identical to ones they've posted or are on their own corporate website.
* I cant find any record of Zenprise proven in a large deployment? The companies that they list on their website all have less total employees than my company has blackberry users.

* Boxtone seems to be more valuable for large deployments while not as valuable for the smaller ( sub 100 users like your Red Sox)
* Boxtone has no mention of detecting config problems, but in all fairness large deployments should have BES Admins who know what they're doing anyway.
* Boxtone's interface is rather bland - which I dont like. And thats the worst I can say about what I've read online in regards to their product.

- MG.

wow..really interesting email. i never realized there was so much information and control within the corporate setting over these devices. i guess thats why blackbery is go big in the business world.