Understanding Reliability

BlackBerry outages happen. They suck. But they happen. Those of us who have been longtime BlackBerry users have dealt with some major outages in the past. We're dealing with one now (I feel really bad for those who have been affected for literally days now). And the odds are this won't be the last BlackBerry service outage you endure, assuming of course you stick with BlackBerry. 

Some of you cope (and are currently coping) with the service interuption quite well. Others, like my friend BlackLion15 think enough is enough and that someone has to go over to Waterloo and slap those in charge at RIM. I usually try and take these kinds of things in stride, but on a day like today when iMessage becomes available in the Apple world, I have to admit the timing of this BlackBerry service outage couldn't be worse.

This outage got me thinking about BlackBerry reliability, which prompted me to re-read an article we posted here on CrackBerry back in 2008 about Understanding Reliability. Reading the article again today (posted below), most of the points ring true... especially that point about wishing RIM would do a better job of communicating what's going on to people. At least RIM is doing a slightly better job this time around, with updates being posted to their website. Though I do think in the three + years since this article was written that the competition has heated up and people's tolerance for service downtime is less than it once was. If you've never read the article before, take a few minutes now and read it again. And let's hope everybody affected by the outage has their service returned to normal soon. It's hard to get work done when I just keep glancing over at my 9900 hoping the red light will start flashing again.

Understanding Reliability

Written by Neil Sainsbury, originally posted on March 3rd, 2008

With all the recent BlackBerry network outages, I thought it might be a good time to sit back and reflect on how reliable the BlackBerry actually is, what a reasonable expectation of reliability should be, and also to address the criticism some people have that RIM isn’t doing enough to notify people of outages.

The first statement to be made when discussing reliability of modern day technology is an obvious one, but it needs to be said: “Nothing is 100% reliable.” Later I’ll be discussing in more detail why this is the case and also how this is exactly the trade-off that the market has had to make to bring you what you want at a price you can afford. For now however, it’s important to keep this little tidbit of information always in the back of your mind.

And really, the fact that technology isn’t 100% reliable should be obvious. At a past workplace, I had more downtime because the office air-conditioner was broken than I have because the BlackBerry network was down. I’ve spent countless hours sitting in gridlock on the highway because somebody’s car broke down and I’ve also probably spent days correcting problems caused by some software problem on my PC that shouldn’t have happened. Unreliability of technology is with us and its here to stay. Does that mean we should not strive to make technology more reliable? No. Does it mean we should resignedly give up when technology fails us? No. What is does mean is that we have to adjust our expectations, acknowledge that problems will happen and move on. Really, in terms of reliability, the BlackBerry is probably one of the most reliable pieces of technology in my life – no small feat, given the large number of independent components (device, carrier network, RIM NOCs, email servers) that comprise “the BlackBerry package.” I think if you were to do the math and work out uptime percentage, you would probably find the same.

So why is technology unreliable? Good question! It’s because we’re cheap. In general, markets are spectacularly good at giving people exactly what they want within fiscal constraints. The reason why technology isn’t 100% reliable is because we, as consumers, are not willing to pay the price for that level of reliability. There comes a point when a company is evaluating their plans to ensure reliability of services when things start getting a little bit ‘kooky’ from an outsider’s perspective. “When part A fails,” they say “we have three engineers stationed within 100KM who can be there in around an hour.” But what if the traffic is jammed? “We’ll have a standby helicopter that can get them there quicker.” Who’s the helicopter pilot? “Bob. He’s ace, but he is often out all night at the casino so we better get a backup pilot too.”

Don’t laugh. I’ve heard discussions like this.

The money involved in making all these provisions for outlandish circumstances is phenomenal but it can mean the difference between 99.9% (three nines) and 99.999% (five nines) reliability. In a competitive marketplace, a company that tries to provide the golden five nines is going to incur significant costs which they will ultimately pass on to the consumer. The company that doesn’t will be able to offer services cheaper for the end-user and in today’s markets tends to be the one that survives while the others go bankrupt.

At the end of the day, the old adage “you get what you pay for” rings true.

If there is one area however where I think you could level a valid complaint against RIM, its notifying people of outages. While you might not realize it, this is actually a PR problem. The question that RIM must ultimately answer when an outage occurs is “Do we actually want to tell people there was a problem?” It’s a tough question to answer too. Every time there is an outage there would be a fairly significant chunk of people who may not have been affected and would never even know. Do you risk telling them something bad and sending down false alarms? That could have terrible image consequences especially for a company that sells itself on reliability. It’s a balancing act and thus far RIM has erred on the side of “image preservation” preferring to release little, if any information. The idea being that if RIM was to make a lot of noise about an outage that would draw unwanted attention yet if they say nothing “everything might just blow over.”

I personally think RIM may now be finding that this approach is not the best and I have a little solution I humbly offer up for consideration. I think a nice solution would be to provide SMS notification to affected subscribers. The majority of these outages seem to be occurring at a software level at RIM’s NOC. So, technically they could build in independent tracking to determine if data requests from devices are being serviced. If a request & response (for emails, web browser data, etc.) does not make a full round trip in from carrier network back out to carrier network at least several consecutive times over a reasonable period of time, an SMS is instead sent to that person notifying them of a problem (simultaneously sounding warning claxons in “the war room” J). By doing this, you keep the individual up to date and also avoid sending out mass notifications which would inevitably reach people that may not even have a problem (bad PR).

I’m sure RIM has considered this approach too and may have even discredited it internally for some reason I’m not aware of but I hope that they could find a similar-in-concept approach to solve this notification problem rather than just do nothing.

In conclusion, next time the BlackBerry network goes down - and believe me, it will happen again - relax. Your BlackBerry has not simply suddenly turned in to a block of wood. It’s still a phone. It can still send SMSs. In today’s technologically bustling sprawl, there’s also a good chance you’ll find a PC with Internet access within 500 metres of where you are.