Understanding Reliability

Understanding Reliability
By Neil Sainsbury on 3 Mar 2008 10:01 am EST
0
loading...
0
loading...
2
loading...

With all the recent BlackBerry network outages, I thought it might be a good time to sit back and reflect on how reliable the BlackBerry actually is, what a reasonable expectation of reliability should be, and also to address the criticism some people have that RIM isn’t doing enough to notify people of outages.

The first statement to be made when discussing reliability of modern day technology is an obvious one, but it needs to be said: “Nothing is 100% reliable.” Later I’ll be discussing in more detail why this is the case and also how this is exactly the trade-off that the market has had to make to bring you what you want at a price you can afford. For now however, it’s important to keep this little tidbit of information always in the back of your mind.

And really, the fact that technology isn’t 100% reliable should be obvious. At a past workplace, I had more downtime because the office air-conditioner was broken than I have because the BlackBerry network was down. I’ve spent countless hours sitting in gridlock on the highway because somebody’s car broke down and I’ve also probably spent days correcting problems caused by some software problem on my PC that shouldn’t have happened. Unreliability of technology is with us and its here to stay. Does that mean we should not strive to make technology more reliable? No. Does it mean we should resignedly give up when technology fails us? No. What is does mean is that we have to adjust our expectations, acknowledge that problems will happen and move on. Really, in terms of reliability, the BlackBerry is probably one of the most reliable pieces of technology in my life – no small feat, given the large number of independent components (device, carrier network, RIM NOCs, email servers) that comprise “the BlackBerry package.” I think if you were to do the math and work out uptime percentage, you would probably find the same.

So why is technology unreliable? Good question! It’s because we’re cheap. In general, markets are spectacularly good at giving people exactly what they want within fiscal constraints. The reason why technology isn’t 100% reliable is because we, as consumers, are not willing to pay the price for that level of reliability. There comes a point when a company is evaluating their plans to ensure reliability of services when things start getting a little bit ‘kooky’ from an outsider’s perspective. “When part A fails,” they say “we have three engineers stationed within 100KM who can be there in around an hour.” But what if the traffic is jammed? “We’ll have a standby helicopter that can get them there quicker.” Who’s the helicopter pilot? “Bob. He’s ace, but he is often out all night at the casino so we better get a backup pilot too.”

Don’t laugh. I’ve heard discussions like this.

The money involved in making all these provisions for outlandish circumstances is phenomenal but it can mean the difference between 99.9% (three nines) and 99.999% (five nines) reliability. In a competitive marketplace, a company that tries to provide the golden five nines is going to incur significant costs which they will ultimately pass on to the consumer. The company that doesn’t will be able to offer services cheaper for the end-user and in today’s markets tends to be the one that survives while the others go bankrupt.

At the end of the day, the old adage “you get what you pay for” rings true.

If there is one area however where I think you could level a valid complaint against RIM, its notifying people of outages. While you might not realize it, this is actually a PR problem. The question that RIM must ultimately answer when an outage occurs is “Do we actually want to tell people there was a problem?” It’s a tough question to answer too. Every time there is an outage there would be a fairly significant chunk of people who may not have been affected and would never even know. Do you risk telling them something bad and sending down false alarms? That could have terrible image consequences especially for a company that sells itself on reliability. It’s a balancing act and thus far RIM has erred on the side of “image preservation” preferring to release little, if any information. The idea being that if RIM was to make a lot of noise about an outage that would draw unwanted attention yet if they say nothing “everything might just blow over.”

I personally think RIM may now be finding that this approach is not the best and I have a little solution I humbly offer up for consideration. I think a nice solution would be to provide SMS notification to affected subscribers. The majority of these outages seem to be occurring at a software level at RIM’s NOC. So, technically they could build in independent tracking to determine if data requests from devices are being serviced. If a request & response (for emails, web browser data, etc.) does not make a full round trip in from carrier network back out to carrier network at least several consecutive times over a reasonable period of time, an SMS is instead sent to that person notifying them of a problem (simultaneously sounding warning claxons in “the war room” J). By doing this, you keep the individual up to date and also avoid sending out mass notifications which would inevitably reach people that may not even have a problem (bad PR).

I’m sure RIM has considered this approach too and may have even discredited it internally for some reason I’m not aware of but I hope that they could find a similar-in-concept approach to solve this notification problem rather than just do nothing.

In conclusion, next time the BlackBerry network goes down - and believe me, it will happen again - relax. Your BlackBerry has not simply suddenly turned in to a block of wood. It’s still a phone. It can still send SMSs. In today’s technologically bustling sprawl, there’s also a good chance you’ll find a PC with Internet access within 500 metres of where you are.

About the Author: Neil is the founder of BlackBerry software company BBSmart and also runs DevBerry, a weblog for BlackBerry application developers. 

Reader comments

Understanding Reliability

12 Comments

Nice take Neil...some valid points made. It's amazing how dependent we can become on the newest technology/conveniences that are available to us.

My favorite: "Relax...there is probably a computer within 200 metres!!

well written article and it really puts things into perspective. you could look at it differently as well; the blackberry has been so perfect and continues to do well that people expect perfection all the time from such a device that has been entrenched in our culture. but like the article says, no technology is 100% dependable and addicts need to realize that involves their sacred blackberries.

I agree totally..I've yet to have been affected even once by the blackouts. If I do next time...oh well..S#@T HAPPENS!! Move on...

I moved to the RIM world in November and it was after careful research I decided the benefits were there and the risks were not the doomsday everyone made it out to be. I guess users have been so used to RIM not having such huge outages on a regular basis that a few in a short window scares them and well it should.

Being the resident gadget person who everyone knows has the latest/greatest toy I am always asked what should I get. Well I ask what can you afford. You see people want the highest reliability at the lowest price and despite what they may want companies need to earn a profit to stay in business and offer all of those said benefits.

The math is simple 24 hours x 365 days = 8760 hours a year, if we get 99.9% as Neil suggested we are at 8751.24 hours a year, meaning we lost 8.76 hours. If we want 99.999% then that means the provider can only be off line 0.0876 hours.

I do feel RIM can and should do more to alert their users of outages and yes they should also provide redundant NOC's because what if a catastrophic event takes their NOC out then what.

This year, RIM is trending a lot worse than 99.9% uptime. Over half of the world is out right now, as far as I can tell. It might be closer to 75% of Blackberry users. No BIS, no BES, no BBM. Basically, well over 80% of the common features are turned off.

99.9% uptime means you're allowed 525 minutes of downtime per year per user. That's a lot of minutes. However, they've blown way past that with this one outage. In fact, they're approaching 99% uptime for the year if it stays out through the rest of today.

This is completely unacceptable. RIM is a service provider and must be held to the same standards to which an AT&T, Verizon or Sprint is held: five-9s. Yes, we need to get to 3 right now, but the target should be 5. Does Google go down for 3 days? Not even close. That's because they're on a pay per use model. If Google (search) goes down for 1 hour, how many millions of searches do they lose? How many people start switching the default to Bing? What about Facebook? Same thing. It's unacceptable that a company with the resources of RIM can't keep the one thing that makes them money (the BlackBerry network) operational even close to three-9s availability.

"and also to address the criticism some people have that RIM isn’t doing enough to notify people of outages".

Why would RIM need to notify you of an outage? Turn on your blackberry. If the service doesnt work....guess what..there may be an outage. :)

Also, how will they notify you if your BB is not functional due to an outage. Unless of course its a planned outage and they notify you before hand

Hi tuxtech,

When RIM doesn't notify people of outages, there can be several undesirable consequences all primarily stemming from the fact that the average person does not know there is an outage. As far as the individual is concerned, their BlackBerry is just quiet. This can mean that they miss important emails. If they knew there was an outage however, they might make alternative plans to get access to their email. Also, they might assume something is wrong with their device and spend hours troubleshooting a problem that doesn't exist. I've seen people talk on forums about having both of these problems during an outage.

Also, as I mention in the article RIM could send SMSs to notify people. While there is a RIM outage, the BlackBerry still functions perfectly for making/receiving calls and receiving/sending SMSs because these things go over the carrier network, not RIMs.

Cheers,
Neil.

While all the reviews and information on Crackberry is great...this has got to be THE BEST article I have read so far...Thank You Neil for this awesome prespective..

Neil, Kudos to you for writing the BEST article that I have seen in a very long time. I think I might print this and stick it on my fridge! lol :) I would if I were you. Really though, you couldn't have written a better article. Thank you.

Great article Neil, i do agree RIM should take better meassures to notify costumers of an outage, SMS notification seems like a good idea to me.

This sort of lenient article from a dyed-in-the wool RIM apologist will fall on deaf ears in a week like this! And I actually think that would be quite justified.