Microsoft still doesn't get it. Tuesday, Arthur de Haan, VP of Windows Live Test and Service Engineering, posted an analysis of the Sept. 8 outage that took down most of Microsoft's cloud applications. On the Windows Team Blog he explains:
[W]e have identified two streams of work to drive specific service improvements around monitoring, problem identification, and recovery. Along with these service improvements, Microsoft is focused on further hardening the DNS service to improve its overall redundancy and fail-over capability... [and] an additional recovery process that will allow a specific property the ability to fail over to restore service and then fail back when the DNS service is restored. In addition, we are reviewing the recovery tools to see if we can make more improvements that will decrease the time it takes to resolve outages.
Nowhere does he mention the single most important failure on Sept. 8: Microsoft didn't keep its customers informed.
Two weeks ago I posted a detailed analysis of the Sept. 8 outage. As such events go, it wasn't a horrible failure, although a worldwide three-hour crash of all of your major online products has to raise a few eyebrows.
The most notable aspect of the outage was a complete collapse of Microsoft's warning system. For varying amounts of time, depending on location, the Microsoft Service Dashboard reported that all was well when it clearly was not. Many people also complained that they couldn't even get to the Service Dashboard to see if Microsoft was aware of the problems or to receive official announcements from Microsoft.
Instead, Microsoft turned to Twitter to keep its customers updated -- belatedly. As best I can tell, the first official tweet occurred almost 90 minutes after customers started complaining. Tellingly, the downrightnow.com site was on top of the problem with its crowd- and cloud-sourcing approach.
A year ago, after a series of BPOS outages, Microsoft rolled out its Online Service Health Dashboard. But Microsoft doesn't make that dashboard available to mere customers: You have to have an Admin account for one of Microsoft's services (Office 365, BPOS) to see what's happening. By contrast, the Windows Azure Service Dashboard is available to anyone.
Systems crash -- d'oh! When they turn belly-up, it's the vendor's responsibility to keep customers in the loop. Yes, admins need to be advised, but so do end-users. What's so hard about setting up a reliable status notification site? Google does it. ISPs do it, with admittedly varying levels of credibility. Every hosted service provider worth its salt has a status site, too. Is anybody in Redmond listening?
No comments yet.
Leave a comment
You must be logged in to post a comment.
Trackbacks are disabled.