Hotmail fail: Microsoft lays an egg in the cloud
Microsoft lost all email for 17,000 Hotmail customers, then botched the response. Is this a harbinger of Office 365 hassles?
If this is the way Microsoft’s going to handle Office 365 outages, we’re in for some interesting times.
On Dec. 30, one of the largest SQL Server databases on the planet started having problems. The database in question just happens to belong to Microsoft. And the way the company reacted to the problems should raise red flags for anyone considering a move to the Microsoft cloud.
According to a blog post by Chris Jones, a Microsoft vice president in Windows Live Engineering, the Hotmail servers had a problem with load balancing, resulting in 17,355 email accounts losing all of their data. It took Microsoft three days to restore the data. At least, Microsoft claims it had the data restored in three days. Voluminous postings on both the Windows Live Engineering site and the Windows Live Solution Center say that some people still haven’t gotten their data back.
Data loss happens in the cloud, on corporate servers, and on the desktop. But this is, arguably, Microsoft’s most widely deployed cloud application, backed by Redmond’s best and brightest, and it failed for 17,000 users for at least three days.
Put aside the obvious technical questions, like why were the servers performing a load balancing act in the middle of the busiest time of year? How did the data disappear and then suddenly reappear? Why does it take three days to retrieve lost data? Can’t SQL Server scale better than that? If you look at Microsoft’s response to the disappearing data, you really have to wonder how the ‘Softies would handle a data-destroying incident involving your company’s data.
Consider: The initial problem notification, predictably, came on the Microsoft support board, the Windows Live Solution Center. Hundreds, then thousands of people reported that all of their messages were gone. The support staff handling the Solution Center must’ve realized they were facing a systemic problem, not a random sampling of clueless users. But instead of coming out with a definitive statement and posting it on the forum in blazing color, the support people just chased after individual reports, using cut-and-paste responses to users’ cries of anguish.
Comments are closed.