Tuesday, 31 March 2009

Private and Public Cloud Outages and Performance

Some high profile reported outage examples of public cloud services. (Sources: blog reports and service monitoring sites).

·         Nov 26, 2007, Yahoo e-commerce services. Heavy online traffic affected half of 40,000 sites that subscribe to yahoo’s e-commerce service. The outage prevented sales from being compoleted on thousands of web sites that depend on the e-commerce service. Outage: approx 6 hours

·         Feb 11, 2008,  Saleforce.com , North American CRM servers, NA5, up and won for most of business day Feb 11. Due to software upgrade installed over weekend causing subsequent service degradation. Outage: 24 Hours

·         Feb 15, 2008  Amazon  S3/EC2  Outage: started at 4.30am to 7.00am (approx 2 hours). Affected many startup sites e.g. Twitter, SmugMug, 37Signals, AdaptiveBlue that use S3 to store data fro their websites.

·         Feb 19, 2008  Yahoo mail smtp. Outage: delays in smtp service estimate 24 hours

·         April 28, 2008 Amazon S3. Service authentication system overloaded with user requests. Outage: 3 hours.

·         Jul 20, 2008 Amazon S3  internal system problems causes S3 to be inaccessible for up to 8 hours. Outage:  5 hours 45 minutes

·         Jul 22 , 2008 Apple MobileMe launch. Mail server crash, some subscribers without email access for 5 days. Overall affecting less than 1% of customers have lost permanently some emails  sent between 18 July and 22 July.

·         Aug 6, Google Gmail, small  number of Apps premier users affecting some users 24 Hours

·         Aug 7, 2008  Citrix , GoToMeeting,GoToWebinar. Due to surge in demand Outage: a few hours. 

·         Aug 8, 2008  Nirvanix and MediaMax/The linkup(Storage). Cloud service failed and closed. Lost unspecified amount of customer data and approx 45% of all data stored. Linkup had about 20,000 paying subscribers.  The aim was to migrate to Nirvanix storage delivery network but only a partial migration was possible before closure.

·         Aug 12, 2008  Google Gmail. Users unable to access mail boxed as Gmail returned a “Temporary Error (502)”.  About 20 million users visit Gmail daily, with more than 100 million accounts in total.  Issue caused by a temporary outage in the contacts system used by Gmail which prevented Gmail from loading properly.  Outage: officially 1 hour 45 min (unofficial 2 hours)

·         Aug 15, 2008 Google Gmail, small  number of Apps premier users affecting some users 24 Hours

·         Aug 26, 2008  XCalibre flexiscale Cloud affecting many businesses using flexiscale on-demand storage, processing and /or network bandwidth. Cited as partly human error. The data structure was not replicated across multiple data centers.    Outage: 2-3 days

·         Jan 6, 2009, Saleforce.com, System wide outage. All Salesforce.com services across all regions were largely unavailable between 12:39pm and 1:17pm. Outage : approx 40 minutes.

·         Feb 24, 2009, Google gmail outage in America and Europe. Third outage in 6 months.  One blog estimate suggested 62 hrs in last 8 months calculating 99.2%, projecting to 99.4% in 12 months.  Outage: 2 hours 30 minutes.

·         March 10, 2009  Google Gmail small number of users affected. Gmail has approx 113 million users (comScore).  Outage: Partially fixed in a few hours but between 24 and 36 hours to restore all affected accounts.

The reduction and duration of frequency of outages has improved from two to three year ago during the start up phases of these services. The current performance should also bee seen in the light of the size of the user accounts that the large public vendors manage which far outnumber even large scale outsourcing and public infrastructure user groups that may be in the order of 100,000+ unique desktop users to 5-10 million subscriber accounts. 

  • Google Gmail has 113 million active accounts,  March 2009

  • Facebook has 175 million active accounts, March 2009

  • Amazon S3 stores more than 29 billion objects , October 2008

  •  Yahoo! Mail has 260 million users with a 67 Petabyte server in the California Region, March 2009

  • Myspace had 106 million accounts inn Sept 2006. Myspace was overtaken by its main competitor Facebook in April 2008

  • Twitter has 4-5 million users November 2008

  •  Apple intunes sold 9 billion songs, representing 70% worldwide digital sales, Jan 2009

  •  Yahoo! Websites receive 2.4 billion page hits per day in October 2007 

These statistics support the “wikinomics paradigm” of a huge online resource and user capacity in comparison to physical bricks and mortar storage and products range. The microeconomics and service design has significant economies of scale leverage.

 With the cloud and on-demand services becoming more visible in mainstream discussion these events will become more critical.  A  learning point from these public cloud failures is the need for transparent communications with the user groups. With the cloud service becoming more visible it is necessary to increase the level of communications on system status to the users in parallel with any system technology improvements.

Yet most proprietary system failures go unnoticed by all except those affected directly. In the cloud there is however more transparency and higher visibility of failure and downtime events.

Google has stated it guarantees corporate customers of google enterprise services will pay for use of Google Apps Premier Edition that Gmail will be available 99.9% of the time.  The 0.1% would be taken literally as 8.76 hours per year.   Google publishes a status dashboard:

Amazon has implemented availability zones and persistent storage and elastic IP addresses rather than static address to enable dynamic remapping on the fly to point to compute instances by the user rather than a Amazon data technician. Amazon have announced a S3 storage service with a 99.9% SLA availability back in October 2007.  Many companies are stated as using AWS to handle spike overflow called “cloud bursting”.

Saleforce.com publish a operating status dashboard for all its server groups globally. This also includes a maintenance schedule for planned downtime typically duration of 1 hour.

In conclusion you can draw at least three potential outcomes as next steps if cloud computing is to become enterprise level for public , private and hybrid combinations of cloud services.

·         Use cloud burst technology to hot box failure and continuity of cloud services

·         Accept existing public cloud service levels as these may be higher than your current service levels for a number of non-core or even core services.

·         Build a private cloud that has the elastic compute benefits of the cloud but is preserved and managed as a internal data center standard

No comments:

Post a Comment

Note: only a member of this blog may post a comment.