« Photos restored | Main | Working on a few new features »

July 20, 2008

Responding to Amazon's S3 outage

Today's S3 outage has disrupted a lot of major web sites today including Twitter and SmugMug, both of which rely on S3 to host photos and/or icons.  The service is coming back online now, but it's been down for almost 7 hours total.

We host all of our photos at Amazon S3, but as a new site with low (but growing) traffic levels Planaroo obviously won't feel the impact that these major sites will.  However, we're optimistic that Planaroo.com will be a big site one day so it's interesting to think of ways to prevent bad things from happening.

Here are the lessons we're taking away from today's problems:

  • Backups are a good thing. The one thing we did right was to keep current backups of the photos we host on S3.  That allowed us to manually replace the URL in our photos database table with a new URL quickly and easily. 
  • Fix it. Don't hope for the best.  S3 went down at 9:06 am Pacific, and we had photos up and running by around noon.  We didn't do anything for two hours because we though -- incorrectly -- that the system would probably come back up quickly.  That was a mistake.
  • Keep backups up-to-date and ready-to-go. Planaroo is based in a private home at the moment, and we depend on our Comcast connection to upload files to our servers at Slicehost.  Normally that's not a problem since the entire code base is small, but Planaroo has well over 1000 photos, and we store small (225x315) and large (500x700) versions of each one on the server.  Comcast's upstream speeds are slow, and they throttle down after the first few MB are uploaded.  It took over an hour to upload all the photos, and that's just not acceptable.  In the future we'll keep a backup ready to go in the event that Amazon S3 goes down, and we'll automate the switchover process.  We'll also think about seeking out a faster upstream solution.
  • We were right not to host critical CSS, JavaScript, or icons on S3.  It's bad enough to serve broken photos, but broken CSS or JavaScript files can render a site unusable.
  • Google's AJAX Libraries could go down, too.  We're using Google's AJAX Libraries API to serve the prototype.js file.  Google could have a similar downtime, and we need a failover plan for that too.

We would like to hear what other small startups are going in response to today's problems.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8353328fd69e200e553ad61f28833

Listed below are links to weblogs that reference Responding to Amazon's S3 outage:

Comments

We were happy with statics hosted on S3 and we will keep them like that. We will host them also on the EC2 web app instance. Just in case it will happen once again. As you said without images the website can work, but without JS, CSS, it was unusable.

Last time S3 went down, there wasn't any info except in their forums. This time they set up a status page, but without much more info regarding the delay. So yes, as soon as S3 will come down, we will switch from S3 to EC2 for these files.

Regarding the backup we are implementing something, but not for the same reasons. Any suggestion for backuping/restoring S3 files is welcome.

Interestingly enough a lot of the philosophy behind your post is already covered in large on the presentation made by Randy Shoup on eBay's Architectural Principles:
http://www.infoq.com/presentations/shoup-ebay-architectural-principles

but i gotta say a quick reminder is always a good thing ;)

Jean-Marc

I worked with Randy at eBay, and he's a rock star. I should have picked up some of his tips by osmosis!

This is a really good point.

There wouldn't be anything to stop you from uploading all your stuff to another service and rebooting all the urls.

Most services will let you maintain an account which isn't being actively billed because you aren't using it. Why not have a script to "boot up" another service from your local copies of stuff and switch.

Or even if you have a lot of materials, suck up the small cost of idling stuff on disk with another service and then just swap over to get the bandwidth costs while your main service is down.

It's significantly easier to have local storage for a travel site or some other site with small storage than a site like Smugmug. They have 335 million photos and store the originals along with 7 or 8 display sizes.

If Amazon is down for a few hours a year, does that justify spending millions to maintain local storage?

@Jon, totally agree that live backups may not be practical for sites like SmugMug which have essentially built their business on top of S3. It's also a lot more complicated once users start uploading photos, which we're not supporting (yet).

@Tom, we have a column in the photos table called "base_url". We set it up that way because we wanted to try serving photos from different hosts to compare metrics. Because our backup retains the same path structure that was on S3, we were able to swap our S3 with the backup's base_url with a SQL statement once the files were in place on the backup host.

When I first started using S3 I thought it was the most amazing thing ever. Within 24 hours, I had my entire web imaging system (heavy user contribution) using s3. However, I wish amazon's .NET methods had included some setup in web.config that would have specified a local cache folder should their service ever go down.

I'm programming this myself now, and using a getFileLink() method around any s3 hosted file. It checks s3 status every 5 minutes. If s3 is down, it uses the local cache for the next 5 minutes.

Essentially, it adds our own local servers to the 'cloud'. I'm surprised s3 didn't provide this feature in the first place.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

My Photo

September 2008

Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30        

Hacker News

It's About Home Work