Random Thoughts About Amazon's Cloud Crash

As CTO of FreeWheel, I was often challenged by my decision to get a global operation team (although very small, in comparison to other companies that I know) to run and maintain our own COLO data centers. Why not put everything on a cloud? The question would come up during all stages of FreeWheel. My answer varies as we grow.

During early days of FreeWheel, we didn’t pick a cloud like Amazon EC2, because they were not mature enough and wasn’t proven in the market, we were very small when we first launched, if Amazon cloud were as big as today, I would think very hard before making a call to rent a COLO. Cloud actually is a perfect choice for a small startup, you can quickly launch your own website or service without much commitment on capital, speed and minimum capital commitment are critical factors when launching a business.

After FreeWheel launched, we have quickly grown and signed up many brand name customers, as part of contracts, we would sign service level agreement (SLA) with them, when you are talking about several 9s in your uptime requirement, a cloud solution would no longer work. Put everything in a cloud would mean you are competing resources with small businesses or personal blog who never worry about speed to response nor uptime. As long as it doesn’t go down all the time, it is good enough them. However, it is not acceptable for us. When you are talking about having a customer portfolio like we have, I would fire myself if I were to put everything on a cloud, because that means I were not taking the quality of service to our customers by heart. A cloud service is not good enough for mission critical business, although it can be a perfect alternative for none mission critical piece of your business, it can also be a perfect solution to handle spiky traffic overflow from your business. Just recently, I was talking to our Chief Architect, asking him to investigate what it takes to expand our testing environment on Amazon cloud. Even today, when everybody is talking about Amazon cloud crash, it still doesn’t change my opinion. The fact that you choose cloud means, it is no big deal if your data is lost permanently, or the service goes down for a few days. When you make the choice of going with the cloud, you know whole heartily that this is going to happen one day, and you are ok with it. Test or dev environment fits the characteristics.     

I find people often blame technology for problems, when in fact it is the decision of where to apply the technology should be blamed. There were people came to me and asked:”Why do you choose Ruby On Rails (ROR)? We have had so many problems with it, it is crappy.” I would ask: what do you use ROR for? We choose ROR for speed of development in UI, not for high performance applications. ROR should not be chosen for high traffic website development, or high performance servers. If I were to choose ROR for our adserver development to support billions of transactions daily, then it is my crappy choice to blame, ROR is not at fault. Same thing, if you were to put all your mission critical data or service on a cloud, and one day you find that you just lost all your data or the service is completely down, it is the decision of doing so to be blamed…

Am I arguing Amazon is not at fault? No, the fact that they claim perfect data backups and guarantee no data lost in their state of art cloud and couldn’t hold their promise is indeed Amazon’s fault. I do find it hard to believe that they don’t have “near-realtime” off site data backups. In order to prevent a complete data lost should anything happen in one data center, one typically should setup copying the data to another data center. Just like how we setup at FreeWheel, if our main site goes down, we can use the data copied to another data center to recover. What I suspect happened to Amazon data lost is: the corruption of the data in one site was replicated to remote data center before they caught it, hence the remote backup is not good either. If this is true (Amazon has not come up with an official explanation yet), it is their monitoring of the data integrity system at fault.


 
Trackbacks
  • No trackbacks exist for this post.
Comments
  • No comments exist for this post.
Leave a comment

Submitted comments are subject to moderation before being displayed.

 Enter the above security code (required)

 Name

 Email (will not be published)

 Website

Your comment is 0 characters limited to 3000 characters.