由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
_SeattleStartup版 - Lessons Netflix Learned from the AWS Outage
相关主题
Summary of the Amazon EC2 and Amazon RDS Service Disruptionhow does AWS's auto scale work? (转载)
这次Amazon的AWS出问题没什么大不了的亚麻选组==
goodbug 以你的经验, EC2 instance 有多大概率,会自己down?有没有软件能够自动做mirror服务器?
小白弱问一个AWS EC2 outage的问题Amazon AWS EC2: 1 Year FREE Windows Cloud Server / VPS (转载)
Amazon和Google云服务的中断给数字化生活敲响了警钟,可能意味着更大的风险。一个潜在的bug可能会带来一连串的事故Amazon AWS EC2: 1 Year FREE Windows Cloud Server / VPS
goodbug, aws上怎么实现web server,app server分离,2个firewall的?哪位说说AWS到底比其他VPS好在哪里?
今天碰见一个高手把Drupal架在AWS上,大家看这个方案可行吗??
Amazon 面试失败, 好伤心,好伤心啊 (转载)AWS vs Azure
相关话题的讨论汇总
话题: our话题: aws话题: netflix话题: zone话题: service
1 (共1页)
t*****y
发帖数: 445
1
On Thursday, April 21st, Amazon experienced a large outage in AWS US-East
which they describe here. This outage was highly publicized because it took
down or severely hampered a number of popular websites that depend on AWS
for hosting. Our post below describes our experience at Netflix with the
outage, and what we've learned from it.
Some Background
Why were some websites impacted while others were not? For Netflix, the
short answer is that our systems are designed explicitly for these sorts of
failures. When we re-designed for the cloud this Amazon failure was exactly
the sort of issue that we wanted to be resilient to. Our architecture avoids
using EBS as our main data storage service, and the SimpleDB, S3 and
Cassandra services that we do depend upon were not affected by the outage.
What Went Well...
The Netflix service ran without intervention but with a higher than usual
error rate and higher latency than normal through the morning, which is the
low traffic time of day for Netflix streaming. We didn't see a noticeable
increase in customer service calls, or our customers' ability to find and
start movies.
In some ways this Amazon issue was the first major test to see if our new
cloud architecture put us in a better place than we were in 2008. It wasn't
perfect, but we didn't suffer any big external outages and only had to do a
small amount of scrambling internally. Some of the decisions we made along
the way that contributed to our success include:
Stateless Services
One of the major design goals of the Netflix re-architecture was to move to
stateless services. These services are designed such that any service
instance can serve any request in a timely fashion and so if a server fails
it抯 not a big deal. In the failure case requests can be routed to another
service instance and we can automatically spin up a new node to replace it.
Data Stored Across Zones
In cases where it was impractical to re-architect in a stateless fashion we
ensure that there are multiple redundant hot copies of the data spread
across zones. In the case of a failure we retry in another zone, or switch
over to the hot standby.
Graceful Degradation
Our systems are designed for failure. With that in mind we have put a lot of
thought into what we do when (not if) a component fails. The general
principles are:
Fail Fast: Set aggressive timeouts such that failing components don抰 make
the entire system crawl to a halt.
Fallbacks: Each feature is designed to degrade or fall back to a lower
quality representation. For example if we cannot generate personalized rows
of movies for a user we will fall back to cached (stale) or un-personalized
results.
Feature Removal: If a feature is non-critical then if it抯 slow we may
remove the feature from any given page to prevent it from impacting the
member experience.
"N+1" Redundancy
Our cloud architecture is is designed with N+1 redundancy in mind. In other
words we allocate more capacity than we actually need at any point in time.
This capacity gives us the ability to cope with large spikes in load caused
by member activity or the ripple effects of transient failures; as well as
the failure of up to one complete AWS zone. All zones are active, so we don'
t have one hot zone and one idle zone as used by simple master/slave
redundancy. The term N+1 indicates that one extra is needed, and the larger
N is, the less overhead is needed for redundancy. We spread our systems and
services out as evenly as we can across three of the four zones. When
provisioning capacity, we use AWS reserved instances in every zone, and
reserve more than we actually use, so that we will have guaranteed capacity
to allocate if any one zone fails. As a result we have higher confidence
that the other zones are able to grow to pick up the excess load from a zone
that is not functioning properly. This does cost money (reservations are
for one to three years with an advance payment) however, this is money well
spent since it makes our systems more resilient to failures.
Cloud Solutions for the Cloud
We could have chosen the simplest path into the cloud, fork-lifting our
existing applications from our data centers to Amazon's and simply using EC2
as if it was nothing more than another set of data centers. However, that
wouldn't have given us the same level of scalability and resiliency that we
needed to run our business. Instead, we fully embraced the cloud paradigm.
This meant leveraging NoSQL solutions wherever possible to take advantage of
the added availability and durability that they provide, even though it
meant sacrificing some consistency guarantees. In addition, we also use S3
heavily as a durable storage layer and pretend that all other resources are
effectively transient. This does mean that we can suffer if there are S3
related issues, however this is a system that Amazon has architected to be
resilient across individual system and zone failures and has proven to be
highly reliable, degrading rather than failing.
What Didn't Go So Well...
While we did weather the storm, it wasn't totally painless. Things appeared
pretty calm from a customer perspective, but we did have to do some
scrambling internally. As it became clear that AWS was unlikely to resolve
the issues before Netflix reached peak traffic in the early evening, we
decided to manually re-assign our traffic to avoid the problematic zone.
Thankfully, Netflix engineering teams were able to quickly coordinate to get
this done. At our current level of scale this is a bit painful, and it is
clear that we need more work on automation and tooling. As we grow from a
single Amazon region and 3 availability zones servicing the USA and Canada
to being a worldwide service with dozens of availability zones, even with
top engineers we simply won't be able to scale our responses manually.
Manual Steps
When Amazon's Availability Zone (AZ) started failing we decided to get out
of the zone all together. This meant making significant changes to our AWS
configuration. While we have tools to change individual aspects of our AWS
deployment and configuration they are not currently designed to enact
wholesale changes, such as moving sets of services out of a zone completely.
This meant that we had to engage with each of the service teams to make the
manual (and potentially error prone) changes. In the future we will be
working to automate this process, so it will scale for a company of our size
and growth rate.
Load Balancing
Netflix uses Amazons Elastic Load Balance (ELB) service to route traffic to
our front end services. We utilize ELB for almost all our web services.
There is one architectural limitation with the service: losing a large
number of servers in a zone can create a service interruption.
ELB's are setup such that load is balanced across zones first, then
instances. This is because the ELB is a two tier load balancing scheme. The
first tier consists of basic DNS based round robin load balancing. This gets
a client to an ELB endpoint in the cloud that is in one of the zones that
your ELB is configured to use. The second tier of the ELB service is an
array of load balancer instances (provisioned directly by AWS), which does
round robin load balancing over our own instances that are behind it in the
same zone.
In the case where you are in 3 zones and and you have many service instances
go down then the rest of the nodes in that AZ have to pick up the slack and
handle the extra load. Eventually, if you couldn't launch more nodes to
bring capacity up to previous levels you would likely suffer a cascading
failure where all the nodes in the zone go down. If this happened then a
third of your traffic would essentially go into a "black hole" and fail.
The net effect of this is that we have to be careful to make sure that our
zones stay evenly balanced with frontend servers so that we don't serve
degraded traffic from any one zone. This meant that when the outage happened
last week we had to manually update all of our ELB endpoints to completely
avoid the failed zone, and change the autoscaling groups behind them to do
the same so that the servers would all be in the zones that were getting
traffic.
It also appears that the instances used by AWS to provide ELB services were
dependent on EBS backed boot disks. Several people have commented that ELB
itself had a high failure rate in the affected zone.
For middle tier load balancing Netflix uses its own software load balancing
service that does balance across instances evenly, independent of which zone
they are in. Services using middle tier load balancing are able to handle
uneven zone capacity with no intervention.
Lessons Learned
This outage gave us some valuable experience and helped us to identify and
prioritize several ways that we should improve our systems to make it more
resilient to failures.
Create More Failures
Currently, Netflix uses a service called "Chaos Monkey" to simulate service
failure. Basically, Chaos Monkey is a service that kills other services. We
run this service because we want engineering teams to be used to a constant
level of failure in the cloud. Services should automatically recover without
any manual intervention. We don't however, simulate what happens when an
entire AZ goes down and therefore we haven't engineered our systems to
automatically deal with those sorts of failures. Internally we are having
discussions about doing that and people are already starting to call this
service "Chaos Gorilla".
Automate Zone Fail Over and Recovery
Relying on multiple teams using manual intervention to fail over from a
failing AZ simply doesn't scale. We need to automate this process, making it
a "one click" operation, that can be invoked if required.
Multiple Region Support
We are currently re-engineering our systems to work across multiple AWS
regions as part of the Netflix drive to support streaming in global markets.
As AWS launches more regions around the world, we want to be able to
migrate the support for a country to a newly launched AWS region that is
closer to our customers. This is essentially the same operation as a
disaster recovery migration from one region to another, so we would have the
ability to completely vacate a region if a large scale outage occurred.
Avoid EBS Dependencies
We had already decided that EBS performance was an issue, so (with one
exception) we have avoided it as a primary data store. With two outages so
far this year we are taking steps to further reduce dependencies by
switching our stateful instances to S3 backed AMIs. These take a lot longer
to create in our build process, but will make our Cassandra storage services
more resilient, by depending purely on internal disks. We already have
plans to migrate our one large MySQL EBS based service to Cassandra.
Conclusion
We set out to build a highly available Netflix service on AWS and this
outage was a major test of that decision. While we do have some lessons
learned and some improvements to make, we continue to be confident that this
is the right strategy for us.
Authors:
Adrian Cockroft - Director - Architecture, ECommerce and System Engineering
Cory Hicks - Sr. Manager, Applied Personalization Algorithms
Greg Orzell - Sr. Manager, Streaming Service and Insight Engineering
1 (共1页)
相关主题
AWS vs AzureAmazon和Google云服务的中断给数字化生活敲响了警钟,可能意味着更大的风险。一个潜在的bug可能会带来一连串的事故
请教goodbug等大神几个AWS的问题goodbug, aws上怎么实现web server,app server分离,2个firewall的?
How Twitter Handles Five Billion Sessions a Day今天碰见一个高手
被亚马逊绑架的问题Amazon 面试失败, 好伤心,好伤心啊 (转载)
Summary of the Amazon EC2 and Amazon RDS Service Disruptionhow does AWS's auto scale work? (转载)
这次Amazon的AWS出问题没什么大不了的亚麻选组==
goodbug 以你的经验, EC2 instance 有多大概率,会自己down?有没有软件能够自动做mirror服务器?
小白弱问一个AWS EC2 outage的问题Amazon AWS EC2: 1 Year FREE Windows Cloud Server / VPS (转载)
相关话题的讨论汇总
话题: our话题: aws话题: netflix话题: zone话题: service