由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Programming版 - Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
相关主题
AWS上做Deep learning large scale design关于aws问goodbug老师一个问题
对分布式软件的Jepsen测试请教一下我的这个问题适合用NoSQL吗?
metric 不参与DL的具体优化吧?搞清楚什么叫做底层先
Partitioning (转载)说到底还是app 层 engineer 和 系统层engineer在斗法
[合集] MS interview questionAWS deployment 用 Asgard 或者script?
The Connected Company给班上的话痨们普及一点杂学吧
真神,原来amazon cloud的底层就是soa架构招聘:上海云立方教育科技-资深Web前端
说说12306需要多少台机器感觉L的stack也一般
相关话题的讨论汇总
话题: dynamodb话题: service话题: pdt话题: metadata话题: membership
进入Programming版参与讨论
1 (共1页)
w**z
发帖数: 8232
1
https://aws.amazon.com/message/5467D2/?utm_content=buffere5a1e&utm_medium=
social&utm_source=linkedin.com&utm_campaign=bufferSummary of the Amazon
DynamoDB Service Disruption and Related Impacts in the US-East Region
Early Sunday morning, September 20, we had a DynamoDB service event in the
US-East Region that impacted DynamoDB customers in US-East, as well as some
other services in the region. The following are some additional details on
the root cause, subsequent impact to other AWS services that depend on
DynamoDB, and corrective actions we’re taking.
Some DynamoDB Context
Among its many functions, DynamoDB stores and maintain tables for customers.
A single DynamoDB table is separated into partitions, each containing a
portion of the table’s data. These partitions are spread onto many servers,
both to provide consistent low-latency access and to replicate the data for
durability.
The specific assignment of a group of partitions to a given server is called
a “membership.” The membership of a set of table/partitions within a
server is managed by DynamoDB’s internal metadata service. The metadata
service is internally replicated and runs across multiple datacenters.
Storage servers hold the actual table data within a partition and need to
periodically confirm that they have the correct membership. They do this by
checking in with the metadata service and asking for their current
membership assignment. In response, the metadata service retrieves the list
of partitions and all related information from its own store, bundles this
up into a message, and transmits back to the requesting storage server. A
storage server will also get its membership assignment after a network
disruption or on startup. Once a storage server has completed processing its
membership assignment, it verifies that it has the table/partition data
locally stored, creates any new table/partitions assigned, and retrieves
data from other storage servers to replicate existing partitions assigned.
The DynamoDB Event
On Sunday, at 2:19am PDT, there was a brief network disruption that impacted
a portion of DynamoDB’s storage servers. Normally, this type of networking
disruption is handled seamlessly and without change to the performance of
DynamoDB, as affected storage servers query the metadata service for their
membership, process any updates, and reconfirm their availability to accept
requests. If the storage servers aren’t able to retrieve this membership
data back within a specific time period, they will retry the membership
request and temporarily disqualify themselves from accepting requests.
But, on Sunday morning, a portion of the metadata service responses exceeded
the retrieval and transmission time allowed by storage servers. As a result
, some of the storage servers were unable to obtain their membership data,
and removed themselves from taking requests. The reason these metadata
service requests were taking too long relates to a recent development in
DynamoDB. Over the last few months, customers have rapidly adopted a new
DynamoDB feature called Global Secondary Indexes (“GSIs”). GSIs allow
customers to access their table data using alternate keys. Because GSIs are
global, they have their own set of partitions on storage servers and
therefore increase the overall size of a storage server’s membership data.
Customers can add multiple GSIs for a given table, so a table with large
numbers of partitions could have its contribution of partition data to the
membership lists quickly double or triple. With rapid adoption of GSIs by a
number of customers with very large tables, the partitions-per-table ratio
increased significantly. This, in turn, increased the size of some storage
servers’ membership lists significantly. With a larger size, the processing
time inside the metadata service for some membership requests began to
approach the retrieval time allowance by storage servers. We did not have
detailed enough monitoring for this dimension (membership size), and didn’t
have enough capacity allocated to the metadata service to handle these much
heavier requests.
So, when the network disruption occurred on Sunday morning, and a number of
storage servers simultaneously requested their membership data, the metadata
service was processing some membership lists that were now large enough
that their processing time was near the time limit for retrieval. Multiple,
simultaneous requests for these large memberships caused processing to slow
further and eventually exceed the allotted time limit. This resulted in the
disrupted storage servers failing to complete their membership renewal,
becoming unavailable for requests, and retrying these requests. With the
metadata service now under heavy load, it also no longer responded as
quickly to storage servers uninvolved in the original network disruption,
who were checking their membership data in the normal cadence of when they
retrieve this information. Many of those storage servers also became
unavailable for handling customer requests. Unavailable servers continued to
retry requests for membership data, maintaining high load on the metadata
service. Though many storage servers’ renewal requests were succeeding,
healthy storage servers that had successfully processed a membership request
previously were having subsequent renewals fail and were transitioning back
to an unavailable state. By 2:37am PDT, the error rate in customer requests
to DynamoDB had risen far beyond any level experienced in the last 3 years,
finally stabilizing at approximately 55%.
Initially, we were unable to add capacity to the metadata service because it
was under such high load, preventing us from successfully making the
requisite administrative requests. After several failed attempts at adding
capacity, at 5:06am PDT, we decided to pause requests to the metadata
service. This action decreased retry activity, which relieved much of the
load on the metadata service. With the metadata service now able to respond
to administrative requests, we were able to add significant capacity. Once
these adjustments were made, we were able to reactivate requests to the
metadata service, put storage servers back into the customer request path,
and allow normal load back on the metadata service. At 7:10am PDT, DynamoDB
was restored to error rates low enough for most customers and AWS services
dependent on DynamoDB to resume normal operations.
There’s one other bit worth mentioning. After we resolved the key issue on
Sunday, we were left with a low error rate, hovering between 0.15%-0.25%. We
knew there would be some cleanup to do after the event, and while this rate
was higher than normal, it wasn’t a rate that usually precipitates a
dashboard post or creates issues for customers. As Monday progressed, we
started to get more customers opening support cases about being impacted by
tables being stuck in the updating or deleting stage or higher than normal
error rates. We did not realize soon enough that this low overall error rate
was giving some customers disproportionately high error rates. It was
impacting a relatively small number of customers, but we should have posted
the green-i to the dashboard sooner than we did on Monday. The issue turned
out to be a metadata partition that was still not taking the amount of
traffic that it should have been taking. The team worked carefully and
diligently to restore that metadata partition to its full traffic volume,
and closed this out on Monday.
There are several actions we'll take immediately to avoid a recurrence of
Sunday's DynamoDB event. First, we have already significantly increased the
capacity of the metadata service. Second, we are instrumenting stricter
monitoring on performance dimensions, such as the membership size, to allow
us to thoroughly understand their state and proactively plan for the right
capacity. Third, we are reducing the rate at which storage nodes request
membership data and lengthening the time allowed to process queries. Finally
and longer term, we are segmenting the DynamoDB service so that it will
have many instances of the metadata service each serving only portions of
the storage server fleet. This will further contain the impact of software,
performance/capacity, or infrastructure failures.
Impact on Other Services
There are several other AWS services that use DynamoDB that experienced
problems during the event. Rather than list them all, which had similar
explanations for their status, we’ll list a few that customers most asked
us about or where the actions are more independent from DynamoDB’s
Correction of Errors (“COE”).
Simple Queue Service (SQS)
In the early stages of the DynamoDB event, the Amazon Simple Queue Service
was delivering slightly elevated errors and latencies. Amazon SQS uses an
internal table stored in DynamoDB to store information describing its queues
. While the queue information is cached within SQS, and is not in the direct
path for “send-message” and “receive-message” APIs, the caches are
refreshed frequently to accommodate creation, deletion, and reassignment
across infrastructure. When DynamoDB finished disabling traffic at 5:45am
PDT (to enable the metadata service to recover), the Simple Queue Service
was unable to read this data to refresh caches, resulting in significantly
elevated error rates. Once DynamoDB began re-enabling customer traffic at 7:
10am PDT, the Simple Queue Service recovered. No data in queues, or
information describing queues was lost as a result of the event.
In addition to the actions being taken by the DynamoDB service, we will be
adjusting our SQS metadata caching to ensure that send and receive
operations continue even without prolonged access to the metadata table.
EC2 Auto Scaling
Between 2:15am PDT and 7:10am PDT, the EC2 Auto Scaling Service delivered
significantly increased API faults. From 7:10am PDT to 10:52am PDT, the Auto
Scaling service was substantially delayed in bringing new instances into
service, or terminating existing unhealthy instances. Existing instances
continued to operate properly throughout the event.
Auto Scaling stores information about its groups and launch configurations
in an internal table in DynamoDB. When DynamoDB began to experience elevated
error rates starting at 2:19am PDT, Auto Scaling could not update this
internal table when APIs were called. Once DynamoDB began recovery at 7:10am
PDT, the Auto Scaling APIs recovered. Recovery was incomplete at this time,
as a significant backlog of scaling activities had built up throughout the
event. The Auto Scaling service executes its launch and termination
activities in a background scheduling service. Throughout the event, a very
large amount of pending activities built up in this job scheduler and it
took until 10:52am PDT to complete all of these tasks.
In addition to the actions taken by the DynamoDB team, to ensure we can
recover quickly when a large backlog of scaling activities accumulate, we
will adjust the way we partition work on the fleet of Auto Scaling servers
to allow for more parallelism in processing these jobs, integrate mechanisms
to prune older scaling activities that have been superseded, and increase
the capacity available to process scaling activities.
CloudWatch
Starting at 2:35am PDT, the Amazon CloudWatch Metrics Service began
experiencing delayed and missing EC2 Metrics along with slightly elevated
errors. CloudWatch uses an internal table stored in DynamoDB to add
information regarding Auto Scaling group membership to incoming EC2 metrics.
From 2:35am PDT to 5:45am PDT, the elevated DynamoDB failure rates caused
intermittent availability of EC2 metrics in CloudWatch. CloudWatch also
observed an abnormally low rate of metrics publication from other services
that were experiencing issues over this time period, further contributing to
missing or delayed metrics.
Then, from approximately 5:51am PDT to 7:10am PDT CloudWatch delivered
significantly elevated error rates for PutMetricData calls affecting all AWS
Service metrics and custom metrics. The impact was due to the significantly
elevated error rates in DynamoDB for the group membership additions
mentioned above. The CloudWatch Metrics Service was fully recovered at 7:
29am PDT.
We understand how important metrics are, especially during an event. To
further increase the resilience of CloudWatch, we will adjust our caching
strategy for the DynamoDB group membership data and only require refresh for
the smallest possible set of metrics. We also have been developing faster
metrics delivery through write-through caching. This cache will provide the
ability to present metrics directly before persisting them and will, as a
side benefit, provide additional protection during an event.
Console
The AWS Console was impacted for some customers from 5:45am PDT to 7:10am
PDT. Customers who were already logged into the Console would have continued
to remain connected. Customers attempting to log into the Console during
this period saw much higher latency in the login process. This was due to a
very long timeout being set on an API call that relied on DynamoDB. The API
call did not have to complete successfully to allow login to proceed but,
with the long timeout, it blocked progress for tens of seconds while it
waited to finish. It should have simply failed quickly and allowed progress
on login to continue.
The timeout had already been changed in a version of the login code that has
entered our test process. Unfortunately it wasn’t yet rolled out when the
event happened. We will make this change in the coming days. The reduced
timeout will mitigate any impact of latency in the API call on the Console.
Final Words
We apologize for the impact to affected customers. While we are proud of the
last three years of availability on DynamoDB (it’s effectively been 100%),
we know how critical this service is to customers, both because many use it
for mission-critical operations and because AWS services also rely on it.
For us, availability is the most important feature of DynamoDB, and we will
do everything we can to learn from the event and to avoid a recurrence in
the future.
g*****g
发帖数: 34805
2
没事弄个 latency monkey在 dynamo上跑跑就不会出这么大事了。

some
customers.

【在 w**z 的大作中提到】
: https://aws.amazon.com/message/5467D2/?utm_content=buffere5a1e&utm_medium=
: social&utm_source=linkedin.com&utm_campaign=bufferSummary of the Amazon
: DynamoDB Service Disruption and Related Impacts in the US-East Region
: Early Sunday morning, September 20, we had a DynamoDB service event in the
: US-East Region that impacted DynamoDB customers in US-East, as well as some
: other services in the region. The following are some additional details on
: the root cause, subsequent impact to other AWS services that depend on
: DynamoDB, and corrective actions we’re taking.
: Some DynamoDB Context
: Among its many functions, DynamoDB stores and maintain tables for customers.

w**z
发帖数: 8232
3
client retry 没做好。我碰到过一次差不多的,是在 C*, 最后只能把clients 全断了
才recover 了。

【在 g*****g 的大作中提到】
: 没事弄个 latency monkey在 dynamo上跑跑就不会出这么大事了。
:
: some
: customers.

x*******1
发帖数: 28835
4
double-faults: GSI 引起了几何级数的metadata service query + heaviliy
partitioned table + max burst enabled. question是为什么cache坏了。为什么
不能动态加capacities。 DDB的failover设计肯定有问题。
x*******1
发帖数: 28835
5
首先没有设计emergent stop。 一键把max burst关掉。
x*******1
发帖数: 28835
6
首先没有设计emergent stop。 一键把max burst关掉。
x*******1
发帖数: 28835
7
cache 坏了是因为oncall bounce了机器,invalidate cache了。 request storm在此
刻就形成。
SN 和RR 的双错,导致了这个LSE
g*****g
发帖数: 34805
8
各种设计上的失误或者 bug是不可避免的,主动在产品环境里有控制地模拟各种宕机和
延迟增加能有效增强系统鲁棒性。

【在 x*******1 的大作中提到】
: cache 坏了是因为oncall bounce了机器,invalidate cache了。 request storm在此
: 刻就形成。
: SN 和RR 的双错,导致了这个LSE

x*******1
发帖数: 28835
9
S3曾经有一个LSE,是因为一块网卡flip bits,正好送来的是dynamic throttle的
rules,你妈坑爹的是,S3没有validate 这个rule。最后。。。。。
1 (共1页)
进入Programming版参与讨论
相关主题
感觉L的stack也一般[合集] MS interview question
又招人了,DevOps EngineerThe Connected Company
Cassandra 比较 Dynamodb真神,原来amazon cloud的底层就是soa架构
DynamoDB 只能在 create Table 时候建 indexes说说12306需要多少台机器
AWS上做Deep learning large scale design关于aws问goodbug老师一个问题
对分布式软件的Jepsen测试请教一下我的这个问题适合用NoSQL吗?
metric 不参与DL的具体优化吧?搞清楚什么叫做底层先
Partitioning (转载)说到底还是app 层 engineer 和 系统层engineer在斗法
相关话题的讨论汇总
话题: dynamodb话题: service话题: pdt话题: metadata话题: membership