[US Region] Elevated error rate of Stream import API

Incident Report for Treasure Data

Postmortem

Summary

On September 3rd, 2019 at around 23:28 PDT (September 4th, 2019 starting at 15:28 JST), we deployed a new version of the Streaming Import API to all production environments.

In the US region, the deployment caused some side effects which culminated with a steep increase in Streaming Import delay, that is the time between the receipt of a Streaming Import request and the completion of the processing of that request in the Streaming Import backend. The additional delay persisted for around 7 hours.

Over the following 6 hours, the team performed a number of operations that eventually were able to mitigate the situation and restore the system to its usual performance levels.

Impact to customers

All customers using the Streaming Import API through fluentd or td-agent were affected by the increased delay. About 80% of Import requests failed during the incident, causing fluentd / td-agent to buffer the data for an extended amount of time.

All customers using the JavaScript SDK / Mobile SDKs / Postback API / and Audit logging were affected by the Import delay as well, because those systems rely on the Streaming Import subsystem to function.

As an effect of the instability of the Streaming Import API, some customers were affected by data duplication, that is the data payload of some of their failed request was imported two times or more.

All customers whom we determined they have suffered from duplicated imports will be contacted and provided details by our Support staff.

Details

At 23:28 PDT of September 3rd (15:28 JST of September 4th) we deployed an updated version of the Streaming Import API to all production regions: US, JP, and EU.

At 23:38 PDT (15:38 JST) we noticed the first symptoms of the performance degradation: 80% the Import requests failed. The request failure caused the fluentd / td-agent clients to start buffering the data, regularly attempting to re-Import it after a fixed interval. The extended amount of buffering built up a large backlog of data, and consequently requests, in the clients.

Starting at 00:30 PDT of September 4th (16:30 JST of September 4th) we observed degraded performance of the connection to the Import task queue and the number of connections between the two queues (typically expected to be comparable) became imbalanced. We restarted one of the task queues but it didn’t resolve the issue.
We repeatedly attempted to roll back the Streaming Import API version but were prevented from doing so because of the instance lifecycle termination restrictions requiring the instance to complete its work before being shut down. This eventually led to a large number of the Streaming Import API instances to be marked as unhealthy and being serviced out by the Auto Scaling Group, leaving the fleet depleted of sufficient capacity to handle the load.

From 05:05 PDT (21:05 JST), we opted to create a new Streaming Import API fleet with the previous version of the software and started transferring the traffic onto the new fleet. This allowed us to introduce just enough capacity to take the performance out of warning zone.

However, by 05:40 PDT (21:40 JST), the Streaming Import API was invested by a large flood of requests from the fluentd / td-agent clients. The clients were attempting to re-send the Import requests that had failed when the fleet was not able to process them. At 06:31 PDT (22:31 JST), we began to introduce an additional (the third) Streaming Import task queue to bring in the additional capacity we needed to consume the request backlog. This was completed around 07:31 PDT (23:47 JST).

Starting at 08:14 PDT of September 4th (00:14 JST of September 5th) we observed a reoccurrence of the same performance degradation for the Streaming Import API. Troubleshooting and investigation ensued, leading us to discover the root cause of the performance degradation having to do with an implicit and unexpected upgrade (from version 5.6 to 5.7) of the MySQL client on the Streaming Import API instances. The upgrade brought with it a change in the default behavior of the client which, when coupled with our specific usage and usage pattern, caused the performance to drop drastically in situations of high load (such as those seen for the Streaming Import pipeline in the US region).

By 09:06 PDT (01:06 JST) we had deployed a change to revert the behavior to the original default configuration, tested it, and rolled it out. The Streaming Import performance returned to standard levels and, over the 7 hours following, the backload of requests was entirely consumed and the Import delay returned to normal levels.

Afterwards, we began investigating the impact of the incident to records imported by Streaming Import: we found out that between 23:38 PDT (15:38 JST) and 05:05 PDT (21:05 JST) some requests had been imported twice or more (worst case: 15 times).
The duplication was caused by the performance degradation of the Streaming Import API: the payload of some requests was successfully processed and acquired but the API failed to respond back to the client in a timely manner, thus causing a timeout from the Application Load Balancer. fluentd / td-agent client reacted to the timeout by resending the same exact request again: in the event the anomalous situation repeats itself, the request data payload was imported 2 or more times.
Our Streaming Import API has a built-in deduplication mechanism. Due to the sheer amount and size of the import requests, so far we have limited the deduplication window to 20 minutes: this is normally sufficient to address sporadic failures or short planned maintenances but unfortunately this incident lasted significantly longer, rendering the deduplication protection insufficient.

Remediations

We are investigating several possible remediation actions:

Introduce load and performance testing in a safe, pre-production environment prior to deploying a new release of the Streaming Import API.
Introduce configuration changes to ensure the behavior of the MySQL client is what we expect at all times, regardless of the underlying version. Pre-production verification is intended to verify these changes.
Explicitly fix the version of the base image and container dependencies (a.k.a. version pinning), including the MySQL client version, in all our environments. Standardize this across all services and instance types.
Evaluate the possibility of increasing the deduplication window to 20 minutes or longer.

Conclusion

We know how critical the reliability of our system is to you and your business and this is especially true for the Streaming Import. We regret both the Import delay impact and duplication this incident caused. We sincerely apologize for the inconvenience and trouble this issue has caused.

Please feel free to reach out to our support team through support@treasuredata.com if you have any questions.

Posted Sep 09, 2019 - 18:08 PDT

Resolved

This streaming import incident started at 11:30 pm Sep 3 PDT by API release. It caused elevated streaming import API error rate and slow response of API. This slow response caused query engine execution delay, too. At 08:30 Sep 4 PDT, streaming API became normal. However, by streaming import request flood caused by the API incident streaming import delay lasted until 5:00 pm Sep 4 PDT. During the import delay period, Streaming import and Mobile/Javascript REST API needed up to 2 hours as visibility delay.

After streaming import delay was fixed, we kept monitoring our system. We experienced different job submission error issue described in https://status.treasuredata.com/incidents/zs55rsqkg189, however, streaming import and other components are operating normally. This incident was fixed.

Posted Sep 05, 2019 - 00:50 PDT

Update

Visibility delay has returned to normal (under 1 minute.) We are continuing to investigate some remaining storage errors, but expect these will not greatly affect our customers' experience at present.

Posted Sep 04, 2019 - 17:38 PDT

Update

Visibility delay has now decreased to an average of approximately 45 minutes and is continuing to drop as expected.

Posted Sep 04, 2019 - 15:59 PDT

Update

Visibility delay has now decreased to an average of just below 1.5 hours (approximately 80 minutes) and is maintaining a downward trend.

Posted Sep 04, 2019 - 14:58 PDT

Update

Visibility delay is approaching an average of 1.75 hours, though a brief interruption from a database failover caused a small pause in processing.

Posted Sep 04, 2019 - 14:00 PDT

Update

Our visibility delay continues to average approximately 2 hours, though we have begun to see a downward trend in the extremes.

Posted Sep 04, 2019 - 13:01 PDT

Update

Our visibility delay remains at approximately 2 hours, while recent capacity adjustments have begun to positively impact our backlog.

Posted Sep 04, 2019 - 12:00 PDT

Update

Our visibility delay is currently holding steady at approximately 2 hours while client-side Fluentd installations continue to flush their pending buffers. We have made some adjustments to the distribution of our internal capacity to attempt to accelerate our internal backlog processing further.

Posted Sep 04, 2019 - 10:58 PDT

Update

We are experiencing an increase in visibility delay to approximately 2 hours as client-side Fluentd installations flush their pending buffers.

Posted Sep 04, 2019 - 09:59 PDT

Update

We have identified the cause of import API issues and have deployed a fix. Our previous internal processing capacity increase is being used to accelerate backlog processing. Current visibility delay is approximately 1.5 hours. We will update this status page with visibiltiy delay on the hour until the incident is resolved.

Posted Sep 04, 2019 - 09:35 PDT

Update

We are working to expand import capacity, throughput has been recovered.
But the due to the backlog of streaming import, currently it takes about 1 hour from import to visible from query engines. We have been working on resolving the import delay completely.

Posted Sep 04, 2019 - 07:36 PDT

Monitoring

Our streaming import API endpoint has been restored after recovery operations, and now it’s now receiving traffic from customers as it was supposed to be. Now we’re working to adjust capacity of backend workers to process increased amount of traffic.

Posted Sep 04, 2019 - 05:35 PDT

Update

The connectivity between application and database is unstable. Investigating with old stable revision.

Posted Sep 04, 2019 - 04:58 PDT

Update

We are still investigating and implementing fix to the problem.

Currently we confirmed that:
- Ingestion throughput has decreased 80%.
- Streaming Import (including JS SDK / Mobile SDK / Postback SDK / Audit logging) is affected delaying.

We’ve still working to find out the root cause of the issues. We will update the status again as we find something.

Posted Sep 04, 2019 - 03:56 PDT

Update

A fix still continues to be being implemented.

Posted Sep 04, 2019 - 03:26 PDT

Update

A fix still continues to be being implemented.

Posted Sep 04, 2019 - 02:36 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 04, 2019 - 00:16 PDT

Investigating

We are observing elevated error rate of Streaming import API and investigating the issue.

Posted Sep 03, 2019 - 23:56 PDT

This incident affected: US (REST API, Streaming Import REST API, Mobile/Javascript REST API, Hadoop / Hive Query Engine, Presto Query Engine).