On September 3rd, 2019 at around 23:28 PDT (September 4th, 2019 starting at 15:28 JST), we deployed a new version of the Streaming Import API to all production environments.
In the US region, the deployment caused some side effects which culminated with a steep increase in Streaming Import delay, that is the time between the receipt of a Streaming Import request and the completion of the processing of that request in the Streaming Import backend. The additional delay persisted for around 7 hours.
Over the following 6 hours, the team performed a number of operations that eventually were able to mitigate the situation and restore the system to its usual performance levels.
All customers using the Streaming Import API through fluentd or td-agent were affected by the increased delay. About 80% of Import requests failed during the incident, causing fluentd / td-agent to buffer the data for an extended amount of time.
All customers using the JavaScript SDK / Mobile SDKs / Postback API / and Audit logging were affected by the Import delay as well, because those systems rely on the Streaming Import subsystem to function.
As an effect of the instability of the Streaming Import API, some customers were affected by data duplication, that is the data payload of some of their failed request was imported two times or more.
All customers whom we determined they have suffered from duplicated imports will be contacted and provided details by our Support staff.
At 23:28 PDT of September 3rd (15:28 JST of September 4th) we deployed an updated version of the Streaming Import API to all production regions: US, JP, and EU.
At 23:38 PDT (15:38 JST) we noticed the first symptoms of the performance degradation: 80% the Import requests failed. The request failure caused the fluentd / td-agent clients to start buffering the data, regularly attempting to re-Import it after a fixed interval. The extended amount of buffering built up a large backlog of data, and consequently requests, in the clients.
Starting at 00:30 PDT of September 4th (16:30 JST of September 4th) we observed degraded performance of the connection to the Import task queue and the number of connections between the two queues (typically expected to be comparable) became imbalanced. We restarted one of the task queues but it didn’t resolve the issue.
We repeatedly attempted to roll back the Streaming Import API version but were prevented from doing so because of the instance lifecycle termination restrictions requiring the instance to complete its work before being shut down. This eventually led to a large number of the Streaming Import API instances to be marked as unhealthy and being serviced out by the Auto Scaling Group, leaving the fleet depleted of sufficient capacity to handle the load.
From 05:05 PDT (21:05 JST), we opted to create a new Streaming Import API fleet with the previous version of the software and started transferring the traffic onto the new fleet. This allowed us to introduce just enough capacity to take the performance out of warning zone.
However, by 05:40 PDT (21:40 JST), the Streaming Import API was invested by a large flood of requests from the fluentd / td-agent clients. The clients were attempting to re-send the Import requests that had failed when the fleet was not able to process them. At 06:31 PDT (22:31 JST), we began to introduce an additional (the third) Streaming Import task queue to bring in the additional capacity we needed to consume the request backlog. This was completed around 07:31 PDT (23:47 JST).
Starting at 08:14 PDT of September 4th (00:14 JST of September 5th) we observed a reoccurrence of the same performance degradation for the Streaming Import API. Troubleshooting and investigation ensued, leading us to discover the root cause of the performance degradation having to do with an implicit and unexpected upgrade (from version 5.6 to 5.7) of the MySQL client on the Streaming Import API instances. The upgrade brought with it a change in the default behavior of the client which, when coupled with our specific usage and usage pattern, caused the performance to drop drastically in situations of high load (such as those seen for the Streaming Import pipeline in the US region).
By 09:06 PDT (01:06 JST) we had deployed a change to revert the behavior to the original default configuration, tested it, and rolled it out. The Streaming Import performance returned to standard levels and, over the 7 hours following, the backload of requests was entirely consumed and the Import delay returned to normal levels.
Afterwards, we began investigating the impact of the incident to records imported by Streaming Import: we found out that between 23:38 PDT (15:38 JST) and 05:05 PDT (21:05 JST) some requests had been imported twice or more (worst case: 15 times).
The duplication was caused by the performance degradation of the Streaming Import API: the payload of some requests was successfully processed and acquired but the API failed to respond back to the client in a timely manner, thus causing a timeout from the Application Load Balancer. fluentd / td-agent client reacted to the timeout by resending the same exact request again: in the event the anomalous situation repeats itself, the request data payload was imported 2 or more times.
Our Streaming Import API has a built-in deduplication mechanism. Due to the sheer amount and size of the import requests, so far we have limited the deduplication window to 20 minutes: this is normally sufficient to address sporadic failures or short planned maintenances but unfortunately this incident lasted significantly longer, rendering the deduplication protection insufficient.
We are investigating several possible remediation actions:
We know how critical the reliability of our system is to you and your business and this is especially true for the Streaming Import. We regret both the Import delay impact and duplication this incident caused. We sincerely apologize for the inconvenience and trouble this issue has caused.
Please feel free to reach out to our support team through support@treasuredata.com if you have any questions.