Streaming import and Starting query delay
Incident Report for Treasure Data
Postmortem

Summary

On Oct 17th, between 5:00 and 20:17 PDT, our Stream ingestion APIs experienced slow response and degraded performance because of extremely high load and concurrency on the backend task queue.

For the duration of the incident, the Stream ingestion API requests would observe significantly higher latency and sporadic failures due to timeouts. Since the APIs never ceased to respond, albeit slowly and less reliably, the built-in request retrying of fluentd / td-agent made sure Stream ingestion requests were not dropped. The slow response times eventually impacted also the Query submission APIs, although for much shorter periods of time.

Timeline

Starting at around 5:00 PDT the Stream Import worker, the background process that fetches import requests from the backend task queue and processes the imported data, experienced a slowdown and became less effective in consuming tasks from the queue - the situation extended for up to 15 minutes. Due to the extremely high concurrency and load on the Stream import API and backend (1.75 million records per second imported and counting), the slower processing was immediately successful in causing the queue to significantly start backing up and become very long, thus causing new incoming requests to be served a lot later than they would have normally had.
The increased length (around 150,000 tasks at this time) of the backend task queue made every request to acquire a new import task much heavier and involved, thus causing additional load on the task queue itself.

Within the first hour and by 6:00 PDT, we started taking the first counter measures, while continuing to preserve the reliability of Stream ingestion pipeline and not allowing any of the Stream import requests to be ultimately dropped. The first measures applied, tuning of database parameters for the most part, showed some initial improvements in performance. Unfortunately, the ever growing load eventually rendered this first effort ineffective and caused the queue size to continue to grow and reach up 400,000 tasks.

We continued to monitor the incident and after establishing that the situation persisted, highlighting a steady incoming load of request, between 8:40 to 12:50 PDT we started adding two additional import task queues to attempt to better partition the tasks. While this operation did not improve the situation of the existing queues, it greatly helped controlling the growth of the task queue and capping it.

The heavy load on the task queue started back pressuring up and affecting the response of all APIs (not only Streaming ingestion), which was caused by the API server processes being held up for much longer time than usual by the slow task queue response, therefore bringing the processing pool capacity of the APIs to near-saturation. At this time, some additional APIs requests, including the Presto and Hive query submission requests, started timing out, although much more sporadically.
To alleviate this situation at 13:15 PDT we deployed a new API server pool to split the Stream ingestion API load in two. This helped reducing the API load and provided additional capacity to consume other API requests while unfortunately still not having significant effect on Stream import processing capacity.

Finally, after observing that the Stream import task queue backload was still not getting adsorbed because the number of incoming import requests matched the processing capacity (which was throttled by the slow import task fetch time), at around 16:00 PDT we started preparing a brand new backend task queue, 5 times bigger than the original in terms of both storage and processing power. The new system queue was ready by 18:30 PDT and new incoming Stream imports started being routed to the new task queue at that time, leaving only the existing tasks (400,000 to 500,000 in total) in the original queue.

At 20:15 PDT, the system had successfully processed all the incumbent tasks and the queue became empty. The larger capacity queue also proved to be able to withstand the request load and no additional backing up of tasks was observed in the queue.

Effect on Customers

We continuously monitored the situation throughout the duration of the incident and took conservative measures that ensured the continuous, although slower, operation of the platform.

No data loss

Per our observations, no import request/task was irremediably dropped. This is due primarily to the fluentd / td-agent built-in resiliency towards sporadic network failures and timeouts. Specific request failures are automatically retried by fluentd / td-agent and up to a predefined number of attempts and time, time that is normally sufficient to alleviate temporary issues.

Small change of data duplication

There could be two possible edge cases leading to data duplication.
The change is very remote and unlikely but still possible, especially when the backend task queue configuration and infrastructure was being updated.

In normal circumstances, the Stream ingestion deduplication mechanism prevents duplicate imports and fluentd / td-agent is setup to fully leverage and trigger that mechanism.

If you have the suspicion that any data duplication may have occurred on your account, please contact us at your earliest convenience.

Query submission delay

Halfway through the incident and more consistently during the last 2 to 3 hours, additional API failures, primarily due to timeouts, occurred. While those failures were sporadic, they ended up causing the inability to perform other common requests in a handful of cases, with the worst case being the inability to submit new queries (Presto or Hive), or caused the queries to get into execution later than they would have normally had.

Fortunately these issues did not extend for long period of times.

Unexpected query results

Throughout the duration of the Stream ingestion slow down, the major, reversible, effect on customers was that import requests would take a much longer time to get processed, principally because of the large import task backlog and slow response from the queue.

By inference, this could have caused TD queries that would normally scan new incoming data from recent Stream ingestion requests, not to find such data - while this does not mean that any of the data was lost, the result of those queries was different from expectations.

Remediations

We learned that while the ‘capacity on paper’ of our backend task queue might be much higher, there are multiple additional factors that can affect the performance and responsiveness of the queue. We are working on making sure all these factors are well understood and factored in the the future improvements we will introduce.

While we were able to come up with a pretty articulated resolution within a matter of hours, we realized the need to establish better remediation strategies and procedures that can help us more effectively deal with the ‘unknown and unexpected’ that production systems tend to throw at us on a daily basis.

We are evaluating the final solution of distributing the backend task queue in separate systems in a more general context and are convinced this could provide additional room for growth, which we know we ought to be prepared for.

Conclusions

While we want to assure you that we took this incident very seriously and are determined to take the necessary steps and precautions to make sure it won’t repeat again, please accept our sincere apologies for the all the inconvenienced these problems may have caused to your work and the continuity of your business.

Please don’t hesitate to contact our Support team support@treasuredata.com should you have any question.

Posted Oct 19, 2016 - 19:01 PDT

Resolved
We completed processing all of the queued streaming import requests. Streaming imports as well as queries are running without delay now.
Posted Oct 17, 2016 - 20:17 PDT
Update
We've finished upgrading our RDS instance type.

Our estimate is it will take another 2 hours and 30 minutes to catch up for all the customers.
Posted Oct 17, 2016 - 18:42 PDT
Update
We're upgrading the instance type of RDS to resolve this issue without any downtime and data lost
Posted Oct 17, 2016 - 18:15 PDT
Update
Starting query also has been delayed because of the degraded performance of our backend DB. Sorry for the inconvenience.
Posted Oct 17, 2016 - 15:46 PDT
Update
We've deployed multiple fixes to mitigate the streaming import delay, and observing the decrease of import queue. All queued chunks are getting imported.

NOTE: Again, there's NO data loss occurred. We're receiving all the data, but this is an internal processing delay.
Posted Oct 17, 2016 - 14:57 PDT
Update
We are still observing the import delay and working on the fix.
Posted Oct 17, 2016 - 11:14 PDT
Monitoring
We're adjusting the configuration and the number of import worker instances to catch up the delay. Current delay is 4400 seconds.
Posted Oct 17, 2016 - 07:50 PDT
Identified
The cause of this delay was temporary database lock contention.
Posted Oct 17, 2016 - 06:11 PDT
Investigating
We're observing import delay by degraded backend DB performance. Now we're investigating the cause.

NOTE: There's NO data loss occurred. All the data received on the cloud side is buffered safely. It's taking time to move data from buffer to customer's database / tables.
Posted Oct 17, 2016 - 05:49 PDT