From 10:20pm PDT on July 30th to 12:50am PDT on July 31st (2pm JST to 4:50pm JST on July 31st), Streaming Import REST API clients experienced request slowdown and elevated error rate due to a Backend Database connectivity issue.
Following the resolution of the issue, from 1:00am to 2:30am PDT on July 31st (5pm JST to 6:30pm JST on July 31st), customers in the US region who use api.treasuredata.com / api-import.treasuredata.com experienced delays in their Streaming imports. During the period imported events needed 7 minutes to become visible/queryable.
We’d like to provide you some additional information about the Streaming Import performance degradation.
Timeline
The timeline of this incident was:
- 10:18pm PDT: We started a release of the Streaming Import API.
- 10:27pm PDT: After the release, we observed that a large number of instances
went offline and the Streaming Import API performance was degraded.
- 10:38pm~10:49pm PDT: Half of the Streaming Import API instances were automatically terminated due to high CPU usage.
- 10:58pm PDT: We doubled the capacity of Streaming Import API and temporarily disabled auto-scaling to prevent accidental scale-in.
- 11:15pm PDT: The load balancer had randomly serviced out many Streaming Import API instances due to health check timeouts. We modified health check configurations to keep more instances serviced in.
- 11:47pm PDT: We noticed that the number of connections to a database used for the Streaming Import was consistently high.
- 11:50pm PDT: We progressively restarted half of Streaming Import API instances.
- 12:07am PDT: We restarted one of the databases used for the Streaming Import.
- 12:15am PDT: We temporarily stopped half of the Streaming Import Worker services.
- 12:44am PDT: A large number of the existing instances was still flagged as unhealthy by the load balancer. We rotated some of unhealthy instances.
- 12:51am PDT: We rolled back the application version of the Streaming Import API to the previous release.
- 12:53am PDT: Most of the instances became healthy and the performance of the Streaming Import API returned to normal levels.
- 01:00am PDT: Once the Streaming Import API returned to normal performance levels, a huge amount of data was queued to be imported and caused an access flood.
- 01:32am PDT: We started to scale out the Streaming Import Worker cluster and increased the number of maximum error connections in the database for the Streaming Import.
- 02:30am PDT: The import backlog was fully consumed.
Remediations
We will continue to investigate the cause of this incident. Based on what we’ve noticed so far, the remediation will include at least:
- Going forward we will ensure at least 75% of the Streaming Import API capacity is kept online during a release.
- We’ll consider to let each Streaming Import API instance have one application worker dedicated to process health check requests: this will prevent instances from being accidentally serviced out when the capacity is insufficient.
- We’ll continue to invest into speeding up and automating Streaming Import Worker scale-out. We’re also working hard to improve the Streaming Import pipeline’s scalability.