[US Region] Degraded Streaming import performance
Incident Report for Treasure Data
Postmortem

From 10:20pm PDT on July 30th to 12:50am PDT on July 31st (2pm JST to 4:50pm JST on July 31st), Streaming Import REST API clients experienced request slowdown and elevated error rate due to a Backend Database connectivity issue.

Following the resolution of the issue, from 1:00am to 2:30am PDT on July 31st (5pm JST to 6:30pm JST on July 31st), customers in the US region who use api.treasuredata.com / api-import.treasuredata.com experienced delays in their Streaming imports. During the period imported events needed 7 minutes to become visible/queryable.

We’d like to provide you some additional information about the Streaming Import performance degradation.

Timeline

The timeline of this incident was:

  • 10:18pm PDT: We started a release of the Streaming Import API.
  • 10:27pm PDT: After the release, we observed that a large number of instances
    went offline and the Streaming Import API performance was degraded.
  • 10:38pm~10:49pm PDT: Half of the Streaming Import API instances were automatically terminated due to high CPU usage.
  • 10:58pm PDT: We doubled the capacity of Streaming Import API and temporarily disabled auto-scaling to prevent accidental scale-in.
  • 11:15pm PDT: The load balancer had randomly serviced out many Streaming Import API instances due to health check timeouts. We modified health check configurations to keep more instances serviced in.
  • 11:47pm PDT: We noticed that the number of connections to a database used for the Streaming Import was consistently high.
  • 11:50pm PDT: We progressively restarted half of Streaming Import API instances.
  • 12:07am PDT: We restarted one of the databases used for the Streaming Import.
  • 12:15am PDT: We temporarily stopped half of the Streaming Import Worker services.
  • 12:44am PDT: A large number of the existing instances was still flagged as unhealthy by the load balancer. We rotated some of unhealthy instances.
  • 12:51am PDT: We rolled back the application version of the Streaming Import API to the previous release.
  • 12:53am PDT: Most of the instances became healthy and the performance of the Streaming Import API returned to normal levels.
  • 01:00am PDT: Once the Streaming Import API returned to normal performance levels, a huge amount of data was queued to be imported and caused an access flood.
  • 01:32am PDT: We started to scale out the Streaming Import Worker cluster and increased the number of maximum error connections in the database for the Streaming Import.
  • 02:30am PDT: The import backlog was fully consumed.

Remediations

We will continue to investigate the cause of this incident. Based on what we’ve noticed so far, the remediation will include at least:

  • Going forward we will ensure at least 75% of the Streaming Import API capacity is kept online during a release.
  • We’ll consider to let each Streaming Import API instance have one application worker dedicated to process health check requests: this will prevent instances from being accidentally serviced out when the capacity is insufficient.
  • We’ll continue to invest into speeding up and automating Streaming Import Worker scale-out. We’re also working hard to improve the Streaming Import pipeline’s scalability.
Posted Aug 01, 2019 - 19:19 PDT

Resolved
This incident has been resolved.

From 10:20pm to 12:50am PDT, Streaming Import REST API clients experienced request slowdown and elevated error rate due to Backend Database connectivity issue. After solving the issue, from 1:00am to 2:30am, US region customers experienced Streaming Import delay. During the period imported events needs 7 minutes to become queryable.
Posted Jul 31, 2019 - 02:45 PDT
Monitoring
A fix has been implemented and we're monitoring the results.
Posted Jul 31, 2019 - 01:03 PDT
Identified
We identified a performance degradation in our import database. We're working on resolving the issue.
Posted Jul 31, 2019 - 00:37 PDT
Investigating
We are observing Streaming import processing slowdown.
Posted Jul 30, 2019 - 23:38 PDT
This incident affected: US (Streaming Import REST API, Mobile/Javascript REST API).