Elevated API error rate

Incident Report for Treasure Data

Postmortem

After performing a detailed investigation of yesterday's Backend DB connection problem, we found that frequent DB write access caused high database write latency and eventually resulted in elevated API error rates. At that time, API accesses from the CLI and REST clients sometimes received 5XX HTTP server errors. Additionally some queries using "TreasureData Result Export" failed due to the same reason since the mechanism leverages the same API for importing records to Treasure Data.

The frequent write accesses were caused by streaming import that requires updating counter and schema of the target import table. At 10:15 PM PDT yesterday (9/27), we disabled counter and schema updates to reduce write access contention to one of our Backend DBs. This mitigated the elevated API error rate but also meant that updates for counter and schema were stopped after that. After implementing a change to mitigate the frequent write accesses, at 11:15 AM PDT today (9/28) we restored the update feature.

Customers who imported data from 10:15 PM to 11:15 AM PDT may have observed that new Presto jobs could not see columns newly added to a table because the Presto engine depends on the schema definition stored in the Backend DB. The schema definition is updated based on streaming import records when the "Auto-Update Schema" feature switch is enabled. Since the mechanism is enabled by default, you could have been depending on the feature. Currently the functionalities of the Auto-Update Schema feature has been restored.

We're really sorry for the any inconvenience this incident may have caused. Please don't hesitate to contact to our support if you have any question or need clarifications.

Posted Sep 28, 2016 - 18:24 PDT

Resolved

This problem has been resolved.

Posted Sep 28, 2016 - 01:25 PDT

Update

We keep observing hiccups at Backend DB connection. We added API server resources for load balancing and keep monitoring.

Posted Sep 27, 2016 - 23:29 PDT

Monitoring

From 21:51 to 22:05 PDT API servers could not respond in time due to network issue at backend DB server. The network issue has already resolved and we keep monitoring all of our system.

Posted Sep 27, 2016 - 22:19 PDT