After performing a detailed investigation of yesterday's Backend DB connection problem, we found that frequent DB write access caused high database write latency and eventually resulted in elevated API error rates. At that time, API accesses from the CLI and REST clients sometimes received 5XX HTTP server errors. Additionally some queries using "TreasureData Result Export" failed due to the same reason since the mechanism leverages the same API for importing records to Treasure Data.
The frequent write accesses were caused by streaming import that requires updating counter
and schema
of the target import table. At 10:15 PM PDT yesterday (9/27), we disabled counter
and schema
updates to reduce write access contention to one of our Backend DBs. This mitigated the elevated API error rate but also meant that updates for counter and schema were stopped after that. After implementing a change to mitigate the frequent write accesses, at 11:15 AM PDT today (9/28) we restored the update feature.
Customers who imported data from 10:15 PM to 11:15 AM PDT may have observed that new Presto jobs could not see columns newly added to a table because the Presto engine depends on the schema definition stored in the Backend DB. The schema definition is updated based on streaming import records when the "Auto-Update Schema" feature switch is enabled. Since the mechanism is enabled by default, you could have been depending on the feature. Currently the functionalities of the Auto-Update Schema feature has been restored.
We're really sorry for the any inconvenience this incident may have caused. Please don't hesitate to contact to our support if you have any question or need clarifications.