On January 22nd, 2020 at around 16:45 PST (January 23rd at around 09:45 JST and January 23rd at around 01:45 CEST), we rolled out an update to the Presto and Hive query engine version tags.
The update applied to all deployment sites equally:
In all the sites, the deployment caused both the Presto and Hive engines to stop working. Customers were unable to execute queries, in any manner.
The issue was resolved 23 minutes later by reverting the changes to the query engine version tags, after which the engines return to function correctly immediately.
All the customer accounts in the US, Japan, and EU sites were impacted.
76,867 query job execution requests failed during the incident (23 minutes), affecting all accounts across the 3 sites.
Several customers saw 97 (in aggregate) of their query jobs configured to export the result remaining stuck in ‘running’ state for approximately 10 and half hours.
At 16:45 PST on January 22nd (09:45 JST or 01:45 CEST on January 23rd), a new configuration was manually added to the REST API system to support customers to select among multiple versions of the Presto and Hive engines. The manual configuration change was coupled with the latest release of the API system and was added under the assumption that the release had already been rolled out.
However, due to a miscommunication around the timing of the release, the release had not yet rolled into production. The configuration change turned out to not be backward compatible with the previous release actually in production and caused side effects: all REST API calls to execute Presto or Hive query jobs started failing immediately after the change was applied. The issue affected all customers, in all deployment sites: US, Japan, and Europe.
12 minutes later, by 16:57 PST (09:57 JST or 01:57 CEST), the first support inquiries started coming in, reporting an anomalously high amount of request failures in creating query jobs. Customers were unable to execute Presto or Hadoop/Hive query jobs by any means: Web Console, REST API, Command-line Interface (CLI), and Presto ODBC/JDBC Gateway. Requests received HTTP 422 (Unprocessable Entity) errors when attempting to execute a query job. The same failures caused Workflows and Scheduled queries automatically executing query jobs to fail as well.
By 17:08 PST (10:08 JST or 02:08 CEST), 23 minutes after the incident was first detected, the configuration change was reverted. The query job requests resumed being accepted, Workflows and Scheduled queries resumed being executed normally.
At 17:14 PST (10:14 JST or 02:14 CEST), while our staff monitored the system after the revert operation, a status page incident was filed to report the incident to customers in arrears.
By 17:25 PST (10:25 JST or 02:25 CEST), the remediation was confirmed to have solved the issue and at 18:05 PST (11:05 JST or 03:08 CEST) the incident was closed on the status page.
Over the following hours, further investigation revealed that the request failures occurred during the incident had additionally caused some of the query jobs configured to export their result to remain stuck in a ‘running’ state although they had already completed execution.
Our staff performed a manual operation to correct the status of these jobs to either ‘success’ or ‘error’ as appropriate for each customer. The operation was completed by 03:19 PST of January 23rd (20:19 JST or 12:19 CEST of January 23rd) or approximately 10 and half hours from the beginning of the incident.
We are investigating several possible remediation actions:
We regret this incident has prevented you from fully leveraging the functionalities of the system and in particular the query subsystems.
Please feel free to reach out to our support team through support@treasuredata.com if you have any questions.