[US/JP/EU] Partial Outage with Hive and Presto Query Engine

Incident Report for Treasure Data

Postmortem

Summary

On January 22nd, 2020 at around 16:45 PST (January 23rd at around 09:45 JST and January 23rd at around 01:45 CEST), we rolled out an update to the Presto and Hive query engine version tags.

The update applied to all deployment sites equally:

US site --- deployed on the AWS North Virginia, US region, or us-east-1
Japan site --- deployed on the AWS Tokyo, Japan region or ap-northeast-1
EU site --- deployed on the AWS Frankfurt, Germany or eu-central-1

In all the sites, the deployment caused both the Presto and Hive engines to stop working. Customers were unable to execute queries, in any manner.

The issue was resolved 23 minutes later by reverting the changes to the query engine version tags, after which the engines return to function correctly immediately.

Impact to customers

All the customer accounts in the US, Japan, and EU sites were impacted.

76,867 query job execution requests failed during the incident (23 minutes), affecting all accounts across the 3 sites.

Several customers saw 97 (in aggregate) of their query jobs configured to export the result remaining stuck in ‘running’ state for approximately 10 and half hours.

Details

At 16:45 PST on January 22nd (09:45 JST or 01:45 CEST on January 23rd), a new configuration was manually added to the REST API system to support customers to select among multiple versions of the Presto and Hive engines. The manual configuration change was coupled with the latest release of the API system and was added under the assumption that the release had already been rolled out.

However, due to a miscommunication around the timing of the release, the release had not yet rolled into production. The configuration change turned out to not be backward compatible with the previous release actually in production and caused side effects: all REST API calls to execute Presto or Hive query jobs started failing immediately after the change was applied. The issue affected all customers, in all deployment sites: US, Japan, and Europe.

12 minutes later, by 16:57 PST (09:57 JST or 01:57 CEST), the first support inquiries started coming in, reporting an anomalously high amount of request failures in creating query jobs. Customers were unable to execute Presto or Hadoop/Hive query jobs by any means: Web Console, REST API, Command-line Interface (CLI), and Presto ODBC/JDBC Gateway. Requests received HTTP 422 (Unprocessable Entity) errors when attempting to execute a query job. The same failures caused Workflows and Scheduled queries automatically executing query jobs to fail as well.

By 17:08 PST (10:08 JST or 02:08 CEST), 23 minutes after the incident was first detected, the configuration change was reverted. The query job requests resumed being accepted, Workflows and Scheduled queries resumed being executed normally.

At 17:14 PST (10:14 JST or 02:14 CEST), while our staff monitored the system after the revert operation, a status page incident was filed to report the incident to customers in arrears.

By 17:25 PST (10:25 JST or 02:25 CEST), the remediation was confirmed to have solved the issue and at 18:05 PST (11:05 JST or 03:08 CEST) the incident was closed on the status page.

Over the following hours, further investigation revealed that the request failures occurred during the incident had additionally caused some of the query jobs configured to export their result to remain stuck in a ‘running’ state although they had already completed execution.

Our staff performed a manual operation to correct the status of these jobs to either ‘success’ or ‘error’ as appropriate for each customer. The operation was completed by 03:19 PST of January 23rd (20:19 JST or 12:19 CEST of January 23rd) or approximately 10 and half hours from the beginning of the incident.

Remediations

We are investigating several possible remediation actions:

Update the release process to make sure the status of a production release can be clearly identified, leaving no ambiguity of whether it’s has been rolled out to production.
Add monitoring and alerting to detect when the API system returns a large number of 4xx or 5xx errors.
The issue stemmed from the configuration not matching the engine selection mechanism and therefore causing the query job creation requests to fail. We want to update the engine selection mechanism such that there always is a safe, fallback engine version.
Update our configuration management practices to make configuration changes more visible and auditable, so as to reduce the time it takes to correlate an anomaly to its cause (e.g. reduce mean-time-to-detection or MTTD).

Conclusion

We regret this incident has prevented you from fully leveraging the functionalities of the system and in particular the query subsystems.

Please feel free to reach out to our support team through support@treasuredata.com if you have any questions.

Posted Jan 28, 2020 - 08:48 PST

Resolved

This incident has been resolved.

Posted Jan 22, 2020 - 18:05 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 22, 2020 - 17:31 PST

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 22, 2020 - 17:24 PST

Update

We are continuing to investigate this issue.

Posted Jan 22, 2020 - 17:18 PST

Update

We are continuing to investigate this issue.

Posted Jan 22, 2020 - 17:15 PST

Investigating

We are currently investigating this issue.

Posted Jan 22, 2020 - 17:14 PST

This incident affected: US (Hadoop / Hive Query Engine, Presto Query Engine), Tokyo (Hadoop / Hive Query Engine, Presto Query Engine), and EU (Hadoop / Hive Query Engine, Presto Query Engine).