Unable to Use Presto Query Engine

Incident Report for Treasure Data

Postmortem

We’d like to give you some additional information about the Presto service disruption that occurred on Jul 24, 2017.

On Jul 24, 2017 at 02:51 PDT we began to observe failures in our automated end to end systems indicating trouble executing Presto queries. The trouble was shortly after confirmed by a few of our customers.

By 03:04 PDT our investigation identified that the issue resided in our feature flag-based access control mechanism, which began to reject Presto queries from all our customers.

By 03:13 PDT we had recovered the configuration and restored customer's ability to query with Presto.

Detailed Explanation

The problem was caused by our internal account management system for provisioning and deprovisioning features and functionalities via associating predetermined roles to an account.

On July 24, 2017 at 02:29 PDT the system was used to deprovision the ability to query via Presto engine (a.k.a. the presto role) from an account. Because of a bug, the tool ended up removing the presto role in its entirety, thus removing the ability to query via Presto from all existing accounts, not just the target one.

Consequently all customers, irrespective of where they were trying to issue Presto queries from (Console, REST API, Ruby, Java, or Python client libraries), started receiving a Presto queries are not enabled for your account error (status code 422) when issuing queries. This error affected the execution of:

Individual Presto queries requests were rejected with the error mentioned above
Execution of Scheduled Presto queries got delayed until the problem was cleared - all queries eventually ran to completion, although later than expected.
TD Workflows executing Presto queries failed. These workflows will require manual re-execution (e.g. through the retry failed button in workflow UI)

By 03:13 PDT and when the configuration was recovered, the system resumed allowing Presto queries from those customers having access to it.

Remediation

Here's our plan to remediate the problem:

Make sure internal tooling is a first class citizen of our development lifecycle and receives the same (if not more) amount of development attention
Work on improving our active monitoring system to provide crisper messaging about the failure encountered and as fast as possible to help us react to this sort of issues quicker.

Finally, we want to apologize for the impact this issue caused and assure you that we will do everything we can to learn from this event and use it to improve our availability further.

Posted Jul 24, 2017 - 17:03 PDT

Resolved

The problem was resolved.

The direct cause was a critical bug of our internal account management system. When we changed the feature flag for the specific customer, the account management system unexpectedly deleted the Presto flag itself, rather than just removing the feature flag from the customer.

We will publish the more detailed post mortem as soon as possible.

Posted Jul 24, 2017 - 03:27 PDT

Monitoring

We've fixed the feature flag system. Now all the Presto cluster started processing the query.

Posted Jul 24, 2017 - 03:13 PDT

Identified

We have identified that our internal feature flag system, which is managing which users have an access to what features, unexpectedly removed the access to Presto cluster. We're working on fixing the feature flags for existing customers.

Posted Jul 24, 2017 - 03:04 PDT

Investigating

We observing a degradation in performance of our main presto cluster.
Currently we are investigating the cause.

Posted Jul 24, 2017 - 02:51 PDT

This incident affected: US (Presto Query Engine).