We’d like to give you some additional information about the Presto service disruption that occurred on Jul 24, 2017.
On Jul 24, 2017 at 02:51 PDT we began to observe failures in our automated end to end systems indicating trouble executing Presto queries. The trouble was shortly after confirmed by a few of our customers.
By 03:04 PDT our investigation identified that the issue resided in our feature flag-based access control mechanism, which began to reject Presto queries from all our customers.
By 03:13 PDT we had recovered the configuration and restored customer's ability to query with Presto.
The problem was caused by our internal account management system for provisioning and deprovisioning features and functionalities via associating predetermined roles to an account.
On July 24, 2017 at 02:29 PDT the system was used to deprovision the ability to query via Presto engine (a.k.a. the presto role) from an account. Because of a bug, the tool ended up removing the presto role in its entirety, thus removing the ability to query via Presto from all existing accounts, not just the target one.
Consequently all customers, irrespective of where they were trying to issue Presto queries from (Console, REST API, Ruby, Java, or Python client libraries), started receiving a Presto queries are not enabled for your account
error (status code 422) when issuing queries.
This error affected the execution of:
retry failed
button in workflow UI)By 03:13 PDT and when the configuration was recovered, the system resumed allowing Presto queries from those customers having access to it.
Here's our plan to remediate the problem:
Finally, we want to apologize for the impact this issue caused and assure you that we will do everything we can to learn from this event and use it to improve our availability further.