[All Regions] Some Presto CREATE TABLE AS / INSERT INTO / DELETE FROM queries failed to write some of the partitions

Incident Report for Treasure Data

Postmortem

Summary

During the period between May 7th and May 22nd, our Presto service suffered from a bug that caused some queries containing CREATE TABLE AS SELECT, INSERT INTO, and/or DELETE FROM statements (from here on referred to as update queries) to produce inconsistent results.

The inconsistency was caused by a subset of the update queries failing to write or delete some of the records into/from the Plazma storage.

For all the customers whose Presto update queries were affected by this problem, our Technical Support and Customer Success teams have already reached out to the account owners one by one to report the issue and provide information about which query jobs and tables were affected. In most (but not all) failure cases, the missing records could be recovered, in which case our team would have provided that information as well.

Revision notes

This postmortem was initially published at this link on June 3rd, 2020 without long term remediations.
An updated version was published on June 22nd, 2020 adding the missing information.

Timeline

On May 7th, at around 01:35 AM PDT (17:35 JST, 10:35 CEST), we released a version of Presto, based on open-source Presto version 317, that addressed the performance limitation issues that some customers had been experiencing.

On May 10th, at 21:25 PDT (May 11th, at 13:25 JST and 06:25 CEST), the same version of Presto was also rolled out to the US and Tokyo regions simultaneously, thus completing the release cycle.

On May 21st, at 21:45 PDT (May 22nd, at 13:45 JST, 06:45 CEST), we received the first customer inquiry stating that after executing a CREATE TABLE AS query using a well know stable dataset (one or more tables), the resulting destination table was found to contain fewer records than expected. Over the next few hours, our team investigated the problem and identified that the update query had failed to write a subset of the table partitions (each containing 1 or more records). The missed writes were due to a race condition causing one of the stages of the update query to assume the writing operation had completed when it had not: in turn, this caused the overall query to complete successfully, aborting the remaining, pending write operations prematurely, and resulting in fewer partitions and records to be written.

The offending change was identified and in order to favor speed of resolution, it was decided for the entire release to be rolled back: the operation was completed by May 22nd at 00:25 PDT (16:25 JST, 09:25 CEST) in all 3 regions.

By May 26th, we had the complete picture of all of the affected queries and customers. Our Technical Support and Customer Success teams had notified customers about the issue, providing a detailed list of the affected update queries jobs and tables.

The roll back reintroduced the same performance limitations that had been affecting some customers and some of their queries: as the previous release went in, some customers’ query processing speed began to increase significantly and our Technical Support team started to receive additional customer inquiries.

As of the publishing of this postmortem, the performance limitations are still present for most affected customers. A release, rolling forward the original performance improvements alongside an additional fix for the write inconsistency bug has been prepared and it’s currently under evaluation.

Rolling forward this release is also important as it serves as a baseline to implement further performance improvements which will surely be necessary.

Impact to customers

We have identified that during the May 7th to May 22nd period 226 customer accounts across the US, Tokyo, and EU regions were affected by this issue.

In total, 5609 update queries jobs and 5338 tables were affected. Of the over 5 thousand affected update queries (0.13% of all the update queries ran on Presto during the period and 0.05% of all Presto queries), around 3,100 where INSERT INTO, 2,400 were CREATE TABLE AS, and only around 100 were DELETE FROM.

Customers whose update queries were affected by this issue experienced these symptoms:

CREATE TABLE AS and INSERT INTO queries
The queries failed to write a part of the expected records and the destination tables were found to contain less records than expected.
DELETE FROM queries
The queries failed to delete a part of the expected records and the target tables were found to contain more records than expected.

When these queries were part of larger ETL workflows (through TD Workflow, Digdag, or other orchestration method), they could also affect the result correctness of the downstream queries, thus not only resulting in inconsistent intermediate results but overall.

The same may hold true for workflows tied to Audience Master Segments and/or Segments: inconsistent results from update queries that are part of these workflows could affect the end result of Segmentation and Activations, with the most likely scenario being that Segmentation returned less profiles than expected. In the Audience workflows case, however, since the workflow ETL relies on intermediate, temporary tables, these issues were largely self-remediated and should have disappeared after a full refresh of the Master Segment or Segment or both.

What went wrong

We could not detect the bug before production rollout

The May 7th release was motivated by an attempt to resolve some performance limitations that, for a handful of customers and their specific use cases, had been found in the previous release. A number of these issues were found in our Presto extensions: in particular, the code used to integrate Presto with our proprietary storage system, Plazma. The performance bottleneck was caused by high heap memory usage in writing table partitions into our storage; the improvement consisted in introducing some parallelization in the writing process but our implementation pushed to optimize the performance too aggressively, introducing the possibility of a race condition triggered under specific circumstances. These are the main factors contributing to this issue turning out in production:

The possibility of a race condition went unnoticed during our code review. Furthermore the code changes to improve the write performance, and therefore introducing this bug, were submitted in bulk: this caused reviewer fatigue and increased the chance of missing this detail.
To adapt to the rapidly changing nature of the Presto open-source engine, we have developed a regression testing mechanism. For the past two years this methodology has been essential to help us greatly increase the confidence that every new version of Presto that was introduced was result accurate and did not contain significant drops in performance. However, since our Presto system processes over 30 million customers queries per month, verifying 100% correctness is an impossible feat: we had to strike a balance between completeness (a.k.a. coverage) and time effectiveness and sampled down the set of queries our regression test suite runs, resulting in around 100 thousand queries, while still requiring several days in aggregate for the suite run to complete. Even though we strived to build the regression suite by picking the most complete set of unique query patterns as seen from our customers’ queries, as it happens when downsampling is involved, there is room for corner cases.
Furthermore, we are learning that a lot of the performance issues can only be observed in a loaded production environment (where multi tenancy and high query concurrency are involved) and are not easily replicable in an aseptic testing environment.
While our coverage was functionally adequate, it was insufficient to trigger an issue that would occur only when the query runs in a system under high load.

We could not detect the write inconsistency issue until May 22nd

We were unable to detect the writing inconsistency of the update queries from May 7th to May 22nd (over 2 weeks), when it was discovered by a customer and reported through a support ticket. These are the main reasons:

We make heavy use of our product for internal purposes. A lot of our data pipelines are built on top of our data platform and Presto itself. It is customary for us to rollout new versions of the Presto engine to our internal accounts first, in a canary release fashion: this is normally helpful as it acts as an additional sanity check. However in this case it was unfortunately not effective to catch the write inconsistency problem, due to the fact that: 1- our workload ran in isolation inside an empty cluster, 2- our usage pattern did not trigger the problem in a noticeable enough manner, and 3- to add to the preceding, we had not implemented sufficient data integrity testing for our data pipelines.
We have a number of metrics in place to monitor the health of our production system and Presto is no exception. These metrics are arranged in dashboards in our metrics provider of choice and set up with alerting thresholds, which can end up paging our team in dire cases. One of these Presto metrics is the per-customer and per-cluster query error count.
In normal circumstances, when a release that goes out is unequivocally faulty (notwithstanding the fact that it should be caught and stopped earlier), the increase in query errors causes several monitoring alerts to go off. In this case, however, the update queries causing write inconsistencies were completed as successful, causing them to fly under the radar, undetected.
In fairness, we should note the update queries causing inconsistencies were a mere 0.05% of all the Presto queries ran during the period so we do believe that even if errors had occurred, our monitoring would have still been ineffective in flagging them out.

We silently reintroduced the Presto performance degradation

As soon as the first customer inquiries regarding the write inconsistency started coming in, we determined that our best course of action was rolling back our release, bringing the engine to the Presto 317-based version before May 7th / May 10th.

These decisions are taken under a high amount of pressure as we don’t take query correctness lightly; we know it’s the foundation of any dependable data processing platform. At the moment the rollback decision was taken, our top priority was resolving the write inconsistency issue as quickly as possible, although we were conscious of the fact that the rollback would have brought back the performance limitations for some customers as a side effect.

While we maintain that the choice we made was indeed the best technically, it was a trade off, which means it was not the best choice for everyone: there are cases in which some customers were indeed not affected by the update queries write inconsistency but they were affected by the previous performance limitations: to them, the recurrence of the performance degradation was unexpected, unannounced, and confusing overall. Therefore while our technical choices may have been sound, our communication with the customers left a lot to be desired, which made this problem hurt more than it should have. We will discuss customer communication more in detail in the section below.

We did not report the situation until May 26th

While we were able to identify the offending change quickly, realizing the full customer impact took much further analysis across a few days (May 22nd to May 26th), compounded by the increasing amount of incoming customer inquiries.

When an incident occurs, the first step, triaging, is essential to determine the blast radius and customer impact. This is done to guide prioritization and choose reporting strategy. Widespread incidents are reported transparently to all customers on our Status Page, while we might resort to communicate the impact individually when only a handful of customers are affected, to avoid confusing all others. Additionally, this is done to contain our Technical Support team’s workload in responding to support inquiries that could be false positives.

In hindsight, given the gravity of the problem (data integrity - or lack thereof, such as this write inconsistency issue - are indeed the biggest problems a data platform can suffer from), we recognize that we should have shortcut the process and taken a less reporting-conservative approach: that is reporting the issue earlier, to the benefit of the 226 customers affected and at the cost of more false positives.

Remediations

Due to gravity of the incident and the way it got to manifest itself, we believe we need to take a comprehensive approach to remediations. We have identified both short and long term remediations, which we are eager to share transparently with all our customers, affected or not, as a testimony of our commitment to improve the dependability of our data platform.

Short term Remediations

The following short term remediations are already in progress.

Data Recovery

We put together an operating procedure to recover table partitions (and all the records contained in them) that update queries failed to write or delete. While not all partitions are recoverable because the destination table may have already been deleted or it may have been overwritten in the meantime, this is possible in most cases.

We developed a set of tools to streamline this operation and we organized an incident response team with technical Subject Matter Experts in Storage systems and Query Engines to cover both the U.S. and Japan time zones.

Our Technical Support and Customer Success teams provided these information to the affected customers, where recovery is possible. In most cases, a customer-specific recovery operation has already been defined and recovery has been underway and is nearing completion.

Fixing the inconsistency and performance issues

We isolated the update queries’ write inconsistency problem and developed a safe fix. The fix has been applied on top of the original May 7th release, which we intend to roll forward.

In order to arrive to a fix, we followed this rough process:

Static analysis / Debugging
We identified the exact location of the code where the write inconsistency race condition could occur. Although our testing covered these functionalities, it was not sufficient to trigger this behavior: we found the issue happens only when there is a delay when registering partition entries, before transactionally committing them into our Plazma storage.
Reproduction
Based on the above information, we developed a set of test cases to reproduce the issue in a testing environment. We took two different approaches: 1- brute-force testing one of the failure cases in a development (10 nodes) cluster for several hours until the occurrence of the issue, 2- failure injection in a locally running version of Presto by adding an artificial delay in two critical paths of the offending code. The former approach was time consuming but eventually allowed us to narrow down to a test case that was able to trigger the bug within 5 minutes. The latter approach was successful in reproducing the issue with complete accuracy, under all conditions.
Fixes
A fix was developed on top of the May 7th release (which also fixed some of the performance limitations). Extreme care went into making sure the release was prepared to contain only the changes that were absolutely necessary, as a way to keep release impact contained and reduce risk.
Detection
As the logic in the original implementation synchronously waited for all partition entries to finish writing before transactionally committing them, it did not have in place guards to defend against, or even detect eventual race conditions (e.g. throw an exception, alert about the anomaly, etc…) In addition to the fix, as a failsafe measure, we implemented logic to detect if and when an unexpected commit (this specific or one similar) occurs while writing into our Plazma storage. When it happens, it will throw an exception forcing the query to error out. In addition, when this situation occurs an alert will also page our staff to flag the need for data recovery.
Verification
Beside the regular regression testing suite, we put the release candidate under a stress test running CREATE TABLE AS, INSERT INTO, and DELETE FROM queries for 24 hours under the above mentioned test case scenarios. In addition to that, we deployed the release candidate in all our staging environments and in production for all our internal accounts. We verified the data integrity of the internal data pipeline by confirming that none of the write attempts had left unwritten partitions behind.

Once ready, the new Presto version is expected to address some of the current performance issues as well, bringing back the parts of the May 7th changes that are proven to be safe: because that version will have the aggressive write parallelism optimization tuned down, we expect the Presto performance will land somewhere in between that of today and what it was after the May 7th release, but closer to the latter.

Due to the criticality of these fixes, we will exercise additional precautions:

Rollout
The changes will be rolled out in phases, a small group of accounts at a time by using the canary deployment strategy: this will reduce the risk of widespread impact in case of unforeseen issues.
Monitoring
We will improve our monitoring and alerting as well as increase the attention to them post release. While not all failure modes can be predicted or successfully provisioned against, keeping a watchful eye for an extended amount of time (24 hours at least) on the existing monitors and metrics will decrease the chances of serious issues happening again.

Release freeze

As of May 29th, we stopped all non-business critical releases. This is primarily to reduce the risk of additional incidents but also to provide aerial cover to our team while they focused on Data Recovery and Fixing the inconsistency and performance issues.

The release freeze will be held until the fix will have landed safely in production and all data recovery operations will have completed: until then, all customer visible, non-business critical changes will be held off. In the event business critical changes are required, including the release of the fix above, they will be communicated transparently through a Scheduled Maintenance on our Status Page or via customer-by-customer communication, depending on whether they impact on all/most or a handful of customers.

Long term Remediations

The following long term remediations will be carried out over the course of the next 3 months (1 quarter).

Strict validation for partition read/write

As mentioned in the short term remediations (refer to the ‘Fixing the inconsistency and performance issues’ section), as an additional failsafe measure, we have implemented a mechanism to detect when an inconsistency happens while writing data in the Plazma storage.

As a long term remediation we plan on standardizing and implementing this approach for all Plazma clients (including, but not limited to, Hadoop/Hive, Presto, Data Connector, Streaming Ingestion, and many internal components) interacting with the Plazma storage. The detection mechanism will verify that the amount of data actually read from or written into Plazma matches what was intended. In case the verification fails, the most appropriate remedial action will be triggered: in Presto's case, this would be reverting the write and retrying the query but the best next action will vary depending on the system, component, and status.

This approach is enabled by the ongoing effort to migrate all Plazma clients to use the new Plazma Metadata API, as it provides us with a way to centrally implement the core of the validation logic.

Data integrity validation during canary release

To detect data integrity issues as early as possible, we have implemented strict data integrity validation. In addition, we plan on automating the monitoring and alerting for the integrity of our internal data pipeline.

As we already routinely roll out Presto release candidates on our internal accounts first (where our internal data pipeline is implemented), we plan on leveraging the additional validation,monitoring, and alerting as an additional gatekeeper for future Presto releases.

In spite of the fact that our internal data pipeline was also affected by the write inconsistency issues, causing internal reporting errors, we were able to recover the situation fully, avoiding heavy downstream effects. This was made possible by the way our data pipeline workflow is structured and the checkpointing practices we have put in place.

While this approach may not suit or be easily retrofitted to all use cases, we believe it could benefit others and we intend to share our expertise and best practices more publicly going forward.

Internal Processes and Procedures Improvements

One of the key factors that led to the delay with which the incident was communicated (refer to the ‘We did not report the situation until May 26th’ section) was inefficient internal processes and procedures amongst the various departments, functions, and stakeholders.

As the organization grew considerably over the past year, we now realize that many of the communication protocols and processes we relied upon and worked well before, did not scale well.

We will take these learnings as an opportunity to revamp the internal processes and procedures: this will start by remapping the roles and responsibilities all stakeholders are expected to have in the event of an incident, explicitly documenting and setting expectations around the interactions that need to occur and the outcomes.

As a fallback plan, we will also institute a virtual team, formed by the people best equipped for taking an Incident Commander role, to provide their expertise in improving our incident response training program and uplevel the rest of the organization to be able to handle incidents of higher severity and complexity.

Production-scale workload-ready Query Simulator

As it became evident from this incident, our Presto Query Simulator, the regression test suite we run before every release, is effective for catching functionality and significant performance regression, but not subtle and/or sporadic ones.

The Query Simulator re-executes customer queries against their own data; this is done in a secure, dedicated production environment to prevent the simulated queries from affecting customer query performance. One of the biggest limitations of the simulator is the inability to precisely validate query result correctness due to the fact that the customer datasets are ever changing.

While 100% test coverage is not attainable, we plan on improving the Query Simulator to increase the likelihood of catching these categories of issues, query execution repeatability, and the overall dependability of the tests suite. This will be accomplished through:

Managing generations of datasets and their table schema: this will ensure running test queries against stable datasets in order to produce stable results, thus allowing us to improve the accuracy of query result checksums.
This will also enable us to expand our verification to entire workflows, rather than being limited to individual queries.
In addition, we expect managing these generations of datasets will also turn out useful for recovering tables when they end up unintentionally deleted by user queries.
Introducing an interactive query testing capability so that we can test the query correctness and performance more easily and frequently for individual customers. We run the query simulations in a secure environment and employees will not directly see customer data nor query results by using their checksums.
Continuously executing the Query Simulator against all active release candidates. By pipelining test executions we will speed the test cycle time thus allowing us to better evaluate each release candidate.

Conclusions

We regret that this incident may have negatively impacted our customer’s data pipelines and therefore affected their business. We know that, by building your data pipeline using our platform, you put a lot of trust in our company and our product and we are determined to do right by you, improving it to become as dependable as it should be.

Please feel free to reach out to our Support team through support@treasuredata.com if you have any questions.

Posted Jun 03, 2020 - 12:32 PDT

Resolved

This is a backfill for the Presto incident occurred between May 7th and May 22nd, 2020 concerning some Presto CREATE TABLE AS / INSERT INTO / DELETE FROM queries being unable to write some of the partitions.

This incident post is a placeholder created to allow us to publish a postmortem immediately after.

Posted May 22, 2020 - 00:00 PDT