During the period between May 7th and May 22nd, our Presto service suffered from a bug that caused some queries containing CREATE TABLE AS SELECT, INSERT INTO, and/or DELETE FROM statements (from here on referred to as update queries) to produce inconsistent results.
The inconsistency was caused by a subset of the update queries failing to write or delete some of the records into/from the Plazma storage.
For all the customers whose Presto update queries were affected by this problem, our Technical Support and Customer Success teams have already reached out to the account owners one by one to report the issue and provide information about which query jobs and tables were affected. In most (but not all) failure cases, the missing records could be recovered, in which case our team would have provided that information as well.
This postmortem was initially published at this link on June 3rd, 2020 without long term remediations.
An updated version was published on June 22nd, 2020 adding the missing information.
On May 7th, at around 01:35 AM PDT (17:35 JST, 10:35 CEST), we released a version of Presto, based on open-source Presto version 317, that addressed the performance limitation issues that some customers had been experiencing.
On May 10th, at 21:25 PDT (May 11th, at 13:25 JST and 06:25 CEST), the same version of Presto was also rolled out to the US and Tokyo regions simultaneously, thus completing the release cycle.
On May 21st, at 21:45 PDT (May 22nd, at 13:45 JST, 06:45 CEST), we received the first customer inquiry stating that after executing a CREATE TABLE AS query using a well know stable dataset (one or more tables), the resulting destination table was found to contain fewer records than expected. Over the next few hours, our team investigated the problem and identified that the update query had failed to write a subset of the table partitions (each containing 1 or more records). The missed writes were due to a race condition causing one of the stages of the update query to assume the writing operation had completed when it had not: in turn, this caused the overall query to complete successfully, aborting the remaining, pending write operations prematurely, and resulting in fewer partitions and records to be written.
The offending change was identified and in order to favor speed of resolution, it was decided for the entire release to be rolled back: the operation was completed by May 22nd at 00:25 PDT (16:25 JST, 09:25 CEST) in all 3 regions.
By May 26th, we had the complete picture of all of the affected queries and customers. Our Technical Support and Customer Success teams had notified customers about the issue, providing a detailed list of the affected update queries jobs and tables.
The roll back reintroduced the same performance limitations that had been affecting some customers and some of their queries: as the previous release went in, some customers’ query processing speed began to increase significantly and our Technical Support team started to receive additional customer inquiries.
As of the publishing of this postmortem, the performance limitations are still present for most affected customers. A release, rolling forward the original performance improvements alongside an additional fix for the write inconsistency bug has been prepared and it’s currently under evaluation.
Rolling forward this release is also important as it serves as a baseline to implement further performance improvements which will surely be necessary.
We have identified that during the May 7th to May 22nd period 226 customer accounts across the US, Tokyo, and EU regions were affected by this issue.
In total, 5609 update queries jobs and 5338 tables were affected. Of the over 5 thousand affected update queries (0.13% of all the update queries ran on Presto during the period and 0.05% of all Presto queries), around 3,100 where INSERT INTO, 2,400 were CREATE TABLE AS, and only around 100 were DELETE FROM.
Customers whose update queries were affected by this issue experienced these symptoms:
When these queries were part of larger ETL workflows (through TD Workflow, Digdag, or other orchestration method), they could also affect the result correctness of the downstream queries, thus not only resulting in inconsistent intermediate results but overall.
The same may hold true for workflows tied to Audience Master Segments and/or Segments: inconsistent results from update queries that are part of these workflows could affect the end result of Segmentation and Activations, with the most likely scenario being that Segmentation returned less profiles than expected. In the Audience workflows case, however, since the workflow ETL relies on intermediate, temporary tables, these issues were largely self-remediated and should have disappeared after a full refresh of the Master Segment or Segment or both.
The May 7th release was motivated by an attempt to resolve some performance limitations that, for a handful of customers and their specific use cases, had been found in the previous release. A number of these issues were found in our Presto extensions: in particular, the code used to integrate Presto with our proprietary storage system, Plazma. The performance bottleneck was caused by high heap memory usage in writing table partitions into our storage; the improvement consisted in introducing some parallelization in the writing process but our implementation pushed to optimize the performance too aggressively, introducing the possibility of a race condition triggered under specific circumstances. These are the main factors contributing to this issue turning out in production:
We were unable to detect the writing inconsistency of the update queries from May 7th to May 22nd (over 2 weeks), when it was discovered by a customer and reported through a support ticket. These are the main reasons:
As soon as the first customer inquiries regarding the write inconsistency started coming in, we determined that our best course of action was rolling back our release, bringing the engine to the Presto 317-based version before May 7th / May 10th.
These decisions are taken under a high amount of pressure as we don’t take query correctness lightly; we know it’s the foundation of any dependable data processing platform. At the moment the rollback decision was taken, our top priority was resolving the write inconsistency issue as quickly as possible, although we were conscious of the fact that the rollback would have brought back the performance limitations for some customers as a side effect.
While we maintain that the choice we made was indeed the best technically, it was a trade off, which means it was not the best choice for everyone: there are cases in which some customers were indeed not affected by the update queries write inconsistency but they were affected by the previous performance limitations: to them, the recurrence of the performance degradation was unexpected, unannounced, and confusing overall. Therefore while our technical choices may have been sound, our communication with the customers left a lot to be desired, which made this problem hurt more than it should have. We will discuss customer communication more in detail in the section below.
While we were able to identify the offending change quickly, realizing the full customer impact took much further analysis across a few days (May 22nd to May 26th), compounded by the increasing amount of incoming customer inquiries.
When an incident occurs, the first step, triaging, is essential to determine the blast radius and customer impact. This is done to guide prioritization and choose reporting strategy. Widespread incidents are reported transparently to all customers on our Status Page, while we might resort to communicate the impact individually when only a handful of customers are affected, to avoid confusing all others. Additionally, this is done to contain our Technical Support team’s workload in responding to support inquiries that could be false positives.
In hindsight, given the gravity of the problem (data integrity - or lack thereof, such as this write inconsistency issue - are indeed the biggest problems a data platform can suffer from), we recognize that we should have shortcut the process and taken a less reporting-conservative approach: that is reporting the issue earlier, to the benefit of the 226 customers affected and at the cost of more false positives.
Due to gravity of the incident and the way it got to manifest itself, we believe we need to take a comprehensive approach to remediations. We have identified both short and long term remediations, which we are eager to share transparently with all our customers, affected or not, as a testimony of our commitment to improve the dependability of our data platform.
The following short term remediations are already in progress.
We put together an operating procedure to recover table partitions (and all the records contained in them) that update queries failed to write or delete. While not all partitions are recoverable because the destination table may have already been deleted or it may have been overwritten in the meantime, this is possible in most cases.
We developed a set of tools to streamline this operation and we organized an incident response team with technical Subject Matter Experts in Storage systems and Query Engines to cover both the U.S. and Japan time zones.
Our Technical Support and Customer Success teams provided these information to the affected customers, where recovery is possible. In most cases, a customer-specific recovery operation has already been defined and recovery has been underway and is nearing completion.
We isolated the update queries’ write inconsistency problem and developed a safe fix. The fix has been applied on top of the original May 7th release, which we intend to roll forward.
In order to arrive to a fix, we followed this rough process:
Once ready, the new Presto version is expected to address some of the current performance issues as well, bringing back the parts of the May 7th changes that are proven to be safe: because that version will have the aggressive write parallelism optimization tuned down, we expect the Presto performance will land somewhere in between that of today and what it was after the May 7th release, but closer to the latter.
Due to the criticality of these fixes, we will exercise additional precautions:
As of May 29th, we stopped all non-business critical releases. This is primarily to reduce the risk of additional incidents but also to provide aerial cover to our team while they focused on Data Recovery and Fixing the inconsistency and performance issues.
The release freeze will be held until the fix will have landed safely in production and all data recovery operations will have completed: until then, all customer visible, non-business critical changes will be held off. In the event business critical changes are required, including the release of the fix above, they will be communicated transparently through a Scheduled Maintenance on our Status Page or via customer-by-customer communication, depending on whether they impact on all/most or a handful of customers.
The following long term remediations will be carried out over the course of the next 3 months (1 quarter).
As mentioned in the short term remediations (refer to the ‘Fixing the inconsistency and performance issues’ section), as an additional failsafe measure, we have implemented a mechanism to detect when an inconsistency happens while writing data in the Plazma storage.
As a long term remediation we plan on standardizing and implementing this approach for all Plazma clients (including, but not limited to, Hadoop/Hive, Presto, Data Connector, Streaming Ingestion, and many internal components) interacting with the Plazma storage. The detection mechanism will verify that the amount of data actually read from or written into Plazma matches what was intended. In case the verification fails, the most appropriate remedial action will be triggered: in Presto's case, this would be reverting the write and retrying the query but the best next action will vary depending on the system, component, and status.
This approach is enabled by the ongoing effort to migrate all Plazma clients to use the new Plazma Metadata API, as it provides us with a way to centrally implement the core of the validation logic.
To detect data integrity issues as early as possible, we have implemented strict data integrity validation. In addition, we plan on automating the monitoring and alerting for the integrity of our internal data pipeline.
As we already routinely roll out Presto release candidates on our internal accounts first (where our internal data pipeline is implemented), we plan on leveraging the additional validation,monitoring, and alerting as an additional gatekeeper for future Presto releases.
In spite of the fact that our internal data pipeline was also affected by the write inconsistency issues, causing internal reporting errors, we were able to recover the situation fully, avoiding heavy downstream effects. This was made possible by the way our data pipeline workflow is structured and the checkpointing practices we have put in place.
While this approach may not suit or be easily retrofitted to all use cases, we believe it could benefit others and we intend to share our expertise and best practices more publicly going forward.
One of the key factors that led to the delay with which the incident was communicated (refer to the ‘We did not report the situation until May 26th’ section) was inefficient internal processes and procedures amongst the various departments, functions, and stakeholders.
As the organization grew considerably over the past year, we now realize that many of the communication protocols and processes we relied upon and worked well before, did not scale well.
We will take these learnings as an opportunity to revamp the internal processes and procedures: this will start by remapping the roles and responsibilities all stakeholders are expected to have in the event of an incident, explicitly documenting and setting expectations around the interactions that need to occur and the outcomes.
As a fallback plan, we will also institute a virtual team, formed by the people best equipped for taking an Incident Commander role, to provide their expertise in improving our incident response training program and uplevel the rest of the organization to be able to handle incidents of higher severity and complexity.
As it became evident from this incident, our Presto Query Simulator, the regression test suite we run before every release, is effective for catching functionality and significant performance regression, but not subtle and/or sporadic ones.
The Query Simulator re-executes customer queries against their own data; this is done in a secure, dedicated production environment to prevent the simulated queries from affecting customer query performance. One of the biggest limitations of the simulator is the inability to precisely validate query result correctness due to the fact that the customer datasets are ever changing.
While 100% test coverage is not attainable, we plan on improving the Query Simulator to increase the likelihood of catching these categories of issues, query execution repeatability, and the overall dependability of the tests suite. This will be accomplished through:
We regret that this incident may have negatively impacted our customer’s data pipelines and therefore affected their business. We know that, by building your data pipeline using our platform, you put a lot of trust in our company and our product and we are determined to do right by you, improving it to become as dependable as it should be.
Please feel free to reach out to our Support team through support@treasuredata.com if you have any questions.