We will release a hotfix for the Presto query engine. No customer visible impact is expected. This is the same hotfix we released on July 6th PDT for the first set of customers (https://manage.statuspage.io/pages/m7f9cnfgfvdp/incidents/h6zrklqq9x9l
) and on July 20th for the second set of customers( https://manage.statuspage.io/pages/m7f9cnfgfvdp/incidents/9hwmbwmcrkg1
). Today, we are releasing the hotfix for the third and final set of customers. The details of the release follow.
The release rolls forward the changes that were introduced on May 7th to improve the query performance degradation experienced by queries in specific scenarios.
The May 7th release was rolled back on May 22nd because it had a bug causing sporadic write inconsistencies in CREATE TABLE AS, INSERT INTO, and DELETE FROM queries. This new version, based on the May 7th release, contains an additional fix for the write inconsistency issue.
Both performance degradation and write inconsistency issues are described in detail in the postmortem at https://status.treasuredata.com/incidents/mrnh2jc0kmqb
The release should have no impact on running queries, which will transparently be transferred to the new Presto query engine version.
As communicated in the last postmortem, due to the Presto write inconsistency having affected the data integrity of the platform, we have taken the following additional precautions:
We removed the code responsible for the aggressive write optimization that caused the write inconsistency bug.
We have reproduced the write inconsistency issue and we built a reliable set of tests to confirm the fix is effective.
We implemented additional application logic to detect eventual race conditions (e.g. throw an exception, alert about the anomaly, etc...) when writing into our Plazma storage. Should a race condition even occur (not expected), it will raise an exception forcing the query to error out and an alert to be sent to our staff. Upon receiving an alert, our staff will investigate the situation and when warranted, reach out to the customer to recommend data recovery.
Our Presto monitoring and alerting was improved. Our team will follow an on-call duty rotation to monitor the health of the system post release for an extended amount of time (96 hours or 3 days) and catch any anomalies.
Beyond this notice, we will provide updates at the start and completion of the operation and once the verification of the new system is completed: at that time, all systems will have returned to full functionality and this Scheduled Maintenance will be closed.
If you have any question or concern about this maintenance, please feel free to reach out to our Support team at email@example.com