Presto performance improvement release
Scheduled Maintenance Report for Treasure Data
Completed
After the release of the Presto performance improvements, we have been monitoring the system for 2 hours and observed no anomaly.

We will now mark this scheduled release as closed but we are committed to continue closely monitoring for the next 3 days.
Posted Jul 20, 2020 - 21:23 PDT
Verifying
The scheduled Presto performance improvement release is complete.

We are monitoring the system closely to ensure all systems successfully complete their *recovery* and return as quickly as possible to full functionality.
Posted Jul 20, 2020 - 19:32 PDT
In progress
The scheduled Presto performance improvement release is starting now.

The release should have no impact on running queries, which will transparently be transferred to the new Presto query engine version.
Posted Jul 20, 2020 - 19:15 PDT
Scheduled
We will release a hotfix for the Presto query engine. No customer visible impact is expected. This is the same release we did on July 6th PDT for the first set of customers (https://manage.statuspage.io/pages/m7f9cnfgfvdp/incidents/h6zrklqq9x9l). Today we are releasing for the second set of customers. The third and the final such release is planned for July 27th PDT. The details of the release follow.

The release rolls forward the changes that were introduced on May 7th to improve the query performance degradation experienced by queries in specific scenarios.

The May 7th release was rolled back on May 22nd because it had a bug causing sporadic write inconsistencies in CREATE TABLE AS, INSERT INTO, and DELETE FROM queries. This new version, based on the May 7th release, contains an additional fix for the write inconsistency issue.

Both performance degradation and write inconsistency issues are described in detail in the postmortem at https://status.treasuredata.com/incidents/mrnh2jc0kmqb.

The release should have no impact on running queries, which will transparently be transferred to the new Presto query engine version.

As communicated in the last postmortem, due to the Presto write inconsistency having affected the data integrity of the platform, we have taken the following additional precautions:

* Remediation
We removed the code responsible for the aggressive write optimization that caused the write inconsistency bug.

* Verification
We have reproduced the write inconsistency issue and we built a reliable set of tests to confirm the fix is effective.

* Detection
We implemented additional application logic to detect eventual race conditions (e.g. throw an exception, alert about the anomaly, etc...) when writing into our Plazma storage. Should a race condition even occur (not expected), it will raise an exception forcing the query to error out and an alert to be sent to our staff. Upon receiving an alert, our staff will investigate the situation and when warranted, reach out to the customer to recommend data recovery.

* Monitoring
Our Presto monitoring and alerting was improved. Our team will follow an on-call duty rotation to monitor the health of the system post release for an extended amount of time (96 hours or 3 days) and catch any anomalies.

# Communication

Beyond this notice, we will provide updates at the start and completion of the operation and once the verification of the new system is completed: at that time, all systems will have returned to full functionality and this Scheduled Maintenance will be closed.

If you have any question or concern about this maintenance, please feel free to reach out to our Support team at support@treasure-data.com.
Posted Jul 20, 2020 - 16:05 PDT
This scheduled maintenance affected: Tokyo (Presto Query Engine) and US (Presto Query Engine).