Database Upgrades

Scheduled Maintenance Report for Treasure Data

Postmortem

Summary

On October 23rd, 2017 starting at around 6:10 PM PDT (October 24th, 2017 starting at 10:10 AM JST), we performed an upgrade of both our API and backend Import task queue databases as planned.

Following the upgrade process, which took approximately 10 minutes as initially planned, we observed our API database read latency (the time it takes to retrieve a records) to be severely impaired. This heavily affected the performance of our API by increased database load and manifested itself as spotty reachability, request timeouts, and various flavors of request failures.

The symptoms of the heavy database latency protracted until around 8:00 PM PDT (12:00 PM JST) when we deployed an emergency workaround suspending a minor part of our notification service. This notification service was found to be the primary cause of the increased database load.

After that, the load on the database subsided drastically and the API gradually resumed its operations until it reached its steady state at around 8:40 PM PDT (12:40 PM JST) when the incident was closed.

Impact on Customers

Here is the list of issues customers could have observed:

Sporadic failures in Web Console login / logout / access.
Sporadic failures in CLI access.
Sporadic failures in REST API access.
Sporadic failures in executing CREATE TABLE queries with Presto.
Incorrect INSERT INTO query job result status: the jobs completed but the system failed to update the job status to match (success/failure).
Incorrect Bulk Import job result status set to ‘killed’: the jobs completed but due to a race condition the status of the job got updated to ‘killed’ in spite of the fact that the Bulk Import job completed successfully.

Required Customer Actions

Rerun any CREATE TABLE Presto query that may have failed.

No other action has been identified as necessary at the moment.

Detailed Explanation

The database upgrade planned for October 23rd PDT was both mandatory (mandated by the infrastructure provider) and necessary to keep up to date with latest improvements, fixes, and security enhancements.

We took great care to plan the upgrade process in an attempt to reduce the amount of API unavailability exposed to the customers.

At 6:10 PM PDT the upgrade of both database was started in parallel.

The upgrade process for the backend Import task queue database was straightforward and went in fact smoothly, taking around 10 minutes and requiring practically no intervention.
Because of the critical role of the API database, the upgrade process was more complex and consisted in upgrading the read-replica and promoting it and swapping it with the master node.

Following the upgrade of the API database, the instance immediately started to exhibit a very high read latency in accessing the backing storage volume, up to 10 times higher than normal (3 to 30 milliseconds). The high read latency increased the pressure on the database, causing the processing capacity of the API to saturate. These issues manifested themselves to customers as spotty reachability, request timeouts, and various flavors of request failures.

The elevated read latency was reportedly due to the database instance incurring in what is called ‘First Touch Penalty’, that is the overhead of retrieving storage blocks from the attached remote storage volume. The penalty is by construction, as reading of storage block is deferred until the block is actually requested and is not prefetched. As referred to traditionally, the database needed to be ‘warmed’.

At around 6:50 PM PDT we started to bring additional API processing capability online to try and mitigate the situation but we soon realized that this workaround was not effective to mitigate the underlying storage read latency.

At around 7:00 PM PDT, we started to investigate the database load in an attempt to identify the heaviest queries and come up with a workaround. By 7:30 PM PDT we had identified the heaviest hitter to be a query generated by a minor part of our notification system and proceeded to suspend that part of the service as it is not fundamental for the functionality of the system.

The mechanism was in charge of dispatching notifications to the owner of Scheduled Queries and Data Transfers in case their scheduled execution resulted in failures: a notification is sent when the status goes from success to failure (the Schedule starts ‘failing’) and when it goes back from failure to success (the Schedule resumes ‘succeeding’).

By around 8:00 PM PDT the workaround was deployed to our production API and the performance started to gradually pick up as an effect of the load on database having subsided. By 8:40 PM PDT the performance of the API had returned to typical levels and the incident was closed.

Post incident closure (around October 24th, 00:35 AM PDT), the Schedule success/failure notification mechanism was patched by fixing the problematic query (causing an inefficient full table scan) and restored to bring the API back to full functionalities.

Remediations

Here's some ideas to remediate these problems:

Reevaluate the performance of our API with a particular eye for queries that are particularly inefficient, even though their lower performance may not be normally noticeable or adversely impact when the system is in steady state (and the database instance is ‘warm’).
Explore and experiment with ‘database warming’ techniques to mitigate the underlying ‘First Touch Penalty’ and prefetch the storage blocks containing the records most likely to be needed sooner/more often.
Fix the race condition causing the job status to incorrectly be set to ‘killed’ for Bulk Import jobs.
Fix the job status synchronization issues concerning INSERT INTO queries (both Hive and Presto query engines).
Improve CREATE TABLE Presto queries to wait longer for API availability and/or fail less silently.

Conclusion

We know how critical our services are to you and your business and in spite of our best effort in minimize the impact of this unavoidable database upgrade, we take the impact of this incident seriously as an opportunity to improve our processes and systems.

In the meantime, we sincerely apologize for the inconvenience and trouble this issue has caused.

Please feel free to reach out to our support team through support@treasuredata.com if you have any questions.

Sincerely,

The Treasure Data Team

Posted Oct 24, 2017 - 11:39 PDT

Completed

Maintenance has completed, and the resulting incident has been resolved. API service has returned to normal.

Email notification of scheduled job failures and recoveries remain disabled. Global system status is now healthy and we are resolving this incident.

We will work on restoring the notification functionality later during a regular release cycle.

Posted Oct 23, 2017 - 20:40 PDT

Verifying

API has returned to service. Latency is still elevated.

We have temporarily disabled the scheduled job failure and recovery notification feature to reduce load. We will re-enable this feature once performance has recovered.

Posted Oct 23, 2017 - 19:58 PDT

In progress

Latency issues have become more severe, we are working to resolve high API latency at this time.

Posted Oct 23, 2017 - 18:58 PDT

Verifying

Maintenance is complete, however we are closely monitoring increased latency after the upgrade. We will update again shortly.

Posted Oct 23, 2017 - 18:40 PDT

In progress

Maintenance is now proceeding during the scheduled maintenance window.

The upgrade procedure will last up to 10 minutes - during this time, our REST API will primarily be unavailable and respond with error codes 500 or similar. As a consequence:

* The Web Console will not be functional.

* Streaming import requests will fail: Where fluentd / td-agent is being used (as recommended), event collection will continue locally on each device/server and will recover automatically once the maintenance is complete due to the built-in buffering and retry mechanisms.

* Javascript SDK and Mobile SDK (Android, iOS, and Unity) event collection will continue undisturbed but events will not be available for querying until the maintenance is completed.

* Real-time CDP event ingestion will continue undisturbed but events will be delayed and become available only shortly after the maintenance is completed.

* td command-line (CLI) commands will encounter failures and execution will be delayed until the maintenance is complete. In some cases, the commands will be returning errors.

* Scheduled queries and connector jobs will be delayed. Already running scheduled jobs will be completed successfully or retried internally until they are.

* Workflows using Treasure Data operators (e.g. td>) will retry and regain full functionalities again after the upgrade. In case of workflow sessions failures, the customer could elect to resume them manually.

* The Presto JDBC / ODBC Gateway will report authentication failures.

Posted Oct 23, 2017 - 18:10 PDT

Update

In about an hour (around 6:10 PM PDT or 10:10 AM JST), we will start our planned Database upgrades.

As a reminder (from the previous communications), during this time:

* The Web Console will not be functional.

* Streaming import requests will fail:
Where fluentd / td-agent is being used (as recommended), event collection will continue locally on each device/server and will recover automatically once the maintenance is complete due to the built-in buffering and retry mechanisms.

* Javascript SDK and Mobile SDK (Android, iOS, and Unity) event collection will continue undisturbed but events will not be available for querying until the maintenance is completed.

* Real-time CDP event ingestion will continue undisturbed but events will be delayed and become available only shortly after the maintenance is completed.

* td command-line (CLI) commands will encounter failures and execution will be delayed until the maintenance is complete. In some cases, the commands will be returning errors.

* Scheduled queries and connector jobs will be delayed. Already running scheduled jobs will be completed successfully or retried internally until they are.

* Workflows using Treasure Data operators (e.g. td>) will retry and regain full functionalities again after the upgrade. In case of workflow sessions failures, the customer could elect to resume them manually.

* The Presto JDBC / ODBC Gateway will report authentication failures.

Beyond this notice, we will provide updates at the start and completion of the operation and once the verification of the new system is completed: at that time, all systems will have returned to full functionality and the Scheduled Maintenance will be closed.

Posted Oct 23, 2017 - 17:14 PDT

Scheduled

On Monday, October 23rd starting at 6 PM PDT (October 24th starting at 10 AM JST), we will be performing the upgrade of some of our databases.

The upgrade is necessary to bring our production databases to the latest and greatest version supported (MySQL v5.7.19), allowing us to stay on par with the latest improvements, fixes, and security patches.

The procedure will **last up to 10 minutes**: during this time, our REST API will primarily be unavailable and respond with error codes 500 or similar. In particular:

* The Web Console will not be functional.

* Streaming import requests will fail:
Where fluentd / td-agent is being used (as recommended), event collection will continue locally on each device/server and will recover automatically once the maintenance is complete due to the built-in buffering and retry mechanisms.

* Javascript SDK and Mobile SDK (Android, iOS, and Unity) event collection will continue undisturbed but events will not be available for querying until the maintenance is completed.

* Real-time CDP event ingestion will continue undisturbed but events will be delayed and become available only shortly after the maintenance is completed.

* td command-line (CLI) commands will encounter failures and execution will be delayed until the maintenance is complete. In some cases, the commands will be returning errors.

* Scheduled queries and connector jobs will be delayed. Already running scheduled jobs will be completed successfully or retried internally until they are.

* Workflows using Treasure Data operators (e.g. td>) will retry and regain full functionalities again after the upgrade. In case of workflow sessions failures, the customer could elect to resume them manually.

* The Presto JDBC / ODBC Gateway will report authentication failures.

Beyond this notice, we will provide updates approximately 1 hour before the beginning of the upgrade window, at the start and completion of the operation, and once the verification is completed. At that time, all systems will have returned to full functionality and the Scheduled Maintenance will be closed.

If you have any question or concern about this upgrade, please feel free to reach out to our Support team at support@treasuredata.com.

Posted Oct 16, 2017 - 21:53 PDT

This scheduled maintenance affected: US (Web Interface, REST API, Streaming Import REST API, Mobile/Javascript REST API, Data Connector Integrations, Presto JDBC/ODBC Gateway, Workflow).