On October 23rd, 2017 starting at around 6:10 PM PDT (October 24th, 2017 starting at 10:10 AM JST), we performed an upgrade of both our API and backend Import task queue databases as planned.
Following the upgrade process, which took approximately 10 minutes as initially planned, we observed our API database read latency (the time it takes to retrieve a records) to be severely impaired. This heavily affected the performance of our API by increased database load and manifested itself as spotty reachability, request timeouts, and various flavors of request failures.
The symptoms of the heavy database latency protracted until around 8:00 PM PDT (12:00 PM JST) when we deployed an emergency workaround suspending a minor part of our notification service. This notification service was found to be the primary cause of the increased database load.
After that, the load on the database subsided drastically and the API gradually resumed its operations until it reached its steady state at around 8:40 PM PDT (12:40 PM JST) when the incident was closed.
Here is the list of issues customers could have observed:
Rerun any CREATE TABLE Presto query that may have failed.
No other action has been identified as necessary at the moment.
The database upgrade planned for October 23rd PDT was both mandatory (mandated by the infrastructure provider) and necessary to keep up to date with latest improvements, fixes, and security enhancements.
We took great care to plan the upgrade process in an attempt to reduce the amount of API unavailability exposed to the customers.
At 6:10 PM PDT the upgrade of both database was started in parallel.
Following the upgrade of the API database, the instance immediately started to exhibit a very high read latency in accessing the backing storage volume, up to 10 times higher than normal (3 to 30 milliseconds). The high read latency increased the pressure on the database, causing the processing capacity of the API to saturate. These issues manifested themselves to customers as spotty reachability, request timeouts, and various flavors of request failures.
The elevated read latency was reportedly due to the database instance incurring in what is called ‘First Touch Penalty’, that is the overhead of retrieving storage blocks from the attached remote storage volume. The penalty is by construction, as reading of storage block is deferred until the block is actually requested and is not prefetched. As referred to traditionally, the database needed to be ‘warmed’.
At around 6:50 PM PDT we started to bring additional API processing capability online to try and mitigate the situation but we soon realized that this workaround was not effective to mitigate the underlying storage read latency.
At around 7:00 PM PDT, we started to investigate the database load in an attempt to identify the heaviest queries and come up with a workaround. By 7:30 PM PDT we had identified the heaviest hitter to be a query generated by a minor part of our notification system and proceeded to suspend that part of the service as it is not fundamental for the functionality of the system.
The mechanism was in charge of dispatching notifications to the owner of Scheduled Queries and Data Transfers in case their scheduled execution resulted in failures: a notification is sent when the status goes from success to failure (the Schedule starts ‘failing’) and when it goes back from failure to success (the Schedule resumes ‘succeeding’).
By around 8:00 PM PDT the workaround was deployed to our production API and the performance started to gradually pick up as an effect of the load on database having subsided. By 8:40 PM PDT the performance of the API had returned to typical levels and the incident was closed.
Post incident closure (around October 24th, 00:35 AM PDT), the Schedule success/failure notification mechanism was patched by fixing the problematic query (causing an inefficient full table scan) and restored to bring the API back to full functionalities.
Here's some ideas to remediate these problems:
We know how critical our services are to you and your business and in spite of our best effort in minimize the impact of this unavoidable database upgrade, we take the impact of this incident seriously as an opportunity to improve our processes and systems.
In the meantime, we sincerely apologize for the inconvenience and trouble this issue has caused.
Please feel free to reach out to our support team through support@treasuredata.com if you have any questions.
Sincerely,
The Treasure Data Team