On September 4th, 2019 at around 20:56 PDT (September 5th, 2019 starting at 12:56 JST), customers of the US region have experienced an incident affecting the ability to manage jobs (issue, update, and kill).
The issue stemmed from having reached an implicit per-table limit of the MySQL database used as persistence for the main API (api.treasuredata.com). The table storing the job definition had reached a maximum size limit.
The condition lasted for around 46 minutes after which the ability to manage jobs quickly resumed, servicing all the pending requests and allowing existing customer jobs to complete.
This incident affected only customers of the US region.
From the start of the incident and for about 46 minutes:
[1] Hive & Presto queries, Data Connector jobs, Result Output jobs (including those writing back to TD), Bulk Import jobs, Partial Delete jobs, and Export jobs.
On September 4th, 2019 at around 20:50 PDT (September 5th, 2019 at 12:50 JST) we were alerted by a spike in errors of the main API for the US region.
While we began to investigate the API failures and narrow the problem down, we received the first customer inquiries indicating they had become unable to issue new jobs. Our focus shifted immediately to the Job API.
Around 21:00 PDT (13:30 JST) we received more customer inquiries. The investigation had identified the culprit to be the
Mysql2::Error: The table 'jobs' is full
errors in the logs. The jobs table is the MySQL table the main API utilizes to persist all job definitions and track the status of:
The fact that all errors were pointing at the jobs table suggested the issue did not affect other tables and that it was not a matter of storage space on the MySQL database but rather a per-table restriction. This was confirmed by 21:15 PDT (13:15 JST).
At 21:20 PDT (13:20 JST) we started deleting the first 1 million records in a hope that it would reduce the table size and make space for additional records and resume updates. This was completed at 21:30 PDT (13:30 JST) when we confirmed that API requests for job creation and job updation had started succeeding again. We also confirmed new jobs had been successfully added to the jobs table in the MySQL database.
We continued to delete old jobs records up to 1.2 million. At 22:15 PDT (14:15 JST), the incident was marked as resolved and the main API error count had dropped to normal levels.
Afterwards, we continued to investigate the root cause of the problem and discovered that MySQL was configured in such a way the per-table file size limit was 2TB (Terabytes). The US region jobs table contains all the job records since the beginning of the business back in 2011, amounting to over 550 million records to date. As customers’ analytical needs have evolved, queries have become more complex and winded, in turn causing the job records to grow very large in size, bloating the size of table storing them.
A posteriori, we were able to confirm that the jobs table file size had hit the 2TB limit at the time of the incident and that after the deletion of the oldest 1.2 million records, we had made enough room to hold a few days worth of new job records (around 8GB of leftover space).
Here's some ideas we are going to investigate to remediate this issue.
Short term
Medium term
Long term
We regret the trouble the recent instability has caused and apologize for the inconvenience: we appreciate and want to honor the trust you put in us by choosing the Arm Treasure Data CDP to build and run your analytics workloads. We are renewing our effort in not only addressing the present issues but also in working towards ways to avoid them from happening in the future. ‘We don’t know what we don’t know’ aside, we continue to be reminded that we need to shift our reliability focus to thinking more ahead.
Please feel free to reach out to our support team through support@treasuredata.com if you have any questions.