[US Region] Job submission

Incident Report for Treasure Data

Postmortem

Summary

On September 4th, 2019 at around 20:56 PDT (September 5th, 2019 starting at 12:56 JST), customers of the US region have experienced an incident affecting the ability to manage jobs (issue, update, and kill).

The issue stemmed from having reached an implicit per-table limit of the MySQL database used as persistence for the main API (api.treasuredata.com). The table storing the job definition had reached a maximum size limit.

The condition lasted for around 46 minutes after which the ability to manage jobs quickly resumed, servicing all the pending requests and allowing existing customer jobs to complete.

Impact to customers

This incident affected only customers of the US region.

From the start of the incident and for about 46 minutes:

Customers were unable to issue new jobs [1]. This also applies to manually executed Saved queries.
Running customer jobs [1] continued to execute unaffected but the status could not be updated and marked as complete (success, failure, or killed). This might have suggested that each of these jobs was running for longer than it actually did.
The execution of Scheduled Hive or Presto queries and Data Connector sources was delayed until the end of the incident.
The execution of jobs [1] from a Workflow (scheduled or manually ran) was delayed until the end of the incident.
Presto JDBC/ODBC API requests failed until the end of the incident.
Master Segment and Batch Segment execution was delayed until the end of the incident: this is the same pattern as ‘Workflow’ above. Audience Studio showed errors because the ‘Presto JDBC/ODBC API’ requests failed per the pattern described above.

[1] Hive & Presto queries, Data Connector jobs, Result Output jobs (including those writing back to TD), Bulk Import jobs, Partial Delete jobs, and Export jobs.

Details

On September 4th, 2019 at around 20:50 PDT (September 5th, 2019 at 12:50 JST) we were alerted by a spike in errors of the main API for the US region.

While we began to investigate the API failures and narrow the problem down, we received the first customer inquiries indicating they had become unable to issue new jobs. Our focus shifted immediately to the Job API.

Around 21:00 PDT (13:30 JST) we received more customer inquiries. The investigation had identified the culprit to be the

Mysql2::Error: The table 'jobs' is full

errors in the logs. The jobs table is the MySQL table the main API utilizes to persist all job definitions and track the status of:

Hive queries
Presto queries
Data Connector jobs
Result Output jobs (including those writing back to TD)
Bulk Import jobs
Export jobs
Partial Delete jobs

The fact that all errors were pointing at the jobs table suggested the issue did not affect other tables and that it was not a matter of storage space on the MySQL database but rather a per-table restriction. This was confirmed by 21:15 PDT (13:15 JST).

At 21:20 PDT (13:20 JST) we started deleting the first 1 million records in a hope that it would reduce the table size and make space for additional records and resume updates. This was completed at 21:30 PDT (13:30 JST) when we confirmed that API requests for job creation and job updation had started succeeding again. We also confirmed new jobs had been successfully added to the jobs table in the MySQL database.

We continued to delete old jobs records up to 1.2 million. At 22:15 PDT (14:15 JST), the incident was marked as resolved and the main API error count had dropped to normal levels.

Afterwards, we continued to investigate the root cause of the problem and discovered that MySQL was configured in such a way the per-table file size limit was 2TB (Terabytes). The US region jobs table contains all the job records since the beginning of the business back in 2011, amounting to over 550 million records to date. As customers’ analytical needs have evolved, queries have become more complex and winded, in turn causing the job records to grow very large in size, bloating the size of table storing them.
A posteriori, we were able to confirm that the jobs table file size had hit the 2TB limit at the time of the incident and that after the deletion of the oldest 1.2 million records, we had made enough room to hold a few days worth of new job records (around 8GB of leftover space).

Remediations

Here's some ideas we are going to investigate to remediate this issue.

Short term

Periodically (daily) check the amount of left over space and delete older records to maintain sufficient headroom.

Medium term

Compact (e.g. defragment) the jobs table file.
The deletion of 1.2 million records should have given us a lot more free space. The current OS page size limit causes additional storage overhead which could be eliminated by compacting the job table file during a planned maintenance.

Long term

Create an archived_jobs table and automatically migrate jobs records older than a certain amount of time there.
Moving the older jobs to different table will significantly reduce the side of the jobs table and will enable us to differentiate the guarantees of searchability and retrievability between the two tables. The additional benefit is that this will open up the opportunity to offer additional, enhanced functionalities in the future.

Conclusion

We regret the trouble the recent instability has caused and apologize for the inconvenience: we appreciate and want to honor the trust you put in us by choosing the Arm Treasure Data CDP to build and run your analytics workloads. We are renewing our effort in not only addressing the present issues but also in working towards ways to avoid them from happening in the future. ‘We don’t know what we don’t know’ aside, we continue to be reminded that we need to shift our reliability focus to thinking more ahead.

Please feel free to reach out to our support team through support@treasuredata.com if you have any questions.

Posted Sep 09, 2019 - 18:21 PDT

Resolved

Now operating normally.

The cause of this incident was an increase in the number of records in our core database tables. It is unrelated to other recent incidents. We apologize for any inconvenience caused.

Posted Sep 05, 2019 - 02:37 PDT

Monitoring

We have made remediation to our core database and the job submission is now operational. We keep monitoring on the issues in core database.

Posted Sep 04, 2019 - 22:16 PDT

Identified

We have identified the issue with the job submission system. Initial remediation was successful and jobs should now be accepted and progress to completion. We are continuing to remediate the system

Posted Sep 04, 2019 - 21:33 PDT

Investigating

We have observed an issue with the job submission system where new jobs cannot be submitted. Team is investigating and we will update shortly.

Posted Sep 04, 2019 - 21:10 PDT

This incident affected: US (REST API).