[US Region] Presto partial performance degradation and potential job failure
Incident Report for Treasure Data
Resolved
This incident has been resolved.

There was a potential status inconsistency issue on one of our Presto cluster in US region during AM 1:04 - 3:10 UTC on Sep 4. Queries issued during this period have been potentially affected.

You may see failure of INSERT/DELETE jobs with error messages like below. Please do not rerun those especially INSERT jobs to avoid double insertion because even if the job failed, the writing to your table might have done.

- cannot get transactionId for null transaction
- Cannot complete uploading. This error is temporary and should be recovered by retrying.
- Failed to rewrite partition
- Killed by the system because this query stalled for more than 1.00h.

Also, some of your queries issued during this period might stuck or even failed with the following error. Those jobs were also affected by this incident.

- Query exceeded the maximum execution time limit of 6.00h
Posted Sep 04, 2023 - 05:45 PDT
Update
There was a potential status inconsistency issue on one of our Presto cluster in US region during AM 1:04 - 3:10 UTC on Sep 4. Queries issued during this period have been potentially affected.

You may see failure of INSERT/DELETE jobs with error messages like below. Please do not rerun those especially INSERT jobs because even if the job failed, the writing to your table might have done.

- cannot get transactionId for null transaction
- Cannot complete uploading. This error is temporary and should be recovered by retrying.

CTAS failed with the following error might be also affected by this incident:

- Query exceeded the maximum execution time limit of 6.00h

Also, some of your queries might stuck during this incident.

We are sure that newly issued queries are not affected while we are still working on to identify the impact of this incident.
Posted Sep 04, 2023 - 03:56 PDT
Update
We are continuing to investigate this issue.
Posted Sep 04, 2023 - 03:55 PDT
Update
There was a potential status inconsistency issue on one of our Presto cluster in US region during AM 1:04 - 3:10 UTC on Sep 4. Queries issued during this period have been potentially affected.

You may see failure of INSERT/DELETE jobs with error messages like below. Please do not rerun those especially INSERT jobs because even if the job failed, the writing to your table might have done.

- cannot get transactionId for null transaction
- Cannot complete uploading. This error is temporary and should be recovered by retrying.

CTAS failed with the following error might be also affected by this incident:

- Query exceeded the maximum execution time limit of 6.00h

Also, some of your queries might stuck during this incident.

We are sure that newly issued queries are not affected while we are still working on to identify the impact of this incident.
Posted Sep 04, 2023 - 01:05 PDT
Update
There might be potential status inconsistency in your INSERT jobs if they failed with error messages like below:

- Cannot complete uploading. This error is temporary and should be recovered by retrying
- cannot get transactionId for null transaction

Please do not rerun those jobs because even if the job failed, the writing to your table might work. We are still working on to identify the impact of this incident.
Posted Sep 03, 2023 - 22:33 PDT
Investigating
We are investigating the cause. Queries may be delayed.
Posted Sep 03, 2023 - 20:10 PDT
This incident affected: US (Presto Query Engine).