[US Region] Elevated API error rate and query performance degradation
Incident Report for Treasure Data
Resolved
We confirmed that 2 consecutive Object Storage connectivity issues were identified and fixed. The incident was resolved.
Posted 8 months ago. Mar 20, 2018 - 10:02 PDT
Update
All systems including Hadoop clusters are operating normally now. We are waiting for Infrastructure Provider to confirm the problem is fixed.
Posted 8 months ago. Mar 20, 2018 - 09:50 PDT
Update
We recovered Hadoop cluster at 09:06 PDT and provisioned additional computing resource for recovery at 09:20 PDT. All system are operating normally now. We keep monitoring in 15 minutes.
Posted 8 months ago. Mar 20, 2018 - 09:25 PDT
Monitoring
Object Storage access performance recovered again at 8:48 PDT. We confirmed that all affected services also began to show recovery. Current impact:
* queued Presto/Result Export jobs were already processed
* we are working for restarting queued Hadoop jobs caused by the Object Storage access trouble
Posted 8 months ago. Mar 20, 2018 - 09:02 PDT
Identified
Again we are working with Infrastructure Provider about Object Storage access performance degradation. Current impact:
* Presto/Hive query slower execution
* S3 Result Export slower execution
* Stream import delay: it takes average 10 minutes to become visible from queries
* Elevated API error rate for CLI and Streaming Import
Posted 8 months ago. Mar 20, 2018 - 08:41 PDT
Update
We confirmed elevated error rate of our API endpoint for CLI and streaming import due to Object Storage access performance degradation.
Posted 8 months ago. Mar 20, 2018 - 08:23 PDT
Investigating
We are observing Object Storage access performance degradation again.
Posted 8 months ago. Mar 20, 2018 - 08:17 PDT
Update
We confirmed that all backlog jobs were processed and there is no delay in job executions and streaming import. We keep monitoring for 30 minutes.
Posted 8 months ago. Mar 20, 2018 - 07:45 PDT
Monitoring
Object Storage access performance recovered. We confirmed that all affected services also began to show recovery. Current impact:
* queued Presto/Hive/Result Export jobs are being processed quickly
* Stream import delay: it takes average 15 minutes to become visible from queries

Additional computing resource is already provisioned to production environment. Queued streaming import will be processed in 10 minutes.
Posted 8 months ago. Mar 20, 2018 - 07:32 PDT
Update
We are working with Infrastructure Provider about Object Storage access performance degradation. Current impact:
* Presto/Hive query start delay and slower execution
* Result Export start delay and slower S3 output
* Stream import delay: it takes average 37 minutes to become visible from queries
* Slower API response for Streaming and Bulk Import
Posted 8 months ago. Mar 20, 2018 - 07:08 PDT
Identified
We are working with Infrastructure Provider about Object Storage access performance degradation. Current impact:
* Presto/Hive query start delay and slower execution
* Stream import delay: it takes average 25 minutes to become visible from queries
* Slower API response for Streaming and Bulk Import
Posted 8 months ago. Mar 20, 2018 - 06:44 PDT
Update
API error rate was restored at 6:06 PDT but we are still experiencing slower API responses and Presto/Hive query executions due to backend Object Storage access slowdown. We are still investigating the root cause.
Posted 8 months ago. Mar 20, 2018 - 06:24 PDT
Investigating
We are observing elevated error rate from 5:56 am PDT. We are investigating the cause.
Posted 8 months ago. Mar 20, 2018 - 06:08 PDT
This incident affected: US (Web Interface, REST API, Streaming Import REST API, Mobile/Javascript REST API, Hadoop / Hive Query Engine, Presto Query Engine).