[US Region] Elevated API error rate and query performance degradation
Incident Report for Treasure Data
Resolved
We confirmed that 2 consecutive Object Storage connectivity issues were identified and fixed. The incident was resolved.
Posted 10 months ago. Mar 20, 2018 - 10:02 PDT
Update
All systems including Hadoop clusters are operating normally now. We are waiting for Infrastructure Provider to confirm the problem is fixed.
Posted 10 months ago. Mar 20, 2018 - 09:50 PDT
Update
We recovered Hadoop cluster at 09:06 PDT and provisioned additional computing resource for recovery at 09:20 PDT. All system are operating normally now. We keep monitoring in 15 minutes.
Posted 10 months ago. Mar 20, 2018 - 09:25 PDT
Monitoring
Object Storage access performance recovered again at 8:48 PDT. We confirmed that all affected services also began to show recovery. Current impact:
* queued Presto/Result Export jobs were already processed
* we are working for restarting queued Hadoop jobs caused by the Object Storage access trouble
Posted 10 months ago. Mar 20, 2018 - 09:02 PDT
Identified
Again we are working with Infrastructure Provider about Object Storage access performance degradation. Current impact:
* Presto/Hive query slower execution
* S3 Result Export slower execution
* Stream import delay: it takes average 10 minutes to become visible from queries
* Elevated API error rate for CLI and Streaming Import
Posted 10 months ago. Mar 20, 2018 - 08:41 PDT
Update
We confirmed elevated error rate of our API endpoint for CLI and streaming import due to Object Storage access performance degradation.
Posted 10 months ago. Mar 20, 2018 - 08:23 PDT
Investigating
We are observing Object Storage access performance degradation again.
Posted 10 months ago. Mar 20, 2018 - 08:17 PDT
Update
We confirmed that all backlog jobs were processed and there is no delay in job executions and streaming import. We keep monitoring for 30 minutes.
Posted 10 months ago. Mar 20, 2018 - 07:45 PDT
Monitoring
Object Storage access performance recovered. We confirmed that all affected services also began to show recovery. Current impact:
* queued Presto/Hive/Result Export jobs are being processed quickly
* Stream import delay: it takes average 15 minutes to become visible from queries

Additional computing resource is already provisioned to production environment. Queued streaming import will be processed in 10 minutes.
Posted 10 months ago. Mar 20, 2018 - 07:32 PDT
Update
We are working with Infrastructure Provider about Object Storage access performance degradation. Current impact:
* Presto/Hive query start delay and slower execution
* Result Export start delay and slower S3 output
* Stream import delay: it takes average 37 minutes to become visible from queries
* Slower API response for Streaming and Bulk Import
Posted 10 months ago. Mar 20, 2018 - 07:08 PDT
Identified
We are working with Infrastructure Provider about Object Storage access performance degradation. Current impact:
* Presto/Hive query start delay and slower execution
* Stream import delay: it takes average 25 minutes to become visible from queries
* Slower API response for Streaming and Bulk Import
Posted 10 months ago. Mar 20, 2018 - 06:44 PDT
Update
API error rate was restored at 6:06 PDT but we are still experiencing slower API responses and Presto/Hive query executions due to backend Object Storage access slowdown. We are still investigating the root cause.
Posted 10 months ago. Mar 20, 2018 - 06:24 PDT
Investigating
We are observing elevated error rate from 5:56 am PDT. We are investigating the cause.
Posted 10 months ago. Mar 20, 2018 - 06:08 PDT
This incident affected: US (Web Interface, REST API, Streaming Import REST API, Mobile/Javascript REST API, Hadoop / Hive Query Engine, Presto Query Engine).