[US Region] Elevated API error rate and query performance degradation

Incident Report for Treasure Data

Resolved

We confirmed that 2 consecutive Object Storage connectivity issues were identified and fixed. The incident was resolved.

Posted Mar 20, 2018 - 10:02 PDT

Update

All systems including Hadoop clusters are operating normally now. We are waiting for Infrastructure Provider to confirm the problem is fixed.

Posted Mar 20, 2018 - 09:50 PDT

Update

We recovered Hadoop cluster at 09:06 PDT and provisioned additional computing resource for recovery at 09:20 PDT. All system are operating normally now. We keep monitoring in 15 minutes.

Posted Mar 20, 2018 - 09:25 PDT

Monitoring

Object Storage access performance recovered again at 8:48 PDT. We confirmed that all affected services also began to show recovery. Current impact:
* queued Presto/Result Export jobs were already processed
* we are working for restarting queued Hadoop jobs caused by the Object Storage access trouble

Posted Mar 20, 2018 - 09:02 PDT

Identified

Again we are working with Infrastructure Provider about Object Storage access performance degradation. Current impact:
* Presto/Hive query slower execution
* S3 Result Export slower execution
* Stream import delay: it takes average 10 minutes to become visible from queries
* Elevated API error rate for CLI and Streaming Import

Posted Mar 20, 2018 - 08:41 PDT

Update

We confirmed elevated error rate of our API endpoint for CLI and streaming import due to Object Storage access performance degradation.

Posted Mar 20, 2018 - 08:23 PDT

Investigating

We are observing Object Storage access performance degradation again.

Posted Mar 20, 2018 - 08:17 PDT

Update

We confirmed that all backlog jobs were processed and there is no delay in job executions and streaming import. We keep monitoring for 30 minutes.

Posted Mar 20, 2018 - 07:45 PDT

Monitoring

Object Storage access performance recovered. We confirmed that all affected services also began to show recovery. Current impact:
* queued Presto/Hive/Result Export jobs are being processed quickly
* Stream import delay: it takes average 15 minutes to become visible from queries

Additional computing resource is already provisioned to production environment. Queued streaming import will be processed in 10 minutes.

Posted Mar 20, 2018 - 07:32 PDT

Update

We are working with Infrastructure Provider about Object Storage access performance degradation. Current impact:
* Presto/Hive query start delay and slower execution
* Result Export start delay and slower S3 output
* Stream import delay: it takes average 37 minutes to become visible from queries
* Slower API response for Streaming and Bulk Import

Posted Mar 20, 2018 - 07:08 PDT

Identified

We are working with Infrastructure Provider about Object Storage access performance degradation. Current impact:
* Presto/Hive query start delay and slower execution
* Stream import delay: it takes average 25 minutes to become visible from queries
* Slower API response for Streaming and Bulk Import

Posted Mar 20, 2018 - 06:44 PDT

Update

API error rate was restored at 6:06 PDT but we are still experiencing slower API responses and Presto/Hive query executions due to backend Object Storage access slowdown. We are still investigating the root cause.

Posted Mar 20, 2018 - 06:24 PDT

Investigating

We are observing elevated error rate from 5:56 am PDT. We are investigating the cause.

Posted Mar 20, 2018 - 06:08 PDT

This incident affected: US (Web Interface, REST API, Streaming Import REST API, Mobile/Javascript REST API, Hadoop / Hive Query Engine, Presto Query Engine).