[US Region] Long queued time of Result Export jobs

Incident Report for Treasure Data

Resolved

From 3:30 AM PDT (07:30 PM JST), Result Export processing workers were under capacity saturation. All customers who were running Result Export job experienced slower job execution after that. At 10:00 AM PDT (02:00 AM JST), we implemented the fix and provisioned additional Result Export processing capacity. At 11:30 AM PDT (03:30 AM JST), the system finished catching up the queued Result Export jobs.

During the incident, a part of customers received the error "This account already has its maximum number of jobs queued or running (256 jobs)" error message in Hive Query and Presto Query from Workflows or direct API calls due to the queued Result Export jobs. We will analyze detailed impact then contact with the affected customers.

The incident is closed.

Posted Jul 05, 2020 - 12:18 PDT

Update

All the queued jobs started processing and Result Export job is running normally. We still monitor for 30 minutes and report the incident impact.

We will give another status update in 30 minutes.

Posted Jul 05, 2020 - 11:45 PDT

Update

After the system becomes normal, we provisioned additional Result Export processing capacity by 10:00 PDT (02:00 JST, 19:00 CEST.) We estimate all queued jobs would be processed within 1 hour.

We will give another status update in 30 minutes.

Posted Jul 05, 2020 - 11:11 PDT

Monitoring

A fix has been applied. We have observed that the num of queued Result Export jobs decreases steadily.

Posted Jul 05, 2020 - 10:04 PDT

Identified

We are working on applying a fix for this Result export queuing issue to improve our Result Export cluster performance now.

Posted Jul 05, 2020 - 09:57 PDT

Investigating

We have observed an increased number of queued jobs for Result Export in our system. Investigating the cause.

Posted Jul 05, 2020 - 09:23 PDT

This incident affected: US (Hadoop / Hive Query Engine, Presto Query Engine).