Result Output issue

Incident Report for Treasure Data

Postmortem

The Nature and Scope of the Incident

On August 24th 2015 between 2:44 and 4:28 PM PDT, following a feature release involving our internal API, a problem affecting all customers allocated to Hadoop 1 (CDH4) clusters was discovered. All queries that push data to external resources (Result Output) were impacted.

How We Discovered the Incident

Internally, we make it a point to use our own product to perform many of our calculations. For example, Treasure Data’s own data processing pipeline is built on top of Treasure Data.

Today at 4:15pm, we noticed that a part of our own internal analytics that relies on Result Output was having an issue. Within 13 minutes, we diagnosed the issue and applied the fix immediately.

What Happened

Treasure Data runs on two versions of Hadoop: Hadoop 1 (CDH4) and Hadoop 2 (currently HDP2). Depending on each customer’s sign-up date for our service, every customer has been assigned to different clusters.

Below is a high-level representation of our backend architecture:

During one of our daily releases, we introduced a change to our backend worker to improve our Hadoop 2 performance, specifically around the ability to push the results of queries to external systems (Result Output).

This change was deployed only to the version of the workers (worker v6) used to execute Hadoop 2 jobs; but, not on the workers (worker v5) used to execute Presto and Hadoop 1 queries.

Changing the backend worker was coupled with a change on the API side and affected compatibility with an earlier implementation (worker v5). Consequently, all jobs executed by workers v5 were unable to perform Result Output.

All Presto and Hive queries that had Result Output configured and assigned to Hadoop 1 were affected.

All other queries (including all queries assigned to HDP2) were NOT affected.

Our Action Item: Thorough Review of Integration Tests and Improved Coverage

As many of you know already, we are in the midst of migrating from Hadoop 1 (CDH4) to Hadoop 2 (currently HDP2) to ensure uniformity in our infrastructure support. Hence, once we are fully migrated on Hadoop 2, these types of problems should not reoccur.

However, this bug has exposed a gap in our integration testing process. Effective immediately, our entire engineering team will conduct a thorough review of our testing process. Once we identify all gaps in our integration tests, we will swiftly implement necessary tests to plug these gaps.

We Apologize and Will Do Better

Treasure Data provides analytics infrastructure as a service, which means that hundreds of businesses like yours rely on our service to perform import calculations and make critical decisions. To put our operations in perspective, 1.9 trillion events are sent to our platform, and 2 million questions were asked on Treasure Data last month.

We are really sorry about this incident and its impact on your data infrastructure. We promise to do better in the future. Thank you for your trust and support.

If you have any further questions, please send us an email (support@treasure-data.com) anytime.

Sincerely, Kazuki Ohta (CTO and Co-founder)

Posted Aug 24, 2015 - 19:19 PDT

Resolved

After switching back the code at 16:28 PDT, we don't see any further problem. Marked as resolved for now. Our postmortem will be posted soon. We sincerely apologize about this incidents.

Posted Aug 24, 2015 - 17:38 PDT

Identified

We identified all jobs from 14:44 to 16:28 PDT for people using Presto, and Hadoop1 (not Hadoop2), didn't execute Result Output correctly. Although Jobs were flagged as succeeded, Result Output weren't executed correctly. The changes included the bug. We're contacting affected customers one by one. Again, really sorry about the problem and thank you for your patience.

Posted Aug 24, 2015 - 17:07 PDT

Investigating

From 14:44 to 16:28 PDT, some jobs flagged as succeeded but the associated Result Output were NOT running. We've switched back the today's deployment. Sorry for the problem, and thank you for your patience.

Posted Aug 24, 2015 - 16:29 PDT