On August 24th 2015 between 2:44 and 4:28 PM PDT, following a feature release involving our internal API, a problem affecting all customers allocated to Hadoop 1 (CDH4) clusters was discovered. All queries that push data to external resources (Result Output) were impacted.
Internally, we make it a point to use our own product to perform many of our calculations. For example, Treasure Data’s own data processing pipeline is built on top of Treasure Data.
Today at 4:15pm, we noticed that a part of our own internal analytics that relies on Result Output was having an issue. Within 13 minutes, we diagnosed the issue and applied the fix immediately.
Treasure Data runs on two versions of Hadoop: Hadoop 1 (CDH4) and Hadoop 2 (currently HDP2). Depending on each customer’s sign-up date for our service, every customer has been assigned to different clusters.
Below is a high-level representation of our backend architecture:
During one of our daily releases, we introduced a change to our backend worker to improve our Hadoop 2 performance, specifically around the ability to push the results of queries to external systems (Result Output).
This change was deployed only to the version of the workers (worker v6) used to execute Hadoop 2 jobs; but, not on the workers (worker v5) used to execute Presto and Hadoop 1 queries.
Changing the backend worker was coupled with a change on the API side and affected compatibility with an earlier implementation (worker v5). Consequently, all jobs executed by workers v5 were unable to perform Result Output.
All Presto and Hive queries that had Result Output configured and assigned to Hadoop 1 were affected.
All other queries (including all queries assigned to HDP2) were NOT affected.
As many of you know already, we are in the midst of migrating from Hadoop 1 (CDH4) to Hadoop 2 (currently HDP2) to ensure uniformity in our infrastructure support. Hence, once we are fully migrated on Hadoop 2, these types of problems should not reoccur.
However, this bug has exposed a gap in our integration testing process. Effective immediately, our entire engineering team will conduct a thorough review of our testing process. Once we identify all gaps in our integration tests, we will swiftly implement necessary tests to plug these gaps.
Treasure Data provides analytics infrastructure as a service, which means that hundreds of businesses like yours rely on our service to perform import calculations and make critical decisions. To put our operations in perspective, 1.9 trillion events are sent to our platform, and 2 million questions were asked on Treasure Data last month.
We are really sorry about this incident and its impact on your data infrastructure. We promise to do better in the future. Thank you for your trust and support.
If you have any further questions, please send us an email (support@treasure-data.com) anytime.
Sincerely, Kazuki Ohta (CTO and Co-founder)