Dear Treasure Data customers,
This is the postmortem for Hive performance degradation problem happened at Nov 12, 2014.
From 2:30pm and 5:00pm PST Nov 12 2014, Hive query optimization called 'time index pushdown' didn't work correctly. time index pushdown is the optimization technique where the query scans only the required time range, to avoid extra I/O requests. This is usually achieved via TDTIMERANGE UDF.
Because of this issue, the queries were always scanning the entire table. This caused the severe performance degradation, especially for the queries which access humongous tables.
This problem didn't impact on data import, and also Presto engine.
Originally we tried to fix 'time index pushdown' related bugs, where it doesn't work for self-join queries. We have modified Hadoop Hive, and our internal middleware. We also confirmed it worked correctly in our staging environment, which continuously deploys our most recent codebase.
We're doing the weekly release, and Tuesday is the date to deploy changes. We have deployed the changes, but we forgot to deploy one associated changes to server configuration (more specifically, updating the Hadoop configuration file). This missing configuration eventually resulted into the problem above.
During that time period, we unfortunately couldn't figure out whether or not the query is too slow or the system is heavily loaded. Eventually one of our customers issued the support ticket, and we noticed the problem around 4:00pm PST.
This problem happened mainly because the mis-communication within engineering team, by not listing all the module to be deployed for the weekly release. We're currently working on having the 'pre-production' environment, to make sure the deployment process systematically.
Again, we want to apologize. We know how critical our services are to our customers' businesses. We will do everything we can to learn from this event and use it to drive improvement across our services.
Sincerely, The Treasure Data Team