Hive Time Filtering Ineffective
Incident Report for Treasure Data
Postmortem

Dear Treasure Data customers,

This is the postmortem for Hive performance degradation problem happened at Nov 12, 2014.

What has happened

From 2:30pm and 5:00pm PST Nov 12 2014, Hive query optimization called 'time index pushdown' didn't work correctly. time index pushdown is the optimization technique where the query scans only the required time range, to avoid extra I/O requests. This is usually achieved via TDTIMERANGE UDF.

Because of this issue, the queries were always scanning the entire table. This caused the severe performance degradation, especially for the queries which access humongous tables.

This problem didn't impact on data import, and also Presto engine.

Why this incident happened?

Originally we tried to fix 'time index pushdown' related bugs, where it doesn't work for self-join queries. We have modified Hadoop Hive, and our internal middleware. We also confirmed it worked correctly in our staging environment, which continuously deploys our most recent codebase.

We're doing the weekly release, and Tuesday is the date to deploy changes. We have deployed the changes, but we forgot to deploy one associated changes to server configuration (more specifically, updating the Hadoop configuration file). This missing configuration eventually resulted into the problem above.

During that time period, we unfortunately couldn't figure out whether or not the query is too slow or the system is heavily loaded. Eventually one of our customers issued the support ticket, and we noticed the problem around 4:00pm PST.

How to prevent the problem?

This problem happened mainly because the mis-communication within engineering team, by not listing all the module to be deployed for the weekly release. We're currently working on having the 'pre-production' environment, to make sure the deployment process systematically.

Again, we want to apologize. We know how critical our services are to our customers' businesses. We will do everything we can to learn from this event and use it to drive improvement across our services.

Sincerely, The Treasure Data Team

Posted Nov 16, 2014 - 23:51 PST

Resolved
This incident has been resolved.
Posted Nov 12, 2014 - 03:55 PST
Update
Most of the pending jobs have been completed and the cluster workload is back to normal now. We've reverted the configuration to normal status.
Posted Nov 11, 2014 - 22:35 PST
Monitoring
We confirmed that the issue is identified and fixed.

We increased the number of available Hive cores for the service and doubled the number of maximum mappers and reducers for every account. This is done to provide more computation resources to the affected queries running slow (because of the lack of time filtering) and speed them up to complete sooner.

We continue to monitor the situation until the number of job queued and execution time will be stable again.
Posted Nov 11, 2014 - 19:36 PST
Identified
We have identified and fixed an issue concerning the time filtering of all Hive queries.

Hive queries that were created between ~2:30 PM and ~5:00 PM PST are affected and therefore their execution, especially if scanning large tables which were meant to be sliced by time, is significantly slower.

A fix for this problem has been deployed at ~5:00 PM PST.
* New queries are once again able to leverage time filtering.
* Existing queries in 'running' state will however still experience the problem and may cause new query to wait for an execution slot. Customers that are able to do so, are invited to terminate the execution of their old queries and rerun them to leverage the fix.

At the same time, more query execution resources are being added to help consume the job backlog for all customers.
Posted Nov 11, 2014 - 17:41 PST