Presto query race condition with columnar processing optimization
Incident Report for Treasure Data
Postmortem

Summary

We experienced an issue with Presto that affected the processing of queries starting at around 1:30 AM on May 26th PDT (17:30 PM on May 26th JST) when a new version of Presto was released.

The issue was resolved and Presto's behavior rolled back to a stable state on 4:10 AM PDT / 20:10 JST on May 26th.

During that time, all queries containing a WHERE clause on a string column were affected and could have produced a wrong result. Please read on for more details.

Description

In last night's Presto release (May 26th, 1:30 AM PDT, 17:30 JST) we enabled columnar processing optimization. This feature intended to optimize the speed of scan and filter operators for a column by storing and analyzing the column values in batches instead of one by one.

Unfortunately this new mechanism introduced an issue in managing the underlying memory buffers storing the in-flight data.

This problem was specifically triggered by processing string columns, because of the particular way they are handled and stored in memory buffers/pages. When the size of column string values exceeded certain thresholds, the memory buffers began to be reused. This caused the following stages of the query processing pipeline operating to read data that was already overwritten onto, thus producing incorrect query results. This issue became more evident in the context of GROUP BYs, fact that initially prompted us to direct our investigation in that direction.

Since the columnar processing optimization was only active for those queries applying both scanning and filtering operators on the same column, all queries containing a WHERE clause could be affected. For example, these queries would be not affected:

SELECT 
    month, 
    COUNT(1) count
  FROM (
    SELECT 
        TD_DATE_TRUNC(‘month’, time) month 
      FROM 
        laws
  ) 
  GROUP BY 
    1 
  ORDER BY 
    1
SELECT 
    TD_DATE_TRUNC(‘month’, time) month, 
    COUNT(1) count
  FROM
    laws
  GROUP BY 
    1 
  ORDER BY 
    1

These sample queries would be affected:

SELECT 
    month, 
    COUNT(1) count
  FROM (
    SELECT 
        TD_DATE_TRUNC(‘month’, time) month 
      FROM 
        laws
      WHERE 
        state_code = ‘CA’
  ) 
  GROUP BY 
    1 
  ORDER BY 
    1
SELECT 
    month, 
    COUNT(1) count
  FROM (
    SELECT 
        TD_DATE_TRUNC(‘month’, time) month,
        state_code 
      FROM 
        laws
  ) 
  WHERE
    state_code = ‘CA’
  GROUP BY 
    1 
  ORDER BY 
    1
SELECT 
    TD_DATE_TRUNC(‘month’, time) month,
    COUNT(1) count
  FROM 
    laws
  WHERE 
    state_code = ‘CA’
  GROUP BY 
    1 
  ORDER BY 
    1

Because other column types are directly mapped to native Java primitives, processing of those columns was not affected.

We realized the occurrence of this problem at around May 26th 4:10AM PDT (20:10 JST) and immediately proceeded to disabled the columnar processing optimization as a precaution. Unfortunately queries that were ran during this time (around 2 1/2 hours) and affected by this issue, produced incorrect results.

Due to the nature of the problem, it is unfortunately not possible to systematically identify which of the queries executed during this time were affected by this issue. We recommend our customers to rerun the queries that are considered important and/or have business impact. We apologize for the inconvenience this may have caused and please don't hesitate to contact our Support staff at support@treasuredata.com if you have any question.

Posted May 26, 2016 - 16:23 PDT

Resolved
We have identified the cause of this problem. Unlike originally stated, while the problem affects GROUP BYs, it’s not originating from there.

In this release we enabled columnar processing optimization, which intended to speed up filtering of columns.
Unfortunately this new mechanism introduced an issue in managing the underlying memory buffers storing the in-flight data, de facto causing invalid query results.

We apologize for the inconvenience this may have caused.

For more details, please refer to the postmortem.
Posted May 26, 2016 - 15:41 PDT
Update
We have identified a possible issue with sub-query optimization when the inner sub-query of a nested query uses GROUP BY.
We continue to investigate further to narrow down the failure cases and will continue to update the status as we have more findings.
Posted May 26, 2016 - 10:05 PDT
Identified
We have found the root cause. We've enabled the optimization option to further optimize the processing of columns, however it has race conditions that could cause wrong results or error. Not all the jobs were affected, and we are still investigating which jobs were affected during 1:30AM-4:10AM PDT (17:30-20:10 JST).

We're sorry for this inconvenience.
Posted May 26, 2016 - 07:04 PDT
Investigating
During 1:30am - 4:10am PDT (17:30-20:10 JST), we found that some queries by Presto returned wrong results or error due to new Presto release. We're investigating the cause and affected jobs.
Posted May 26, 2016 - 05:01 PDT