Streaming Data Import Delay for td-agent v1.1.20 or earlier
Incident Report for Treasure Data
Postmortem

Dear Treasure Data Customers, this is the postmortem for streaming data import delay happened at Jul 21 2015, for td-agent v1.1.20 or earlier.

What has happned?

From 2015-07-21 14:30 to 2015-07-22 11:45 (UTC), our API import endpoint (https://api-import.treasuredata.com/) didn't accept SSLv3.

Customers using td-agent1 (old stable) version 1.1.20 or earlier, experienced the data import delay with the error message like below in the log (/var/log/td-agent.log).

SSL_connect returned=1 errno=0 state=SSLv3 read server hello A: sslv3 alert handshake failure

This error suggests that td-agent is trying to upload with SSLv3, but our API server rejected to establish the connections.

While the incoming data should be buffered on the disk, the data imports were hugely delayed. At 2015-07-22 11:45 (UTC), we have enabled SSLv3 again, and the problem was solved.

Why this incident happened?

Recently the world discovered that SSLv3 contains weaknesses in its ability to protect and secure communications. This is well-know as POODLE vulnerability.

These weaknesses have been addressed in Transport Layer Security (TLS), which is the replacement for SSLv3 and the new default for most operating systems and clients.

Consistent with our top priority to protect Treasure Data customers, Treasure Data had a plan to support versions of the more modern TLS rather than SSLv3.

Originally we planned to let customers know about deprecating SSLv3 via emails.

However, when we modified our load balancer (Elastic Load Balancer) configuration, SSLv3 was disabled by default by the underlying cloud provider and we didn't recognize about it. After the load balancer config change, our endpoint started rejecting SSLv3 connections.

How to prevent the problem?

We will work more closely and quickly, to mitigate any potential security issues.

Also we'll implement further monitoring mechanism on the server-side to more strictly check the customer's data import rate. This will allow us to have a critical alert, when multiple customers' import rate dramatically dropped down.

We will also contact affected customers to stop using SSv3. Our customer success team will reach out you to recommend upgrading to td-agent2 (current stable), or td-agent1 (old stable) v1.1.21.

Again, we want to apologize. We know how critical our services are to our customers' businesses. We will do everything we can to learn from this event and use it to drive improvement across our services.

Sincerely, The Treasure Data Team

Posted Jul 23, 2015 - 23:02 PDT

Resolved
This incident has been resolved.
Posted Jul 22, 2015 - 04:45 PDT