[US region] Scheduled Databases upgrade

Scheduled Maintenance Report for Treasure Data

Completed

The scheduled Database upgrade is now complete and we verified that all systems have completely recovered.

The upgrade process took 5 minutes from 18:32 to 18:37 PDT (10:32 to 10:37 JST) in total.

Posted Oct 23, 2019 - 19:06 PDT

Verifying

The scheduled Database upgrade is now complete.

We are now monitoring closely to ensure all systems return to their typical functionality levels as quickly as possible.

Posted Oct 23, 2019 - 18:44 PDT

In progress

The scheduled Database upgrade is starting now.

We expect the maintenance procedure to start between 18:15 PDT (10:15 JST) and 18:30 PDT (10:30 JST) and the bulk of the operation to last no more than 15 minutes. During the maintenance our REST APIs will primarily be unavailable and respond with error code 500 or similar.

Immediately after the operation completes, there will be a recovery period of at most 15 additional minutes throughout which the REST APIs and Web Interface will gradually return to their typical performance levels.

Posted Oct 23, 2019 - 18:00 PDT

Update

Our Planned Maintenance window will begin in about an hour, starting at 18:00 PDT (10:00 JST). We expect the maintenance procedure to start between 18:15 PDT (10:15 JST) and 18:30 PDT (10:30 JST) and the bulk of the operation to last no more than 15 minutes. Immediately after the operation completes, there will be a recovery period of at most 15 additional minutes throughout which the REST APIs and Web Interface will gradually return to their typical performance levels.

As a reminder (from the previous communications), during this time:

* The REST API will become unreachable and respond with error codes 500 or similar. This will prevent all primary actions from occurring: for example, read/write/update/delete of databases, tables, scheduled and saved queries, data connector sources, and users, creation/submission of Presto and Hive queries and Data connector jobs.

* The Web Interface will not be fully functional. Similar impact as per the point above.

* The td command-line (CLI) commands will either fail (read requests) returning errors to the user or be delayed (write requests) until the maintenance is complete.

* Streaming import requests will fail: where fluentd / td-agent is being used (as recommended), event collection will continue locally on each device/server and will recover automatically once the maintenance is complete thanks to the built-in buffering and retry mechanisms.

* The execution of scheduled queries and connector jobs will be delayed. Already executing scheduled jobs will be completed or retried internally until they are. The jobs retrying mechanism may cause the execution of the jobs to last longer than expected and 15 minutes in the worst case.

* Workflows using Treasure Data operators (e.g. td>) will retry and regain full functionalities again after the upgrade. In case of workflow sessions failures, the customer can elect to resume them manually.

* Javascript SDK and Mobile SDK (Android, iOS, and Unity) event collection will continue undisturbed but the records will not be available for querying until after the maintenance is completed.

* The Presto JDBC / ODBC Gateway will report authentication failures to the clients (ODBC and JDBC clients and tools/services using them).

Beyond this notice, we will provide updates at the start and completion of the operation and once the verification of the new system is completed: at that time, all systems will have returned to full functionality and the Scheduled Maintenance will be closed.

Posted Oct 23, 2019 - 17:00 PDT

Scheduled

On Wednesday, October 23rd starting at 18:15 PST (Thursday, October 24th starting at 10:15 JST), Treasure Data will be performing an upgrade of the main database for the US region.

The maintenance is necessary to upgrade the database storage to withstand future usage growth and it’s part of the remediation activities discussed in the postmortem of this incident https://status.treasuredata.com/incidents/zs55rsqkg189.

# Impact

NOTE: This maintenance notice only interests customers using the US region and not customers using the Tokyo or EU regions.

The database operation will take at most 15 minutes: we refer to this as maintenance phase. The systems directly connecting to the database, our REST APIs (https://api.treasuredata.com, https://api-import.treasuredata.com) and Web Interface, will be majorly affected and be unreachable during this phase.
The maintenance will be followed by a recovery phase of at most 15 minutes. During this phase, the REST APIs and Web Interface will gradually return to their typical performance levels.

Below is a summary of the impact customers will observe:

* The REST API will become unreachable and respond with error codes 500 or similar. This will prevent all primary actions from occurring: for example, read/write/update/delete of databases, tables, scheduled and saved queries, data connector sources, and users, creation/submission of Presto and Hive queries and Data connector jobs.

* The Web Interface will not be fully functional. Similar impact as per the point above.

* The td command-line (CLI) commands will either fail (read requests) returning errors to the user or be delayed (write requests) until the maintenance is complete.

* Streaming import requests will fail: where fluentd / td-agent is being used (as recommended), event collection will continue locally on each device/server and will recover automatically once the maintenance is complete thanks to the built-in buffering and retry mechanisms.

* The execution of scheduled queries and connector jobs will be delayed. Already executing scheduled jobs will be completed or retried internally until they are. The jobs retrying mechanism may cause the execution of the jobs to last longer than expected and 15 minutes in the worst case.

* Workflows using Treasure Data operators (e.g. td>) will retry and regain full functionalities again after the upgrade. In case of workflow sessions failures, the customer can elect to resume them manually.

* Javascript SDK and Mobile SDK (Android, iOS, and Unity) event collection will continue undisturbed but the records will not be available for querying until after the maintenance is completed.

* The Presto JDBC / ODBC Gateway will report authentication failures to the clients (ODBC and JDBC clients and tools/services using them).

Beyond this notice, we will provide updates approximately 1 hour before the beginning of the upgrade window, at the start and completion of the operation, and once the verification is completed. At that time, all systems will have returned to full functionality and the Scheduled Maintenance will be closed.

If you have any question or concern about this upgrade, please feel free to reach out to our Support team at support@treasuredata.com.

Posted Oct 09, 2019 - 15:33 PDT

This scheduled maintenance affected: US (Web Interface, REST API, Streaming Import REST API, Mobile/Javascript REST API, Data Connector Integrations, Hadoop / Hive Query Engine, Presto Query Engine, Presto JDBC/ODBC Gateway, Workflow, CDP API).