[US Region] Scheduled Metadata Database maintenance
Scheduled Maintenance Report for Treasure Data
Postmortem

Summary

On Sep 16th, 2019 starting at 18:32 and until 19:29 PDT (Sep 17th, 2019 from 10:32 to 11:29 JST) we performed a maintenance of the PlazmaDB Metadata database to upgrade the version of the underlying PostgreSQL version to 11.

The upgrade was necessary to mitigate the performance limitations due to database lock contention that the system had suffered from since August 21st and that was the cause of moderate, frequent Streaming Import delays and sporadic Query execution delays.

The upgrade was successfully and was able to bring the performance of the database back to adequate levels given our current workload: all database metrics look healthy once again.

Following the maintenance, on Sep 17th, 2019 at 04:27 PDT (Sep 17th, 2019 from 20:27 JST) and around 3 hours later on at 07:36 PDT (23:36 JST), we reported two Presto performance degradation incidents: see here and here. We wanted to clarify that both of these incidents are unrelated to the PlazmaDB maintenance happened yesterday and are exclusively due to issues related to the infrastructure nodes the Presto engine is run on top of.

What’s Next

While this improvement is no guarantee that the performance will be sufficient to withstand a future steep increase in the workload, it has provided relief from the continuous incidents we have been experiencing and created enough headroom to allow us to start exploring options to address future scalability of these components. The most concrete plan is pushing forward the rollout of the Plazma Metadata API: while this improvement is not directly related to improving stability, it will introduce a further abstraction layer between the PlazmaDB database and the systems writing to/reading from it. This will:

  • introduce more efficient and adaptable per-client rate limiting. This will allow us to better isolate failures and reduce the blast radius in case of trouble.
  • enable us to better control the scalability of PlazmaDB. This will provide us with additional abilities to introduce future scalability improvements transparently to the customers and without the need for a planned downtime in most cases.

We are currently focused on the remediations for the two additional major incidents occurred on September 3rd ([US Region] Elevated error rate of Stream import API) and September 4th ([US] Job submission). We are still in the process of evaluating the various options and this may lead us to the scheduling of an additional scheduled maintenance in the near future: stay tuned, we will announce on the status page as soon we have determined for that to be the best path forward.

Conclusion

We once again regret the trouble the recent instability has caused and apologize for the inconvenience from the incidents and the scheduled maintenance.

Please feel free to reach out to our support team through support@treasuredata.com if you have any questions.

Posted Sep 18, 2019 - 14:23 PDT

Completed
*Maintenance* and system *recovery* have been fully completed.
The scheduled database maintenance is now complete.
Posted Sep 16, 2019 - 19:49 PDT
Update
Plazma Meta database downtime by upgrade operation started at 1:32 am and completed at 1:56 am UTC. However, in system recovery operation, one of Presto cluster took time to start working, and becomes fully functional at 2:29 am UTC.
We are monitoring Presto clusters to confirm queued queries are processed as expected.
Posted Sep 16, 2019 - 19:44 PDT
Verifying
The scheduled database maintenance is complete.

We are monitoring the system closely to ensure all systems successfully complete their *recovery* and return as quickly as possible to full functionality.
Posted Sep 16, 2019 - 19:31 PDT
Update
Our database upgrade operations finished successfully, but one of Presto query engine clusters are still not operational, and our queued jobs will remain in queued state for more minutes.
Posted Sep 16, 2019 - 19:26 PDT
Update
The database upgrade operations are still in progress. We are monitoring the progress and will require additional maintenance time window.
Our maintenance window will be extended until 7:30 PM PDT (11:30 AM JST, 4:30 AM CEST).
Posted Sep 16, 2019 - 18:54 PDT
In progress
The scheduled database maintenance window is starting now.

During the maintenance and recovery, customers may experience the following:

- Streaming, Mobile, and JavaScript/Browser imports delay

Streaming import (through td-agent or fluentd) requests will continue to be accepted as usual but the requests will remain queued until after the database maintenance is complete. We expect stream import processing to be further delayed during recovery.
The same will apply to import requests from Browsers (Javascript SDK) and Mobiles (Android, iOS, and Unity SDKs).

- Jobs execution delay

All jobs (Presto, Hive, Result Export, Data Connector Integrations, Bulk Import, Export, and Partial Delete jobs submitted from Console, API, Workflow or triggered by our system according to the configured schedule) will fail and continue to retry during maintenance. During recovery, we expect jobs to begin processing slowly: within 30 minutes job processing should reach back to full throughput.

- Presto JDBC / ODBC Gateway errors

The Presto JDBC / ODBC Gateway will report errors during maintenance due to the unreachability of the Metadata database: errors will be propagated to the clients. During recovery, we expect processing of Presto JDBC / ODBC jobs to follow the same recovery pattern as all other jobs (see above).

- Console

Data Workbench and Audience Studio will incur in errors caused by failures of the underlying Master Segments, Segments, and Workflows jobs.
Posted Sep 16, 2019 - 18:01 PDT
Update
In about an hour, from 6:00 PM PDT (10:00 AM JST, 3:00 AM CEST), the maintenance window for the PlazmaDB Metadata database will commence.

During the maintenance and recovery, customers may experience the following:

- Streaming, Mobile, and JavaScript/Browser imports delay

Streaming import (through td-agent or fluentd) requests will continue to be accepted as usual but the requests will remain queued until after the database maintenance is complete. We expect stream import processing to be further delayed during recovery.
The same will apply to import requests from Browsers (Javascript SDK) and Mobiles (Android, iOS, and Unity SDKs).

- Jobs execution delay

All jobs (Presto, Hive, Result Export, Data Connector Integrations, Bulk Import, Export, and Partial Delete jobs submitted from Console, API, Workflow or triggered by our system according to the configured schedule) will fail and continue to retry during maintenance. During recovery, we expect jobs to begin processing slowly: within 30 minutes job processing should reach back to full throughput.

- Presto JDBC / ODBC Gateway errors

The Presto JDBC / ODBC Gateway will report errors during maintenance due to the unreachability of the Metadata database: errors will be propagated to the clients. During recovery, we expect processing of Presto JDBC / ODBC jobs to follow the same recovery pattern as all other jobs (see above).

- Console

Data Workbench and Audience Studio will incur in errors caused by failures of the underlying Master Segments, Segments, and Workflows jobs.

Beyond this notice, we will provide updates at the start and completion of the operation, and once the verification of the new system is completed. At that time, all systems will have returned to full functionality and this Scheduled Maintenance will be closed.
Posted Sep 16, 2019 - 17:00 PDT
Scheduled
On Monday, September 16th from 6 to 7 PM PDT (Tuesday, September 17th from 10 to 11 AM JST, September 17th from 3 to 4 AM CEST) we will be performing maintenance on the PlazmaDB Metadata database. The maintenance is necessary to upgrade the PostgreSQL database to address the performance limitations that have recently affected the TD system and have surfaced as Streaming Import visibility delays and occasional slowdowns of Queries.

This maintenance was originally scheduled for Tuesday, September 3rd but was cancelled due to the issues our testing and benchmarking had uncovered during and after the test upgrade. These issues have now been addressed.

The database will become unreachable for the duration of the maintenance procedure, which should last no longer than 20 minutes. We expect this to be followed by a recovery period of around 30 minutes during which the system will gradually reach back to full throughput.


# Impact

During the maintenance and recovery, customers may experience the following:

- Streaming, Mobile, and JavaScript/Browser imports delay

Streaming import (through td-agent or fluentd) requests will continue to be accepted as usual but the requests will remain queued until after the database maintenance is complete. We expect stream import processing to be further delayed during recovery.
The same will apply to import requests from Browsers (Javascript SDK) and Mobiles (Android, iOS, and Unity SDKs).

- Jobs execution delay

All jobs (Presto, Hive, Result Export, Data Connector Integrations, Bulk Import, Export, and Partial Delete jobs submitted from Console, API, Workflow or triggered by our system according to the configured schedule) will fail and continue to retry during maintenance. During recovery, we expect jobs to begin processing slowly: within 30 minutes job processing should reach back to full throughput.

- Presto JDBC / ODBC Gateway errors

The Presto JDBC / ODBC Gateway will report errors during maintenance due to the unreachability of the Metadata database: errors will be propagated to the clients. During recovery, we expect processing of Presto JDBC / ODBC jobs to follow the same recovery pattern as all other jobs (see above).

- Console

Data Workbench and Audience Studio will incur in errors caused by failures of the underlying Master Segments, Segments, and Workflows jobs.


# Communication

Beyond this notice, we will provide updates approximately 1 hour before the beginning of the maintenance window, at the start and completion of the operation, and once the verification is completed. At that time, all systems will have returned to full functionality and the Scheduled Maintenance will be closed.

If you have any question or concern about this upgrade, please feel free to reach out to our Support team at support@treasuredata.com.
Posted Sep 10, 2019 - 18:38 PDT
This scheduled maintenance affected: US (Web Interface, REST API, Streaming Import REST API, Mobile/Javascript REST API, Data Connector Integrations, Hadoop / Hive Query Engine, Presto Query Engine, Presto JDBC/ODBC Gateway, Workflow, CDP API).