On Sep 16th, 2019 starting at 18:32 and until 19:29 PDT (Sep 17th, 2019 from 10:32 to 11:29 JST) we performed a maintenance of the PlazmaDB Metadata database to upgrade the version of the underlying PostgreSQL version to 11.
The upgrade was necessary to mitigate the performance limitations due to database lock contention that the system had suffered from since August 21st and that was the cause of moderate, frequent Streaming Import delays and sporadic Query execution delays.
The upgrade was successfully and was able to bring the performance of the database back to adequate levels given our current workload: all database metrics look healthy once again.
Following the maintenance, on Sep 17th, 2019 at 04:27 PDT (Sep 17th, 2019 from 20:27 JST) and around 3 hours later on at 07:36 PDT (23:36 JST), we reported two Presto performance degradation incidents: see here and here. We wanted to clarify that both of these incidents are unrelated to the PlazmaDB maintenance happened yesterday and are exclusively due to issues related to the infrastructure nodes the Presto engine is run on top of.
While this improvement is no guarantee that the performance will be sufficient to withstand a future steep increase in the workload, it has provided relief from the continuous incidents we have been experiencing and created enough headroom to allow us to start exploring options to address future scalability of these components. The most concrete plan is pushing forward the rollout of the Plazma Metadata API: while this improvement is not directly related to improving stability, it will introduce a further abstraction layer between the PlazmaDB database and the systems writing to/reading from it. This will:
We are currently focused on the remediations for the two additional major incidents occurred on September 3rd ([US Region] Elevated error rate of Stream import API) and September 4th ([US] Job submission). We are still in the process of evaluating the various options and this may lead us to the scheduling of an additional scheduled maintenance in the near future: stay tuned, we will announce on the status page as soon we have determined for that to be the best path forward.
We once again regret the trouble the recent instability has caused and apologize for the inconvenience from the incidents and the scheduled maintenance.
Please feel free to reach out to our support team through support@treasuredata.com if you have any questions.