[JP Region] Degraded performance of Personalization APIs

Incident Report for Treasure Data

Resolved

All originally affected systems have been operating normally for the last 20 minutes: we are closing this incident.

A quick summary of the incident follows:
The partial performance degradation for the Personalization Lookup and Ingest APIs started at 11:12 PDT (18:12 UTC). At that time one of the servers in the cluster was handling 8x the load of others making it at hot spot.
By 11:48 PDT (18:48 UTC) the server was restarted and the performance degradation was largely remediated although the system was still busy processing the accumulated backload of ingest requests.
At 11:54 PDT (18:54 UTC) the system had reached back to normal performance levels and the ingest backlog was completely consumed.

Posted Aug 03, 2019 - 12:30 PDT

Monitoring

We observed skewed load on one of the server instances providing the Personalization functionalities (Lookup and Ingest).
The server instance was restarted and that allowed the load to be properly repartitioned across the other instances in the cluster.
The system is now recovering and the ingestion backload is being consumed.
We'll continue to monitor the system closely and until the load will have returned to normal levels.

Posted Aug 03, 2019 - 12:05 PDT

Investigating

We are investigated a degradation in performance for the Personalization APIs. Both the Lookup and Ingest APIs appear affected.
Clients requests may receive 50x errors back.

Posted Aug 03, 2019 - 11:37 PDT

This incident affected: Tokyo (CDP Personalization - Lookup API, CDP Personalization - Ingest API).