[EU Region] Elevated error rate and performance degradation for personalization API

Incident Report for Treasure Data

Resolved

We implemented fundamental isolation to a problematic configuration at 14:42 UTC. The remediation caused the cluster workload to drop from 60% to 1%. On Friday, we implemented write access isolation to the problematic configuration. It stopped the cluster workload from growing. Today, we implemented read access isolation that restored the cluster workload to the previous level.

The system is operating normally now. We close the incident. We acknowledge we need further actions to prevent the same incident from happening again by a similar configuration. We will post further postmortem when we are ready.

Posted Jan 30, 2025 - 07:43 PST

Update

We are still monitoring the service.

Between Thursday, 30 Jan 2025, 10:00 UTC to 11:05 UTC, customers experienced elevated error rates and longer latency for Profiles API lookup. Currently, the cluster workload has calmed down and is operating normally.

Our response team is ready to provision additional processing capacity. However, we are closely monitoring the service status to avoid further downtime during peak times. In addition to it, we are working on isolating problematic accesses from the service.

We will keep the status page open and update you on the progress.

Posted Jan 30, 2025 - 06:18 PST

Update

We are continuing to monitor for any further issues.

Posted Jan 30, 2025 - 04:31 PST

Monitoring

We are currently observing that the performance degradation and error rate have improved.
We continue to closely monitor the metrics.

Posted Jan 30, 2025 - 03:38 PST

Investigating

We detected degraded performance of personalization API and an error rate increase.
We are currently investigating this issue.

Posted Jan 30, 2025 - 02:54 PST

This incident affected: EU (CDP API, CDP Personalization - Lookup API, CDP Personalization - Ingest API).