[EU Region] Elevated error/ performance degradation related to personalisation API

Incident Report for Treasure Data

Postmortem

The Profiles API enables browsers to retrieve personalized content based on detailed customer information. Between January 20 and January 22, we experienced elevated error rates and increased latency for a subset of requests to the Profiles API.

We sincerely apologize for the inconvenience caused by this incident. We understand the critical role our API plays in delivering seamless user experiences, and we are committed to preventing such disruptions in the future.

Timeline

On January 20, from 7:45 to 11:15 UTC - 3% error rate during the time
On January 21, from 7:35 to 10:25 UTC - 33% error rate during the time
On January 22, from 9:15 to 16:40 UTC - 40% error rate during the time

During these periods, API calls to https://cdp-eu01.in.treasuredata.com/ exhibited elevated error rates and latency. This issue did not impact RT 2.0, the newer version of our real-time system.

Incident Analysis

This is the current analysis snapshot; updates will be provided as more information becomes available.

We noticed a gradual increase in processing workloads on the Profiles API starting on January 6, driven by the complexity of real-time segmentation. By January 20, this workload exceeded the internal concurrency limit configured in our caching cluster. Key observations are:

Symptoms consistently began to appear around 07:30 UTC each day.
Internal system indicators flagged potential issues approximately two hours prior to the incidents.

The bottleneck was traced to the caching cluster's concurrency capacity, which was insufficient to handle the growing workload.

Action Taken

Based on the observation, we implemented the mitigation to increase the concurrency capacity in the caching cluster. We will monitor the symptoms closely today and provide additional capacity when necessary.

Further Actions

Our development team will have a capacity review of the Profiles API infrastructure to prepare for future workload growth. The remediation plan will include the following steps:

Enhanced monitoring and alerting of the caching cluster’s concurrency capacity
Ensuring safe yet rapid capacity provisioning when required

We will provide a follow-up update by the end of Friday, summarizing any additional findings and actions taken.

Hiroshi (Nahi) Nakamura
CTO & VP Engineering
Treasure Data

Posted Jan 22, 2025 - 19:17 PST

Resolved

Between Wednesday, 22 Jan 2025 09:15 UTC to 16:40 UTC, Some customers experienced elevated error rates and increased latency related to Profiles API. A fix has been implemented and the issue has been resolved.

If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will share an incident retrospective soon.

Posted Jan 22, 2025 - 18:41 PST

Monitoring

We have fully deployed our fixes to the Personalization API and our monitors show systems operating normally. Our teams will continue to monitor the issue, and we will update this incident if we observe any unusual behavior.

If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will share an incident retrospective once it is available.

Posted Jan 22, 2025 - 18:15 PST

Update

We have observed some intermittent errors as we roll out a fix to all of our systems, and users may see delays or errors as the change is applied to our systems. Our response team is working to minimize the impact to customers while we deploy this change, but we expect some slower performance while we gradually deploy the fix over the next 3-4 hours.

Posted Jan 22, 2025 - 16:44 PST

Update

Our response team has identified a potential cause for this issue, and we will be deploying a fix shortly. At this time we have not observed any elevated error rates or delays since 16:40 UTC. We will provide an additional update once this fix has been deployed.

If you are observing abnormal errors or long delays from our Personalization API, please reach out to our support team. We will continue to monitor for any issues, and will update once our fix is deployed.

Posted Jan 22, 2025 - 12:24 PST

Update

From 09:00 to 17:00 UTC, we observed elevated 500s and high latency on the CDP KVS server. Customers may have observed elevated errors and timeouts during this period when sending requests to the Personalization API.

Our team has been investigating this issue and has deployed a workaround to our systems while we work to identify the root cause of the problem.

There should be no system impact at this time. Customers who continue to observe delays or elevated error rates should contact our support team, and we'll be happy to assist them further.

We will continue to investigate and will provide another update by 11 PM UTC.

Posted Jan 22, 2025 - 10:34 PST

Update

We have applied various mitigation on our infrastructure side however it doesn't decrease the error rate.
We are continuously investigating the possible causes on our end

Posted Jan 22, 2025 - 06:52 PST

Investigating

We are currently observing errors or performance degradation for the personalization API.
We are investigating the cause of the issue now.

Posted Jan 22, 2025 - 01:56 PST

This incident affected: EU (CDP Personalization - Lookup API, CDP Personalization - Ingest API).