[EU] Personalization API configuration update

Scheduled Maintenance Report for Treasure Data

Postmortem

This is the follow-up postmortem to https://status.treasuredata.com/incidents/yv5492c02d4l.

The Profiles API enables browsers to retrieve personalized content based on detailed customer information. Between January 20 and January 23, we experienced elevated error rates and increased latency for a subset of requests to the Profiles API.

Timeline

On January 20, from 7:45 to 11:15 UTC - 3% error rate during the time
On January 21, from 7:35 to 10:25 UTC - 33% error rate during the time
On January 22, from 9:15 to 16:40 UTC - 40% error rate during the time
On January 23, from 7:22 to 11:40 UTC - 14% error rate during the time
On January 24, from 1:21 to 2:09 UTC - 47% error rate during the time

During these periods, API calls to https://cdp-eu01.in.treasuredata.com/ exhibited elevated error rates and latency. This issue did not impact RT 2.0, the newer version of our real-time system.

Incident Analysis

We noticed a gradual increase in processing workloads on the Profiles API starting on January 6, driven by the complexity of real-time segmentation. By January 20, this workload exceeded the internal concurrency limit configured in our caching cluster.

The bottleneck was traced to the caching cluster's concurrency capacity, which was insufficient to handle the growing workload.

Action Taken

Based on the observation, we implemented the mitigation to increase the concurrency capacity in the caching cluster. During the incident handling on January 22, the response team implemented the mitigation to increase the concurrency capacity (+100%) in the caching cluster. However, on January 23, we found that the workload was growing continuously. We added more capacity on January 23 and 24, but unfortunately, the operation needed other service disruptions.

In parallel, we traced the core contributing factor of the growing workload as a part of the Realtime Write feature (RT 1.0 Realtime segmentation). It is suspected that memory access contention of online memory calculation for Realtime segmentation in the memory cluster is causing the cluster slowdown. This increased Realtime Read latency.

By focused customer configuration analysis, we found an excessive number of Realtime segments defined to a single Parent Segment is causing the unexpected workload growth. Thus, we updated Realtime Write event routing configuration to isolate the high latency issue. After the isolation, the cluster workload stopped growing. Currently, the cluster is operating stably with 4x of the peak time capacity.

Further, we set a monitor of the concurrency capacity that pages our response team to ensure further capacity when the demand increases from the current peak workload.

Further Actions

We continue investigating the root cause in detail to implement a fix that prevents unexpected capacity saturation from happening again. The short-term remediation includes introducing legitimate configuration limits that affect cluster stability.

We sincerely apologize for the inconvenience caused by this incident. We understand the critical role our API plays in delivering seamless user experiences, and we are committed to preventing such disruptions in the future.

Hiroshi (Nahi) Nakamura
CTO & VP Engineering
Treasure Data

Posted Jan 24, 2025 - 18:03 PST

Completed

We confirmed the service is operating normally.

Between Friday, 24 Jan 2025, 01:21 UTC to 02:09 UTC, customers experienced elevated error rates related to Profiles API configuration update by the system maintenance. We planned to change one configuration that provides additional processing capacity before EU daytime. However, the configuration update did not apply to a set of nodes, and we needed to recreate the cluster by reverting the configuration change.
We will not apply the configuration change in the EU region unless we identify the cause of the update failure.

If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident.

Posted Jan 23, 2025 - 18:46 PST

Verifying

We rolled back the configuration. The service should be up and running normally. We are verifying the service.

Posted Jan 23, 2025 - 18:15 PST

Update

We implemented the recovery operation and waiting for the service recovers.

Posted Jan 23, 2025 - 18:06 PST

Update

We are currently observing API errors due to the configuration update. We are working on recovery the service.

Posted Jan 23, 2025 - 17:40 PST

In progress

The configuration update operation started.

Posted Jan 23, 2025 - 17:10 PST

Scheduled

We would like to inform you about a planned configuration update to Personalization API. This update is scheduled to enhance the internal capacity of the service proactively.

* Purpose: Increase internal concurrency capacity
* Impact: No expected service interruptions or user impact
* Date/Time: within 30 minutes

We will monitor the update process closely to ensure a smooth transition. If you have any questions or concerns, please don't hesitate to reach out to our support team.

Thank you for your understanding and support!

Posted Jan 23, 2025 - 16:54 PST

This scheduled maintenance affected: EU (CDP Personalization - Lookup API, CDP Personalization - Ingest API).