This is the follow-up postmortem to https://status.treasuredata.com/incidents/yv5492c02d4l.
The Profiles API enables browsers to retrieve personalized content based on detailed customer information. Between January 20 and January 23, we experienced elevated error rates and increased latency for a subset of requests to the Profiles API.
During these periods, API calls to https://cdp-eu01.in.treasuredata.com/
exhibited elevated error rates and latency. This issue did not impact RT 2.0, the newer version of our real-time system.
We noticed a gradual increase in processing workloads on the Profiles API starting on January 6, driven by the complexity of real-time segmentation. By January 20, this workload exceeded the internal concurrency limit configured in our caching cluster.
The bottleneck was traced to the caching cluster's concurrency capacity, which was insufficient to handle the growing workload.
Based on the observation, we implemented the mitigation to increase the concurrency capacity in the caching cluster. During the incident handling on January 22, the response team implemented the mitigation to increase the concurrency capacity (+100%) in the caching cluster. However, on January 23, we found that the workload was growing continuously. We added more capacity on January 23 and 24, but unfortunately, the operation needed other service disruptions.
In parallel, we traced the core contributing factor of the growing workload as a part of the Realtime Write feature (RT 1.0 Realtime segmentation). It is suspected that memory access contention of online memory calculation for Realtime segmentation in the memory cluster is causing the cluster slowdown. This increased Realtime Read latency.
By focused customer configuration analysis, we found an excessive number of Realtime segments defined to a single Parent Segment is causing the unexpected workload growth. Thus, we updated Realtime Write event routing configuration to isolate the high latency issue. After the isolation, the cluster workload stopped growing. Currently, the cluster is operating stably with 4x of the peak time capacity.
Further, we set a monitor of the concurrency capacity that pages our response team to ensure further capacity when the demand increases from the current peak workload.
We continue investigating the root cause in detail to implement a fix that prevents unexpected capacity saturation from happening again. The short-term remediation includes introducing legitimate configuration limits that affect cluster stability.
We sincerely apologize for the inconvenience caused by this incident. We understand the critical role our API plays in delivering seamless user experiences, and we are committed to preventing such disruptions in the future.
Hiroshi (Nahi) Nakamura
CTO & VP Engineering
Treasure Data