The Profiles API enables browsers to retrieve personalized content based on detailed customer information. Between January 20 and January 22, we experienced elevated error rates and increased latency for a subset of requests to the Profiles API.
We sincerely apologize for the inconvenience caused by this incident. We understand the critical role our API plays in delivering seamless user experiences, and we are committed to preventing such disruptions in the future.
During these periods, API calls to https://cdp-eu01.in.treasuredata.com/
exhibited elevated error rates and latency. This issue did not impact RT 2.0, the newer version of our real-time system.
This is the current analysis snapshot; updates will be provided as more information becomes available.
We noticed a gradual increase in processing workloads on the Profiles API starting on January 6, driven by the complexity of real-time segmentation. By January 20, this workload exceeded the internal concurrency limit configured in our caching cluster. Key observations are:
The bottleneck was traced to the caching cluster's concurrency capacity, which was insufficient to handle the growing workload.
Based on the observation, we implemented the mitigation to increase the concurrency capacity in the caching cluster. We will monitor the symptoms closely today and provide additional capacity when necessary.
Our development team will have a capacity review of the Profiles API infrastructure to prepare for future workload growth. The remediation plan will include the following steps:
We will provide a follow-up update by the end of Friday, summarizing any additional findings and actions taken.
Hiroshi (Nahi) Nakamura
CTO & VP Engineering
Treasure Data