Permutive API outage
Incident Report for Permutive
Postmortem

Over the past week, two separate incidents have impacted our platform. The first incident took place on March 5th 2024, when we experienced an API outage between 01:36 UTC and 06:49 UTC. The second incident occurred on March 10th 2024, with an API outage occurring between 07:50 UTC and 08:50 UTC.

We anticipate minimal impact on campaign delivery as our SDK continued to segment and target users during these times. However, due to the API outage, no user events were collected. You may notice a drop in data in our Insights module and lower unique users and pageview numbers for March 5th and March 10th. If you utilise Routing, you will see a drop in event collection for the duration of the outages and those utilising our daily Routing to AWS S3 will also see a delay in their routed data for March 9th.

At Permutive, we are proud of our platform’s stability. In the full year preceding these events, our core APIs achieved 100% uptime. So, on rare occasions when incidents occur, we aim to stay transparent with our customers and explain how we plan to avoid them in the future.

We have carried out a thorough investigation into both incidents. While each incident had a similar level of impact on our services, the two incidents were unrelated and had different underlying causes. The incident on March 5th was caused by an internal error in our event handling API, which caused the API to scale down to where it was no longer able to handle traffic. We identified the issue just after 06:00am UTC on 5th March. Our team reacted swiftly and put a manual fix in place by 06:49am. The root cause of the incident on March 10th was a process failure that affected an internal service critical to the functioning of several externally facing services, including our event handling API. Our team was able to resolve this issue in just under an hour. 

Following these incidents, we are actively taking steps to prevent recurrences of similar outages:

  • We have implemented mitigations against the observed scaling issues during the incident on March 5th
  • We are taking steps to improve the resiliency of our critical APIs by minimising dependencies on downstream services  
  • We have begun to introduce new monitoring and alerting for the process which failed on March 10th, to mitigate the risk of process failure
  • We are looking at ways to mitigate the risk of long-running outages. We are reviewing our processes for out-of-hours support and looking at ways to ensure a faster response in the event of a critical outage overnight

We would like to apologise for any inconvenience these incidents may have caused you. If you have any questions, concerns, or further feedback, please do not hesitate to get in touch with technical-services@permutive.com.

Posted Mar 14, 2024 - 18:18 UTC

Resolved
This incident has been resolved with no further downtime expected.
Posted Mar 10, 2024 - 11:11 UTC
Monitoring
The Permutive API suffered an outage between 07:56 UTC and 08:48 UTC on 10th March 2024. During this period, segmentation & activation of users remained operational with degraded service, but Insights and Routing were affected. All services have been restored and our team are now monitoring.
Posted Mar 10, 2024 - 08:57 UTC
This incident affected: Permutive API and Routing.