Mailings Running Slowly

Incident Report for FeedBlitz

Postmortem

Background

Overnight (July 7-8), a component that’s used by FeedBlitz to divide mailings across our servers to ensure that we can deliver email as quickly as possible via our highly parallel distributed architecture, unexpectedly and suddenly hit significant performance issues.

As such, the effort and time required to split up any given mailing, and to tell each FeedBlitz server what to work on next, went from milliseconds to many seconds. This in turn led to mailings taking hours to go out instead of the expected timeframes.

The net result was that FeedBlitz did not make the performance standards we've historically set - and met - for ourselves and for you, our clients. We apologize for the disruption.

Immediate Actions

As a company we attacked the problem from a couple of directions. Firstly - what was going wrong (and so how can we fix it?). Secondly, how to accelerate mailings to get them out more quickly.

The challenge this incident presented was fourfold.

Although the service was clearly not performing well, there wasn’t anything wrong as such (i.e. nothing in our infrastructure was down). All the moving parts, and the load on the elements of the architecture, were exactly the same as the previous days, weeks etc., yet the FeedBlitz email engine was suddenly performing very slowly. What this meant, as we drilled deeper into diagnosis, was that we’d tipped into a significant scalability issue.
The second challenge was that adding more systems to expedite mailings actually exacerbated the scalability problem. We had to do that, however, to help you get your word out.
Thirdly, because system performance was slow, upgrading elements on Wednesday with changes to address the issue was also time-consuming. Incremental improvements during the day to alleviate the load on the relevant systems were difficult to deploy, delaying the benefits each one brought.
Finally, the slow-down affected internal data replication, the option to switch traffic to a different (faster) set of machines wasn’t available to us, as that would effectively force us into data loss.

The team worked throughout the day and throughout the night to get all the mailings out as quickly as possible. In parallel, the misbehaving component was identified, along with the solutions to lighten its burdens. That, eventually, and very late at night (or very early in the morning, depending on one’s perspective), mostly worked. For many clients, the next day’s mailings were better, and the incident posted here was closed. The team worked to address some edge cases, notably for a few large lists on our ASAP schedule, which could still be unexpectedly slow.

Downstream Effects

On Monday morning (July 12) we became aware that, related to this incident, some RSS-based Express schedule mailings were not working successfully. That was addressed that day, and if you were affected by that issue you can resend any email by following these instructions.

All mailings are now 100% correct and performing at (or better than) historic levels.

Lessons and Next Steps

Technically, we’re reviewing (1) how to better handle email coordination more scalably in the future, (2) where other latent scale problems may be lurking, and (3) how to better detect and manage them before they become unexpected crises.

From a customer communication perspective, looking back, I was premature in closing the incident, and on Monday morning we should have had an in-app banner directing Express schedule clients to check their mailings once we knew that some of these lists had had problems.

We’ll be more cognizant of communicating post-incident actions and alerts in the future, and I acknowledge that this was a misstep on my part.

Posted Jul 13, 2021 - 12:09 EDT

Resolved

Mailings are now running on time and at their usual pace.

Posted Jul 08, 2021 - 03:13 EDT

Update

We have identified the component responsible for the slow down in today's mailings. An update is being rolled out across the FeedBlitz architecture and we expect mailing performance to return to normal as that proceeds. We appreciate your patience as we bring the system back to expected performance levels.

Posted Jul 07, 2021 - 14:02 EDT

Monitoring

Due to an earlier database issue, mailings started after 1:00 am eastern are running slowly as the system catches up. We are tracking progress and apologize for any inconvenience caused.

Posted Jul 07, 2021 - 03:58 EDT

This incident affected: Email Marketing.