Overnight (July 7-8), a component that’s used by FeedBlitz to divide mailings across our servers to ensure that we can deliver email as quickly as possible via our highly parallel distributed architecture, unexpectedly and suddenly hit significant performance issues.
As such, the effort and time required to split up any given mailing, and to tell each FeedBlitz server what to work on next, went from milliseconds to many seconds. This in turn led to mailings taking hours to go out instead of the expected timeframes.
The net result was that FeedBlitz did not make the performance standards we've historically set - and met - for ourselves and for you, our clients. We apologize for the disruption.
As a company we attacked the problem from a couple of directions. Firstly - what was going wrong (and so how can we fix it?). Secondly, how to accelerate mailings to get them out more quickly.
The challenge this incident presented was fourfold.
The team worked throughout the day and throughout the night to get all the mailings out as quickly as possible. In parallel, the misbehaving component was identified, along with the solutions to lighten its burdens. That, eventually, and very late at night (or very early in the morning, depending on one’s perspective), mostly worked. For many clients, the next day’s mailings were better, and the incident posted here was closed. The team worked to address some edge cases, notably for a few large lists on our ASAP schedule, which could still be unexpectedly slow.
On Monday morning (July 12) we became aware that, related to this incident, some RSS-based Express schedule mailings were not working successfully. That was addressed that day, and if you were affected by that issue you can resend any email by following these instructions.
All mailings are now 100% correct and performing at (or better than) historic levels.
Technically, we’re reviewing (1) how to better handle email coordination more scalably in the future, (2) where other latent scale problems may be lurking, and (3) how to better detect and manage them before they become unexpected crises.
From a customer communication perspective, looking back, I was premature in closing the incident, and on Monday morning we should have had an in-app banner directing Express schedule clients to check their mailings once we knew that some of these lists had had problems.
We’ll be more cognizant of communicating post-incident actions and alerts in the future, and I acknowledge that this was a misstep on my part.