Sunday's Downtime
To all our customers,
This past Sunday, 5/26/24, we had the longest outage in our 14-year history. The most painful part is that it happened on a Sunday morning. We've taken great care to build and maintain a stable, performant application that churches can rely on, week-in and week-out. This outage challenges that earned reputation, and we're taking significant measures to ensure it never happens again.
Before going into details of what happened, we want to say first that we are deeply sorry for this downtime, and how it may have impacted your church. Our entire team is committed to serving the local church with excellence and professionalism. A prolonged outage like what we experienced on Sunday is simply not acceptable.
Many of you have sent us kind and encouraging words since Sunday, and we appreciate it; your patience and understanding is greatly appreciated and not taken for granted.
What we did wrong
We failed to promptly communicate with you as soon as the outage began and it became evident that it was going to last more than a few minutes.
We did not initiate our failover procedures soon enough. While we were given incorrect information from our infrastructure partner, we still should have acted sooner.
We did not have sufficient redundancies in place. While we have failover mechanisms in place, and while a datacenter blackout is an extremely rare event and a “worst case” scenario, we should not have been so dependent on this single infrastructure partner.
Summary of Events
The downtime on Sunday was caused by a complete power outage at our primary datacenter, which also resulted in some damaged network switches inside the datacenter. This was not a security-related or data-leak incident of any kind.
Hivelocity, one of our hosting providers (we also use Amazon AWS, Digital Ocean, and Vultr) lost utility power to one of its 13 datacenters, DAL-1, located in north Dallas. Severe weather and tornados struck this area of Dallas early Sunday morning, which caused the utility power failure. However, tier-1 data centers have power redundancy (both batteries and generators) to guarantee uptime. For reasons we don't yet know, and don’t matter now, their power transfer switch failed when the generators kicked on, meaning the entire datacenter went dark for hours. This sort of outage—a total datacenter blackout—is extremely rare. In fact, it should not have happened at all, since backup power is a crucial component for any datacenter.
As a result, the majority of our core infrastructure, along with thousands of other servers affecting thousands of other companies, abruptly went offline. Our team was immediately notified by our uptime monitoring systems, and we started investigating the cause and communicating with Hivelocity.
After power was restored to the facility, many network switches were found to be damaged and needed physical repairs before they could come back online. This affected a small but critical portion of our infrastructure. We were in constant communication with Hivelocity, and were told all day Sunday that everything would be back up “within an hour”. After many missed ETAs, our engineers decided to initiate a failover onto our backup servers to get back online. We did not take this step sooner due to the communication we were receiving and the risk of prolonging rather than shortening the downtime.
In light of the severity of this downtime, we will be making major infrastructure changes and moving all of our core infrastructure to AWS. This will take several months to accomplish fully and safely.
Timeline of events
Date: Sunday, May 26, 2024
At approximately 6:17AM CST our primary datacenter, Hivelocity's DAL-1 in Dallas, was hit with a power outage due to severe storms in the Dallas area. Our core infrastructure was abruptly knocked offline.
By 8:41AM CST, power was restored to the datacenter, and our servers began coming back online. We were hopeful that the outage would be resolved soon.
By 10:02AM CST, the majority of our infrastructure was back online. Unfortunately, several critical pieces were still offline.
At 10:24AM CST, Hivelocity informed us that due to the power outage, many network switches were damaged and needed to be repaired or replaced. This was the reason for the continued outage.
At this time, we considered promoting our failover servers to bring everything back online. We decided against this, for two reasons. First, due to the severity of the outage (an entire datacenter blackout) there was risk that we would run into unanticipated issues and actually prolong the downtime. Second, we were told repeatedly that the outage would be resolved within an hour.
Between 10:24AM and 4:49PM CST Hivelocity repeatedly confirmed that work would be completed very soon. We were promised “within an hour” several times, which made it difficult to justify a major failover procedure which would potentially take much longer. During this time, we were reviewing contingency plans and preparing our backup infrastructure.
At 6:08PM CST, we decided to initiate the failover to backup infrastructure.
At 6:35PM CST, our backup infrastructure was brought online and service was restored.
Actions We’re Taking
Before we outline some technical changes we're making, we want to address what we’re doing to help make things right for this past Sunday:
If your church was impacted by Sunday’s downtime, please reach out to our team so we can make things right with you. Different churches have been impacted differently - some may have important messages that failed to send, others may have had keywords where people couldn’t get a response from, others needed to communicate severe weather service changes which ironically was the initial cause of our downtime. In any case, we are committed to making good with each individual customer.
We will be auditing message activity that should have been sent Sunday and refund any credits that you may have used.
Technical Changes
More proactive downtime notifications - In our 14 years as a company, we’ve experienced only a few outages, but nothing anywhere close to this long. Because this has never been an issue, we’ve never built a formal customer notification process for outages like this. On Sunday, our status page was updated immediately, and our website showed a notification at 7:22am. But we sent the first email notification at 8:48am, over two hours after the outage began. That’s unacceptable and a failure on our part, especially on a Sunday morning, when many churches are relying on Clearstream for service communications. An earlier email notification would have helped churches make contingency plans for their morning services.
Over the coming weeks, among other infrastructure changes, we’re building a formal process for downtime notifications. The goal is to clearly, specifically, and timely communicate with all customers.
Server infrastructure improvements - Up until now, we’ve used Hivelocity and Amazon AWS for different parts of our application. In light of this outage and the way it was handled, we’ve decided to move entirely to AWS. Our engineering team is currently planning this migration.
Better failover procedures - While we had failover mechanisms in place, which we used successfully, we did not have a formal procedure for when to initiate a failover. As a part of our migration to AWS, we will formalize and improve these failover procedures, so that any future downtime can be quickly mitigated.
Closing
Thousands of churches and non-profits trust Clearstream with their text/email communications. We take that responsibility very seriously. If your church was impacted by Sunday’s outage, we're sorry we failed you. This past Sunday takes the top spot as the worst day in our 14-year history. The actions we’ve listed above (among others) aren’t just lip service to appease a customer base. We’ve built a reputation for quality in our product and customer support, and we intend to keep that reputation.
As stated earlier - if your church was impacted from Sunday’s downtime, please reach out to our team so we can make things right with you. Different churches have been impacted differently and we’re committed to making good with each individual customer.
Sincerely,
Michael Lepinay and Trevor Gehman, co-founders