On Monday, following an internal billing complication, we were informed that one of our clusters had been reset at 1:25 UTC. After reactivating it, we discovered that the SSL edge certificate supplied by our CDN provider, Cloudflare, was not functioning properly.
We determined that Cloudflare was attempting to renew the certificate from Let's Encrypt, their provider, but the process was stalled in a pending state and not advancing.
Despite several attempts to reset it, we were unsuccessful. We learned that sometimes certificates may expire before being renewed.
To resolve this issue, we purchased a new SSL certificate from DigiCert instead of relying on Let's Encrypt's service. We plan to manage the certificate ourselves moving forward.
The cluster's traffic was restored by approximately 1:45.
Recommendation for customers:
One customer reported not seeing the updated cluster DNS record for several minutes after the service was restored. We believe this may be because their DNS server cached the reply longer than our original TTL (time-to-live) record, which is set to 5 minutes. We strongly advise customers to ensure they are not caching DNS responses for longer than our TTL setting, or to use mainstream DNS name servers.