Gateway

Gelöst 22. May 2019, 12:51 Uhr EDT

Incident Summary

Incident: 5 minute outage due to a RAID storage failure
Incident Date: 2019-05-19 21:21 - 21:26 EDT
Affected Services: All

Incident Description

There was a 5 minute outage caused by a drive failure on our primary database server. The issue manifested itself in a way that prevented our high availability database cluster from self-healing. No data or transactions were lost during this incident. No funding events were impacted by this incident.

Client Impact

Our gateway and portals would have been unavailable for approximately 5 minutes. Approximately 3000 transactions and 1000 merchants are expected to have been impacted, based on run-rates from the prior week.

Timeline

21:21: Drive failure occurs on our primary database server; gateways and portals are automatically taken offline
21:22: SEV-1 incident is raised to our on-call team
21:25: Server is forced offline
21:26: Service is restored

Resolution

The server was manually forced offline, which allowed our database cluster to self-heal by failing over to another healthy node. The drive causing the issue on the primary node was replaced.

Next Steps

We are replacing the drives in our database servers with new SSDs.
We have opened a support case with Microsoft to understand why the RAID array did not ignore the faulty drive.
The database servers are slated for a refresh in late Q2/early Q3 as part of our initiative to upgrade to SQL Server 2017.
We have ordered hardware RAID controllers for our new database servers, to replace our software RAID controllers.

Aktualisiert 19. May 2019, 21:44 Uhr EDT

We experienced approximately 5 minutes of downtime due to a storage failure and may need to perform additional maintenance during Monday morning’s maintenance window (0430 EDT). We will provide a postmortem within 2 days.

Gelöst 19. May 2019, 21:30 Uhr EDT

The Gateway issue is resolved and service has been fully restored. We will provide a postmortem / incident report within 2 business days.

Bewertet 19. May 2019, 21:25 Uhr EDT

We have confirmed that the Gateway service is currently experiencing widespread disruptions at this time. Our network operations center is investigating the issue and will update this incident with additional information as it becomes available.

We will provide regular updates until the situation is corrected.