Gateway

Gateway

Gelöst

Incident Summary

Incident: 5 minute outage due to a RAID storage failure
Incident Date: 2019-05-19 21:21 - 21:26 EDT
Affected Services: All


Incident Description

There was a 5 minute outage caused by a drive failure on our primary database server. The issue manifested itself in a way that prevented our high availability database cluster from self-healing. No data or transactions were lost during this incident. No funding events were impacted by this incident.

Client Impact

Our gateway and portals would have been unavailable for approximately 5 minutes. Approximately 3000 transactions and 1000 merchants are expected to have been impacted, based on run-rates from the prior week.

Timeline

  • 21:21: Drive failure occurs on our primary database server; gateways and portals are automatically taken offline
  • 21:22: SEV-1 incident is raised to our on-call team
  • 21:25: Server is forced offline
  • 21:26: Service is restored

Resolution

The server was manually forced offline, which allowed our database cluster to self-heal by failing over to another healthy node. The drive causing the issue on the primary node was replaced.

Next Steps

  • We are replacing the drives in our database servers with new SSDs.
  • We have opened a support case with Microsoft to understand why the RAID array did not ignore the faulty drive.
  • The database servers are slated for a refresh in late Q2/early Q3 as part of our initiative to upgrade to SQL Server 2017.
  • We have ordered hardware RAID controllers for our new database servers, to replace our software RAID controllers.
Aktualisiert

We experienced approximately 5 minutes of downtime due to a storage failure and may need to perform additional maintenance during Monday morning’s maintenance window (0430 EDT). We will provide a postmortem within 2 days.

Gelöst

The Gateway issue is resolved and service has been fully restored. We will provide a postmortem / incident report within 2 business days.

Bewertet

We have confirmed that the Gateway service is currently experiencing widespread disruptions at this time. Our network operations center is investigating the issue and will update this incident with additional information as it becomes available.

We will provide regular updates until the situation is corrected.