Gateway

Postmortem Jan 11, 2019 10:15 AM EST

Postmortem update for 2019-01-11:

Log shipping to a DR SQL instance has been set up for all databases.
Application connectivity to DR SQL instance has been successfully tested.
We have completed a re-evaluation of our RPOs for each database. We have implemented a 5 minute RPO for our core databases and RTO of < 2 hours for the platform.
RCA from external vendor is complete. We have reviewed their recommendations and begun implementing changes accordingly.
Future state HA/DR design for our data tier from our vendor is complete and we will be reviewing their recommendations later this January.

Resolved Jan 3, 2019 8:39 PM EST

This issue was resolved on October 26th. We will continue to periodically post updates to the postmortem, as we continue to improve our HA and DR solutions.

Updated Dec 15, 2018 8:56 AM EST

Postmortem update for 2018-12-15:

Received RCA from external consultants, reviewing recommendations on 2018-12-17
Log shipping to warm standby DR server will be set up in production starting on 2018-12-18
Review of our DR & HA future state architecture/recommendations from external consultants scheduled for the week of 2019-01-07

Updated Dec 9, 2018 9:39 AM EST

Postmortem update for 2018-12-09:

Quote for DR storage array has been submitted
Third-party vendor’s RCA for the 10/26 outage is due week of Dec 10th

Updated Nov 30, 2018 12:09 PM EST

Postmortem update for 2018-11-30:

We have increased the frequency of our database log shipping schedule, reducing our RPO from 10 minutes to 5 minutes for our most critical databases
We should have an update next week on our DR database and storage solution

Updated Nov 23, 2018 9:23 AM EST

Postmortem update for 2018-11-23:

This was a short week due to the Thanksgiving holiday in the USA.

We are testing a 5 minute Recovery Point Objective in our Performance Lab, by performing log backups every 5 minutes instead of 10 minutes.
We have received a quote from our storage vendor re: a storage array needed to reduce our Recovery Time Objective to under 60 minutes, using the “log shipping” database approach mentioned below.
We have officially kicked off a SOW with Fortified Data to perform a holistic review of our DR & HA setup, and provide recommendations. Their assessment is due by EOY.

Updated Nov 16, 2018 3:20 PM EST

Postmortem update for 2018-11-16

Scheduled maintenance on 2018-11-16 was successful. This maintenance verified that our backups are in good shape. This window should be the last scheduled maintenance in calendar year 2018.
Future scheduled maintenance windows will be moved earlier (0430-0500 ET). This timing corresponds to our lowest traffic volume, and helps better accommodate merchants who open earlier in the morning.
We are working to reduce our Recovery Point Objective to 5 minutes. We should have an update on that within the next 1-2 weeks.
We are working to reduce our Recovery Time Objective to under 60 minutes, using the “log shipping” database approach mentioned below. We should have an update on that effort in early/mid December.

Postmortem Nov 12, 2018 4:39 PM EST

Postmortem update for 2018-11-12:

Scheduled maintenance on 2018-11-09 was successful. This maintenance allows us to restart routine scheduled SQL maintenance jobs (eg. monitoring of our indexes’ health), as well as allocates space for us to do a “side by side” restore of our database, in the event of DB corruption. This will enable us to commit to an improved RTO.
Additional maintenance scheduled for 2018-11-16. This maintenance will be used to verify that our backups are in good shape. This window should be the last scheduled maintenance in calendar year 2018.
We have begun setting up a DR database that is not attached to our Storage Area Network. It will use a different mechanism (log shipping) to replicate data. This should help insulate this database from the type of corruption we saw on 2018-10-26.
We have begun testing a feature of our SAN that will allow us to take instant online snapshots of our databases. If successful, this approach is likely to be integrated into future scheduled maintenance runbooks, and could help further reduce our RTO in the event of DB corruption.

Postmortem Nov 2, 2018 1:13 PM EDT

Postmortem update for 2018-11-02:

Engaged external consultants to review our high availability (HA) and disaster recovery (DR) posture
Engaged VARs to upgrade our DR infrastructure
Scheduled maintenance planned for 2018-11-09 that will enable a 120 minute RTO
Improved change management process to better notify customers of upcoming maintenance
Working with our status page vendor to subscribe all merchants and ISVs to our status page notifications

Postmortem Oct 29, 2018 3:59 PM EDT

On the morning of Friday, October 26, Cayan experienced a roughly 4.5 hour outage that affected all of its services. All of us at Cayan are deeply apologetic for the impact this incident has had on our merchants, ISVs, VARs, and your respective customers. The facts of the event and Cayan’s efforts to restore your confidence in our products and services are outlined below. We are working diligently to implement these changes and to prevent such an event in the future.

Details

While performing scheduled maintenance in its data centers on the morning of October 26, Cayan’s transactional database became corrupted.
The corrupted database was replicated to both of our data centers, which prevented us from redirecting all traffic to our secondary data center.
The nature of this corruption brought all of Cayan’s services offline. This outage affected all products, services, APIs, and portals hosted by Cayan’s gateway, including but not limited to:
- Its Genius family of products.
- Its portals (eg. our Virtual Terminal / Merchant Portal).
- Its APIs (eg. MerchantWare and Transport.Web).
- Its e-Commerce integrations (eg. Genius Checkout, Magento, Salesforce Commerce Cloud)
This outage lasted approximately 4.5 hours - from approximately 630am ET until approximately 11am ET.
Due to the severity and nature of this failure, we were required to restore our database from a backup, in order to restore service.
There was no transactional data lost during this incident.
There may be short delays in funding for some merchants due to this incident. Our Operations team is working to ensure that all merchants are fully funded.

Timeline of Events

All times US EDT on 26 Oct 2018.

06:10: Maintenance begins. No downtime or production impact is anticipated.
06:32: Database becomes corrupted.
06:33: Last transaction processed before the outage.
06:33: Alarms triggered.
06:35: Incident response bridge opened.
06:35-09:00: Various attempts at restoring service happen in parallel:
- Plan “A”: we begin procedures to restore the database from a recent backup.
- Various “plan Bs” are attempted in parallel, in an effort to restore service more quickly.
09:05: All “plan Bs” are exhausted, without success.
09:08: Database is ready to be restored.
09:10: Database restore begins.
10:20: Database restore completes.
10:25: Began running tests to verify the integrity of the restored database.
10:50: Tests complete successfully.
10:55: Service restored.

Remediation Steps

Our communication during the incident was inadequate. While Cayan’s status page was updated in a timely fashion, the message did not go out through other channels to all of our merchants and ISVs quickly enough, nor did we publish frequent enough updates on our progress or ETA on the resolution beyond our status page.
- We are re-evaluating our incident notification protocols, and will likely be subscribing all merchants and ISVs to Cayan’s status page. We are committed to posting timely and transparent updates to all merchants through our status page.
- Within 2 weeks, we will be able to notify all merchants and ISVs within 15 minutes of an outage.
Our recovery time was inadequate. A recovery time of 4.5 hours is unacceptable.
- Within 3 weeks, we will have implemented a solution that ensures a recovery time objective (RTO) of less than 120 minutes.
- Before the end of 2018, we will have implemented a solution that ensures a RTO of less than 60 minutes.
We will provide weekly updates every Friday with regard to our progress on the above.

Resolved Oct 26, 2018 9:35 PM EDT

The Gateway issue is resolved and service has been fully restored.

Monitoring Oct 26, 2018 11:19 AM EDT

The service outage has been resolved as of approximately 11am ET. We will be posting an full incident report including remediation steps within 1 business day.

Resolved Oct 26, 2018 11:15 AM EDT

The service outage has been resolved as of approximately 11am ET. We will be posting an full incident report including remediation steps within 1 business day.

Investigating Oct 26, 2018 9:13 AM EDT

Our Gateway services are continuing to experience widespread disruptions at this time. Our network operations center is attempting several recovery paths in parallel, and will update this incident with additional information as it becomes available.

Investigating Oct 26, 2018 8:24 AM EDT

Our Gateway services are continuing to experience widespread disruptions at this time. Our network operations center is investigating the issue and will update this incident with additional information as it becomes available.

Assessed Oct 26, 2018 7:40 AM EDT

We have confirmed that the Gateway service is currently experiencing widespread disruptions at this time. Our network operations center is investigating the issue and will update this incident with additional information as it becomes available.