Parsec Extended Maintenance

Minor incident Parsec Parsec API Parsec Billing Parsec Signaling Service
2023-12-05 07:12 UTC · 1 hour, 38 minutes

Updates

Post-mortem

Summary

During our planned maintenance window from Dec 4 10:00pm to Dec 5 02:00am EST, our team encountered unexpected database corruption issues that did not appear in previous testing. This unfortunately forced us to extend our maintenance from 02:00am EST to 04:00am EST. This caused our backend services and host connectivity to go offline for longer than we planned. If you have a host machine that still appears offline since this incident, we recommend opening and restarting the Parsec app on the machine, and contacting support if that does not immediately bring it back online.

Detailed incident report

The main purpose of the maintenance was to test our new multi-region redundant infrastructure, including cross-region database replication. We performed an intentional database failover from our primary region (us-east-1) to our secondary region (us-west-2). After doing so, our production MySQL database began reporting corruption of its internal transaction log that caused it to enter a continual, 30-second reboot loop.

After reverting the failover back to our primary us-east-1 region, the problem persisted despite multiple attempts at DB shutdown, recovery, and same-region instance failover. Unfortunately, our cross-region replication strategy also caused the corrupt transaction log to be replicated as well. Restoring a snapshot to a new database instance from an error-free timestamp also presented the same corruption issue. The east primary DB stayed in an unrecoverable crash loop between approximately 1:10am EST until 2:40am EST, when we found an alternate method to successfully force recovery on the primary database.

Support engineers from our cloud infrastructure provider have informed us that the underlying cause is a known, intermittent bug in MySQL 8 that can occur at high write loads and specific types of table indexes, such as Parsec’s table for maintaining the online status of host machines. After significant time with AWS engineers to diagnose the issue, we fully remediated by temporarily disabling a low-level MySQL feature related to InnoDB undo log configuration, dropped a high-usage database index that was triggering the bug, and then finally rebooted all primary and secondary database instances.

After the remediation at 2:40am EST, our backend systems gradually processed queued jobs and reconnection attempts until returning to normal levels around 4:00am EST. These systems include the REST APIs, WebSocket APIs and background jobs required for Parsec to operate. We continue to investigate how our database schema triggered this specific MySQL issue, and are also improving the high-volume database write and host reconnect strategies to avoid this happening again.

December 7, 2023 · 19:13 UTC
Resolved

We have confirmed platform stability and consider this incident resolved. If you are still seeing issues, please contact our support team.

December 5, 2023 · 08:49 UTC
Monitoring

We’re starting to see services recover and will keep monitoring.

December 5, 2023 · 08:39 UTC
Issue

Parsec is extending a previously scheduled maintenance window to complete necessary changes and restore service.

December 5, 2023 · 07:12 UTC

← Back