GitHub Reports Service Disruptions in January 2025

0




Jessie A Ellis
Feb 13, 2025 20:05

GitHub experienced three incidents in January 2025, causing service disruptions due to deployment, configuration changes, and hardware failures, according to GitHub’s availability report.





Service Disruptions in January

In January 2025, GitHub experienced three significant incidents that led to degraded performance across its services, as detailed in their availability report. These disruptions were attributed to various technical issues, including deployment errors, configuration changes, and hardware failures.

Incident Details

January 9, 2025 (31 minutes)

The first incident occurred on January 9, from 01:26 to 01:56 UTC. A deployment introduced a problematic query that saturated a primary database server, leading to a 6% error rate, peaking at 6.85%. Users faced 500 response errors across several services. GitHub mitigated the issue by rolling back the deployment after 14 minutes of investigation, identifying the errant query through their internal tools and dashboards.

January 13, 2025 (49 minutes)

On January 13, between 23:35 UTC and 00:24 UTC, Git operations were unavailable due to a configuration change related to traffic routing. This adjustment caused the internal load balancer to drop requests necessary for Git operations. The situation was resolved by reverting the configuration change. GitHub is now enhancing monitoring and deployment practices to improve detection times and automate mitigation efforts.

January 30, 2025 (26 minutes)

The final incident on January 30, from 14:22 to 14:48 UTC, involved failures in web requests to github.com, with a peak error rate of 44% and an average successful request time exceeding three seconds. This issue originated from a hardware failure in the caching layer responsible for rate limiting. Due to the absence of automated failover, the impact was prolonged. GitHub performed a manual failover to trusted hardware to prevent recurrence. They plan to implement a high availability cache configuration to bolster resilience against similar failures.

Future Improvements

GitHub is actively investing in enhancing their tooling to detect problematic queries before deployment and improving their cache resilience to prevent future disruptions. These measures aim to reduce detection and mitigation times for potential issues.

For real-time updates on service status and post-incident reports, users can visit GitHub’s status page. Further insights into GitHub’s engineering efforts can be found on the GitHub Engineering Blog.

Image source: Shutterstock



Source link

You might also like
Leave A Reply

Your email address will not be published.