In May, we experienced three separate incidents that resulted in a significant impact and degradation of the availability status of several services on GitHub.com. This report also sheds light on the billing incident that affected GitHub Actions and Codespaces users in April.
During this incident, our alerting systems detected increased CPU usage on one of the GitHub Container registry databases. When we received the alert, we immediately began to investigate. Due to this preventive monitoring added since the last incident in April at 08:59 UTC, the on-call duty was monitoring and preparing to execute the mitigation for this incident.
As the CPU usage on the database continued to increase, the container registry started responding to requests with increased latency, followed by an internal server error for a percentage of requests. At this point, we knew there was an impact on customers and changed the public status of the service. This increased CPU activity was due to a high volume of the “Put Manifest” command. Other package registers were not impacted.
The reason for the CPU usage was that the throttling criteria configured on the API side for this command was too permissive, and a database query was found to be underperforming at this degree of scale. This caused a crash for anyone using the GitHub Container registry. Users were experiencing latency issues when pushing or pulling packages, as well as slow access to the packages UI.
In order to limit the impact, we throttled requests from all organizations/users and to restore normal operation, we had to reset the state of our database by restarting our front-end servers and then the database.
To prevent this in the future, we have added separate rate limiting for specific organization/user operation types and will continue to work on improving performance for SQL queries.
Our alert systems have detected degraded availability for API requests during this time. Due to the recency of these incidents, we are still investigating contributing factors and will provide a more detailed update on causes and solutions in the June Readiness Report, which will be released the first Wednesday in July.
During this incident, services such as GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks were impacted. As we continue to investigate contributing factors, we will provide a more detailed update in the June Availability Report. We will also share more information about our efforts to minimize the impact of similar incidents in the future.
As mentioned in the April Readiness Report, we are now providing a more detailed update on this incident after further investigation.
On April 14, GitHub Actions and Codespaces customers began reporting incorrect charges for metered services listed in their GitHub billing settings. As a result, customers were hitting their GitHub spending limits and unable to perform new actions or create new codespaces. We immediately started an incident bridge. Our first step was to unlock all customers by giving unlimited use of shares and codespaces at no additional cost for the duration of this incident.
By reviewing the timeline and list of recently applied changes, we determined that the issue was caused by a code change in the metered billing pipeline. In an attempt to improve the performance of our metered usage processor, the minutes of shares and codespaces were mistakenly multiplied by 1,000,000,000 to convert gigabytes to bytes when this was not not necessary for these products. This was due to a Shared Metered Billing code change that was not intended to impact these products.
To resolve the issue, we reverted the code change and started repairing the corrupted data saved during the crash. We did not re-enable metered billing for GitHub products until we fixed the incorrect billing data, which happened 24 hours after this incident.
To prevent this incident in the future, we added a Rubocop (Ruby Static Code Analyzer) rule to block pull requests that contain insecure billing code updates. Additionally, we’ve added anomaly monitoring for the invoiced quantity, so next time we’ll be alerted before the impacted customers. We’ve also tightened the release process to require a feature flag and end-to-end testing when submitting such changes.
We will continue to update you on the progress and investments we are making to ensure the reliability of our services. To receive real-time updates on status changes, please follow our status page. You can also read more about what we’re working on on the GitHub Engineering blog.