The GitHub Blog https://github.blog/ Updates, ideas, and inspiration from GitHub to help developers build and design software. Fri, 13 Mar 2026 22:35:30 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 https://github.blog/wp-content/uploads/2019/01/cropped-github-favicon-512.png?fit=32%2C32 The GitHub Blog https://github.blog/ 32 32 153214340 Continuous AI for accessibility: How GitHub transforms feedback into inclusion https://github.blog/ai-and-ml/github-copilot/continuous-ai-for-accessibility-how-github-transforms-feedback-into-inclusion/ Thu, 12 Mar 2026 16:00:00 +0000 https://github.blog/?p=94451 AI automates triage for accessibility feedback, allowing us to focus on fixing barriers—turning a chaotic backlog into continuous, rapid resolutions.

The post Continuous AI for accessibility: How GitHub transforms feedback into inclusion appeared first on The GitHub Blog.

]]>

For years, accessibility feedback at GitHub didn’t have a clear place to go.

Unlike typical product feedback, accessibility issues don’t belong to any single team—they cut across the entire ecosystem. For example, a screen reader user might report a broken workflow that touches navigation, authentication, and settings. A keyboard-only user might hit a trap in a shared component used across dozens of pages. A low vision user might flag a color contrast issue that affects every surface using a shared design element. No single team owns any of these problems—but every one of them blocks a real person.

These reports require coordination that our existing processes weren’t originally built for. Feedback was often scattered across backlogs, bugs lingered without owners, and users followed up to silence. Improvements were often promised for a mythical “phase two” that rarely materialized.

We knew we needed to change this. But before we could build something better, we had to lay the groundwork—centralizing scattered reports, creating templates, and triaging years of backlog. Only once we had that foundation in place could we ask: How can AI make this easier?

The answer was an internal workflow, powered by GitHub Actions, GitHub Copilot, and GitHub Models, that ensures every piece of user and customer feedback becomes a tracked, prioritized issue. When someone reports an accessibility barrier, their feedback is captured, reviewed, and followed through until it’s addressed. We didn’t want AI to replace human judgment—we wanted it to handle repetitive work so humans could focus on fixing the software.

This is how we went from chaos to a system where every piece of accessibility feedback is tracked, prioritized, and acted on—not eventually, but continuously.

Accessibility as a living system

Continuous AI for accessibility weaves inclusion into the fabric of software development. It’s not a single product or a one-time audit—it’s a living methodology that combines automation, artificial intelligence, and human expertise.

This philosophy connects directly to our support for the 2025 Global Accessibility Awareness Day (GAAD) pledge: strengthening accessibility across the open source ecosystem by ensuring user and customer feedback is routed to the right teams and translated into meaningful platform improvements.

The most important breakthroughs rarely come from code scanners—they come from listening to real people. But listening at scale is hard, which is why we needed technology to help amplify those voices. We built a feedback workflow that functions less like a static ticketing system and more like a dynamic engine—leveraging GitHub products to clarify, structure, and track user and customer feedback, turning it into implementation-ready solutions.

Designing for people first

Before jumping into solutions, we stepped back to understand who this system needed to serve:

  • Issue submitters: Community managers, support agents, and sales reps submit issues on behalf of users and customers. They aren’t always accessibility experts, so they need a system that guides them and teaches accessibility concepts in the flow of work.
  • Accessibility and service teams: Engineers and designers responsible for fixes need structured, actionable data—reproducible steps, WCAG mapping, severity scores, and clear ownership.
  • Program and product managers: Leadership needs visibility into pain points by category, trends, and progress over time to allocate resources strategically.

With these personas in mind, we knew we wanted to 1) treat feedback as data flowing through a pipeline and 2) build a system able to evolve with us.

How feedback flows

With that foundation set, we built an architecture around an event-driven pattern, where each step triggers a GitHub Action that orchestrates what comes next—ensuring consistent handling no matter where the feedback originates. We built this system largely by hand starting in mid-2024. Today, tools like Agentic Workflows let you create GitHub Actions using natural language—meaning this kind of system could be built in a fraction of the time.

The workflow reacts to key events: Issue creation launches GitHub Copilot analysis via the GitHub Models API, status changes initiate hand-offs between teams, and resolution triggers submitter follow-up with the user. Every Action can also be triggered manually or re-run as needed—automation covers the common path, while humans can step in at any point.

Feedback isn’t just captured—it continuously flows through the right channels, providing visibility, structure, and actionability at every stage.

*Click images to enlarge.

A left-to-right flowchart showing the seven steps of the feedback workflow in sequence: Intake, Copilot Analysis, Submitter Review, Accessibility Team Review, Link Audits, Close Loop, and Improvement. Feedback loops show that Submitter Review can re-run Copilot Analysis, Close Loop can return to Accessibility Team Review, and Improvement feeds updated prompts back to Copilot Analysis.

1. Actioning intake

Feedback can come from anywhere—support tickets, social media posts, email, direct outreach—but most users choose the GitHub accessibility discussion board. It’s where they can work together and build community around shared experiences. Today, 90% of the accessibility feedback flows through that single channel. Because posts are public, other users can confirm the problem, add context, or suggest workarounds—so issues often arrive with richer detail than a support ticket ever could. Regardless of the source, every piece of feedback gets acknowledged within five business days, and even feedback we can’t act on gets a response pointing to helpful resources.

When feedback requires action from internal teams, a team member manually creates a tracking issue using our custom accessibility feedback issue template. Issue templates are pre-defined forms that standardize how information is collected when opening a new issue. The template captures the initial context—what the user reported, where it came from, and which components are involved—so nothing is lost between intake and triage.

This is where automation kicks in. Creating the issue triggers a GitHub Action that engages GitHub Copilot, and a second Action adds the issue to a project board, providing a centralized view of current status, surfacing trends, and helping identify emerging needs.

A left-to-right flowchart where user or customer feedback enters through Discussion Board, Support Ticket, Social Media, Email, or Direct Outreach, moves to an Acknowledge and Validate step, branches at a validity decision, and either proceeds to Create Issue or loops back through Request More Details to the user.

2. GitHub Copilot analysis

With the tracking issue created, a GitHub Action workflow programmatically calls the GitHub Models API to analyze the report. We chose stored prompts over model fine-tuning so that anyone on the team can update the AI’s behavior through a pull request—no retraining pipeline, no specialized ML knowledge required.

We configured GitHub Copilot using custom instructions developed by our accessibility subject matter experts. Our prompt serves two roles: triage analysis, which classifies issues by WCAG violation, severity, and affected user group, and accessibility coaching, where GitHub Copilot acts as a subject-matter expert to help teams write and review accessible code.

These instruction files point to our accessibility policies, component library, and internal documentation that details how we interpret and apply WCAG success criteria. When our standards evolve, the team updates the markdown and instruction files via pull request—the AI’s behavior changes with the next run, not the next training cycle. For a detailed walkthrough of this approach, see our guide on optimizing GitHub Copilot custom instructions for accessibility.

The automation works in two steps. First, an Action fires on issue creation and triggers GitHub Copilot to analyze the report. GitHub Copilot populates approximately 80% of the issue’s metadata automatically—over 40 data points including issue type, user segment, original source, affected components, and enough context to understand the user’s experience. The remaining 20% requires manual input from the team member. GitHub Copilot then posts a comment on the issue containing:

  • A summary of the problem and user impact
  • Suggested WCAG success criteria for potential violations
  • Severity level (sev1 through sev4, where sev1 is critical)
  • Impacted user groups (screen reader users, keyboard users, low vision users, etc.)
  • Recommended team assignment (design, engineering, or both)
  • A checklist of low-barrier accessibility tests so the submitter can verify the issue

Then a second Action fires on that comment, parses the response, applies labels based on the severity GitHub Copilot assigned, updates the issue’s status on the project board, and assigns it to the submitter for review.

If GitHub Copilot’s analysis seems off, anyone can flag it by opening an issue describing what it got wrong and what it should have said—feeding directly into our continuous improvement process.

A left-to-right flowchart where a newly created issue triggers Action 1, which feeds the report along with custom instructions and WCAG documentation into Copilot Analysis. Copilot posts a comment with its findings, then Action 2 parses that comment and branches into four parallel outcomes: applying labels, applying metadata, adding to the project board, and assigning the submitter.

3. Submitter review

Before we act on GitHub Copilot’s recommendations, two layers of review happen—starting with the issue submitter.

The submitter attempts to replicate the problem the user reported. The checklist GitHub Copilot provides in its comment guides our community managers, support agents, and sales reps through expert-level testing procedures—no accessibility expertise required. Each item includes plain-language explanations, step-by-step instructions, and links to tools and documentation.

Example questions include:

  • Can you navigate the page using only a keyboard? Press “Tab” to move through interactive elements. Can you reach all buttons, links, and form fields? Can you see where your focus is at all times?
  • Do images have descriptive alt text? Right-click an image and select “Inspect” to view the markup. Does the alt attribute describe the image’s purpose, or is it a generic file name?
  • Are interactive elements clearly labeled? Using a screen reader, navigate to a button or link. Is its purpose announced clearly? Alternatively, review the accessibility tree in your browser’s developer tools to inspect how elements are exposed to assistive technologies.

If the submitter can replicate the problem, they mark the issue as reviewed, which triggers the next GitHub Action. If they can’t reproduce it, they reach out to the user for more details. Once new information arrives, the submitter can re-run the GitHub Copilot analysis—either by manually triggering the Action from the Actions tab or by removing and re-adding the relevant label to kick it off automatically. AI provides the draft, but humans provide the verification.

A left-to-right flowchart where the submitter receives the issue with Copilot’s checklist, attempts to replicate the problem, and reaches a decision. If replicable, the issue is marked as reviewed and moves to the accessibility team. If not replicable, the submitter contacts the user for more details. When new information arrives, the submitter re-runs Copilot analysis, which loops back to the replication step.

4. Accessibility team review

Once the submitter marks the issue as reviewed, a GitHub Action updates its status on the workflow project board and adds it to a separate accessibility first responder board. This alerts the accessibility team—engineers, designers, champions, testing vendors, and managers—that GitHub Copilot’s analysis is ready for their review.

The team validates GitHub Copilot’s analysis—checking the severity level, WCAG mapping, and category labels—and corrects anything the AI got wrong. When there’s a discrepancy, we assume the human is correct. We log these corrections and use them to refine the prompt files, improving future accuracy.

Once validated, the team determines the resolution approach:

  • Documentation or settings update: Provide the solution directly to the user.
  • Code fix by the accessibility team: Create a pull request directly.
  • Service team needed: Assign the issue to the appropriate service team and track it through resolution.

With a path forward set, the team marks the issue as triaged. An Action then reassigns it to the submitter, who communicates the plan to the user—letting them know what’s being done and what to expect.

A left-to-right flowchart where a reviewed issue triggers an Action that updates the project board and adds it to the first responder board. The accessibility team validates Copilot’s analysis, logs any corrections, then determines a resolution: provide documentation, create a code fix, or assign to a service team. All three paths converge at marking the issue as triaged, which triggers an Action that reassigns it to the submitter to communicate the plan to the user.

5. Linking to audits

As part of the review process, the team connects user and customer feedback to our formal accessibility audit system.

Roughly 75–80% of the time, reported issues correspond to something we already know about from internal audits. Instead of creating duplicates, we find the existing internal audit issue and add a customer-reported label. This lets us prioritize based on real-world impact—a sev2 issue might technically be less critical than a sev1, but if multiple users are reporting it, we bump up its priority.

If the feedback reveals something new, we create a new audit issue and link it to the tracking issue.

A left-to-right flowchart where the team checks whether an existing audit issue covers the reported problem. If one exists, they link it and add a customer-reported label. If not, they create a new audit issue and link it. Both paths converge at updating priority based on real-world impact.

6. Closing the loop

This is the most critical step for trust. Users who take the time to report accessibility barriers deserve to know their feedback led to action.

Once a resolution path is set, the submitter reaches out to the original user to let them know the plan—what’s being fixed, and what to expect. When the fix ships, the submitter follows up again and asks the user to test it. Because most issues originate from the community discussion board, we post confirmations there for everyone to see.

If the user confirms the fix works, we close the tracking issue. If the fix doesn’t fully address the problem, the submitter gathers more details and the process loops back to the accessibility team review. We don’t close issues until the user confirms the fix works for them.

A left-to-right flowchart where the submitter communicates the resolution plan to the user and monitors until the fix ships. The user is asked to test the fix. If it works, the issue is closed. If it doesn’t, the submitter gathers more details and the process loops back to the accessibility team review.

7. Continuous improvement

The workflow doesn’t end when an issue closes—it feeds back into itself.

When submitters or accessibility team members spot inaccuracies in GitHub Copilot’s output, they open a new issue requesting a review of the results. Every GitHub Copilot analysis comment includes a link to create this issue at the bottom, so the feedback loop is built into the workflow itself. The team reviews the inaccuracy, and the correction becomes a pull request to the custom instruction and prompt files described earlier.

We also automate the integration of new accessibility guidance. A separate GitHub Action scans our internal accessibility guide repository weekly and incorporates changes into GitHub Copilot’s custom instructions automatically.

The goal isn’t perfection—it’s continuous improvement. Each quarter, we review accuracy metrics and refine our instructions. These reviews feed into quarterly and fiscal year reports that track resolution times, WCAG failure patterns, and feedback volume trends—giving leadership visibility into both progress and persistent gaps. The system gets smarter over time, and now we have the data to show it.

A left-to-right flowchart with two parallel loops. In the first, an inaccuracy is spotted, a review issue is opened, the team creates a pull request to update the prompt files, and the changes merge to improve future analyses. In the second, a weekly Action scans the accessibility guide repository and auto-updates Copilot's custom instructions. Both loops feed into quarterly reviews that produce fiscal year reports tracking resolution times, WCAG failure patterns, and feedback volume trends.

Impact in numbers

A year ago, nearly half of accessibility feedback sat unresolved for over 300 days. Today, that backlog isn’t just smaller—it’s gone. And the improvements don’t stop there.

  • 89% of issues now close within 90 days (up from 21%)
  • 62% reduction in average resolution time (118 days → 45 days)
  • 70% reduction in manual administrative time
  • 1,150% increase in issues resolved within 30 days (4 → 50 year-over-year)
  • 50% reduction in critical sev1 issues
  • 100% of issues closed within 60 days in our most recent quarter

We track this through automated weekly and quarterly reports generated by GitHub Actions—surfacing which WCAG criteria fail most often and how resolution times trend over time.

Beyond the numbers

A user named James emailed us to report that the GitHub Copilot CLI was inaccessible. Decorative formatting created noise for screen readers, and interactive elements were impossible to navigate.

A team member created a tracking issue. Within moments, GitHub Copilot analyzed the report—mapping James’s description to specific technical concepts, linking to internal documentation, and providing reproduction steps so the submitter could experience the product exactly as James did.

With that context, the team member realized our engineering team had already shipped accessible CLI updates earlier in the year—James simply wasn’t aware.

They replied immediately. His response? “Thanks for pointing out the –screen-reader mode, which I think will help massively.”

Because the AI workflow identified the problem correctly, we turned a frustration into a resolution in hours.

But the most rewarding result isn’t the speed—it’s the feedback from users. Not just that we responded, but that the fixes actually worked for them:

  • “Huge thanks to the team for updating the contributions graph in the high contrast theme. The addition of borders around the grid edges is a small but meaningful improvement. Keep it up!”
  • “Let’s say you want to create several labels for your GitHub-powered workflow: bug, enhancement, dependency updates… But what if you are blind? Before you had only hex codes randomly thrown at you… now it’s fixed, and those colors have meaningful English names. Well done, GitHub!”
  • “This may not be very professional but I literally just screamed! This fix has actually made my day… Before this I was getting my wife to manage the GitHub issues but now I can actually navigate them by myself! It means a lot that I can now be a bit more independent so thank you again.”

That independence is the point. Every workflow, every automation, every review—it all exists so moments like these are the expectation, not the exception.

The bigger picture

Stories like these remind us why the foundation matters. Design annotations, code scanners, accessibility champions, and testing with people with disabilities—these aren’t replaced by AI. They are what make AI-assisted workflows effective. Without that human foundation, AI is just a faster way to miss the point.

We’re still learning, and the system is still evolving. But every piece of feedback teaches us something, and that knowledge now flows continuously back to our team, our users, and the tools we build. 

If you maintain a repository—whether it’s a massive enterprise project or a weekend open-source library—you can build this kind of system today. Start small. Create an issue template for accessibility. Add a .github/copilot-instructions.md file with your team’s accessibility standards. Let AI handle the triage and formatting so your team can focus on what really matters: writing more inclusive code.

And if you hit an accessibility barrier while using GitHub, please share your feedback. It won’t disappear into a backlog. We’re listening—and now we have the system to follow through.

The post Continuous AI for accessibility: How GitHub transforms feedback into inclusion appeared first on The GitHub Blog.

]]>
94451
GitHub availability report: February 2026 https://github.blog/news-insights/company-news/github-availability-report-february-2026/ Thu, 12 Mar 2026 03:23:54 +0000 https://github.blog/?p=94475 In February, we experienced six incidents that resulted in degraded performance across GitHub services.

The post GitHub availability report: February 2026 appeared first on The GitHub Blog.

]]>
In February, we experienced six incidents that resulted in degraded performance across GitHub services.

We recognize the impact these outages have had on teams, workflows, and overall confidence in our platform. Earlier today, we released a blog post outlining the root causes of recent incidents and the steps GitHub is taking to make our systems more resilient moving forward. Thank you for your patience as we work through near-term and long-term investments we’re making.

Below, we go over the six major incidents specific to February.

February 02 17:41 UTC (lasting 1 hour and 5 minutes)

From January 31, 2026, 00:30 UTC, to February 2, 2026, 18:00 UTC, Dependabot service was degraded and failed to create 10% of automated pull requests. This was due to a cluster failover that connected to a read-only database.

We mitigated the incident by pausing Dependabot queues until traffic was properly routed to healthy clusters. All failed jobs were identified and restarted.

We added new monitors and alerts to reduce our time to detect and prevent this in the future.

February 02 19:03 UTC (lasting 5 hours and 53 minutes)

On February 2, 2026, between 18:35 UTC and 22:20 UTC, GitHub Actions hosted runners and GitHub Codespaces were unavailable, with service degraded until full recovery at 23:10 UTC for standard runners, February 3, 2026 at 00:30 UTC for larger runners, and February 3 at 00:15 for codespaces. During this time, actions jobs queued and timed out while waiting to acquire a hosted runner. Other GitHub features that leverage this compute infrastructure were similarly impacted, including Copilot coding agent, Copilot code review, CodeQL, Dependabot, GitHub Enterprise Importer, and GitHub Pages. All regions and runner types were impacted. Codespaces creation and resume operations also failed in all regions. Self-hosted runners for actions on other providers were not impacted.

This outage was caused by a loss in telemetry that cascaded to mistakenly applying security policies to backend storage accounts in our underlying compute provider. Those policies blocked access to critical VM metadata, causing all VM create, delete, reimage, and other operations to fail. More information is available here. This was mitigated by rolling back the policy changes, which started at 22:15 UTC. As VMs came back online, our runners worked through the backlog of requests that hadn’t timed out.

We are working with our compute provider to improve our incident response and engagement time, improve early detection, and ensure safe rollout should similar changes occur in the future.

February 09 16:19 UTC (lasting 1 hour and 21 minutes) and February 09 19:01 UTC (lasting 1 hour and 8 minutes)

On February 9, 2026, GitHub experienced two related periods of degraded availability affecting github.com, the GitHub API, GitHub Actions, Git operations, GitHub Copilot, and other services. The first period occurred between 16:12 UTC and 17:39 UTC, and the second between 18:53 UTC and 20:09 UTC. In total, users experienced approximately 2 hours and 43 minutes of degraded service across the two incidents.

During both incidents, users encountered errors loading pages on github.com, failures when pushing or pulling code over HTTPS, failures starting or completing GitHub Actions workflow runs, and errors using GitHub Copilot. Additional services including GitHub Issues, pull requests, webhooks, Dependabot, GitHub Pages, and GitHub Codespaces experienced intermittent errors. SSH-based Git operations were not affected during either incident.

Our investigation determined that both incidents shared the same underlying cause: a configuration change to a user settings caching mechanism caused a large volume of cache rewrites to occur simultaneously. In the first incident, asynchronous rewrites overwhelmed a shared infrastructure component responsible for coordinating background work, which led to cascading failures and connection exhaustion in the service proxying Git operations over HTTPS. We mitigated this incident by disabling async cache rewrites and restarting the affected Git proxy service across multiple datacenters.

The second incident arose when an additional source of cache updates, not addressed by the initial mitigation, introduced a high volume of synchronous writes. This caused replication delays, resulting in a similar cascade of failures and again leading to connection exhaustion in the Git HTTPS proxy. We mitigated by disabling the source of the cache rewrites and again restarting Git proxy.

We are taking the following immediate steps:

  • We optimized the caching mechanism to avoid write amplification and added self-throttling during bulk updates.
  • We are adding safeguards to ensure the caching mechanism responds more quickly to rollbacks and strengthening how changes to these caching systems are planned, validated, and rolled out with additional checks.
  • We are fixing the underlying cause of connection exhaustion in our Git HTTPS proxy layer so the proxy can recover from this failure mode automatically without requiring manual restarts.

February 12 07:53 UTC (lasting 2 hours and 3 minutes)

On February 12, 2026, between 00:51 UTC and 09:35 UTC, users attempting to create or resume Codespaces experienced elevated failure rates across Europe, Asia, and Australia, peaking at a 90% failure rate. Impact started in UK South and impacted other regions progressively. US regions were not impacted.

The failures were caused by an authorization claim change in a core networking dependency, which led to codespace pool provisioning failures. Alerts detected the issue but did not have the appropriate severity, leading to delayed detection and response. Learning from this, we have improved our validation of changes to this backend service and monitoring during rollout. Our alerting thresholds have also been updated to catch issues before they impact customers and improved our automated failover mechanisms to cover this area.

February 12 10:38 UTC (lasting 34 minutes)

On February 12, 2026, from 09:16 to 11:01 UTC, users attempting to download repository archives (tar.gz/zip) that include Git LFS objects received errors. Standard repository archives without LFS objects were not affected. On average, the archive download error rate was 0.0042% and peaked at 0.0339% of requests to the service. This was caused by the deployment of an incorrect network configuration in the LFS Service that caused service health checks to fail and an internal service to be incorrectly marked as unreachable.

We mitigated the incident by manually applying the corrected network setting. Additional checks for corruption and auto-rollback detection were added to prevent this type of configuration issue.


Follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the engineering section on the GitHub Blog.

The post GitHub availability report: February 2026 appeared first on The GitHub Blog.

]]>
94475
Addressing GitHub’s recent availability issues https://github.blog/news-insights/company-news/addressing-githubs-recent-availability-issues-2/ Wed, 11 Mar 2026 21:41:51 +0000 https://github.blog/?p=94472 GitHub recently experienced several availability incidents. We understand the impact these outages have on our customers and are sharing details on the stabilization work we’re prioritizing right now.

The post Addressing GitHub’s recent availability issues appeared first on The GitHub Blog.

]]>

Over the past several weeks, GitHub has experienced significant availability and performance issues affecting multiple services. Three of the most significant incidents happened on February 2, February 9, and March 5.

First and foremost, we take responsibility. We have not met our own availability standards, and we know that reliability is foundational to the work you do every day. We understand the impact these outages have had on your teams, your workflows, and your confidence in our platform.

Here, we’ll unpack what’s been causing these incidents and what we’re doing to make our systems more resilient moving forward.

What happened

These incidents have occurred during a period of extremely rapid usage growth across our platform, exposing scaling limitations in parts of our current architecture. Specifically, we’ve found that recent platform instability was primarily driven by rapid load growth, architectural coupling that allowed localized issues to cascade across critical services, and inability of the system to adequately shed load from misbehaving clients.

Before we cover what we are doing to prevent these issues going forward, it is worth diving into the details of the most impactful incidents.

February 9 incident

On Monday, February 9, we experienced a high‑impact incident due to a core database cluster that supports authentication and user management becoming overloaded. The mistakes that led to the problem were made days and weeks earlier.

In early February, two very popular client-side applications that make a significant amount of API calls against our servers were released, with unintentional changes driving a more-than-tenfold increase in read traffic they generated. Because these applications end up being updated by the users over time, the increase in usage doesn’t become evident right away; it appears as enough users upgrade.

On Saturday, February 7, we deployed a new model. While trying to get it to customers as quickly as possible, we changed a refresh TTL on a cache storing user settings from 12 to 2 hours. The model was released to a narrower set of customers due to limited capacity, which made the change necessary. At this point, everything was operating normally because the weekend load is significantly lower, and we didn’t have sufficiently granular alarms to detect the looming issue.

Three things then compounded on February 9: our regular peak load, many customers updating to the new version of the client apps as they were starting their week, and another new model release. At this point, the write volume due to the increased TTL and the read volume from the client apps combined to overwhelm the database cluster. While the TTL change was quickly identified as a culprit, it took much longer to understand why the read load kept increasing, which prolonged the incident. Further, due to the interaction between different services after the database cluster became overwhelmed, we needed to block the extra load further up the stack, and we didn’t have sufficiently granular switches to identify which traffic we needed to block at that level.

The investigation for the February 9 incident raised a lot of important questions about why the user settings were stored in this particular database cluster and in this particular way. The architecture was originally selected for simplicity at a time when there were very few models and very few governance controls and policies related to those models. But over time, something that was a few bytes per user grew into kilobytes. We didn’t catch how dangerous that was because the load was visible only during new model or policy rollouts and was masked by the TTL. Since this database cluster houses data for authentication and user management, any services that depend on these were impacted.

GitHub Actions incidents on February 2 and March 5

We also had two significant instances where our failover solution was either insufficient or didn’t function correctly:

  • Actions hosted runners had a significant outage on February 2. Most cloud infrastructure issues in this area typically do not cause impact as they occur in a limited number of regions, and we automatically shift traffic to healthy regions. However, in this case, there was a cascading set of events triggered by a telemetry gap that caused existing security policies to be applied to key internal storage accounts affecting all regions. This blocked access to VM metadata on VM creates and halted hosted runner lifecycle operations.
  • Another impactful incident for Actions occurred on March 5. Automated failover has been progressively rolling out across our Redis infrastructure, and on this day, a failover occurred for a Redis cluster used by Actions job orchestration. The failover performed as expected, but a latent configuration issue meant the failover left the cluster in a state with no writable primary. With writes failing and failover not available as a mitigation, we had to correct the state manually to mitigate. This was not an aggressive rollout or missing resiliency mechanism, but rather latent configuration that was only exposed by an event in production infrastructure.

For both of these incidents, the investigations brought up unexpected single points of failure that we needed to protect and needed to dry run failover procedures in the production more rigorously.

Across these incidents, contributing factors expanded the scope of impact to be much broader or longer than necessary, including:

  • Insufficient isolation between critical path components in our architecture
  • Inadequate safeguards for load shedding and throttling
  • Gaps in end-to-end validation, monitoring for attention on earlier signals, and partner coordination during incident response

What we are doing now

Our engineering teams are fully engaged in both near-term mitigations and durable longer-term architecture and process investments. We are addressing two common themes: managing rapidly increasing load by focusing on resilience and isolation of critical paths and preventing localized failures from ever causing broad service degradation.

In the near term, we are prioritizing stabilization work to reduce the likelihood and impact of incidents. This includes:

  1. Redesigning our user cache system, which hosts model policies and more, to accommodate significantly higher volume in a segmented database cluster.
  2. Expediting capacity planning and completing a full audit of fundamental health for critical data and compute infrastructure to address urgent growth.
  3. Further isolate key dependencies so that critical systems like GitHub Actions and Git will not be impacted by any shared infrastructure issues, reducing cascade risk. This is being done through a combination of removing or handling dependency failures where possible or isolating dependencies.
  4. Protecting downstream components during spikes to prevent cascading failures while prioritizing critical traffic loads.

In parallel, we are accelerating deeper platform investments to deliver on GitHub’s commitment to supporting sustained, high-rate growth with high availability. These include:

  1. Migrating our infrastructure to Azure to accommodate rapid growth, enabling both vertical scaling within regions and horizontal scaling across regions. In the short term, this provides a hybrid approach for infrastructure resiliency. As of today, 12.5% of all GitHub traffic is served from our Azure Central US region, and we are on track to serving 50% of all GitHub traffic by July. Longer term, this enables simplification of our infrastructure architecture and more global resiliency by adopting managed services.
  2. Breaking apart the monolith into more isolated services and data domains as appropriate, so we can scale independently, enable more isolated change management, and implement localized decisions about shedding traffic when needed.

We are also continuing tactical repair work from every incident.

Our commitment to transparency

We recognize that it’s important to provide you with clear communication and transparency when something goes wrong. We publish summaries of all incidents that result in degraded performance of GitHub services on our status page and in our monthly availability reports. The February report will publish later today with a detailed explanation of incidents that occurred last month, and our March report will publish in April.

Given the scope of recent incidents, we felt it was important to address them with the community today. We know GitHub is critical digital infrastructure, and we are taking urgent action to ensure our platform is available when and where you need it. Thank you for your patience as we strengthen the stability and resilience of the GitHub platform.

The post Addressing GitHub’s recent availability issues appeared first on The GitHub Blog.

]]>
94472
The era of “AI as text” is over. Execution is the new interface. https://github.blog/ai-and-ml/github-copilot/the-era-of-ai-as-text-is-over-execution-is-the-new-interface/ Tue, 10 Mar 2026 20:16:01 +0000 https://github.blog/?p=94415 AI is shifting from prompt-response interactions to programmable execution. See how the GitHub Copilot SDK enables agentic workflows directly inside your applications.

The post The era of “AI as text” is over. Execution is the new interface. appeared first on The GitHub Blog.

]]>

Over the past two years, most teams have interacted with AI the same way: provide text input, receive text output, and manually decide what to do next.

But production software doesn’t operate on isolated exchanges. Real systems execute. They plan steps, invoke tools, modify files, recover from errors, and adapt under constraints you define.

As a developer, you’ve gotten used to using GitHub Copilot as your trusted AI in the IDE. But I bet you’ve thought more than once: “Why can’t I use this kind of agentic workflow inside my own apps too?”

Now you can.

The GitHub Copilot SDK makes that execution layer available as a programmable capability inside your software.

Instead of maintaining your own orchestration stack, you can embed the same production-tested planning and execution engine that powers GitHub Copilot CLI directly into your systems.

If your application can trigger logic, it can now trigger agentic execution. This shift changes the architecture of AI-powered systems.

So how does it work? Here are three concrete patterns teams are using to embed agentic execution into real applications.

Pattern #1: Delegate multi-step work to agents

For years, teams have relied on scripts and glue code to automate repetitive tasks. But the moment a workflow depends on context, changes shape mid-run, or requires error recovery, scripts become brittle. You either hard-code edge cases, or start building a homegrown orchestration layer.

With the Copilot SDK, your application can delegate intent rather than encode fixed steps.

For example:

Your app exposes an action like “Prepare this repository for release.”

Instead of defining every step manually, you pass intent and constraints. The agent:

  • Explores the repository
  • Plans required steps
  • Modifies files
  • Runs commands
  • Adapts if something fails

All while operating within defined boundaries.

Why this matters: As systems scale, fixed workflows break down. Agentic execution allows software to adapt while remaining constrained and observable, without rebuilding orchestration from scratch.

View multi-step execution examples →

Pattern #2: Ground execution in structured runtime context

Many teams attempt to push more behavior into prompts. But encoding system logic in text makes workflows harder to test, reason about, and evolve. Over time, prompts become brittle substitutes for structured system integration.

With the Copilot SDK, context becomes structured and composable.

You can:

  • Define domain-specific tools or agent skills
  • Expose tools via Model Context Protocol (MCP)
  • Let the execution engine retrieve context at runtime

Instead of stuffing ownership data, API schemas, or dependency rules into prompts, your agents access those systems directly during planning and execution.

For example, an internal agent might:

  • Query service ownership
  • Pull historical decision records
  • Check dependency graphs
  • Reference internal APIs
  • Act under defined safety constraints

Why this matters: Reliable AI workflows depend on structured, permissioned context. MCP provides the plumbing that keeps agentic execution grounded in real tools and real data, without guesswork embedded in prompts.

Pattern #3: Embed execution outside the IDE

Much of today’s AI tooling assumes meaningful work happens inside the IDE. But modern software ecosystems extend far beyond an editor.

Teams want agentic capabilities inside:

  • Desktop applications
  • Internal operational tools
  • Background services
  • SaaS platforms
  • Event-driven systems

With the Copilot SDK, execution becomes an application-layer capability.

Your system can listen for an event—such as a file change, deployment trigger, or user action—and invoke Copilot programmatically.

The planning and execution loop runs inside your product, not in a separate interface or developer tool.

Why this matters: When execution is embedded into your application, AI stops being a helper in a side window and becomes infrastructure. It’s available wherever your software runs, not just inside an IDE or terminal.

Build your first Copilot-powered app →

Execution is the new interface

The shift from “AI as text” to “AI as execution” is architectural. Agentic workflows are programmable planning and execution loops that operate under constraints, integrate with real systems, and adapt at runtime.

The GitHub Copilot SDK makes those execution capabilities accessible as a programmable layer. Teams can focus on defining what their software should accomplish, rather than rebuilding how orchestration works every time they introduce AI.

If your application can trigger logic, it can trigger agentic execution.

Explore the GitHub Copilot SDK →

The post The era of “AI as text” is over. Execution is the new interface. appeared first on The GitHub Blog.

]]>
94415
Under the hood: Security architecture of GitHub Agentic Workflows https://github.blog/ai-and-ml/generative-ai/under-the-hood-security-architecture-of-github-agentic-workflows/ Mon, 09 Mar 2026 16:00:00 +0000 https://github.blog/?p=94363 GitHub Agentic Workflows are built with isolation, constrained outputs, and comprehensive logging. Learn how our threat model and security architecture help teams run agents safely in GitHub Actions.

The post Under the hood: Security architecture of GitHub Agentic Workflows appeared first on The GitHub Blog.

]]>

Whether you’re an open-source maintainer or part of an enterprise team, waking up to documentation fixes, new unit tests, and refactoring suggestions can be a true “aha” moment. But automation also raises an important concern: how do you put guardrails on agents that have access to your repository and the internet? Will you be wondering if your agent relied on documentation from a sketchy website, or pushed a commit containing an API token? What if it decides to add noisy comments to every open issue one day? Automations must be predictable to offer durable value.

But what is the safest way to add agents to existing automations like CI/CD? Agents are non-deterministic: They must consume untrusted inputs, reason over repository state, and make decisions at runtime. Letting agents operate in CI/CD without real-time supervision allows you to scale your software engineering, but it also requires novel guardrails to keep you from creating security problems.

GitHub Agentic Workflows run on top of GitHub Actions. By default, everything in an action runs in the same trust domain. Rogue agents can interfere with MCP servers, access authentication secrets, and make network requests to arbitrary hosts. A buggy or prompt-injected agent with unrestricted access to these resources can act in unexpected and insecure ways.

That’s why security is baked into the architecture of GitHub Agentic Workflows. We treat agent execution as an extension of the CI/CD model—not as a separate runtime. We separate open‑ended authoring from governed execution, then compile a workflow into a GitHub Action with explicit constraints such as permissions, outputs, auditability, and network access.

This post explains how we built Agentic Workflows with security in mind from day one, starting with the threat model and the security architecture that it needs.

Threat model

There are two properties of agentic workflows that change the threat model for automation.

First, agents’ ability to reason over repository state and act autonomously makes them valuable, but it also means they cannot be trusted by default—especially in the presence of untrusted inputs.

Second, GitHub Actions provide a highly permissive execution environment. A shared trust domain is a feature for deterministic automation, enabling broad access, composability, and good performance. But when combined with untrusted agents, having a single trust domain can create a large blast radius if something goes wrong.

Under this model, we assume an agent will try to read and write state that it shouldn’t, communicate over unintended channels, and abuse legitimate channels to perform unwanted actions. By default, GitHub Agentic Workflows run in a strict security mode with this threat model in mind, and their design is guided by four security principles: defense in depth, don’t trust agents with secrets, stage and vet all writes, and log everything.

Defend in depth

GitHub Agentic Workflows provide a layered security architecture consisting of substrate, configuration, and planning layers. Each layer limits the impact of failures above it by enforcing distinct security properties that are consistent with its assumptions.

Diagram of a three-layer system architecture with labeled sections Planning layer, Configuration layer, and Substrate layer. Each layer contains three blue tiles:

Planning: Safe Outputs MCP (GitHub write operations), Call filtering (call availability, volume), Output sanitization (secret removal, moderation).
Configuration: Compiler (GH AW extension), Firewall policies (allowlist), MCP config (Docker image, auth token).
Substrate: Action runner VM (OS, hypervisor), Docker containers (Docker daemon, network), Trusted containers (firewall, MCP gateway, API proxy).

The substrate layer rests on a GitHub Actions runner virtual machine (VM) and several trusted containers that limit the resources an agent can access. Collectively, the substrate level provides isolation among components, mediation of privileged operations and system calls, and kernel-enforced communication boundaries. These protections hold even if an untrusted user-level component is compromised and executes arbitrary code within its container isolation boundary.

Above the substrate layer is a configuration layer that includes declarative artifacts and the toolchains that interpret them to instantiate a secure system structure and connectivity. The configuration layer dictates which components are loaded, how components are connected, what communication channels are permitted, and what privileges are assigned. Externally minted tokens, such as agent API keys and GitHub access tokens, are critical inputs that bound components’ external effects—configuration controls which tokens are loaded into which containers.

The final layer of defense is the planning layer. The configuration layer dictates which components exist and how they communicate, but it does not dictate which components are active over time. The planning layer’s primary responsibility is to create a staged workflow with explicit data exchanges between them. The safe outputs subsystem, which will be described in greater detail below, is the primary instance of secure planning.

Don’t trust agents with secrets

From the beginning, we wanted workflow agents to have zero access to secrets. Agentic workflows execute as GitHub Actions, in which components share a single trust domain on top of the runner VM. In that model, sensitive material like agent authentication tokens and MCP server API keys reside in environment variables and configuration files visible to all processes in the VM.

This is dangerous because agents are susceptible to prompt injection: Attackers can craft malicious inputs like web pages or repository issues that trick agents into leaking sensitive information. For example, a prompt-injected agent with access to shell-command tools can read configuration files, SSH keys, Linux /proc state, and workflow logs to discover credentials and other secrets. It can then upload these secrets to the web or encode them within public-facing GitHub objects like repository issues, pull requests, and comments.

Our first mitigation was to isolate the agent in a dedicated container with tightly controlled egress: firewalled internet access, MCP access through a trusted MCP gateway, and LLM API calls through an API proxy. To limit internet access, agentic workflows create a private network between the agent and firewall. The MCP gateway runs in a separate trusted container, launches MCP servers, and has exclusive access to MCP authentication material.

Although agents like Claude, Codex, and Copilot must communicate with an LLM over an authenticated channel, we avoid exposing those tokens directly to the agent’s container. Instead, we place LLM auth tokens in an isolated API proxy and configure agents to route model traffic through that proxy.

Architecture diagram showing several connected Docker containers. A Codex token connects to an api-proxy container, which connects to an OpenAI service icon. A separate flow shows an agent container (linked to chroot/host) communicating over http to a gh-aw-firewall container, then over http to a gh-aw-mcpg container (linked to Host Docker Socket), then over stdio to a GitHub MCP container (linked to a GitHub PAT). A GitHub icon appears above the GitHub MCP container.

Zero-secret agents require a fundamental trade-off between security and utility. Coding workloads require broad access to compilers, interpreters, scripts, and repository state, but expanding the in-container setup would duplicate existing actions provisioning logic and increase the set of network destinations that must be allowed through the firewall.

Instead, we carefully expose host files and executables using container volume mounts and run the agent in a chroot jail. We start by mounting the entire VM host file system read-only at /host. We then overlay selected paths with empty tmpfs layers and launch the agent in a chroot jail rooted at /host. This approach keeps the host-side setup intact while constraining the agent’s writable and discoverable surface to what it needs for its job.

Stage and vet all writes

Prompt-injected agents can still do harm even if they do not have access to secrets. For example, a rogue agent could spam a repository with pointless issues and pull requests to overwhelm repository maintainers, or add objectionable URLs and other content in repository objects.

To prevent this kind of behavior, the agentic workflows compiler decomposes workflows into explicit stages and defines, for each stage:

  • The active components and permissions (read vs. write)
  • The data artifacts emitted by that stage
  • The admissible downstream consumers of those artifacts

While the agent runs, it can read GitHub state through the GitHub MCP server and can only stage its updates through the safe outputs MCP server. Once the agent exits, write operations that have been buffered by the safe outputs MCP server are processed by a suite of safe outputs analyses.

Diagram showing a GitHub-centric workflow with green arrows and two rows of components. At the top, a GitHub icon points down into three boxes: Agent (Untrusted), GitHub MCP (Read-only), and MCP config (Write-buffered). Below are three processing steps labeled Filter operations, Moderate content, and Remove secrets, each marked 'Deterministic analysis.' Green arrows indicate data flow from GitHub into the system, down through configuration to 'Remove secrets,' then left through 'Moderate content' and 'Filter operations,' looping back toward the agent.

First, safe outputs allow workflow authors to specify which write operations an agent can perform. Authors can choose which subset of GitHub updates are allowed, such as creating issues, comments, or pull requests. Second, safe outputs limits the number of updates that are allowed, such as restricting an agent to creating at most three pull requests in a given run. Third, safe outputs analyzes update content to remove unwanted patterns, such as output sanitization to remove URLs. Only artifacts that pass through the entire safe outputs pipeline can be passed on, ensuring that each stage’s side effects are explicit and vetted.

Log everything

Even with zero secrets and vetted writes, an agent can still transform repository data and invoke tools in unintended ways or try to break out of the constraints that we impose upon it. Agents are determined to accomplish their tasks by any means and have a surprisingly deep toolbox of tricks for doing so. If an agent behaves unexpectedly, post-incident analysis requires visibility into the complete execution path.

Agentic workflows make observability a first-class property of the architecture by logging extensively at each trust boundary. Network and destination-level activity is recorded at the firewall layer, model request/response metadata and authenticated requests are captured by the API proxy; and tool invocations are logged by the MCP gateway and MCP servers. We also add internal instrumentation to the agent container to audit potentially sensitive actions like environment variable accesses. Together, these logs support end-to-end forensic reconstruction, policy validation, and rapid detection of anomalous agent behavior.

Pervasive logging also lays the foundation for future information-flow controls. Every location where communication can be observed is also a location where it can be mediated. Agentic workflows already support the GitHub MCP server’s lockdown mode, and in the coming months, we’ll introduce additional safety controls that enforce policies across MCP servers based on visibility (public vs. private) and the role of a repository object’s author.

What’s next?

We’d love for you to be involved! Share your thoughts in the Community discussion or join us (and tons of other awesome makers) in the #agentic-workflows channel of the GitHub Next Discord. We look forward to seeing what you build with GitHub Agentic Workflows. Happy automating, and keep an eye out for more updates!

The post Under the hood: Security architecture of GitHub Agentic Workflows appeared first on The GitHub Blog.

]]>
94363
How to scan for vulnerabilities with GitHub Security Lab’s open source AI-powered framework https://github.blog/security/how-to-scan-for-vulnerabilities-with-github-security-labs-open-source-ai-powered-framework/ Fri, 06 Mar 2026 21:09:04 +0000 https://github.blog/?p=94370 GitHub Security Lab Taskflow Agent is very effective at finding Auth Bypasses, IDORs, Token Leaks, and other high-impact vulnerabilities.

The post How to scan for vulnerabilities with GitHub Security Lab’s open source AI-powered framework appeared first on The GitHub Blog.

]]>

For the last few months, we’ve been using the GitHub Security Lab Taskflow Agent along with a new set of auditing taskflows that specialize in finding web security vulnerabilities. They also turn out to be very successful at finding high-impact vulnerabilities in open source projects. 

As security researchers, we’re used to losing time on possible vulnerabilities that turn out to be unexploitable, but with these new taskflows, we can now spend more of our time on manually verifying the results and sending out reports. Furthermore, the severity of the vulnerabilities that we’re reporting is uniformly high. Many of them are authorization bypasses or information disclosure vulnerabilities that allow one user to login as somebody else or to access the private data of another user.

Using these taskflows, we’ve reported more than 80 vulnerabilities so far. At the time of writing, approximately 20 of them have already been disclosed. And we’re continually updating our advisories page when new vulnerabilities are disclosed. In this blog post, we’ll show a few concrete examples of high-impact vulnerabilities that are found by these taskflows, like accessing personally identifiable information (PII) in shopping carts of ecommerce applications or signing in with any password into a chat application.

We’ll also explain how the taskflows work, so you can learn how to write your own. The security community moves faster when it shares knowledge, which is why we’ve made the framework open source and easy to run on your own project. The more teams using and contributing to it, the faster we collectively eliminate vulnerabilities.

How to run the taskflows on your own project

Want to get started right away? The taskflows are open source and easy to run yourself! Please note: A GitHub Copilot license is required, and the prompts will use premium model requests. (Note that running the taskflows can result in many tool calls, which can easily consume a large amount of quota.)

  1. Go to the seclab-taskflows repository and start a codespace.
  2. Wait a few minutes for the codespace to initialize.
  3. In the terminal, run ./scripts/audit/run_audit.sh myorg/myrepo

It might take an hour or two to finish on a medium-sized repository. When it finishes, it’ll open an SQLite viewer with the results. Open the “audit_results” table and look for rows with a check-mark in the “has_vulnerability” column.

Tip: Due to the non-deterministic nature of LLMs, it is worthwhile to perform multiple runs of these audit taskflows on the same codebase. In certain cases, a second run can lead to entirely different results. In addition to this, you might perform those two runs using different models (e.g., the first using GPT 5.2 and the second using Claude Opus 4.6).

The taskflows also work on private repos, but you’ll need to modify the codespace configuration to do so because it won’t allow access to your private repos by default.

Introduction to taskflows

Taskflows are YAML files that describe a series of tasks that we want to do with an LLM. With them, we can write prompts to complete different tasks and have tasks that depend on each other. The seclab-taskflow-agent framework takes care of running the tasks sequentially and passing the results from one task to the next.

For example, when auditing a repository, we first divide the repository into different components according to their functionalities. Then, for each component, we may want to collect some information such as entry points where it takes untrusted input from, intended privilege, and purposes of the component, etc. These results are then stored in a database to provide the context for subsequent tasks. 

Based on the context data, we can then create different auditing tasks. Currently, we have a task that suggests some generic issues for each component and another task that carefully audits each suggested issue. However, it’s also possible to create other tasks, such as tasks with specific focus on a certain type of issue.

These become a list of tasks we specify in a taskflow file.

Diagram of the auditing taskflows, showing the context gathering taskflow communicating with different auditing taskflows via a database named repo_context.db.

We use tasks instead of one big prompt because LLMs have limited context windows, and complex, multi-step tasks are often not completed properly. For example, some steps can be left out. Even though some LLMs have larger context windows, we find that taskflows are still useful in providing a way for us to control and debug the tasks, as well as for accomplishing bigger and more complex projects.

The seclab-taskflow-agent can also run the same task across many components asynchronously (like a for loop). During audits, we often reuse the same prompt and task for every component, varying only the details. The seclab-taskflow-agent lets us define templated prompts, iterate through components, and substitute component-specific details as it runs.

Taskflows for general security code audits

After using seclab-taskflow-agent to triage CodeQL alerts, we decided we didn’t want to restrict ourselves to specific types of vulnerabilities and started to explore using the framework for more general security auditing. The main challenge in giving LLMs more freedom is the possibility of hallucinations and an increase in false positives. After all, the success with triaging CodeQL alerts was partly due to the fact that we gave the LLM a very strict and well-defined set of instructions and criteria, so the results could be verified at each stage to see if the instructions were followed. 

So our goal here was to find a good way to allow the LLM the freedom to look for different types of vulnerabilities while keeping hallucinations under control.

We’re going to show how we used agent taskflows to discover high-impact vulnerabilities with high true positive rate using just taskflow design and prompt engineering.

General taskflow design

To minimize hallucinations and false positives at the taskflow design level, our taskflow starts with a threat modelling stage, where a repository is divided into different components based on functionalities and various information, such as entry points, and the intended use of each component is collected. This information helps us to determine the security boundary of each component and how much exposure it has to untrusted input. 

The information collected through the threat modelling stage is then used to determine the security boundary of each component and to decide what should be considered a security issue. For example, a command injection in a CLI tool with functionality designed to execute any user input script may be a bug but not a security vulnerability, as an attacker able to inject a command using the CLI tool can already execute any script. 

At the level of prompts, the intended use and security boundary that is discovered is then used in the prompts to provide strict guidelines as to whether an issue found should be considered a vulnerability or not.

You need to take into account of the intention and threat model of the component in component notes to determine if an issue is a valid security issue or if it is an intended functionality. You can fetch entry points, web entry points and user actions to help you determine the intended usage of the component.

Asking an LLM something as vague as looking for any type of vulnerability anywhere in the code base would give poor results with many hallucinated issues. Ideally, we’d like to simulate the triage environment where we have some potential issues as the starting point of analysis and ask the LLM to apply rigorous criteria to determine whether the potential issue is valid or not.

To bootstrap this process, we break the auditing task into two steps. 

  • First, we ask the LLM to go through each component of the repository and suggest types of vulnerabilities that are more likely to appear in the component. 
  • These suggestions are then passed to another task, where they will be audited according to rigorous criteria. 

In this setup, the suggestions from the first step act as some inaccurate vulnerability alerts flagged by an “external tool,” while the second step serves as a triage step. While this may look like a self-validating process—by breaking it down into two steps, each with a fresh context and different prompts—the second step is able to provide an accurate assessment of suggestions.

We’ll now go through these tasks in detail.

Threat modeling stage

When triaging alerts flagged by automatic code scanning tools, we found that a large proportion of false positives is the result of improper threat modeling. Most static analysis tools do not take into account the intended usage and security boundary of the source code and often give results that have no security implications. For example, in a reverse proxy application, many SSRF (server-side request forgery) vulnerabilities flagged by automated tools are likely to fall within the intended use of the application, while some web services used, for example, in continuous integration pipelines are designed to execute arbitrary code and scripts within a sandboxed environment. Remote code execution vulnerabilities in these applications without a sandboxed escape are generally not considered a security risk.

Given these caveats, it pays to first go through the source code to get an understanding of the functionalities and intended purpose of code. We divide this process into the following tasks: 

  • Identify applications: A GitHub repository is an imperfect boundary for auditing: It may be a single component within a larger system or contain multiple components, so it’s worth identifying and auditing each component separately to match distinct security boundaries and keep scope manageable. We do this with the identify_applications taskflow, which asks the LLM to inspect the repository’s source code and documentation and divide it into components by functionality. 
  • Identify entry points: We identify how each entry point is exposed to untrusted inputs to better gauge risk and anticipate likely vulnerabilities. Because “untrusted input” varies significantly between libraries and applications, we provide separate guidelines for each case.
  • Identify web entry points: This is an extra step to gather further information about entry points in the application and append information that is specific to web application entry points such as noting the HTTP method and paths that are required to access a certain endpoint. 
  • Identify user actions: We have the LLM review the code and identify what functionality a user can access under normal operation. This clarifies the user’s baseline privileges, helps assess whether vulnerabilities could enable privilege gains, and informs the component’s security boundary and threat model, with separate instructions depending on whether the component is a library or an application. 

At each of the above steps, information gathered about the repository is stored in a database. This includes components in the repository, their entry points, web entry points, and intended usage. This information is then available for use in the next stage.

Issue suggestion stage

At this stage, we instruct the LLM to suggest some types of vulnerabilities, or a general area of high security risk for each component based on the information about the entry point and intended use of the component gathered from the previous step. In particular, we put emphasis on the intended usage of the component and its risk from untrusted input:

Base your decision on:
- Is this component likely to take untrusted user input? For example, remote web request or IPC, RPC calls?
- What is the intended purpose of this component and its functionality? Does it allow high privileged action?
Is it intended to provide such functionalities for all user? Or is there complex access control logic involved?
- The component itself may also have its own `README.md` (or a subdirectory of it may have a `README.md`). Take a look at those files to help understand the functionality of the component.

We also explicitly instruct the LLM to not suggest issues that are of low severity or are generally considered non-security issues.

However, you should still take care not to include issues that are of low severity or requires unrealistic attack scenario such as misconfiguration or an already compromised system.

In general, we keep this stage relatively free of restrictions and allow the LLM freedom to explore and suggest different types of vulnerabilities and potential security issues. The idea is to have a reasonable set of focus areas and vulnerability types for the actual auditing task to use as a starting point.

One problem we ran into was that the LLM would sometimes start auditing the issues that it suggested, which would defeat the purpose of the brainstorming phase. To prevent this, we instructed the LLM to not audit the issues.

Issue audit stage

This is the final stage of the taskflows. Once we’ve gathered all the information we need about the repository and have suggested some types of vulnerabilities and security risks to focus on, the taskflow goes through each suggested issue and audits them by going through the source code. At this stage, the task starts with fresh context to scrutinize the issues suggested from the previous stage. The suggestions are considered to be unvalidated, and this taskflow is instructed to verify these issues:

The issues suggested have not been properly verified and are only suggested because they are common issues in these types of application. Your task is to audit the source code to check if this type of issues is present.

To avoid the LLM coming up with issues that are non-security related in the context of the component, we once again emphasize that intended usage must be taken into consideration.

You need to take into account of the intention and threat model of the component in component notes to determine if an issue is a valid security issue or if it is an intended functionality.

To avoid the LLM hallucinating issues that are unrealistic, we also instruct it to provide a concrete and realistic attack scenario and to only consider issues that stem from errors in the source code:

Do not consider scenarios where authentication is bypassed via stolen credential etc. We only consider situations that are achievable from within the source code itself.
...
If you believe there is a vulnerability, then you must include a realistic attack scenario, with details of all the file and line included, and also what an attacker can gain by exploiting the vulnerability. Only consider the issue a vulnerability if an attacker can gain privilege by performing an action that is not intended by the component.

To further reduce hallucinations, we also instruct the LLM to provide concrete evidence from the source code, with file path and line information:

Keep a record of the audit notes, be sure to include all relevant file path and line number. Just stating an end point, e.g. `IDOR in user update/delete endpoints (PUT /user/:id)` is not sufficient. I need to have the file and line number.

Finally, we also instruct the LLM that it is possible that there is no vulnerability in the component and that it should not make things up:

Remember, the issues suggested are only speculation and there may not be a vulnerability at all and it is ok to conclude that there is no security issue.

The emphasis of this stage is to provide accurate results while following strict guidelines—and to provide concrete evidence of the findings. With all these strict instructions in place, the LLM indeed rejects many unrealistic and unexploitable suggestions with very few hallucinations. 

The first prototype was designed with hallucination prevention as a priority, which raised a question: Would it become too conservative, rejecting most vulnerability candidates and failing to surface real issues?

The answer is clear after we ran the taskflow on a few repositories.

Three examples of vulnerabilities found by the taskflows

In this section, we’ll show three examples of vulnerabilities that were found by the taskflows and that have already been disclosed. In total, we have found and reported over 80 vulnerabilities so far. We publish all disclosed vulnerabilities on our advisories page.

Privilege escalation in Outline (CVE-2025-64487)

Our information-gathering taskflows are optimized toward web applications, which is why we first pointed our audit taskflows to a collaborative web application called Outline.

Outline is a multi-user collaboration suite with properties we were especially interested in: 

  • Documents have owners and different visibility, with permissions per users and teams.
  • Access rules like that are hard to analyze with a Static Application Security Testing (SAST) tool, since they use custom access mechanisms and existing SAST tools typically don’t know what actions a normal “user” should be able to perform. 
  • Such permission schemes are often also hard to analyze for humans by only reading the source code (if you didn’t create the scheme yourself, that is).
Screenshot showing an opened document in Outline. Outline is a collaborative web application.

And success: Our taskflows found a bug in the authorization logic on the very first run!

The notes in the audit results read like this:

Audit target: Improper membership management authorization in component server (backend API) of outline/outline (component id 2).

Summary conclusion: A real privilege escalation vulnerability exists. The document group membership modification endpoints (documents.add_group, documents.remove_group) authorize with the weaker \"update\" permission instead of the stronger \"manageUsers\" permission that is required for user membership changes. Because \"update\" can be satisfied by having only a ReadWrite membership on the document, a non‑admin document collaborator can grant (or revoke) group memberships – including granting Admin permission – thereby escalating their own privileges (if they are in the added group) and those of other group members. This allows actions (manageUsers, archive, delete, etc.) that were not intended for a mere ReadWrite collaborator.

Reading the TypeScript-based source code and verifying this finding on a test instance revealed that it was exploitable exactly as described. In addition, the described steps to exploit this vulnerability were on point:

Prerequisites:
- Attacker is a normal team member (not admin), not a guest, with direct ReadWrite membership on Document D (or via a group that grants ReadWrite) but NOT Admin.
- Attacker is a member of an existing group G in the same team (they do not need to be an admin of G; group read access is sufficient per group policy).

Steps:
1. Attacker calls POST documents.add_group (server/routes/api/documents/documents.ts lines 1875-1926) with body:
   {
     "id": "<document-D-id>",
     "groupId": "<group-G-id>",
     "permission": "admin"
   }
2. Authorization path:
   - Line 1896: authorize(user, "update", document) succeeds because attacker has ReadWrite membership (document.ts lines 96-99 allow update).
   - Line 1897: authorize(user, "read", group) succeeds for any non-guest same-team user (group.ts lines 27-33).
   No \"manageUsers\" check occurs.
3. Code creates or updates GroupMembership with permission Admin (lines 1899-1919).
4. Because attacker is a member of group G, their effective document permission (via groupMembership) now includes DocumentPermission.Admin.
5. With Admin membership, attacker now satisfies includesMembership(Admin) used in:
   - manageUsers (document.ts lines 123-134) enabling adding/removing arbitrary users via documents.add_user / documents.remove_user (lines 1747-1827, 1830-1872).
   - archive/unarchive/delete (document.ts archive policy lines 241-252; delete lines 198-208) enabling content integrity impact.
   - duplicate, move, other admin-like abilities (e.g., duplicate policy lines 136-153; move lines 155-170) beyond original ReadWrite scope.

Using these instructions, a low-privileged user could add arbitrary groups to a document that the user was only allowed to update (the user not being in the possession of the “manageUsers” permission that was typically required for such changes).

In this sample, the group “Support” was added to the document by the low-privileged user named “gg.”

A screenshot of the share/document permissions functionality in Outline. The group “Support” was added by the “gg@test.test” user without having enough permissions for that action.

The Outline project fixed this and another issue we reported within three days! (Repo advisory)

The shopping cartocalypse (CVE-2025-15033, CVE-2026-25758)

We didn’t realize what systematic issues we’d uncover in the cart logic of ecommerce applications until we pointed our taskflows at the first online shop in our list. In the PHP-based WooCommerce project, the taskflows promptly found a way for normally signed-in shop users to view all guest orders—including personally identifiable information (including names, addresses, and phone numbers). After we reported this, Automattic (the company behind WooCommerce) quickly released an update (CVE-2025-15033) and accompanying blog post.

Intrigued by that vulnerability, we’ve added additional ecommerce applications to our list of applications to be audited by our agent. And sure enough, we found more vulnerabilities. The popular Ruby-based Spree commerce application contained two similar vulnerabilities (CVE-2026-25758 and CVE-2026-25757). The more critical one allowed unauthenticated users to simply enumerate the addresses (and phone numbers) of all guest orders by more or less incrementing a sequential number.

In this screenshot, the attacker “test66” linked their session to an existing address of a guest user, thus being able to view the full address and phone number.

An unauthenticated guest user revealing addresses of other guest users by enumerating through them.

Our bug-hunting spree didn’t stop with Spree. Our taskflows uncovered similar issues in two additional ecommerce applications.

These authorization logic bugs had been undiscovered for years.

Signing in to Rocket.Chat using any password (CVE-2026-28514)

(This is not what passwordless authentication should look like!)

Every so often you can’t believe your eyes. This finding reported by our taskflows in Rocket.Chat was one of those moments.

When your agent comes back with a note like this:

VULNERABILITY: password authentication bypass in account-service allows logging in as any user with a password set.

You might find it hard to believe at first.

When you then continue reading the output:

Root cause:
- ee/apps/account-service/src/lib/utils.ts:60-61: `validatePassword` returns `Promise<boolean>` (bcrypt.compare(...)).
- ee/apps/account-service/src/lib/loginViaUsername.ts:18-21: `const valid = user.services?.password?.bcrypt && validatePassword(password, user.services.password.bcrypt);` but does NOT `await` the Promise; since a Promise is truthy, `if (!valid) return false;` is never triggered when bcrypt hash exists.
- ee/apps/account-service/src/lib/loginViaUsername.ts:23-35: proceeds to mint a new login token and saves it, returning `{ uid, token, hashedToken, tokenExpires }`.

It might make more sense, but you’re still not convinced.

It turns out the suspected finding is in the micro-services based setup of Rocket.Chat. In that particular setup, Rocket.Chat exposes its user account service via its DDP Streamer service.

Rocket.Chat’s microservices deployment Copyright Rocket.Chat.
Rocket.Chat’s microservices deployment Copyright Rocket.Chat. (This architecture diagram is from Rocket.Chat’s documentation.)

Once our Rocket.Chat test setup was working properly, we had to write proof of concept code to exploit this potential vulnerability. The notes of the agent already contained the JSON construct that we could use to connect to the endpoint using Meteor’s DDP protocol.

We connected to the WebSocket endpoint for the DDP streamer service, and yes: It was truly possible to login into the exposed Rocket.Chat DDP service using any password. Once signed in, it was also possible to perform other operations such as connecting to arbitrary chat channels and listening on them for messages sent to those channels.

Here we received the message “HELLO WORLD!!!” while listening on the “General” channel.

The proof of concept code connected to the DDP streamer endpoint received “HELLO WORLD!!!” in the general channel.

The technical details of this issue are interesting (and scary as well). Rocket.Chat, primarily a TypeScript-based web application, uses bcrypt to store local user passwords. The bcrypt.compare function (used to compare a password against its stored hash) returns a Promise—a fact that is reflected in Rocket.Chat’s own validatePassword function, which returns Promise<boolean>:

export const validatePassword = (password: string, bcryptPassword: string): Promise<boolean> =>
    bcrypt.compare(getPassword(password), bcryptPassword);

However, when that function was used, the value of the Promise was not settled (e.g. by adding an await keyword in front of validatePassword):

const valid = user.services?.password?.bcrypt && validatePassword(password, user.services.password.bcrypt);

if (!valid) {
    return false;
}

This led to the result of validatePassword being ANDed with true. Since a returned Promise is always “truthy” speaking in JavaScript terms, the boolean valid subsequently was always true when a user had a bcrypt password set.

Severity aside, it’s fascinating that the LLM was able to pick up this rather subtle bug, follow it through multiple files, and arrive at the correct conclusion.

What we learned

After running the taskflows over 40 repositories—mostly multi-user web applications—the LLM suggested 1,003 issues (potential vulnerabilities). 

After the audit stage, 139 were marked as having vulnerabilities, meaning that the LLM decided they were exploitable After deduplicating the issues—duplicates happen because each repository is run a couple of times on average and the results are aggregated—we end up with 91 vulnerabilities, which we decided to manually inspect before reporting.

  • We rejected 20 (22%) results as FP: False Positives that we couldn’t reproduce manually.
  • We rejected 52 (57%) results as low severity: Issues that have very limited potential impact (e.g., blind SSRF with only a HTTP status code returned, issues that require malicious admin during installation stage, etc.).
  • We kept only 19 (21%) results that we considered vulnerabilities impactful enough to report, all serious vulnerabilities with the majority having a high or critical severity (e.g., vulnerabilities that can be triggered without specific requirements with impact to confidentiality or integrity, such as disclosure of personal data, overwriting of system settings, account takeover, etc.). 

This data was collected using gpt-5.x as the model for code analysis and audit tasks.

Note that we have run the taskflows on more repositories since this data was collected, so this table does not represent all the data we’ve collected and all vulnerabilities we’ve reported.

Issue categoryAllHas vulnerabilityVulnerability rate
IDOR/Access control issue2413815.8%
XSS1311713.0%
CSRF1101715.5%
Authentication issue911516.5%
Security misconfiguration751317.3%
Path traversal611016.4%
SSRF45715.6%
Command injection39512.8%
Remote code execution2414.2%
Business logic issue24625.0%
Template injection2414.2%
File upload handling issues (excludes path traversal)18211.1%
Insecure deserialization1700.0%
Open redirect1600.0%
SQL injection900.0%
Sensitive data exposure800.0%
XXE400.0%
Memory safety300.0%
Others 66710.6%

If we divide the findings into two rough categories—logical issues (IDOR, authentication, security misconfiguration, business logic issues, sensitive data exposure) and technical issues (XSS, CSRF, path traversal, SSRF, command injection, remote code execution, template injection, file upload issues, insecure deserialization, open redirect, SQL injection, XXE, memory safety)—we get 439 logical issues and 501 technical issues. Although more technical issues were suggested, the difference isn’t significant because some broad categories (such as remote code execution and file upload issues) can also involve logical issues depending on the attacker scenario.

There are only three suggested issues that concern memory safety. This isn’t too surprising, given the majority of the repositories tested are written in memory-safe languages. But we also suspect that the current taskflows may not be very efficient in finding memory-safety issues, especially when comparing to other automated tools such as fuzzers. This is an interesting area that can be improved by creating more specific taskflows and making more tools, like fuzzers, available to the LLM.

This data led us to the following observations.

LLMs are particularly good at finding logic bugs

What stands out from the data is the 25% rate of “Business logic issue” and the large amount of IDOR issues. In fact, the total number of IDOR issues flagged as vulnerable is more than the next two categories combined (XSS and CSRF). Overall, we get the impression that the LLM does an excellent job of understanding the code space and following the control flow, while taking into account the access control model and intended usage of the application, which is more or less what we’d expect from LLMs that excel in tasks like code reviews. This also makes it great for finding logic bugs that are difficult to find with traditional tools.

LLMs are good at rejecting low-severity issues and false positives

Curiously, none of the false positives are what we’d consider to be hallucinations. All the reports, including the false positives, have sound evidence backing them up, and we were able to follow through the report to locate the endpoints and apply the suggested payload. Many of the false positives are due to more complex circumstances beyond what is available in the code, such as browser mitigations for XSS issues mentioned above or what we would consider as genuine mistakes that a human auditor is also likely to make. For example, when multiple layers of authentications are in place, the LLM could sometimes miss out some of the checks, resulting in false positives.

We have since tested more repositories with more vulnerabilities reported, but the ratio between vulnerabilities and repositories remains roughly the same.

To demonstrate the extensibility of taskflows and how extra information can be incorporated into the taskflows, we created a new taskflow to run after the audit stage, which incorporates our new-found knowledge to filter out low-severity vulnerabilities. We found that the taskflow can filter out roughly 50% of the low-severity vulnerabilities with a couple of borderline vulnerabilities that we reported also getting marked as low severity. The taskflow and the prompt can be adjusted to fit the user’s own preference, but for us, we’re happy to make it more inclusive so we don’t miss out on anything impactful.

LLMs are good at threat modeling

The LLM performs well in threat modeling in general. During the experiment, we tested it on a number of applications with different threat models, such as desktop applications, multi-tenant web applications, applications that are designed to run code in sandbox environments (code injection by design), and reverse proxy applications (applications where SSRF-like behavior is intended). The taskflow is able to take into account the intended usage of these applications and make sound decisions. The taskflow struggles most with threat modelling of desktop applications, as it is often unclear whether other processes running on the user’s desktop should be considered trusted or not.

We’ve also observed some remarkable reasoning by the LLM that excludes issues with no privilege gains. For example, in one case, the LLM noticed that while there are inconsistencies in access control, the issue does not give the attacker any advantages over a manual copy and paste action:

Security impact assessment:

A user possessing only read access to a document (no update rights) can duplicate it provided they also have updateDocument rights on the destination collection. This allows creation of a new editable copy of content they could already read. This does NOT grant additional access to other documents nor bypass protections on the original; any user with read access could manually copy-paste the content into a new document they are permitted to create (creation generally allowed for non-guest, non-viewer members in ReadWrite collections per createDocument collection policy)

We’ve also seen some more sophisticated techniques that were used in the reasoning. For example, in one application that is running scripts in a sandboxed nodejs environment, the LLM suggested the following technique to escape the sandbox:

In Node’s vm, passing any outer-realm function into a contextified sandbox leaks that function’s outer-realm Function constructor through the `constructor` property. From inside the sandbox:
  const F = console.log.constructor; // outer-realm Function
  const hostProcess = F('return process')(); // host process object
  // Bypass module allowlist via host dynamic import
  const cp = await F('return import("node:child_process")')();
  const out = cp.execSync('id').toString();
  return [{ json: { out } }];

The presence of host functions (console.log, timers, require, RPC methods) is sufficient to obtain the host Function constructor and escape the sandbox. The allowlist in require-resolver is bypassed by constructing host-realm functions and using dynamic import of built-in modules (e.g., node:child_process), which does not go through the sandbox’s custom require.

While the result turns out to be a false positive due to other mitigating factors, it demonstrates the LLM’s technical knowledge.

Get involved!

The taskflows we used to find these vulnerabilities are open source and easy to run on your own project, so we hope you’ll give them a try! We also want to encourage you to write your own taskflows. The results showcased in this blog post are just small examples of what’s possible. There are other types of vulnerabilities to find, and there are other security-related problems, like triaging SAST results or building development setups, which we think taskflows can help with. Let us know what you’re building by starting a discussion on our repo!

The post How to scan for vulnerabilities with GitHub Security Lab’s open source AI-powered framework appeared first on The GitHub Blog.

]]>
94370
60 million Copilot code reviews and counting https://github.blog/ai-and-ml/github-copilot/60-million-copilot-code-reviews-and-counting/ Thu, 05 Mar 2026 20:10:43 +0000 https://github.blog/?p=94298 How Copilot code review helps teams keep up with AI-accelerated code changes.

The post 60 million Copilot code reviews and counting appeared first on The GitHub Blog.

]]>

Since our initial launch of Copilot code review (CCR) last April, usage has grown 10X, now accounting for more than one in five code reviews on GitHub.

Behind the scenes, we’ve been running continuous experiments to enhance comment quality. We also moved to an agentic architecture that retrieves repository context and reasons across changes. At every step of the way, we’ve listened to your feedback: your survey answers and even your simple thumbs-up and thumbs-down reactions on comments have helped us identify key issues and iterate on our UX to provide a comprehensive review experience.

Copilot code review handles pull request reviews and summaries, allowing teams to focus on more complex tasks.

Suvarna Rane, Software Development Manager, General Motors

Redefining a “good” code review

As Copilot code review evolved over time, so has our definition of a “good code review.” When we started building it in 2024, our goal was simple thoroughness. Since then, we’ve learned that what developers actually value is high-signal feedback that helps them move a pull request forward quickly. Today, Copilot code review leverages the best models, memory, and agentic tool-calling to conduct comprehensive reviews. To get here, we’ve used a continuous evaluation loop to tune the agent’s judgment, focusing on three qualities that shape that experience: accuracy, signal, and speed.

Accuracy

Our aim has been for Copilot code review to deliver sound judgment, prioritizing consequential logic and maintainability issues. We evaluate performance in two ways: through internal testing against known code issues, and through production signals from real pull requests. In production, we track two key indicators:

  • Developer feedback: Thumbs-up and thumbs-down reactions on comments help us understand whether suggestions are helpful.
  • Production signals: We measure whether flagged issues are resolved before merging.

Together, these signals help ensure that Copilot code review surfaces issues that matter, and that faster merges come from confident fixes, not less scrutiny.

Copilot code review comment identifying a missing dependency in a React useCallback hook and suggesting a code change to add handleKeyboardDrag to the dependency array.

Signal

In code review, more comments don’t necessarily mean a better review. Our goal isn’t to maximize comment volume, but to surface issues that actually matter.

A high-signal comment helps a developer understand both the problem and the fix:

Copilot code review comment warning that a retry loop could run indefinitely when an API returns HTTP 429 without a Retry-After header and suggesting adding a retry limit and backoff.

Silence is better than noise. In 71% of the reviews, Copilot code review surfaces actionable feedback. In the remaining 29%, the agent says nothing at all.

As our ability to identify high-signal findings improves, we’re also able to comment more confidently, now averaging about 5.1 comments per review without increasing review churn or lowering our quality threshold.

Speed

In code review, speed matters, but signal matters more. Copilot code review is designed to provide a reliable first pass shortly after a pull request is opened. That being said, meaningful reviews still require analysis. As reasoning capabilities improve, so does the computation required to surface deeper issues.

We treat this as a deliberate trade-off. In one recent change, adopting a more advanced reasoning model improved positive feedback rates by 6%, even though review latency increased by 16%.

For us, that’s the right exchange. A slightly slower review that surfaces real issues is far more valuable than instant feedback that adds noise. We continue to reduce latency wherever possible, but never at the expense of high-signal findings developers can trust.

About the agentic architecture

Given our new definition of “good,” we redeveloped our code review system. Today’s agentic design can retrieve context intelligently and explore the repository to understand logic, architecture, and specific invariants.

This shift alone has driven an initial 8.1% increase in positive feedback.

Here’s why:

  • It catches issues as it reads, not just at the end: Previously, agents waited until the end of a review to finalize results, which often led to “forgetting” early discoveries.
  • It can maintain memory across reviews: Now, every pull request doesn’t need to be an isolated event. If it flags a pattern in one part of the codebase, it can reuse that context in future reviews.
  • It keeps long pull requests reviewable with an explicit plan: It can map out its review strategy ahead of time, significantly improving its performance on long, complex pull requests, where context is easily lost.
  • It reads linked issues and pull requests: That extra context helps it flag subtle gaps. This includes cases where the code looks reasonable in isolation but doesn’t match the project’s requirements.

Making reviews easier to navigate

By iterating on how the agent interacts with pull requests, we’ve reduced noise and made feedback more actionable. Here’s what that means for you.

  • Quickly understand feedback (and the fix) with multi-line comments: We moved away from pinning comments to single lines. By attaching feedback to logical code ranges, Copilot makes it easier to see what it’s referring to and apply the suggested change.
Copilot code review comment on a GitHub Actions workflow identifying a missing use_caches input parameter and suggesting a code change to add the boolean input to the workflow configuration.
  • Keep your pull request timeline readable: Instead of multiple separate comments for the same pattern error, which can be overwhelming, the agent clusters them into a single, cohesive unit to reduce cognitive load.
  • Fix whole classes of issues at once with batch autofixes: Apply suggested fixes in batches, resolving an entire class of logic bugs or style issues at once, rather than context-switching through a dozen individual suggestions.

Take this with you

As AI continues to accelerate software development, it’s more important than ever to help teams review and trust code at scale. Copilot code review helps teams keep pace by surfacing high-signal feedback directly in pull requests, enabling developers to catch issues earlier and merge with greater confidence.

More than 12,000 organizations now run Copilot code review automatically on every pull request. At WEX, this shift toward default AI –assisted reviews has helped scale Copilot adoption across the engineering organization:

Today, two-thirds of developers are using Copilot — including the organization’s most active contributors. WEX has since expanded adoption by making Copilot code review a default across every repository. Developers are also heavily utilizing agent mode and the coding agent to drive autonomy, helping WEX see a huge lift in deployments, with ~30% more code shipped. — WEX customer story

Going forward, we’re focused on deeper personalization and high-fidelity interactivity, refining the agent to learn your team’s unwritten preferences while enabling two-way conversations that let you refine fixes and explore alternatives before merging.

As Copilot capabilities continue to evolve, from coding and planning to review and automation, the goal is simple: help developers move faster while maintaining the trust and quality that great software demands.

Get started today

Copilot code review is a premium feature available with Copilot Pro, Copilot Pro+, Copilot Business, and Copilot Enterprise. See the following resources to:

Already enabled Copilot code review? See these docs to set up automatic Copilot code reviews on every pull request within your repository or organization.

Have thoughts or feedback? Please let us know in our community discussion post.

The post 60 million Copilot code reviews and counting appeared first on The GitHub Blog.

]]>
94298
Scaling AI opportunity across the globe: Learnings from GitHub and Andela https://github.blog/developer-skills/career-growth/scaling-ai-opportunity-across-the-globe-learnings-from-github-and-andela/ Thu, 05 Mar 2026 17:00:00 +0000 https://github.blog/?p=94275 Developers connected to Andela share how they’re learning AI tools inside real production workflows.

The post Scaling AI opportunity across the globe: Learnings from GitHub and Andela appeared first on The GitHub Blog.

]]>

Across the globe, developer talent is abundant. But what has been historically inequitable is the access to emerging technologies, mentorship, and enablement when those technologies are reshaping the industry. Developers in regions like Africa, South America, and Southeast Asia can build products at scale, yet access to emerging tools and learning pathways often varies by geography and employer.

Andela is a global talent marketplace built on the belief that where you live should not determine your access to opportunity. Over the past two years, GitHub and Andela have been working together to expand structured AI access across Andela’s 5.5-million-member global talent network. As of now, 3,000 Andela engineers have been trained on GitHub Copilot through Andela’s AI Academy.

Starting in 2024, Andela began rolling out structured AI training to selected developers across Africa and Latin America whose day-to-day work directly involved complex production systems. Instead of treating AI as a standalone experiment, the program integrated Copilot directly into day-to-day development processes—within IDE environments, pull request reviews, and active refactoring work—ensuring it was evaluated under real production constraints.

To understand how the approach worked in practice, we spoke with Andela developers. Below, you’ll learn how they introduced AI into active production systems and identified a model that you can apply, whether you’re experimenting independently or integrating AI tools at work.

The challenges that global developers face today

Developers in regions such as Africa, South America, and Southeast Asia face a distinct set of challenges when it comes to AI skilling and reskilling.

Many developers contend with unreliable connectivity, limited access to high‑performance compute, and the high cost of cloud tools and data—all of which are foundational to learning and practicing modern AI. Training content is often designed for well‑resourced environments, assumes constant internet access, and is rarely localized for language, context, or regional use cases. At the same time, many developers are navigating informal or contract‑based work, leaving little time or financial cushion to invest in reskilling.

Without intentional investment in affordable access, localized learning pathways, and community‑driven ecosystems, the rapid pace of AI advancement risks widening existing inequities—excluding talented developers from across the globe from fully participating in, shaping, and benefiting from the future of AI.

Learning AI inside real work

For most mid-career developers, stepping away from production responsibilities to experiment with AI tools is not realistic. Deadlines continue, systems remain live, and reputations are earned over time. This is why learning has to happen inside real work.

In many organizations, AI tooling is provisioned broadly, and teams are told to experiment. Access is assumed to be enough. But without clarity around which roles benefit most, what jobs are being targeted, and how review standards evolve, adoption can stall or fragment.

Andela took a different approach. Developers were identified based on the relevance of AI to their responsibilities, job profiles were defined, and training programs reflected the actual systems developers were accountable for maintaining.

This is because the team at Andela knows that developers are rarely starting from scratch. More often, they are working inside dense, high-stakes systems where mistakes carry consequences. For many engineers across the globe, access to structured experimentation with emerging tools has not always been guaranteed, which makes learning inside real work both necessary and consequential.

Stephen N’nouka A’ Issah, a React developer from Cameroon who works in Rwanda, assumed early on that AI tools would not perform well under that level of complexity.

I thought it might help with simple things. But I didn’t expect it to work with advanced patterns or legacy code.

Stephen N’nouka A’ Issah, React developer

That skepticism reflected experience. Many developers have seen tools demonstrate well in controlled environments and struggle once deployed in production systems.

Recognizing this reality, Andela chose not to treat AI as a separate discipline or certification exercise removed from day-to-day work. Instead, through its AI Academy, it embedded learning directly into production workflows.

Abraham Omomoh, a learning program manager at Andela, explained the philosophy clearly.

Training has to reflect what developers are actually asked to do at work, not idealized exercises.

Abraham Omomoh, learning program manager at Andela

This way, learning occurs within the same systems developers are already accountable for maintaining.

The first payoff: Faster orientation

One of the earliest recognized benefits for developers was wasn’t increased output, but faster orientation within unfamiliar systems.

Daniel Nascimento, a senior engineer in Brazil with more than 25 years of experience, described what it’s like to work on legacy code that “nobody wants to touch,” where the real risk isn’t speed so much as unintended consequences.“The first thing I ask is: what does this project actually do?” he said. “What’s the architecture? What are the weaknesses? What are the strengths?”

To make change safer, he now uses AI tools to generate unit tests before refactoring, creating clearer boundaries for what can be modified without breaking behavior.

Legacy code usually doesn’t have coverage. So I use it to build that coverage first. Then I know what I’m playing with.

Daniel Nascimento, senior engineer

Stephen described a similar pattern when onboarding to unfamiliar systems. In his experience, AI doesn’t replace understand, it compresses the time it takes to surface intent, architectural patterns, and constraints before making changes. Much of this work involves:

  • Generating tests to understand behavior
  • Drafting refactors to clarify control flow
  • Sketching diagrams to reason about system boundaries

Even then, many suggestions still require cleanup or introduce subtle issues, reinforcing the importance of disciplined reviews.

With AI, confidence compounds

After several weeks of applying AI inside production systems, we could start to measure incremental improvements.

Developers reported:

  • Faster onboarding to unfamiliar systems
  • More confidence taking ownership of ambiguous work
  • Less time spent on setup and more on decisions

Daniel estimated a significant productivity gain, largely driven as an outcome of working differently.

“Using GitHub Copilot, I boosted my productivity by around 50%,” he said.

But it’s not just speed. It gives me more time to connect with the business and focus on real impact.

Daniel Nascimento

He emphasized much of that gain came from reducing repetitive overhead rather than replacing core engineering judgment.

For developers who previously lacked structured exposure to AI tooling, that access translated into expanded professional skills. Certifications strengthened their credibility and AI fluency expanded the scope of work they could take on.

The AI skills gap shows up as access, not ability

This work reinforces a broader pattern: the AI skills gap is, at its core, about structured access to tools, mentorship, and practical enablement.

Developers who adapt faster typically have:

  • Access to modern tools
  • Space to experiment safely
  • Teams aligned on how those tools should be used

Where those conditions exist, learning compounds. Where they don’t, AI impact is limited.

And this also matters for developers across the globe where increased skilling translates to better job and economic opportunities. Koffi Kelvin, an Andela engineer based in Kenya, shared, “GitHub Copilot is a portal that catapulted my professional trajectory into a literal other dimension.”

Between the workflows, security, testing and high-octane pipelines, it’s been less like a career path and more like a rocket launch.

Koffi Kelvin, Andela engineer

Expanding structured access in the Global South isn’t about catching up. Instead, it’s about ensuring that the developers shaping AI-assisted systems reflect the full diversity of global engineering talent.

Everyone benefits with access

When everyone across the globe has structured access to learning, we all benefit from it. AI upskilling is not about chasing hype or predicting the future. It is about learning how to integrate new tools into real systems without stepping away from the job. It allows developers to take on more complex work, contribute more confidently to global teams, and continue building at the edge of modern practice regardless of geography.

When that learning is structured—when access is intentional rather than incidental—it compounds. Sammy Kiogara Mati, an Andela engineer who works on GitHub, shared that “GitHub Copilot has expanded my view of what’s possible for global tech talent.”

AI does not level the playing field on its own. Structured access does.

To start your own AI learning journey,
check out GitHub Learn >

The post Scaling AI opportunity across the globe: Learnings from GitHub and Andela appeared first on The GitHub Blog.

]]>
94275
How we rebuilt the search architecture for high availability in GitHub Enterprise Server https://github.blog/engineering/architecture-optimization/how-we-rebuilt-the-search-architecture-for-high-availability-in-github-enterprise-server/ Tue, 03 Mar 2026 18:45:09 +0000 https://github.blog/?p=94225 Here's how we made the search experience better, faster, and more resilient for GHES customers.

The post How we rebuilt the search architecture for high availability in GitHub Enterprise Server appeared first on The GitHub Blog.

]]>

So much of what you interact with on GitHub depends on search—obviously the search bars and filtering experiences like the GitHub Issues page, but it is also the core of the releases page, projects page, the counts for issues and pull requests, and more. Given that search is such a core part of the GitHub platform, we’ve spent the last year making it even more durable. That means, less time spent managing GitHub Enterprise Server, and more time working on what your customers care most about. 

In recent years, GitHub Enterprise Server administrators had to be especially careful with search indexes, the special database tables optimized for searching. If they didn’t follow maintenance or upgrade steps in exactly the right order, search indexes could become damaged and need repair, or they might get locked and cause problems during upgrades. Quick context if you’re not running High Availability (HA) setups, they’re designed to keep GitHub Enterprise Server running smoothly even if part of the system fails. You have a primary node that handles all the writes and traffic, and replica nodes that stay in sync and can take over if needed.

Diagram labeled 'HA Architecture' with two boxes: 'Primary Node' and 'Replica Node.' Across both of them there exists an 'Elasticsearch Cluster' with a nested box on each node labeled 'ES Instance.' A pink arrow points from the Primary Node’s ES Instance to the Replica Node’s ES Instance, indicating replication or failover in a high-availability setup.

Much of this difficulty comes from how previous versions of Elasticsearch, our search database of choice, were integrated. HA GitHub Enterprise Server installations use a leader/follower pattern. The leader (primary server) receives all the writes, updates, and traffic. Followers (replicas) are designed to be read-only. This pattern is deeply ingrained into all of the operations of GitHub Enterprise Server.

This is where Elasticsearch started running into issues. Since it couldn’t support having a primary node and a replica node, GitHub engineering had to create an Elasticsearch cluster across the primary and replica nodes. This made replicating data straightforward and additionally gave some performance benefits, since each node could locally handle search requests. 

Diagram showing 'Primary Node' and 'Replica Node' as part of an 'Elasticsearch Cluster.' The Primary Node contains 'Primary Shard 1,' and the Replica Node contains 'Primary Shard 2.' A pink arrow points from an empty shard slot on the 'Primary Node' to Shard 2, representing the unwanted move of a primary shard to the 'Replica Node.'

Unfortunately, the problems of clustering across servers eventually began to outweigh the benefits. For example, at any point Elasticsearch could move a primary shard (responsible for receiving/validating writes) to a replica. If that replica was then taken down for maintenance, GitHub Enterprise Server could end up in a locked state. The replica would wait for Elasticsearch to be healthy before starting up, but Elasticsearch couldn’t become healthy until the replica rejoined.

For a number of GitHub Enterprise Server releases, engineers at GitHub tried to make this mode more stable. We implemented checks to ensure Elasticsearch was in a healthy state, as well as other processes to try and correct drifting states. We went as far as attempting to build a “search mirroring” system that would allow us to move away from the clustered mode. But database replication is incredibly challenging and these efforts needed consistency.

What changed?

After years of work, we’re now able to use Elasticsearch’s Cross Cluster Replication (CCR) feature to support HA GitHub Enterprise. 

“But David,” you say, “That’s replication between clusters. How does that help here?” 

I’m so glad you asked. With this mode, we’re moving to use several, “single-node” Elasticsearch clusters. Now each Enterprise server instance will operate as independent single node Elasticsearch clusters.

Diagram showing two boxes labeled 'Primary Node' and 'Replica Node.' Each box contains a dashed rectangle labeled 'Elasticsearch Instance / Cluster.' A double-headed pink arrow labeled 'Replicate Index Data (CCR)' connects the two boxes, illustrating bidirectional data replication between the primary and replica Elasticsearch clusters.

CCR lets us share the index data between nodes in a way that is carefully controlled and natively supported by Elasticsearch. It copies data once it’s been persisted to the Lucene segments (Elasticsearch’s underlying data store). This ensures we’re replicating data that has been durably persisted within the Elasticsearch cluster.

In other words, now that Elasticsearch supports a leader/follower pattern, GitHub Enterprise Server administrators will no longer be left in a state where critical data winds up on read-only nodes.

Under the hood

Elasticsearch has an auto-follow API, but it only applies to indexes created after the policy exists. GitHub Enterprise Server HA installations already have a long-lived set of indexes, so we need a bootstrap step that attaches followers to existing indexes, then enables auto-follow for anything created in the future.

Here’s a sample of what that workflow looks like:

function bootstrap_ccr(primary, replica):
  # Fetch the current indexes on each 
  primary_indexes = list_indexes(primary)
  replica_indexes = list_indexes(replica)

  # Filter out the system indexes
  managed = filter(primary_indexes, is_managed_ghe_index)
  
  # For indexes without follower patterns we need to
  #   initialize that contract
  for index in managed:
    if index not in replica_indexes:
      ensure_follower_index(replica, leader=primary, index=index)
    else:
      ensure_following(replica, leader=primary, index=index)

  # Finally we will setup auto-follower patterns 
  #   so new indexes are automatically followed
  ensure_auto_follow_policy(
    replica,
    leader=primary,
    patterns=[managed_index_patterns],
    exclude=[system_index_patterns]
  )

This is just one of the new workflows we’ve created to enable CCR in GitHub Enterprise Server. We’ve needed to engineer custom workflows for failover, index deletion, and upgrades. Elasticsearch only handles the document replication, and we’re responsible for the rest of the index’s lifecycle. 

How to get started with CCR mode 

To get started using the new CCR mode, reach out to [email protected] and let them know you’d like to use the new HA mode for GitHub Enterprise Server. They’ll set up your organization so that you can download the required license.

Once you’ve downloaded your new license, you’ll need to set `ghe-config app.elasticsearch.ccr true`. With that finished, administrators can run a `config-apply` or an upgrade on your cluster to move to 3.19.1, which is the first release to support this new architecture.  

When your GitHub Enterprise Server restarts, Elasticsearch will migrate your installation to use the new replication method. This will consolidate all the data onto the primary nodes, break clustering across nodes, and restart replication using CCR. This update may take some time depending on the size of your GitHub Enterprise Server instance.

While the new HA method is optional for now, we’ll be making it our default over the next two years. We want to ensure there’s ample time for GitHub Enterprise administrators to get their feedback in, so now is the time to try it out. 

We’re excited for you to start using the new HA mode for a more seamless experience managing GitHub Enterprise Server. 

Want to get the most out of search on your High Availability GitHub Enterprise Server deployment? Reach out to support to get set up with our new search architecture!

The post How we rebuilt the search architecture for high availability in GitHub Enterprise Server appeared first on The GitHub Blog.

]]>
94225
Join or host a GitHub Copilot Dev Days event near you https://github.blog/ai-and-ml/github-copilot/join-or-host-a-github-copilot-dev-days-event-near-you/ Tue, 03 Mar 2026 16:55:00 +0000 https://github.blog/?p=94244 GitHub Copilot Dev Days is a global series of hands-on, in-person, community-led events designed to help developers explore real-world, AI-assisted coding.

The post Join or host a GitHub Copilot Dev Days event near you appeared first on The GitHub Blog.

]]>

The way we build software is changing fast. AI is no longer a “someday” tool. It’s reshaping how we plan, write, review, and ship code right now. As products evolve faster than ever, developers are expected to keep up just as quickly. That’s why GitHub Copilot Dev Days exists: for developers to level up together on how they can use AI-assisted coding today.

GitHub Copilot Dev Days is a global series of hands-on, in-person, community-led events designed to help developers explore real-world AI-assisted coding with GitHub Copilot. Join us for the knowledge, stay for the great food, good vibes, and plenty of fun along the way. Find an event near you and register today.

Who is GitHub Copilot Dev Days for?

Anyone and everyone who is looking to improve their development workflow and learn something new! We have events run by and for folks from professional developers to students. Sessions cover various levels and programming backgrounds.

If it’s your first time trying out AI-assisted development, this event will introduce you to the tools and best practices to succeed from day one. If you’re more advanced, we’re excited to show you the latest tips and tricks to ensure you’re fully up to date.

What to expect from a GitHub Copilot Dev Day

Each event will feature live demos, practical sessions, and interactive workshops with high-quality training content. We will focus on real workflows you can use right away, whether you’re already using Copilot daily or just getting started. Your hosts are development experts: GitHub Stars, Microsoft MVPs, GitHub Campus Experts, Microsoft Student Ambassadors, GitHub and Microsoft employees, to name a few.

We will have training materials covering the GitHub Copilot CLI, Cloud Agent, GitHub Copilot in VS Code, Visual Studio, and other editors, and more! Different events will focus on different topics, so be sure to review the registration page beforehand.

The specific event details will vary, as each community event organizer might tweak the event to fit the interests of their local developer community. Here is a sample agenda:

  • Introductory session: 30-45 minutes on GitHub Copilot.
  • Local community session: 30-45 minutes by a local developer or community leader on relevant topics.
  • Hands-on workshop: 1 hour of coding and practical exercises.

All events are an opportunity to connect with your local developer community, learn something new, and enjoy some snacks and swag!

Events begin in March

Events are now live in cities around the world starting in March. Spots are limited and dates are approaching—now’s the time to grab a seat.

Want to bring GitHub Copilot Dev Days to your user group? Fill out our form.

Find a GitHub Copilot Dev Days event near you and register today >

The post Join or host a GitHub Copilot Dev Days event near you appeared first on The GitHub Blog.

]]>
94244