The latest on open source - The GitHub Blog

How to scan for vulnerabilities with GitHub Security Lab’s open source AI-powered framework

Man Yue Mo — Fri, 06 Mar 2026 21:09:04 +0000

For the last few months, we’ve been using the GitHub Security Lab Taskflow Agent along with a new set of auditing taskflows that specialize in finding web security vulnerabilities. They also turn out to be very successful at finding high-impact vulnerabilities in open source projects.

As security researchers, we’re used to losing time on possible vulnerabilities that turn out to be unexploitable, but with these new taskflows, we can now spend more of our time on manually verifying the results and sending out reports. Furthermore, the severity of the vulnerabilities that we’re reporting is uniformly high. Many of them are authorization bypasses or information disclosure vulnerabilities that allow one user to login as somebody else or to access the private data of another user.

Using these taskflows, we’ve reported more than 80 vulnerabilities so far. At the time of writing, approximately 20 of them have already been disclosed. And we’re continually updating our advisories page when new vulnerabilities are disclosed. In this blog post, we’ll show a few concrete examples of high-impact vulnerabilities that are found by these taskflows, like accessing personally identifiable information (PII) in shopping carts of ecommerce applications or signing in with any password into a chat application.

We’ll also explain how the taskflows work, so you can learn how to write your own. The security community moves faster when it shares knowledge, which is why we’ve made the framework open source and easy to run on your own project. The more teams using and contributing to it, the faster we collectively eliminate vulnerabilities.

How to run the taskflows on your own project

Want to get started right away? The taskflows are open source and easy to run yourself! Please note: A GitHub Copilot license is required, and the prompts will use premium model requests. (Note that running the taskflows can result in many tool calls, which can easily consume a large amount of quota.)

Go to the seclab-taskflows repository and start a codespace.
Wait a few minutes for the codespace to initialize.
In the terminal, run ./scripts/audit/run_audit.sh myorg/myrepo

It might take an hour or two to finish on a medium-sized repository. When it finishes, it’ll open an SQLite viewer with the results. Open the “audit_results” table and look for rows with a check-mark in the “has_vulnerability” column.

Tip: Due to the non-deterministic nature of LLMs, it is worthwhile to perform multiple runs of these audit taskflows on the same codebase. In certain cases, a second run can lead to entirely different results. In addition to this, you might perform those two runs using different models (e.g., the first using GPT 5.2 and the second using Claude Opus 4.6).

The taskflows also work on private repos, but you’ll need to modify the codespace configuration to do so because it won’t allow access to your private repos by default.

Introduction to taskflows

Taskflows are YAML files that describe a series of tasks that we want to do with an LLM. With them, we can write prompts to complete different tasks and have tasks that depend on each other. The seclab-taskflow-agent framework takes care of running the tasks sequentially and passing the results from one task to the next.

For example, when auditing a repository, we first divide the repository into different components according to their functionalities. Then, for each component, we may want to collect some information such as entry points where it takes untrusted input from, intended privilege, and purposes of the component, etc. These results are then stored in a database to provide the context for subsequent tasks.

Based on the context data, we can then create different auditing tasks. Currently, we have a task that suggests some generic issues for each component and another task that carefully audits each suggested issue. However, it’s also possible to create other tasks, such as tasks with specific focus on a certain type of issue.

These become a list of tasks we specify in a taskflow file.

We use tasks instead of one big prompt because LLMs have limited context windows, and complex, multi-step tasks are often not completed properly. For example, some steps can be left out. Even though some LLMs have larger context windows, we find that taskflows are still useful in providing a way for us to control and debug the tasks, as well as for accomplishing bigger and more complex projects.

The seclab-taskflow-agent can also run the same task across many components asynchronously (like a for loop). During audits, we often reuse the same prompt and task for every component, varying only the details. The seclab-taskflow-agent lets us define templated prompts, iterate through components, and substitute component-specific details as it runs.

Taskflows for general security code audits

After using seclab-taskflow-agent to triage CodeQL alerts, we decided we didn’t want to restrict ourselves to specific types of vulnerabilities and started to explore using the framework for more general security auditing. The main challenge in giving LLMs more freedom is the possibility of hallucinations and an increase in false positives. After all, the success with triaging CodeQL alerts was partly due to the fact that we gave the LLM a very strict and well-defined set of instructions and criteria, so the results could be verified at each stage to see if the instructions were followed.

So our goal here was to find a good way to allow the LLM the freedom to look for different types of vulnerabilities while keeping hallucinations under control.

We’re going to show how we used agent taskflows to discover high-impact vulnerabilities with high true positive rate using just taskflow design and prompt engineering.

General taskflow design

To minimize hallucinations and false positives at the taskflow design level, our taskflow starts with a threat modelling stage, where a repository is divided into different components based on functionalities and various information, such as entry points, and the intended use of each component is collected. This information helps us to determine the security boundary of each component and how much exposure it has to untrusted input.

The information collected through the threat modelling stage is then used to determine the security boundary of each component and to decide what should be considered a security issue. For example, a command injection in a CLI tool with functionality designed to execute any user input script may be a bug but not a security vulnerability, as an attacker able to inject a command using the CLI tool can already execute any script.

At the level of prompts, the intended use and security boundary that is discovered is then used in the prompts to provide strict guidelines as to whether an issue found should be considered a vulnerability or not.

You need to take into account of the intention and threat model of the component in component notes to determine if an issue is a valid security issue or if it is an intended functionality. You can fetch entry points, web entry points and user actions to help you determine the intended usage of the component.

Asking an LLM something as vague as looking for any type of vulnerability anywhere in the code base would give poor results with many hallucinated issues. Ideally, we’d like to simulate the triage environment where we have some potential issues as the starting point of analysis and ask the LLM to apply rigorous criteria to determine whether the potential issue is valid or not.

To bootstrap this process, we break the auditing task into two steps.

First, we ask the LLM to go through each component of the repository and suggest types of vulnerabilities that are more likely to appear in the component.
These suggestions are then passed to another task, where they will be audited according to rigorous criteria.

In this setup, the suggestions from the first step act as some inaccurate vulnerability alerts flagged by an “external tool,” while the second step serves as a triage step. While this may look like a self-validating process—by breaking it down into two steps, each with a fresh context and different prompts—the second step is able to provide an accurate assessment of suggestions.

We’ll now go through these tasks in detail.

Threat modeling stage

When triaging alerts flagged by automatic code scanning tools, we found that a large proportion of false positives is the result of improper threat modeling. Most static analysis tools do not take into account the intended usage and security boundary of the source code and often give results that have no security implications. For example, in a reverse proxy application, many SSRF (server-side request forgery) vulnerabilities flagged by automated tools are likely to fall within the intended use of the application, while some web services used, for example, in continuous integration pipelines are designed to execute arbitrary code and scripts within a sandboxed environment. Remote code execution vulnerabilities in these applications without a sandboxed escape are generally not considered a security risk.

Given these caveats, it pays to first go through the source code to get an understanding of the functionalities and intended purpose of code. We divide this process into the following tasks:

Identify applications: A GitHub repository is an imperfect boundary for auditing: It may be a single component within a larger system or contain multiple components, so it’s worth identifying and auditing each component separately to match distinct security boundaries and keep scope manageable. We do this with the identify_applications taskflow, which asks the LLM to inspect the repository’s source code and documentation and divide it into components by functionality.
Identify entry points: We identify how each entry point is exposed to untrusted inputs to better gauge risk and anticipate likely vulnerabilities. Because “untrusted input” varies significantly between libraries and applications, we provide separate guidelines for each case.
Identify web entry points: This is an extra step to gather further information about entry points in the application and append information that is specific to web application entry points such as noting the HTTP method and paths that are required to access a certain endpoint.
Identify user actions: We have the LLM review the code and identify what functionality a user can access under normal operation. This clarifies the user’s baseline privileges, helps assess whether vulnerabilities could enable privilege gains, and informs the component’s security boundary and threat model, with separate instructions depending on whether the component is a library or an application.

At each of the above steps, information gathered about the repository is stored in a database. This includes components in the repository, their entry points, web entry points, and intended usage. This information is then available for use in the next stage.

Issue suggestion stage

At this stage, we instruct the LLM to suggest some types of vulnerabilities, or a general area of high security risk for each component based on the information about the entry point and intended use of the component gathered from the previous step. In particular, we put emphasis on the intended usage of the component and its risk from untrusted input:

Base your decision on:
- Is this component likely to take untrusted user input? For example, remote web request or IPC, RPC calls?
- What is the intended purpose of this component and its functionality? Does it allow high privileged action?
Is it intended to provide such functionalities for all user? Or is there complex access control logic involved?
- The component itself may also have its own `README.md` (or a subdirectory of it may have a `README.md`). Take a look at those files to help understand the functionality of the component.

We also explicitly instruct the LLM to not suggest issues that are of low severity or are generally considered non-security issues.

However, you should still take care not to include issues that are of low severity or requires unrealistic attack scenario such as misconfiguration or an already compromised system.

In general, we keep this stage relatively free of restrictions and allow the LLM freedom to explore and suggest different types of vulnerabilities and potential security issues. The idea is to have a reasonable set of focus areas and vulnerability types for the actual auditing task to use as a starting point.

One problem we ran into was that the LLM would sometimes start auditing the issues that it suggested, which would defeat the purpose of the brainstorming phase. To prevent this, we instructed the LLM to not audit the issues.

Issue audit stage

This is the final stage of the taskflows. Once we’ve gathered all the information we need about the repository and have suggested some types of vulnerabilities and security risks to focus on, the taskflow goes through each suggested issue and audits them by going through the source code. At this stage, the task starts with fresh context to scrutinize the issues suggested from the previous stage. The suggestions are considered to be unvalidated, and this taskflow is instructed to verify these issues:

The issues suggested have not been properly verified and are only suggested because they are common issues in these types of application. Your task is to audit the source code to check if this type of issues is present.

To avoid the LLM coming up with issues that are non-security related in the context of the component, we once again emphasize that intended usage must be taken into consideration.

You need to take into account of the intention and threat model of the component in component notes to determine if an issue is a valid security issue or if it is an intended functionality.

To avoid the LLM hallucinating issues that are unrealistic, we also instruct it to provide a concrete and realistic attack scenario and to only consider issues that stem from errors in the source code:

Do not consider scenarios where authentication is bypassed via stolen credential etc. We only consider situations that are achievable from within the source code itself.
...
If you believe there is a vulnerability, then you must include a realistic attack scenario, with details of all the file and line included, and also what an attacker can gain by exploiting the vulnerability. Only consider the issue a vulnerability if an attacker can gain privilege by performing an action that is not intended by the component.

To further reduce hallucinations, we also instruct the LLM to provide concrete evidence from the source code, with file path and line information:

Keep a record of the audit notes, be sure to include all relevant file path and line number. Just stating an end point, e.g. `IDOR in user update/delete endpoints (PUT /user/:id)` is not sufficient. I need to have the file and line number.

Finally, we also instruct the LLM that it is possible that there is no vulnerability in the component and that it should not make things up:

Remember, the issues suggested are only speculation and there may not be a vulnerability at all and it is ok to conclude that there is no security issue.

The emphasis of this stage is to provide accurate results while following strict guidelines—and to provide concrete evidence of the findings. With all these strict instructions in place, the LLM indeed rejects many unrealistic and unexploitable suggestions with very few hallucinations.

The first prototype was designed with hallucination prevention as a priority, which raised a question: Would it become too conservative, rejecting most vulnerability candidates and failing to surface real issues?

The answer is clear after we ran the taskflow on a few repositories.

Three examples of vulnerabilities found by the taskflows

In this section, we’ll show three examples of vulnerabilities that were found by the taskflows and that have already been disclosed. In total, we have found and reported over 80 vulnerabilities so far. We publish all disclosed vulnerabilities on our advisories page.

Privilege escalation in Outline (CVE-2025-64487)

Our information-gathering taskflows are optimized toward web applications, which is why we first pointed our audit taskflows to a collaborative web application called Outline.

Outline is a multi-user collaboration suite with properties we were especially interested in:

Documents have owners and different visibility, with permissions per users and teams.
Access rules like that are hard to analyze with a Static Application Security Testing (SAST) tool, since they use custom access mechanisms and existing SAST tools typically don’t know what actions a normal “user” should be able to perform.
Such permission schemes are often also hard to analyze for humans by only reading the source code (if you didn’t create the scheme yourself, that is).

And success: Our taskflows found a bug in the authorization logic on the very first run!

The notes in the audit results read like this:

Audit target: Improper membership management authorization in component server (backend API) of outline/outline (component id 2).

Summary conclusion: A real privilege escalation vulnerability exists. The document group membership modification endpoints (documents.add_group, documents.remove_group) authorize with the weaker \"update\" permission instead of the stronger \"manageUsers\" permission that is required for user membership changes. Because \"update\" can be satisfied by having only a ReadWrite membership on the document, a non‑admin document collaborator can grant (or revoke) group memberships – including granting Admin permission – thereby escalating their own privileges (if they are in the added group) and those of other group members. This allows actions (manageUsers, archive, delete, etc.) that were not intended for a mere ReadWrite collaborator.

Reading the TypeScript-based source code and verifying this finding on a test instance revealed that it was exploitable exactly as described. In addition, the described steps to exploit this vulnerability were on point:

Prerequisites:
- Attacker is a normal team member (not admin), not a guest, with direct ReadWrite membership on Document D (or via a group that grants ReadWrite) but NOT Admin.
- Attacker is a member of an existing group G in the same team (they do not need to be an admin of G; group read access is sufficient per group policy).

Steps:
1. Attacker calls POST documents.add_group (server/routes/api/documents/documents.ts lines 1875-1926) with body:
   {
     "id": "",
     "groupId": "",
     "permission": "admin"
   }
2. Authorization path:
   - Line 1896: authorize(user, "update", document) succeeds because attacker has ReadWrite membership (document.ts lines 96-99 allow update).
   - Line 1897: authorize(user, "read", group) succeeds for any non-guest same-team user (group.ts lines 27-33).
   No \"manageUsers\" check occurs.
3. Code creates or updates GroupMembership with permission Admin (lines 1899-1919).
4. Because attacker is a member of group G, their effective document permission (via groupMembership) now includes DocumentPermission.Admin.
5. With Admin membership, attacker now satisfies includesMembership(Admin) used in:
   - manageUsers (document.ts lines 123-134) enabling adding/removing arbitrary users via documents.add_user / documents.remove_user (lines 1747-1827, 1830-1872).
   - archive/unarchive/delete (document.ts archive policy lines 241-252; delete lines 198-208) enabling content integrity impact.
   - duplicate, move, other admin-like abilities (e.g., duplicate policy lines 136-153; move lines 155-170) beyond original ReadWrite scope.

Using these instructions, a low-privileged user could add arbitrary groups to a document that the user was only allowed to update (the user not being in the possession of the “manageUsers” permission that was typically required for such changes).

In this sample, the group “Support” was added to the document by the low-privileged user named “gg.”

The Outline project fixed this and another issue we reported within three days! (Repo advisory)

The shopping cartocalypse (CVE-2025-15033, CVE-2026-25758)

We didn’t realize what systematic issues we’d uncover in the cart logic of ecommerce applications until we pointed our taskflows at the first online shop in our list. In the PHP-based WooCommerce project, the taskflows promptly found a way for normally signed-in shop users to view all guest orders—including personally identifiable information (including names, addresses, and phone numbers). After we reported this, Automattic (the company behind WooCommerce) quickly released an update (CVE-2025-15033) and accompanying blog post.

Intrigued by that vulnerability, we’ve added additional ecommerce applications to our list of applications to be audited by our agent. And sure enough, we found more vulnerabilities. The popular Ruby-based Spree commerce application contained two similar vulnerabilities (CVE-2026-25758 and CVE-2026-25757). The more critical one allowed unauthenticated users to simply enumerate the addresses (and phone numbers) of all guest orders by more or less incrementing a sequential number.

In this screenshot, the attacker “test66” linked their session to an existing address of a guest user, thus being able to view the full address and phone number.

Our bug-hunting spree didn’t stop with Spree. Our taskflows uncovered similar issues in two additional ecommerce applications.

These authorization logic bugs had been undiscovered for years.

Signing in to Rocket.Chat using any password (CVE-2026-28514)

(This is not what passwordless authentication should look like!)

Every so often you can’t believe your eyes. This finding reported by our taskflows in Rocket.Chat was one of those moments.

When your agent comes back with a note like this:

VULNERABILITY: password authentication bypass in account-service allows logging in as any user with a password set.

You might find it hard to believe at first.

When you then continue reading the output:

Root cause:
- ee/apps/account-service/src/lib/utils.ts:60-61: `validatePassword` returns `Promise` (bcrypt.compare(...)).
- ee/apps/account-service/src/lib/loginViaUsername.ts:18-21: `const valid = user.services?.password?.bcrypt && validatePassword(password, user.services.password.bcrypt);` but does NOT `await` the Promise; since a Promise is truthy, `if (!valid) return false;` is never triggered when bcrypt hash exists.
- ee/apps/account-service/src/lib/loginViaUsername.ts:23-35: proceeds to mint a new login token and saves it, returning `{ uid, token, hashedToken, tokenExpires }`.

It might make more sense, but you’re still not convinced.

It turns out the suspected finding is in the micro-services based setup of Rocket.Chat. In that particular setup, Rocket.Chat exposes its user account service via its DDP Streamer service.

Rocket.Chat’s microservices deployment Copyright Rocket.Chat. (This architecture diagram is from Rocket.Chat’s documentation.)

Once our Rocket.Chat test setup was working properly, we had to write proof of concept code to exploit this potential vulnerability. The notes of the agent already contained the JSON construct that we could use to connect to the endpoint using Meteor’s DDP protocol.

We connected to the WebSocket endpoint for the DDP streamer service, and yes: It was truly possible to login into the exposed Rocket.Chat DDP service using any password. Once signed in, it was also possible to perform other operations such as connecting to arbitrary chat channels and listening on them for messages sent to those channels.

Here we received the message “HELLO WORLD!!!” while listening on the “General” channel.

The technical details of this issue are interesting (and scary as well). Rocket.Chat, primarily a TypeScript-based web application, uses bcrypt to store local user passwords. The bcrypt.compare function (used to compare a password against its stored hash) returns a Promise—a fact that is reflected in Rocket.Chat’s own validatePassword function, which returns Promise:

export const validatePassword = (password: string, bcryptPassword: string): Promise =>
    bcrypt.compare(getPassword(password), bcryptPassword);

However, when that function was used, the value of the Promise was not settled (e.g. by adding an await keyword in front of validatePassword):

const valid = user.services?.password?.bcrypt && validatePassword(password, user.services.password.bcrypt);

if (!valid) {
    return false;
}

This led to the result of validatePassword being ANDed with true. Since a returned Promise is always “truthy” speaking in JavaScript terms, the boolean valid subsequently was always true when a user had a bcrypt password set.

Severity aside, it’s fascinating that the LLM was able to pick up this rather subtle bug, follow it through multiple files, and arrive at the correct conclusion.

What we learned

After running the taskflows over 40 repositories—mostly multi-user web applications—the LLM suggested 1,003 issues (potential vulnerabilities).

After the audit stage, 139 were marked as having vulnerabilities, meaning that the LLM decided they were exploitable After deduplicating the issues—duplicates happen because each repository is run a couple of times on average and the results are aggregated—we end up with 91 vulnerabilities, which we decided to manually inspect before reporting.

We rejected 20 (22%) results as FP: False Positives that we couldn’t reproduce manually.
We rejected 52 (57%) results as low severity: Issues that have very limited potential impact (e.g., blind SSRF with only a HTTP status code returned, issues that require malicious admin during installation stage, etc.).
We kept only 19 (21%) results that we considered vulnerabilities impactful enough to report, all serious vulnerabilities with the majority having a high or critical severity (e.g., vulnerabilities that can be triggered without specific requirements with impact to confidentiality or integrity, such as disclosure of personal data, overwriting of system settings, account takeover, etc.).

This data was collected using gpt-5.x as the model for code analysis and audit tasks.

Note that we have run the taskflows on more repositories since this data was collected, so this table does not represent all the data we’ve collected and all vulnerabilities we’ve reported.

Issue category	All	Has vulnerability	Vulnerability rate
IDOR/Access control issue	241	38	15.8%
XSS	131	17	13.0%
CSRF	110	17	15.5%
Authentication issue	91	15	16.5%
Security misconfiguration	75	13	17.3%
Path traversal	61	10	16.4%
SSRF	45	7	15.6%
Command injection	39	5	12.8%
Remote code execution	24	1	4.2%
Business logic issue	24	6	25.0%
Template injection	24	1	4.2%
File upload handling issues (excludes path traversal)	18	2	11.1%
Insecure deserialization	17	0	0.0%
Open redirect	16	0	0.0%
SQL injection	9	0	0.0%
Sensitive data exposure	8	0	0.0%
XXE	4	0	0.0%
Memory safety	3	0	0.0%
Others	66	7	10.6%

If we divide the findings into two rough categories—logical issues (IDOR, authentication, security misconfiguration, business logic issues, sensitive data exposure) and technical issues (XSS, CSRF, path traversal, SSRF, command injection, remote code execution, template injection, file upload issues, insecure deserialization, open redirect, SQL injection, XXE, memory safety)—we get 439 logical issues and 501 technical issues. Although more technical issues were suggested, the difference isn’t significant because some broad categories (such as remote code execution and file upload issues) can also involve logical issues depending on the attacker scenario.

There are only three suggested issues that concern memory safety. This isn’t too surprising, given the majority of the repositories tested are written in memory-safe languages. But we also suspect that the current taskflows may not be very efficient in finding memory-safety issues, especially when comparing to other automated tools such as fuzzers. This is an interesting area that can be improved by creating more specific taskflows and making more tools, like fuzzers, available to the LLM.

This data led us to the following observations.

LLMs are particularly good at finding logic bugs

What stands out from the data is the 25% rate of “Business logic issue” and the large amount of IDOR issues. In fact, the total number of IDOR issues flagged as vulnerable is more than the next two categories combined (XSS and CSRF). Overall, we get the impression that the LLM does an excellent job of understanding the code space and following the control flow, while taking into account the access control model and intended usage of the application, which is more or less what we’d expect from LLMs that excel in tasks like code reviews. This also makes it great for finding logic bugs that are difficult to find with traditional tools.

LLMs are good at rejecting low-severity issues and false positives

Curiously, none of the false positives are what we’d consider to be hallucinations. All the reports, including the false positives, have sound evidence backing them up, and we were able to follow through the report to locate the endpoints and apply the suggested payload. Many of the false positives are due to more complex circumstances beyond what is available in the code, such as browser mitigations for XSS issues mentioned above or what we would consider as genuine mistakes that a human auditor is also likely to make. For example, when multiple layers of authentications are in place, the LLM could sometimes miss out some of the checks, resulting in false positives.

We have since tested more repositories with more vulnerabilities reported, but the ratio between vulnerabilities and repositories remains roughly the same.

To demonstrate the extensibility of taskflows and how extra information can be incorporated into the taskflows, we created a new taskflow to run after the audit stage, which incorporates our new-found knowledge to filter out low-severity vulnerabilities. We found that the taskflow can filter out roughly 50% of the low-severity vulnerabilities with a couple of borderline vulnerabilities that we reported also getting marked as low severity. The taskflow and the prompt can be adjusted to fit the user’s own preference, but for us, we’re happy to make it more inclusive so we don’t miss out on anything impactful.

LLMs are good at threat modeling

The LLM performs well in threat modeling in general. During the experiment, we tested it on a number of applications with different threat models, such as desktop applications, multi-tenant web applications, applications that are designed to run code in sandbox environments (code injection by design), and reverse proxy applications (applications where SSRF-like behavior is intended). The taskflow is able to take into account the intended usage of these applications and make sound decisions. The taskflow struggles most with threat modelling of desktop applications, as it is often unclear whether other processes running on the user’s desktop should be considered trusted or not.

We’ve also observed some remarkable reasoning by the LLM that excludes issues with no privilege gains. For example, in one case, the LLM noticed that while there are inconsistencies in access control, the issue does not give the attacker any advantages over a manual copy and paste action:

Security impact assessment:

A user possessing only read access to a document (no update rights) can duplicate it provided they also have updateDocument rights on the destination collection. This allows creation of a new editable copy of content they could already read. This does NOT grant additional access to other documents nor bypass protections on the original; any user with read access could manually copy-paste the content into a new document they are permitted to create (creation generally allowed for non-guest, non-viewer members in ReadWrite collections per createDocument collection policy)

We’ve also seen some more sophisticated techniques that were used in the reasoning. For example, in one application that is running scripts in a sandboxed nodejs environment, the LLM suggested the following technique to escape the sandbox:

In Node’s vm, passing any outer-realm function into a contextified sandbox leaks that function’s outer-realm Function constructor through the `constructor` property. From inside the sandbox:
  const F = console.log.constructor; // outer-realm Function
  const hostProcess = F('return process')(); // host process object
  // Bypass module allowlist via host dynamic import
  const cp = await F('return import("node:child_process")')();
  const out = cp.execSync('id').toString();
  return [{ json: { out } }];

The presence of host functions (console.log, timers, require, RPC methods) is sufficient to obtain the host Function constructor and escape the sandbox. The allowlist in require-resolver is bypassed by constructing host-realm functions and using dynamic import of built-in modules (e.g., node:child_process), which does not go through the sandbox’s custom require.

While the result turns out to be a false positive due to other mitigating factors, it demonstrates the LLM’s technical knowledge.

Get involved!

The taskflows we used to find these vulnerabilities are open source and easy to run on your own project, so we hope you’ll give them a try! We also want to encourage you to write your own taskflows. The results showcased in this blog post are just small examples of what’s possible. There are other types of vulnerabilities to find, and there are other security-related problems, like triaging SAST results or building development setups, which we think taskflows can help with. Let us know what you’re building by starting a discussion on our repo!

The post How to scan for vulnerabilities with GitHub Security Lab’s open source AI-powered framework appeared first on The GitHub Blog.

What to expect for open source in 2026

Dylan Birtolo — Wed, 18 Feb 2026 18:41:42 +0000

Over the years (decades), open source has grown and changed along with software development, evolving as the open source community becomes more global.

But with any growth comes pain points. In order for open source to continue to thrive, it’s important for us to be aware of these challenges and determine how to overcome them.

To that end, let’s take a look at what Octoverse 2025 reveals about the direction open source is taking. Feel free to check out the full Octoverse report, and make your own predictions.

Growth that’s global in scope

In 2025, GitHub saw about 36 million new developers join our community. While that number alone is huge, it’s also important to see where in the world that growth comes from. India added 5.2 million developers, and there was significant growth across Brazil, Indonesia, Japan, and Germany.

What does this mean? It’s clear that open source is becoming more global than it was before. It also means that oftentimes, the majority of developers live outside the regions where the projects they’re working on originated. This is a fundamental shift. While there have always been projects with global contributors, it’s now starting to become a reality for a greater number of projects.

Given this global scale, open source can’t rely on contributors sharing work hours, communication strategies, cultural expectations, or even language. The projects that are going to thrive are the ones that support the global community.

One of the best ways to do this is through explicit communication maintained in areas like contribution guidelines, codes of conduct, review expectations, and governance documentation. These are essential infrastructure for large projects that want to support this community. Projects that don’t include these guidelines will have trouble scaling as the number of contributors increases across the globe. Those that do provide them will be more resilient, sustainable, and will provide an easier path to onboard new contributors.

The double-edged sword of AI

AI has had a major role in accelerating global participation over 2025. It’s created a pathway that makes it easier for new developers to enter the coding world by dramatically lowering the barrier to entry. It helps contributors understand unfamiliar codebases, draft patches, and even create new projects from scratch. Ultimately, it has helped new developers make their first contributions sooner.

However, it has also created a lot of noise, or what is called “AI slop.” AI slop is a large quantity of low-quality—and oftentimes inaccurate—contributions that don’t add value to the project. Or they are contributions that would require so much work to incorporate, it would be faster to implement the solution yourself.

This makes it harder than ever to maintain projects and make sure they continue moving forward in the intended direction. Auto-generated issues and pull requests increase volume without always increasing the quality of the project. As a result, maintainers need to spend more time reviewing contributions from developers with vastly variable levels of skill. In a lot of cases, the amount of time it takes to review the additional suggestions has risen faster than the number of maintainers.

Even if you remove AI slop from the equation, the sheer volume of contributions has grown, potentially to unmanageable levels. It can feel like a denial of service attack on human attention.

This is why maintainers have been asking: how do you sift through the noise and find the most important contributions? Luckily, we’ve added some tools to help. There are also a number of open source AI projects specifically trying to address the AI slop issue. In addition, maintainers have been using AI defensively, using it to triage issues, detect duplicate issues, and handle simple maintenance like the labeling of issues. By helping to offload some of the grunt work, it gives maintainers more time to focus on the issues that require human intervention and decision making.

Expect the open source projects that continue to expand and grow over the next year to be those that incorporate AI as part of the community infrastructure. In order to deal with this quantity of information, AI cannot be just a coding assistant. It needs to find ways to ease the pressure of being a maintainer and find a way to make that work more scalable.

Record growth is healthy, if it’s planned for

On the surface, record global growth looks like success. But this influx of newer developers can also be a burden. The sheer popularity of projects that cover basics, such as contributing your first pull request to GitHub, shows that a lot of these new developers are very much in their infancy in terms of comfort with open source. There’s uncertainty about how to move forward and how to interact with the community. Not to mention challenges with repetitive onboarding questions and duplicate issues.

This results in a growing gap between the number of participants in open source projects and the number of maintainers with a sense of ownership. As new developers grow at record rates, this gap will widen.

The way to address this is going to be less about having individuals serving as mentors—although that will still be important. It will be more about creating durable systems that show organizational maturity. What does this mean? While not an exhaustive list, here are some items:

Having a clear, defined path to move from contributor to reviewer to maintainer. Be aware that this can be difficult without a mentor to help guide along this path.
Shared governance models that don’t rely on a single timezone or small group of people.
Documentation that provides guidance on how to contribute and the goals of the project.

By helping to make sure that the number of maintainers keeps relative pace with the number of contributors, projects will be able to take advantage of the record growth. This does create an additional burden on the current maintainers, but the goal is to invest in a solid foundation that will result in a more stable structure in the future. Projects that don’t do this will have trouble functioning at the increased global scale and might start to stall or see problems like increased technical debt.

But what are people building?

It can’t be denied that AI was a major focus—about 60% of the top growing projects were AI focused. However, there were several that had nothing to do with AI. These projects (e.g., Home Assistant, VS Code, Godot) continue to thrive because they meet real needs and support broad, international communities.

Just as the developer space is growing on a global scale, the same can be said about the projects that garner the most interest. These types of projects that support a global community and address their needs are going to continue to be popular and have the most support.

This just continues to reinforce how open source is really embracing being a global phenomenon as opposed to a local one.

What this year will likely hold

Open source in 2026 won’t be defined by a single trend that emerged over 2025. Instead, it will be shaped by how the community responds to the pressures identified over the last year, particularly with the surge in AI and an explosively growing global community.

For developers, this means that it’s important to invest in processes as much as code. Open source is scaling in ways that would have been impossible to imagine a decade ago, and the important question going forward isn’t how much it will grow—it’s how can you make that growth sustainable.

Read the full Octoverse report >

The post What to expect for open source in 2026 appeared first on The GitHub Blog.

Securing the AI software supply chain: Security results across 67 open source projects

Gregg Cochran — Tue, 17 Feb 2026 19:00:00 +0000

Modern software is built on open source projects. In fact, you can trace almost any production system today, including AI, mobile, cloud, and embedded workloads, back to open source components. These components are the invisible infrastructure of software: the download that always works, the library you never question, the build step you haven’t thought about in years, if ever.

A few examples:

curl moves data for billions of systems, from package managers to CI pipelines.
Python, pandas, and SciPy sit underneath everything from LLM research to ETL workflows and model evaluation.
Node.js, LLVM, and Jenkins shape how software is compiled, tested, and shipped across industries.

When these projects are secure, teams can adopt automation, AI‑enhanced tooling, and faster release cycles without adding risk or slow down development. When they aren’t, the blast radius crosses project boundaries, propagating through registries, clouds, transitive dependencies, and production systems, including AI systems, that react far faster than traditional workflows.

Securing this layer is not only about preventing incidents; it’s about giving developers confidence that the systems they depend on—whether for model training, CI/CD, or core runtime behavior—are operating on hardened, trustworthy foundations. Open source is shared industrial infrastructure that deserves real investment and measurable outcomes.

That is the mission of the GitHub Secure Open Source Fund: to secure open source projects that underpin the digital supply chain, catalyze innovation, and are critical to the modern AI stack.

We do this by directly linking funding to verified security outcomes and by giving maintainers resources, hands‑on security training, and a security community where they can raise their highest‑risk concerns and get expert feedback.

Why securing critical open source projects matters

A single production service can depend on hundreds or even thousands of transitive dependencies. As Log4Shell demonstrated, when one widely used project is compromised, the impact is rarely confined to a single application or company.

Investing in the security of widely used open source projects does three things at once:

It reinforces that security is a baseline requirement for modern software, not optional labor.
It gives maintainers time, resources, and support to perform proactive security work.
It reduces systemic risk across the global software supply chain.

This security work benefits everyone who writes, ships, or operates code, even if they never interact directly with the projects involved. That gap is exactly what the GitHub Secure Open Source Fund was built to close. In Session 1 & 2, 71 projects made significant security improvements. In Session 3, 67 open source projects delivered concrete security improvements to reduce systemic risk across the software supply chain.

How the GitHub Secure Open Source Fund works

Each session is a three-week sprint and engagement for a total of 12 months. Funding and participation are tied directly to outcome‑driven goals and verified security improvements.

The sprint is designed and curated by the GitHub Security Lab, and delivered by security experts from GitHub and our partners. The training is structured into different focus areas per week.

These include:

Foundations of open source security
Threat modeling and secure coding
AI security and vulnerability management

Throughout this program, each project receives $10,000 USD via GitHub Sponsors (which breaks down to $6,000 USD during the sprint and $2,000 USD at 6- and 12-month security check-ins). Projects are invited to a new security-focused community and office hours with the GitHub Security Lab, which they can take advantage of during the full 12 months. They also receive security resources to immediately implement in their project and Azure credits for cloud infrastructure.

Learn more >

Session 3, by the numbers

67 projects
98 maintainers
$670,000 in non-dilutive funding powered by GitHub Sponsors
99% of projects completed the program with core GitHub security features enabled

Real security results across all sessions:

138 projects
219 maintainers
38 countries represented by participating projects
$1.38M in non-dilutive funding powered by GitHub Sponsors
191 new CVEs Issued
250+ new secrets prevented from being leaked
600+ leaked secrets were detected and resolved
Billions of monthly downloads powered by alumni projects

Plus, in just the last 6 months:

500+ CodeQL alerts fixed
66 secrets blocked

Where security work happened in Session 3

Session 3 focused on improving security across the systems developers rely on every day. The projects below are grouped by the role they play in the software ecosystem.

Core programming languages and runtimes 🤖

CPython • Himmelblau • LLVM • Node.js • Rustls

These projects define how software is written and executed. Improvements here flow downstream to entire ecosystems.

This group includes CPython, Node.js, LLVM, Rustls, and related tooling that shapes compilation, execution, and cryptography at scale.

For example, improvements to CPython directly benefit millions of developers who rely on Python for application development, automation, and AI workloads. LLVM maintainers identified security improvements that complement existing investments and reduce risk across toolchains used throughout the industry.

When language runtimes improve their security posture, everything built on top of them inherits that resilience.

Web, networking, and core infrastructure libraries 📚

Apache APISIX• curl• evcc • kgateway• Netty• quic-go• urllib3 • Vapor

These projects form the connective tissue of the internet. They handle HTTP, TLS, APIs, and network communication that nearly every application depends on.

This group includes curl, urllib3, Netty, Apache APISIX, quic-go, and related libraries that sit on the hot path of modern software.

Build systems, CI/CD, and release tooling 🧰

Apache Airflow • Babel • Foundry • Gitoxide • GoReleaser • Jenkins • Jupyter Docker Stacks • node-lru-cache • oapi-codegen • PyPI / Warehouse • rimraf • webpack

Compromising build tooling compromises the entire supply chain. These projects influence how software is built, tested, packaged, and shipped.

Session 3 included projects such as Jenkins, Apache Airflow, GoReleaser, PyPI Warehouse, webpack, and related automation and release infrastructure.

Maintainers in this category focused on securing workflows that often run with elevated privileges and broad access. Improvements here help prevent tampering before software ever reaches users.

Data science, scientific computing, and AI foundations 📊

ACI.dev • ArviZ • CocoIndex • OpenBB Platform • OpenMetadata • OpenSearch • pandas • PyMC • SciPy • TraceRoot

These projects sit at the core of modern data analysis, research, and AI development. They are increasingly embedded in production systems as well as research pipelines.

Projects such as pandas, SciPy, PyMC, ArviZ, and OpenSearch participated in Session 3. Maintainers expanded security coverage across large and complex codebases, often moving from limited scanning to continuous checks on every commit and release.

Many of these projects also engaged deeply with AI-related security topics, reflecting their growing role in AI workflows.

Developer tools and productivity utilities ⚒️

AssertJ • ArduPilot • AsyncAPI Initiative • Bevy • calibre • DIGIT • fabric.js • ImageMagick • jQuery • jsoup • Mastodon • Mermaid • Mockoon • p5.js • python-benedict • React Starter Kit • Selenium • Sphinx• Spyder • ssh_config• Thunderbird for Android • Two.js • xyflow • Yii framework

These projects shape the day-to-day experience of writing, testing, and maintaining software.

The group includes tools such as Selenium, Sphinx, ImageMagick, calibre, Spyder, and other widely used utilities that appear throughout development and testing environments.

Improving security here reduces the risk that developer tooling becomes an unexpected attack vector, especially in automated or shared environments.

Identity, secrets, and security frameworks 🔒

external-secrets • Helmet.js • Keycloak • Keyshade • Oauth2 (Ruby) • varlock • WebAuthn (Go)

These projects form the backbone of authentication, authorization, secrets management, and secure configuration.

Session 3 participants included projects such as Keycloak, external-secrets, oauth2 libraries, WebAuthn tooling, and related security frameworks.

Maintainers in this group often reported shifting from reactive fixes to systematic threat modeling and long-term security planning, improving trust for every system that depends on them.

Security as shared infrastructure

One of the most durable outcomes of the program was a shift in mindset.

Maintainers moved security from a stretch goal to a core requirement. They shifted from reactive patching to proactive design, and from isolated work to shared practice. Many are now publishing playbooks, sharing incident response exercises, and passing lessons on to their contributor communities.

That is how security scales: one-to-many.

What’s next: Help us make open source more secure

Securing open source is basic maintenance for the internet. By giving 67 heavily used projects real funding, three focused weeks, and direct help, we watched maintainers ship fixes that now protect millions of builds a day. This training, taught by the GitHub Security Lab and top cybersecurity experts, allows us to go beyond one-on-one education and enable one-to-many impact.

For example, many maintainers are working to make their playbooks public. The incident-response plans they rehearsed are forkable. The signed releases they now ship flow downstream to every package manager and CI pipeline that depends on them.

Join us in this mission to secure the software supply chain at scale.

Projects and maintainers: Apply now to the GitHub Secure Open Source Fund and help make open source safer for everyone. Session 4 begins April 2026. If you write code, rely on open source, or want the systems you depend on to remain trustworthy, we encourage you to apply.
Funding and Ecosystem Partners: Become a Funding or Ecosystem Partner and support a more secure open source future. Join us on this mission to secure the software supply chain at scale!

Thank you to all of our partners

We couldn’t do this without our incredible network of partners. Together, we are helping secure the open source ecosystem for everyone!

Funding Partners: Alfred P. Sloan Foundation, American Express, Chainguard, Datadog, Herodevs, Kraken, Mayfield, Microsoft, Shopify, Stripe, Superbloom, Vercel, Zerodha, 1Password

Ecosystem Partners: Atlantic Council, Ecosyste.ms, CURIOSS, Digital Data Design Institute Lab for Innovation Science, Digital Infrastructure Insights Fund, Microsoft for Startups, Mozilla, OpenForum Europe, Open Source Collective, OpenUK, Open Technology Fund, OpenSSF, Open Source Initiative, OpenJS Foundation, University of California, OWASP, Santa Cruz OSPO, Sovereign Tech Agency, SustainOSS

The post Securing the AI software supply chain: Security results across 67 open source projects appeared first on The GitHub Blog.

Welcome to the Eternal September of open source. Here’s what we plan to do for maintainers.

Ashley Wolf — Thu, 12 Feb 2026 20:14:11 +0000

Open collaboration runs on trust. For a long time, that trust was protected by a natural, if imperfect filter: friction.

If you were on Usenet in 1993, you’ll remember that every September a flood of new university students would arrive online, unfamiliar with the norms, and the community would patiently onboard them. Then mainstream dial-up ISPs became popular and a continuous influx of new users came online. It became the September that never ended.

Today, open source is experiencing its own Eternal September. This time, it’s not just new users. It’s the sheer volume of contributions.

When the cost to contribute drops

In the era of mailing lists contributing to open source required real effort. You had to subscribe, lurk, understand the culture, format a patch correctly, and explain why it mattered. The effort didn’t guarantee quality, but it filtered for engagement. Most contributions came from someone who had genuinely engaged with the project.

It also excluded people. The barrier to entry was high. Many projects worked hard to lower it in order to make open source more welcoming.

A major shift came with the pull request. Hosting projects on GitHub, using pull requests, and labeling “Good First Issues” reduced the friction needed to contribute. Communities grew and contributions became more accessible.

That was a good thing.

But friction is a balancing act. Too much keeps people and their ideas out, too little friction can strain the trust open source depends on.

Today, a pull request can be generated in seconds. Generative AI makes it easy for people to produce code, issues, or security reports at scale. The cost to create has dropped but the cost to review has not.

It’s worth saying: most contributors are acting in good faith. Many want to help projects they care about. Others are motivated by learning, visibility, or the career benefits of contributing to widely used open source. Those incentives aren’t new and they aren’t wrong.

The challenge is what happens when low-quality contributions arrive at scale. When volume accelerates faster than review capacity, even well-intentioned submissions can overwhelm maintainers. And when that happens, trust, the foundation of open collaboration, starts to strain.

The new scale of noise

It is tempting to frame “low-quality contributions” or “AI slop” contributions as a unique recent phenomenon. It isn’t. Maintainers have always dealt with noisy inbound.

The Linux kernel operates under a “web of trust” philosophy and formalized its SubmittingPatches guide and introduced the Developer Certificate of Origin (DCO) in 2004 for a reason.
Mozilla and GNOME built formal triage systems around the reality that most incoming bug reports needed filtering before maintainers invested deeper time.
Automated scanners: Long before GenAI, maintainers dealt with waves of automated security and code quality reports from commercial and open source scanning tools.

The question from maintainers has often been the same: “Are you really trying to help me, or just help yourself?“

Just because a tool—whether a static analyzer or an LLM—makes it easy to generate a report or a fix, it doesn’t mean that contribution is valuable to the project. The ease of creation often adds a burden to the maintainer because there is an imbalance of benefit. The contributor maybe gets the credit (or the CVE, or the visibility), while the maintainer gets the maintenance burden.

Maintainers are feeling that directly. For example:

curl ended its bug bounty program after AI-generated security reports exploded, each taking hours to validate.
Projects like Ghostty are moving to invitation-only contribution models, requiring discussion before accepting code contributions.
Multiple projects are adopting explicit rules about AI-generated contributions.

These are rational responses to an imbalance.

What we’re doing at GitHub

At GitHub, we aren’t just watching this happen. Maintainer sustainability is foundational to open source, and foundational to us. As the home of open source, we have a responsibility to help you manage what comes through the door.

We are approaching this from multiple angles: shipping immediate relief now, while building toward longer-term, systemic improvements. Some of this is about tooling. Some is about creating clearer signals so maintainers can decide where to spend their limited time.

Features we’ve already shipped

Repo-level pull request controls: Gives maintainers the option to limit pull request creation to collaborators or disable pull requests entirely. While the introduction of the pull request was fundamental to the growth of open source, maintainers should have the tools they need to manage their projects.
Pinned comments on issues: You can now pin a comment to the top of an issue from the comment menu.
Banners to reduce comment noise: Experience fewer unnecessary notifications with a banner that encourages people to react or subscribe instead of leaving noise like “+1” or “same here.”
Pull request performance improvements: Pull request diffs have been optimized for greater responsiveness and large pull requests in the new files changed experience respond up to 67% faster.
Faster issue navigation: Easier bug triage thanks to significantly improved speeds when browsing and navigating issues as a maintainer.
Temporary interaction limits: You can temporarily enforce a period of limited activity for certain users on a public repository.

Plus, coming soon: pull request deletion from the UI. This will remove spam or abusive pull requests so repositories can stay more manageable.

These improvements focus on reducing review overhead.

Exploring next steps

We know that walls don’t build communities. As we explore next steps, our focus is on giving maintainers more control while helping protect what makes open source communities work.

Some of the directions we’re exploring in consultation with maintainers include:

Criteria-based gating: Requiring a linked issue before a pull request can be opened, or defining rules that contributions must meet before submission.
Improved triage tools: Potentially leveraging automated triage to evaluate contributions against a project’s own guidelines (like CONTRIBUTING.md) and surface which pull requests should get your attention first.

These tools are meant to support decision-making, not replace it. Maintainers should always remain in control.

We are also aware of tradeoffs. Restrictions can disproportionately affect first-time contributors acting in good faith. That’s why these controls are optional and configurable.

The community is building ladders

One of the things I love most about open source is that when the community hits a wall, people build ladders. We’re seeing a lot of that right now.

Maintainers across the ecosystem are experimenting with different approaches. Some projects have moved to invitation-only workflows. Others are building custom GitHub Actions for contributor triage and reputation scoring.

Mitchell Hashimoto’s Vouch project is an interesting example. It implements an explicit trust management system where contributors must be vouched for by trusted maintainers before they can participate. It’s experimental and some aspects will be debated, but it fits a longer lineage, from Advogato’s trust metric to Drupal’s credit system to the Linux kernel’s Signed-off-by chain.

At the same time, many communities are investing heavily in education and onboarding to widen who can contribute while setting clearer expectations. The Python community, for example, emphasizes contributor guides, mentorship, and clearly labeled entry points. Kubernetes pairs strong governance with extensive documentation and contributor education, helping new contributors understand not just how to contribute, but what a useful contribution looks like.

These approaches aren’t mutually exclusive. Education helps good-faith contributors succeed. Guardrails help maintainers manage scale.

There is no single correct solution. That’s why we are excited to see maintainers building tools that match their project’s specific values. The tools communities build around the platform often become the proving ground for what might eventually become features. So we’re paying close attention.

Building community, not just walls

We also need to talk about incentives. If we only build blocks and bans, we create a fortress, not a bazaar.

Right now, the concept of “contribution” on GitHub still leans heavily toward code authorship. In WordPress, they use manually written “props” credit given not just for code, but for writing, reproduction steps, user testing, and community support. It recognizes the many forms of contribution that move a project forward.

We want to explore how GitHub can better surface and celebrate those contributions. Someone who has consistently triaged issues or merged documentation PRs has proven they understand your project’s voice. These are trust signals we should be surfacing to help you make decisions faster.

Tell us what you need

We’ve opened a community discussion to gather feedback on the directions we’re exploring: Exploring Solutions to Tackle Low-Quality Contributions on GitHub.

We want to hear from you. Share what is working for your projects, where the gaps are, and what would meaningfully improve your experience maintaining open source.

Open source’s Eternal September is a sign of something worth celebrating: more people want to participate than ever before. The volume of contributions is only going to grow — and that’s a good thing. But just as the early internet evolved its norms and tools to sustain community at scale, open source needs to do the same. Not by raising the drawbridge, but by giving maintainers better signals, better tools, and better ways to channel all that energy into work that moves their projects forward.

Let’s build that together.

The post Welcome to the Eternal September of open source. Here’s what we plan to do for maintainers. appeared first on The GitHub Blog.

AI-supported vulnerability triage with the GitHub Security Lab Taskflow Agent

Man Yue Mo — Tue, 20 Jan 2026 19:52:50 +0000

Triaging security alerts is often very repetitive because false positives are caused by patterns that are obvious to a human auditor but difficult to encode as a formal code pattern. But large language models (LLMs) excel at matching the fuzzy patterns that traditional tools struggle with, so we at the GitHub Security Lab have been experimenting with using them to triage alerts. We are using our recently announced GitHub Security Lab Taskflow Agent AI framework to do this and are finding it to be very effective.

💡 Learn more about it and see how to activate the agent in our previous blog post.

In this blog post, we’ll introduce these triage taskflows, showcase results, and share tips on how you can develop your own—for triage or other security research workflows.

By using the taskflows described in this post, we quickly triaged a large number of code scanning alerts and discovered many (~30) real-world vulnerabilities since August, many of which have already been fixed and published. When triaging the alerts, the LLMs were only given tools to perform basic file fetching and searching. We have not used any static or dynamic code analysis tools other than to generate alerts from CodeQL.

While this blog post showcases how we used LLM taskflows to triage CodeQL queries, the general process creates automation using LLMs and taskflows. Your process will be a good candidate for this if:

You have a task that involves many repetitive steps, and each one has a clear and well-defined goal.
Some of those steps involve looking for logic or semantics in code that are not easy for conventional programming to identify, but are fairly easy for a human auditor to identify. Trying to identify them often results in many monkey patching heuristics, badly written regexp, etc. (These are potential sweet spots for LLM automation!)

If your project meets those criteria, then you can create taskflows to automate these sweet spots using LLMs, and use MCP servers to perform tasks that are well suited for conventional programming.

Both the seclab-taskflow-agent and seclab-taskflows repos are open source, allowing anyone to develop LLM taskflows to perform similar tasks. At the end of this blog post, we’ll also give some development tips that we’ve found useful.

Introduction to taskflows

Taskflows are YAML files that describe a series of tasks that we want to do with an LLM. In this way, we can write prompts to complete different tasks and have tasks that depend on each other. The seclab-taskflow-agent framework takes care of running the tasks one after another and passing the results from one task to the next.

For example, when auditing CodeQL alert results, we first want to fetch the code scanning results. Then, for each result, we may have a list of tasks that we need to check. For example, we may want to check if an alert can be reached by an untrusted attacker and whether there are authentication checks in place. These become a list of tasks we specify in a taskflow file.

We use tasks instead of one big prompt because LLMs have limited context windows, and complex, multi-step tasks often are not completed properly. Some steps are frequently left out, so having a taskflow to organize the task avoids these problems. Even with LLMs that have larger context windows, we find that taskflows are useful to provide a way for us to control and debug the task, as well as to accomplish bigger and more complex tasks.

The seclab-taskflow-agent can also perform a batch “for loop”-style task asynchronously. When we audit alerts, we often want to apply the same prompts and tasks to every alert, but with different alert details. The seclab-taskflow-agent allows us to create templated prompts to iterate through the alerts and replace the details specific to each alert when running the task.

Triaging taskflows from a code scanning alert to a report

The GitHub Security Lab periodically runs a set of CodeQL queries against a selected set of open source repositories. The process of triaging these alerts is usually fairly repetitive, and for some alerts, the causes of false positives are usually fairly similar and can be spotted easily.

For example, when triaging alerts for GitHub Actions, false positives often result from some checks that have been put in place to make sure that only repo maintainers can trigger a vulnerable workflow, or that the vulnerable workflow is disabled in the configuration. These access control checks come in many different forms without an easily identifiable code pattern to match and are thus very difficult for a static analyzer like CodeQL to detect. However, a human auditor with general knowledge of code semantics can often identify them easily, so we expect an LLM to be able to identify these access control checks and remove false positives.

Over the course of a couple of months, we’ve tested our taskflows with a few CodeQL rules using mostly Claude Sonnet 3.5. We have identified a number of real, exploitable vulnerabilities. The taskflows do not perform an “end-to-end” analysis, but rather produce a bug report with all the details and conclusions so that we can quickly verify the results. We did not instruct the LLM to validate the results by creating an exploit nor provide any runtime environment for it to test its conclusion. The results, however, remain fairly accurate even without an automated validation step and we were able to remove false positives in the CodeQL queries quickly.

The rules are chosen based on our own experience of triaging these types of alerts and whether the list of tasks can be formulated into clearly defined instructions for LLMs to consume.

General taskflow design

Taskflows generally consist of tasks that are divided into a few different stages. In the first stage, the tasks collect various bits of information relevant to the alert. This information is then passed to an auditing stage, where the LLM looks for common causes of false positives from our own experience of triaging alerts. After the auditing stage, a bug report is generated using the information gathered. In the actual taskflows, the information gathering and audit stage are sometimes combined into a single task, or they may be separate tasks, depending on how complex the task is.

To ensure that the generated report has sufficient information for a human auditor to make a decision, an extra step checks that the report has the correct formatting and contains the correct information. After that, a GitHub Issue is created, ready to be reviewed.

Creating a GitHub Issue not only makes it easy for us to review the results, but also provides a way to extend the analysis. After reviewing and checking the issues, we often find that there are causes for false positives that we missed during the auditing process. Also, if the agent determines that the alert is valid, but the human reviewer disagrees and finds that it’s a false positive for a reason that was unknown to the agent so far, the human reviewer can document this as an alert dismissal reason or issue comment. When the agent analyzes similar cases in the future, it will be aware of all the past analysis stored in those issues and alert dismissal reasons, incorporate this new intelligence in its knowledge base, and be more effective at detecting false positives.

Information collection

During this stage, we instruct the LLM (examples are provided in the Triage examples section below) to collect relevant information about the alert, which takes into account the threat model and human knowledge of the alert in general. For example, in the case of GitHub Actions alerts, it will look at what permissions are set in the GitHub workflow file, what are the events that trigger the GitHub workflow, whether the workflow is disabled, etc. These generally involve independent tasks that follow simple, well-defined instructions to ensure the information collected is consistent. For example, checking whether a GitHub workflow is disabled involves making a GitHub API call via an MCP server.

To ensure that the information collected is accurate and to reduce hallucination, we instruct the LLM to include precise references to the source code that includes both file and line number to back up the information it collected:

You should include the line number where the untrusted code is invoked, as well as the untrusted code or package manager that is invoked in the notes.

Each task then stores the information it collected in audit notes, which are kind of a running commentary of an alert. Once the task is completed, the notes are serialized to a database which the next task can then append their notes to when it is done.

In general, each of the information gathering tasks is independent of each other and does not need to read each other’s notes. This helps each task to focus on its own scope without being distracted by previously collected information.

The end result is a “bag of information” in the form of notes associated with an alert that is then passed to the auditing tasks.

Audit issue

At this stage, the LLM goes through the information gathered and performs a list of specific checks to reject alert results that turned out to be false positives. For example, when triaging a GitHub Actions alert, we may have collected information about the events that trigger the vulnerable workflow. In the audit stage, we’ll check if these events can be triggered by an attacker or if they run in a privileged context. After this stage, a lot of the false positives that are obvious to a human auditor will be removed.

Decision-making and report generation

For alerts that have made it through the auditing stage, the next step is to create a bug report using the information gathered, as well as the reasoning for the decision at the audit stage. Again, in our prompt, we are being very precise about the format of the report and what information we need. In particular, we want it to be concise but also include information that makes it easy for us to verify the results, with precise code references and code blocks.

The report generated uses the information gathered from the notes in previous stages and only looks at the source code to fetch code snippets that are needed in the report. No further analysis is done at this stage. Again, the very strict and precise nature of the tasks reduces the amount of hallucination.

Report validation and issue creation

After the report is written, we instruct the LLM to check the report to ensure that all the relevant information is contained in the report, as well as the consistency of the information:

Check that the report contains all the necessary information:
- This criteria only applies if the workflow containing the alert is a reusable action AND has no high privileged trigger. 
You should check it with the relevant tools in the gh_actions toolbox.
If that's not the case, ignore this criteria.
In this case, check that the report contains a section that lists the vulnerable action users. 
If there isn't any vulnerable action users and there is no high privileged trigger, then mark the alert as invalid and using the alert_id and repo, then remove the memcache entry with the key {{ RESULT_key }}.

Missing or inconsistent information often indicates hallucinations or other causes of false positives (for example, not being able to track down an attacker controlled input). In either case, we dismiss the report.

If the report contains all the information and is consistent, then we open a GitHub Issue to track the alert.

Issue review and repo-specific knowledge

The GitHub Issue created in the previous step contains all the information needed to verify the issue, with code snippets and references to lines and files. This provides a kind of “checkpoint” and a summary of the information that we have, so that we can easily extend the analysis.

In fact, after creating the issue, we often find that there are repo-specific permission checks or sanitizers that render the issue a false positive. We are able to incorporate these problems by creating taskflows that review these issues with repo-specific knowledge added in the prompts. One approach that we’ve experimented with is to collect dismissal reasons for alerts in a repo and instruct the LLM to take into account these dismissal reasons and review the GitHub issue. This allows us to remove false positives due to reasons specific to a repo.

In this case, the LLM is able to identify the alert as false positive after taking into account a custom check-run permission check that was recorded in the alert dismissal reasons.

Triage examples and results

In this section we’ll give some examples of what these taskflows look like in practice. In particular, we’ll show taskflows for triaging some GitHub actions and JavaScript alerts.

GitHub Actions alerts

The specific actions alerts that we triaged are checkout of untrusted code in a privileged context and code injection.

The triaging of these queries shares a lot of similarities. For example, both involve checking the workflow triggering events, permissions of the vulnerable workflow, and tracking workflow callers. In fact, the main differences involve local analysis of specific details of the vulnerabilities. For code injection, this involves whether the injected code has been sanitized, how the expression is evaluated and whether the input is truly arbitrary (for example, pull request ID is unlikely to cause code injection issue). For untrusted checkout, this involves whether there is a valid code execution point after the checkout.

Since many elements in these taskflows are the same, we’ll use the code injection triage taskflow as an example. Note that because these taskflows have a lot in common, we made heavy use of reusable features in the seclab-taskflow-agent, such as prompts and reusable tasks.

When manually triaging GitHub Actions alerts for these rules, we commonly run into false positives because of:

Vulnerable workflow doesn’t run in a privileged context. This is determined by the events that trigger the vulnerable workflow. For example, a workflow triggered by the pull_request_target runs in a privileged context, while a workflow triggered by the pull_request event does not. This can usually be determined by simply looking at the workflow file.
Vulnerable workflow disabled explicitly in the repo. This can be checked easily by checking the workflow settings in the repo.
Vulnerable workflow explicitly restricts permissions and does not use any secrets. In which case, there is little privilege to gain.
Vulnerability specific issues, such as invalid user input or sanitizer in the case of code injection and the absence of a valid code execution point in the case of untrusted checkout.
Vulnerable workflow is a reusable workflow but not reachable from any workflow that runs in privileged context.

Very often, triaging these alerts involves many simple but tedious checks like the ones listed above, and an alert can be determined to be a false positive very quickly by one of the above criteria. We therefore model our triage taskflows based on these criteria.

So, our action-triage taskflows consist of the following tasks during information gathering and the auditing stage:

Workflow trigger analysis: This stage performs both information gathering and auditing. It first collects events that trigger the vulnerable workflow, as well as permission and secrets that are used in the vulnerable workflow. It also checks whether the vulnerable workflow is disabled in the repo. All information is local to the vulnerable workflow itself. This information is stored in running notes which are then serialized to a database entry. As the task is simple and involves only looking at the vulnerable workflow, preliminary auditing based on the workflow trigger is also performed to remove some obvious false positives.
Code injection point analysis: This is another task that only analyzes the vulnerable workflow and combines information gathering and audit in a single task. This task collects information about the location of the code injection point, and the user input that is injected. It also performs local auditing to check whether a user input is a valid injection risk and whether it has a sanitizer.
Workflow user analysis: This performs a simple caller analysis that looks for the caller of the vulnerable workflow. As it can potentially retrieve and analyze a large number of files, this step is divided into two main tasks that perform information gathering and auditing separately. In the information gathering task, callers of the vulnerable workflow are retrieved and their trigger events, permissions, use of secrets are recorded in the notes. This information is then used in the auditing task to determine whether the vulnerable workflow is reachable by an attacker.

Each of these tasks is applied to the alert and at each step, false positives are filtered out according to the criteria in the task.

After the information gathering and audit stage, our notes will generally include information such as the events that trigger the vulnerable workflow, permissions and secrets involved, and (in case of a reusable workflow) other workflows that use the vulnerable workflow as well as their trigger events, permissions, and secrets. This information will form the basis for the bug report. As a sanity check to ensure that the information collected so far is complete and consistent, the review_report task is used to check for missing or inconsistent information before a report is created.

After that, the create_report task is used to create a bug report which will form the basis of a GitHub Issue. Before creating an issue, we double check that the report contains the necessary information and conforms to the format that we required. Missing information or inconsistencies are likely the results of some failed steps or hallucinations and we reject those cases.

The following diagram illustrates the main components of the triage_actions_code_injection taskflow:

We then create GitHub Issues using the create_issue_actions taskflow. As mentioned before, the GitHub Issues created contain sufficient information and code references to verify the vulnerability quickly, as well as serving as a summary for the analysis so far, allowing us to continue further analysis using the issue. The following shows an example of an issue that is created:

In particular, we can use GitHub Issues and alert dismissal reasons as a means to incorporate repo-specific security measures and to further the analysis. To do so, we use the review_actions_injection_issues taskflow to first collect alert dismissal reasons from the repo. These dismissal reasons are then checked against the alert stated in the GitHub Issue. In this case, we simply use the issue as the starting point and instruct the LLM to audit the issue and check whether any of the alert dismissal reasons applies to the current issue. Since the issue contains all the relevant information and code references for the alert, the LLM is able to use the issue and the alert dismissal reasons to further the analysis and discover more false positives. The following shows an alert that is rejected based on the dismissal reasons:

The following diagram illustrates the main components of the issue creation and review taskflows:

JavaScript alerts

Similarly to triaging action alerts, we also triaged code scanning alerts for the JavaScript/TypeScript languages to a lesser extent. In the JavaScript world, we triaged code scanning alerts for the client-side cross-site-scripting CodeQL rule. (js/xss)

The client-side cross-site scripting alerts have more variety with regards to their sources, sinks, and data flows when compared to the GitHub Actions alerts.

The prompts for analyzing those XSS vulnerabilities are focused on helping the person responsible for triage make an educated decision, not making the decision for them. This is done by highlighting the aspects that seem to make a given alert exploitable by an attacker and, more importantly, what likely prevents the exploitation of a given potential issue. Other than that, the taskflows follow a similar scheme as described in the GitHub Actions alerts section.

While triaging XSS alerts manually, we’ve often identified false positives due to these reasons:

Custom or unrecognized sanitization functions (e.g. using regex) that the SAST-tool cannot verify.
Reported sources that are likely unreachable in practice (e.g., would require an attacker to send a message directly from the webserver).
Untrusted data flowing into potentially dangerous sinks, whose output then is only used in an non-exploitable way.
The SAST-tool not knowing the full context where the given untrusted data ends up.

Based on these false positives, the prompts in the relevant taskflow or even in the active personality were extended and adjusted. If you encounter certain false positives in a project, auditing it makes sense to extend the prompt so that false positives are correctly marked (and also if alerts for certain sources/sinks are not considered a vulnerability).

In the end, after executing the taskflows triage_js_ts_client_side_xss and create_issues_js_ts, the alert would result in GitHub issues such as:

While this is a sample for an alert worthy of following up (which turned out to be a true positive, being exploitable by using a javascript: URL), alerts that the taskflow agent decided were false positive get their issue labelled with “FP” (for false positive):

Taskflows development tips

In this section we share some of our experiences when working on these taskflows, and what we think are useful in the development of taskflows. We hope that these will help others create their own taskflows.

Use of database to store intermediate state

While developing a taskflow with multiple tasks, we sometimes encounter problems in tasks that run at a later stage. These can be simple software problems, such as API call failures, MCP server bugs, prompt-related problems, token problems, or quota problems.

By keeping tasks small and storing results of each task in a database, we avoided rerunning lengthy tasks when failure happens. When a task in a taskflow fails, we simply rerun the taskflow from the failed task and reuse the results from earlier tasks that are stored in the database. Apart from saving us time when a task failed, it also helped us to isolate effects of each task and tweak each task using the database created from the previous task as a starting point.

Breaking down complex tasks into smaller tasks

When we were developing the triage taskflows, the models that we used did not handle large context and complex tasks very well. When trying to perform complex and multiple tasks within the same context, we often ran into problems such as tasks being skipped or instructions not being followed.

To counter that, we divided tasks into smaller, independent tasks. Each started with a fresh new context. This helped reduce the context window size and alleviated many of the problems that we had.

One particular example is the use of templated repeat_prompt tasks, which loop over a list of tasks and start a new context for each of them. By doing this, instead of going through a list in the same prompt, we ensured that every single task was performed, while the context of each task was kept to a minimum.

An added benefit is that we are able to tweak and debug the taskflows with more granularity. By having small tasks and storing results of each task in a database, we can easily separate out part of a taskflow and run it separately.

Delegate to MCP server whenever possible

Initially, when checking and gathering information, such as workflow triggers, from the source code, we simply incorporated instructions in prompts because we thought the LLM should be able to gather the information from the source code. While this worked most of the time, we also noticed some inconsistencies due to the non-deterministic nature of the LLM. For example, the LLM sometimes would only record a subset of the events that trigger the workflow, or it would sometimes make inconsistent conclusions about whether the trigger runs the workflow in a privileged context or not.

Since these information and checks can easily be performed programmatically, we ended up creating tools in the MCP servers to gather the information and perform these checks. This led to a much more consistent outcome.

By moving most of the tasks that can easily be done programmatically to MCP server tools while leaving the more complex logical reasoning tasks, such as finding permission checks for the LLM, we were able to leverage the power of LLM while keeping the results consistent.

Reusable taskflow to apply tweaks across taskflows

As we were developing the triage taskflows, we realized that many tasks can be shared between different triage taskflows. To make sure that tweaks in one taskflow can be applied to the rest and to reduce the amount of copy and paste, we needed to have some ways to refactor the taskflows and extract reusable components.

We added features like reusable tasks and prompts. Using these features allowed us to reuse and apply changes consistently across different taskflows.

Configuring models across taskflows

As LLMs are constantly developing and new versions are released frequently, it soon became apparent that we need a way to update model version numbers across taskflows. So, we added the model configuration feature that allows us to change models across taskflows, which is useful when the model version needs updating or we just want to experiment and rerun the taskflows with a different model.

Closing

In this post we’ve shown how we created taskflows for the seclab-taskflow-agent to triage code scanning alerts.

By breaking down the triage into precise and specific tasks, we were able to automate many of the more repetitive tasks using LLM. By setting out clear and precise criteria in the prompts and asking for precise answers from the LLM to include code references, the LLM was able to perform the tasks as instructed while keeping the amount of hallucination to a minimum. This allows us to leverage the power of LLM to triage alerts and reduces the amount of false positives greatly without the need to validate the alert dynamically.

As a result, we were able to discover ~30 real world vulnerabilities from CodeQL alerts after running the triaging taskflows.

The discussed taskflows are published in our repo and we’re looking forward to seeing what you’re going to build using them! More recently, we’ve also done some further experiments in the area of AI assisted code auditing and vulnerability hunting, so stay tuned for what’s to come!

Get the guide to setting up the GitHub Security Lab Taskflow Agent >

Disclaimers:

When we use these taskflows to report vulnerabilities, our researchers review carefully all generated output before sending the report. We strongly recommend you do the same.
Note that running the taskflows can result in many tool calls, which can easily consume a large amount of quota.
The taskflows may create GitHub Issues. Please be considerate and seek the repo owner’s consent before running them on somebody else’s repo.

The post AI-supported vulnerability triage with the GitHub Security Lab Taskflow Agent appeared first on The GitHub Blog.

Community-powered security with AI: an open source framework for security research

Kevin Backhouse — Wed, 14 Jan 2026 18:45:09 +0000

Since its founding in 2019, GitHub Security Lab has had one primary goal: community-powered security. We believe that the best way to improve software security is by sharing knowledge and tools, and by using open source software so that everybody is empowered to audit the code and report any vulnerabilities that they find.

Six years later, a new opportunity has emerged to take community-powered security to the next level. Thanks to AI, we can now use natural language to encode, share, and scale our security knowledge, which will make it even easier to build and share new security tools. And under the hood, we can use Model Context Protocol (MCP) interfaces to build on existing security tools like CodeQL.

As a community, we can eliminate software vulnerabilities far more quickly if we share our knowledge of how to find them. With that goal in mind, our team has been experimenting with an agentic framework called the GitHub Security Lab Taskflow Agent. We’ve been using it internally for a while, and we also recently shared it with the participants of the GitHub Secure Open Source Fund. Although it’s still experimental, it’s ready for others to use.

Demo: Variant analysis

It takes only a few steps to get started with seclab-taskflow-agent:

Create a personal access token.
Add codespace secrets.
Start a codespace.
Run a taskflow with a one-line command.

Please follow along and give it a try!

Note: This demo will use some of your token quota, and it’s possible that you’ll hit rate limits, particularly if you’re using a free GitHub account. But I’ve tried to design the demo so that it will work on a free account. The quotas will refresh after one day if you do hit the rate limits.

Create a fine-grained personal access token

Go to your developer settings page and create a personal access token (PAT).

Scroll down and add the “models” permission:

Add codespaces secrets

For security reasons, it’s not a good idea to save the PAT that you just created in a file on disk. Instead, I recommend saving it as a “codespace secret,” which means it’ll be available as an environment variable when you start a codespace in the next step.

Go to your codespaces settings and create a secret named GH_TOKEN:

Under “Repository access,” add GitHubSecurityLab/seclab-taskflows, which is the repo that we’ll start the codespace from.

Now go back to your codespaces settings and create a second secret named AI_API_TOKEN. You can use the same PAT for both secrets.

We want to use two secrets so that GH_TOKEN is used to access GitHub’s API and do things like read the code, whereas AI_API_TOKEN can access the AI API. Only one PAT is needed for this demo because it uses the GitHub Models API, but the framework also supports using other (not GitHub) APIs for the AI requests.

Start a codespace

Now go to the seclab-taskflows repo and start a codespace:

After the codespace starts, wait a few minutes until you see a prompt like this:

It’s important to wait until you see (.venv) before the prompt, as it indicates that the Python virtual environment has been created.

Run a taskflow with a one-line command

In the codespace terminal, enter this command to run the variant analysis demo taskflow:

python -m seclab_taskflow_agent -t seclab_taskflows.taskflows.audit.ghsa_variant_analysis_demo -g repo=github/cmark-gfm -g ghsa=GHSA-c944-cv5f-hpvr

Answer “yes” when it asks for permission to run memcache_clear_cache; this is the first run so the cache is already empty. The demo downloads and analyzes a security advisory from the repository (in this example, GHSA-c944-cv5f-hpvr from cmark-gfm). It tries to identify the source code file that caused the vulnerability, then it downloads that source code file and audits it for other similar bugs. It’s not a sophisticated demo, and (thankfully) it has not found any new bugs in cmark-gfm 🫣. But it’s short and simple, and I’ll use it later to explain what a taskflow is. You can also try it out on a different repository, maybe one of your own, by changing the repo name at the end of the command.

Other ways to run

I recommend using a codespace because it’s a quick, reliable way to get started. It’s also a sandboxed environment, which is good for security. But there are other ways to run the framework if you prefer.

Running in a Linux terminal

These are the commands to install and run the demo locally on a Linux system:

export AI_API_TOKEN=github_pat_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
export GH_TOKEN=$AI_API_TOKEN
python3 -m venv .venv
source .venv/bin/activate
pip install seclab-taskflows
python -m seclab_taskflow_agent -t seclab_taskflows.taskflows.audit.ghsa_variant_analysis_demo -g repo=github/cmark-gfm -g ghsa=GHSA-c944-cv5f-hpvr

These commands download our latest release from PyPI. Note that some of the toolboxes included with the framework may not work out-of-the-box with this approach because they depend on other software being installed. For example, the CodeQL toolbox depends on CodeQL being installed. You can copy the installation instructions from the devcontainer configuration that we use to build our codespaces environment.

Running in docker

We publish a docker image with tools like CodeQL pre-installed. You can run it with this script. Be aware that this docker image only includes seclab-taskflow-agent. We are planning to publish a second “batteries included” image that also includes seclab-taskflows in the future. Note: I’ll explain the relationship between seclab-taskflow-agent and seclab-taskflows in the section about the collaboration model.

Taskflows

A taskflow is a YAML file containing a list of tasks for the framework to execute. Let’s look at the taskflow for my demo (source):

seclab-taskflow-agent:
  filetype: taskflow
  version: 1

globals:
  repo:
  ghsa:

taskflow:
  - task:
      must_complete: true
      agents:
        - seclab_taskflow_agent.personalities.assistant
      toolboxes:
        - seclab_taskflow_agent.toolboxes.memcache
      user_prompt: |
        Clear the memory cache.

  - task:
      must_complete: true
      agents:
        - seclab_taskflow_agent.personalities.assistant
      toolboxes:
        - seclab_taskflows.toolboxes.ghsa
        - seclab_taskflows.toolboxes.gh_file_viewer
        - seclab_taskflow_agent.toolboxes.memcache
      user_prompt: |
        Fetch the details of the GHSA {{ GLOBALS_ghsa }} of the repo {{ GLOBALS_repo }}.

        Analyze the description to understand what type of bug caused
        the vulnerability. DO NOT perform a code audit at this stage, just 
        look at the GHSA details.

        Check if any source file is mentioned as the cause of the GHSA.
        If so, identify the precise file path and line number.

        If no file path is mentioned, then report back to the user that 
        you cannot find any file path and end the task here.

        The GHSA may not specify the full path name of the source
        file, or it may mention the name of a function or method
        instead, so if you have difficulty finding the file, try
        searching for the most likely match.

        Only identify the file path for now, do not look at the code or
        fetch the file contents yet.

        Store a summary of your findings in the memcache with the GHSA
        ID as the key. That should include the file path and the function that 
        the file is in.

  - task:
      must_complete: true
      agents:
        - seclab_taskflow_agent.personalities.assistant
      toolboxes:
        - seclab_taskflows.toolboxes.gh_file_viewer
        - seclab_taskflow_agent.toolboxes.memcache
      user_prompt: |
        Fetch the GHSA ID and summary that were stored in the memcache
        by the previous task.

        Look at the file path and function that were identified. Use the 
        get_file_lines_from_gh tool to fetch a small portion of the file instead of
        fetching the entire file.

        Fetch the source file that was identified as the cause of the
        GHSA in repo {{ GLOBALS_repo }}. 

        Do a security audit of the code in the source file, focusing
        particularly on the type of bug that was identified as the
        cause of the GHSA.

You can see that it’s quite similar in structure to a GitHub Actions workflow. There’s a header at the top, followed by the body, which contains a series of tasks. The tasks are completed one by one by the agent framework. Let’s go through the sections one by one, focusing on the most important bits:

Header

The first part of the header defines the file type. The most frequently used file types are:

taskflow: Describes a sequence of tasks for the framework to execute.
personality: It’s often useful to ask to assume a particular personality while executing a task. For example, we have an action_expert personality that is useful for auditing actions workflows.
toolbox: Contains instructions for running an MCP server. For example, the demo uses the gh_file_viewer toolbox for downloading source code files from GitHub.

The globals section defines global variables named “repo” and “ghsa,” which we initialized with the command-line arguments -g repo=github/cmark-gfm and -g ghsa=GHSA-c944-cv5f-hpvr. It’s a crude way to parameterize a taskflow.

Task 1

Tasks always specify a “personality” to use. For non-specialized tasks, we often just use the assistant personality.

Each task starts with a fresh context, so the only way to communicate a result from one task to the next is by using a toolbox as an intermediary. In this demo, I’ve used the memcache toolbox, which is a simple key-value store. We find that this approach is better for debugging, because it means that you can rerun an individual task with consistent inputs when you’re testing it.

This task also demonstrates that toolboxes can ask for confirmation before doing something potentially destructive, which is an important protection against prompt injection attacks.

Task 2

This task uses the ghsa toolbox to download the security advisory from the repository and the gh_file_viewer toolbox to find the source file that’s mentioned in the advisory. It creates a summary and uses the memcache toolbox to pass it to the next task.

Task 3

This task uses the memcache toolbox to fetch the results from the previous task and the gh_file_viewer toolbox to download the source code and audit it.

Often, the wording of a prompt is more subtle than it looks, and this third task is an example of that. Previous versions of this task tried to analyze the entire source file in one go, which used too many tokens. So the second paragraph, which asks to analyze a “small portion of the file,” is very important to make this task work successfully.

Taskflows summary

I hope this demo has given you a sense of what a taskflow is. You can find more detailed documentation in README.md and GRAMMAR.md. You can also find more examples in this subdirectory of seclab-taskflow-agent and this subdirectory of seclab-taskflows.

Collaboration model

We would love for members of the community to publish their own suites of taskflows. To make collaboration easy, we have built on top of Python’s packaging ecosystem. Our own two repositories are published as packages on PyPI:

seclab-taskflow-agent: the implementation of the taskflow framework.
seclab-taskflows: a suite of taskflows written by our team.

The reason why we have two repositories is that we want to separate the “engine” from the suites of taskflows that use it. Also, seclab-taskflows is intended to be an easy-to-copy template for anybody who would like to publish their own suite of taskflows. To get started on your package, we recommend using the hatch new command to create the initial project structure. It will generate things like the pyproject.toml file, which you’ll need for uploading to PyPI. Next we recommend creating a directory structure like ours, with sub-directories for taskflows, toolboxes, etc. Feel free to also copy other parts of seclab-taskflows, such as our publish-to-pypi.yaml workflow, which automatically uploads your package to PyPI when you push a tag with a name like “v1.0.0.”

An important feature of the collaboration model is that it is also easy to share MCP servers. For example, check out the MCP servers that are included with the seclab-taskflows package. Each MCP server has a corresponding toolbox YAML file (in the toolboxes directory) which contains the instructions for running it.

The import system

Taskflows often need to refer to other files, like personalities or toolboxes. And for the collaboration model to work well, we want you to be able to reuse personalities and toolboxes from other packages. We are leveraging Python’s importlib to make it easy to reference a file from a different package. To illustrate how it works, here’s an example in which seclab-taskflows is using a toolbox from seclab-taskflow-agent:

toolboxes:
  - seclab_taskflow_agent.toolboxes.memcache

The implementation splits the name seclab_taskflow_agent.toolboxes.memcache into a directory (seclab_taskflow_agent.toolboxes) and a filename (memcache). Then it uses Python’s importlib.resources.files to locate the directory and loads the file named memcache.yaml from that directory. The only quirk of this system is that names always need to have at least two parts, which means that your files always need to be stored at least one directory deep. But apart from that, we’re using Python’s import system as is, which means that there’s plenty of documentation and advice available online.

Project vision

We have two main goals with this project. First is to encourage community-powered security. Many of the agentic security tools that are currently popping up are closed-source black boxes, which is the antithesis of what we stand for as a team. We want people to be able to look under the hood and see how the taskflows work. And we want people to be able to easily create and share their own taskflows. As a community, we can eliminate software vulnerabilities far more quickly if we share our knowledge of how to find them. We’re hoping that taskflows can be an effective tool for that.

Second is to create a tool that we want to use ourselves. As a research team, we want a tool that’s good for rapid experimentation. We need to be able to quickly create a new security rule and try it out. With that in mind, we’re not trying to create the world’s most polished or efficient tool, but rather something that’s easy to modify.

For a deeper technical dive into how we’re using the framework for security research, explore a blog post by my colleagues Peter Stöckli and Man Yue Mo, where they share how they’re using the framework for triaging CodeQL alerts.

Check out the latest security news >

The post Community-powered security with AI: an open source framework for security research appeared first on The GitHub Blog.

Light waves, rising tides, and drifting ships: Game Off 2025 winners

Lee Reilly — Sat, 10 Jan 2026 21:36:00 +0000

This year’s Game Off winners weren’t just inspired by the WAVES theme. They surfed it. From sound-bending platformers and tide-controlled puzzles to glow-slug exploration, naval drift madness, and typing so intense it collapses stadium morale, these top ten games prove that “waves” are not only in the ocean. They can mean physics, emotion, memory, or even pure chaos.

Game Off is GitHub’s annual game jam, now in its 13th year, inviting developers from around the world to build games around a single theme and share the source code. It’s part experiment, part playground, and part community showcase where beginners and veterans ship games, learn in public, and play, rate, and review each other’s work.

The top ten highest rated games overall are all shown below – all rated and reviewers by the Game Off participants themselves. Congratulations, builders! So grab a controller. Or a keyboard. Or a frog. Then, get ready to surf these games! 🏄🏻

1 Evaw

https://github.blog/wp-content/uploads/2026/01/evaw.mp4#t=0.001

Play Source

Godot

Evaw is a moody platformer where light waves, sound waves, and poetic radio transmissions guide you through the ruins of a fallen nation. Bend physics, unlock paths, ignore useless feathers, and slowly realize this world is listening back. This game includes adjustable difficulty, hidden secrets, and a speedrun timer for when “cozy” turns competitive. Hydration encouraged.

2 Where the Water Flows

https://github.blog/wp-content/uploads/2026/01/where-the-water-flows.mp4#t=0.001

Play Source

Godot

A calm, clever isometric puzzle adventure where water levels are the level design. Raise or lower the sea level, reveal paths, and think before you move. This game’s serene visuals, smart pacing, and puzzles reward observation over reflexes. Clean, chill, and dangerously close to being meditative.

3 BEACON

https://github.blog/wp-content/uploads/2026/01/beacon.mp4#t=0.001

Play Source

Unity

You are a slug. You glow. You are lost. BEACON drops you into a dark, abandoned world where light waves reveal paths, secrets, and missing friends. Short, atmospheric puzzles reward curiosity and patience, not button mashing. Cozy, quiet, and proof that illumination solves most problems, including loneliness.

4 A Kingdom Slightly Out Of Tune

https://github.blog/wp-content/uploads/2026/01/a-kingdom-out-of-tune.mp4#t=0.001

Play Source

Unity

A puzzle-strategy mashup where sound waves are weapons, and matching colors triggers musical shockwaves. Slide tiles, chain colors, remix upgrades, and blast enemies back into harmony. Experience cute art, punchy audio, and combos that clear the screen in one perfectly tuned hit.

5 Wave Drifter

https://github.blog/wp-content/uploads/2026/01/wave-drifter.mp4#t=0.001

Play Source

Godot

Peaceful sailing is canceled. Wave Drifter throws you into high-speed naval chases where drifting is mandatory and ramming is encouraged. Boost, blast, collect purple orbs, stack upgrades, and turn your ship into a floating mistake for the royal navy. Enjoy simple controls, busted builds, and “one more run” energy until sunrise.

6 Tidal Town

https://github.blog/wp-content/uploads/2026/01/tidal-town.mp4#t=0.001

Play Source

Godot

City planning meets divine disaster. Slide buildings into color groups to score points and calm Poseidon, who is absolutely not calm. This game is easy to learn, surprisingly tactical, and constantly evolving as waves reshape the board. Bright visuals, smooth play, and a playful take on managing chaos one house at a time.

7 Froggy Love

https://github.blog/wp-content/uploads/2026/01/froggy-love.mp4#t=0.001

Play Source

Unity

A cozy ripple-physics puzzle about guiding a lovestruck frog across a pond using carefully placed splashes. Three handcrafted levels turn wave timing into romance logistics. Snakes are bad. Vibes are good. Proof that even small ripples can lead to big feelings.

8 Ooqo

https://github.blog/wp-content/uploads/2026/01/oogo.mp4#t=0.001

Play Source

Godot

A slick arcade score-chaser where movement becomes rhythm. Glide across fish to survive, chain combos to build momentum, and never step where you shouldn’t. Simple rules, deep flow state, plus a hypnotic soundtrack and leaderboard that will steal your time.

9 La Ola

https://github.blog/wp-content/uploads/2026/01/la-ola.mp4#t=0.001

Play Source

Godot

Tired of running endlessly? Try typing endlessly! La Ola turns “the wave” you’ll see at big soccer games into a high-stakes typing gauntlet where your WPM (words per minute) holds the stadium vibe together. Miss letters, the crowd naps. Miss a space, the wave collapses. Procedurally generated text is powered by Markov Chains, because science. Brutal, addictive, and extremely honest about your keyboard skills.

10 The Last Wave

https://github.blog/wp-content/uploads/2026/01/the-last-wave.mp4#t=0.001

Play Source

Unity

A haunting mystery where waves carry memory, grief, and identity. Work alongside Saja, Korea’s grim reaper, to reconstruct lives lost in a fatal fire by matching names, locations, and overlapping wave signals. Inspired by Return of the Obra Dinn, this game is deeply thoughtful, culturally grounded, and quietly devastating. Short, meticulous, unforgettable.

These ten games are the top-ranked entries overall from more than 700 submissions to Game Off 2025. They’re not the whole story. The quality across this year’s jam was consistently high, with hundreds more games worth exploring. Play more on itch.io.

Thanks <3

Thanks for building—and for playing. Seriously.

Game Off only works because people show up. Builders, players, raters, reviewers. Every comment, every score, every “hey this is cool” helped turn this jam into a real community moment. Thanks for making waves with us. We’ll see you in the next one.

The post Light waves, rising tides, and drifting ships: Game Off 2025 winners appeared first on The GitHub Blog.

5 podcast episodes to help you build with confidence in 2026

Cassidy Williams — Tue, 23 Dec 2025 00:15:00 +0000

The end of the year creates a rare kind of quiet. It is the kind that makes it easier to reflect on how you have been building, what you have been learning, and what you want to do differently next year. It is also the perfect moment to catch up on the mountain of browser tabs you’ve kept open and podcast episodes you’ve bookmarked. Speaking of podcasts, we have one! (Wow, smooth transition, Cassidy).

If you’re looking to level-up your thinking around AI, open source software sustainability, and the future of software, we have some great conversations you can take on the road with you.

This year on the GitHub Podcast, we talked with maintainers, educators, data experts, and builders across the open source ecosystem. These conversations were not just about trends or tools. They offered practical ways to think more clearly about where software is headed and how to grow alongside it. If 2026 is about building with more intention, these five episodes are a great place to start.

Understand where AI tooling is actually heading

If this year left you feeling overwhelmed by the pace of change in AI tooling, you are not alone. New models, new agents, and new workflows seemed to appear every week, often without much clarity on how they fit together or which ones would actually last.

Our Unlocking the power of MCP episode slows things down. It introduces the Model Context Protocol (MCP) as a way to make sense of that chaos, explaining how an open standard can help AI systems interact with tools in consistent and transparent ways. Rather than adding to the noise, the conversation gives you a clearer mental model for how modern AI tools are being built and why open standards matter for trust, interoperability, and long-term flexibility. Most importantly, MCP makes building better for everyone. Learn about how the standard works (and you can check out GitHub’s open sourced MCP server, too).

Ship smaller, smarter software—faster

Not every meaningful piece of software needs a pitch deck or a product roadmap. Building tools and the future of DIY development explores a growing shift toward personal, purpose-built tools. These are tools created to solve one specific problem well, often by the people who feel that pain most acutely. Developers and non-developers alike are really empowered these days by open source and AI tools, because they’re enabled to build faster and with less mental overhead. It is a great reminder that modern tooling and AI have lowered the barrier to turning ideas into working software, without stripping away creativity or craftsmanship. After listening to this one, you might just pick up that unused domain name and make something! 😉

Understand what keeps open source sustainable

If you were around the tech scene in 2021, you probably remember the absolute chaos that came with the Log4Shell vulnerability that was exposed in November that year. That vulnerability (and others since then) put a spotlight on the world’s dependence on underfunded open source infrastructure. But, money can’t solve all of the world’s problems, unfortunately. From Log4Shell to the Sovereign Tech Fund is a really interesting conversation about why success is not just about funding, but also community health, processes, and communication. By the end, you come away with a deeper appreciation for the invisible labor behind the tools you rely on, and a clearer sense of how individuals, companies, and governments can show up more responsibly.

Align your skills with real-world trends

2025 really has been the year of growth and change across projects. The Octoverse report analyzes the state of open source across 1.12 billion open source contributions, 518 million merged pull requests, 180 million developers… you get the idea, a lot of numbers and a lot of data. TypeScript’s Takeover, AI’s Lift-Off: Inside the 2025 Octoverse Report grounds the conversation in data from GitHub’s Octoverse report, turning billions of contributions into meaningful signals. The discussion helps connect trends like TypeScript’s rise, AI-assisted workflows, and even COBOL’s unexpected resurgence to real decisions developers face: what to learn next, where to invest time, and how to stay adaptable. Rather than predicting the future, it offers something more useful: a clearer picture of the present and how to navigate what comes next.

Understand what privacy-first software looks like in practice

As more everyday devices become connected, it is getting harder to tell where convenience ends and control begins. This episode offers a refreshing counterpoint. Recorded live at GitHub Universe 2025, the conversation with Frank “Frenck” Nijhof explores how Home Assistant has grown into one of the most active open source projects in the world by prioritizing local control, privacy, and long-term sustainability.

Listening to Privacy-First Smart Homes with Frenck from Home Assistant shifts how you think about automation and ownership. You hear how millions of households run smart homes without relying on the cloud, why the Open Home Foundation exists to fight vendor lock-in and e-waste, and how a welcoming community scaled to more than 21,000 contributors. The discussion also opens up what contribution can look like beyond writing code, showing how documentation, testing, and community support play a critical role. It is a reminder that building better technology often starts with clearer values and more inclusive ways to participate. Plus, you get to hear about the weird and wonderful ways people use Home Assistant to power their lives.

Take this with you

As we look toward 2026, these episodes share a common thread. They encourage building with clarity, curiosity, and care for your tools, your community, and yourself. Whether you are listening while traveling, winding down for the year, or planning what you want to focus on next, we hope these conversations help you start the year feeling more grounded and inspired.

And if you speed through these episodes, don’t worry; we have so many more fantastic episodes from this season. You can listen to every episode of the GitHub Podcast wherever you get your podcasts, or watch them on YouTube. We are excited to see what you build in 2026.

Subscribe to the GitHub Podcast >

The post 5 podcast episodes to help you build with confidence in 2026 appeared first on The GitHub Blog.

This year’s most influential open source projects

Lee Reilly — Mon, 22 Dec 2025 23:48:52 +0000

From Appwrite to Zulip, the Open Source Zone at Universe 2025 was stacked with projects that pushed boundaries and turned heads. These twelve open source teams brought the creativity, the engineering craft, and the “I need to try that” demos that make Universe special. Here’s a closer look at what they showcased this year.

If you want to join them in 2026, applications for next year’s Open Source Zone are open now!

Appwrite: Backend made simple

Appwrite is an open source backend platform that helps developers build secure and scalable apps without boilerplate. With APIs for databases, authentication, storage, and more, it’s become a go-to foundation for web and mobile developers who want to ship faster.

Origin story: Appwrite Appwrite was created in 2019 by Eldad Fux as a side project, and it quickly grew from a weekend project to one of the fastest-growing developer platforms on GitHub, with over 50,000 stars and hundreds of contributors worldwide.

Appwrite’s @divanov11 and @stnguyen90 give the Open Source Zone a 👍🏻.

GoReleaser: Effortless release automation for Go

GoReleaser automates packaging, publishing, and distributing Go projects so developers can ship faster with less stress. With strong support from its contributor base, it has become the go-to release engineering tool for Go maintainers who want to focus on building rather than busywork.

🚦 Go go go, GoReleaser: GoReleaser started life in 2015 as a simple release.sh script. Within a year, @caarlos0, rewrote it in Go with YAML configs, during his holiday break—instead of, you know, actually taking a holiday. That rewrite became the foundation of what’s now a tool with over 15,000 stars and paying customers worldwide. GitHub included! E.g. for GitHub CLI.

And can we all just take a minute to applaud the GoReleaser logo?!

💡 Fun fact: one of my colleagues, @ashleymcnamara, has created a secession (that’s the word for a bunch of Gophers—I checked!) of iconic Gopher designs that have become part of Go’s visual culture. If you’ve seen a Gopher sticker at a conference, odds are it came from her repo. Watch out, Ashley. Looks like you have some competition.

Homebrew: The missing package manager for macOS

Speaking of great logos. Homebrew is the de facto package manager for macOS, beloved by developers for making it simple to install, manage, and update software from the command line. From data scientists to DevOps engineers, millions rely on Homebrew every day to bootstrap their environments, automate workflows, and keep projects running smoothly.

Thanks for having us! GitHub Universe was a great opportunity to re-energize by meeting users and fellow maintainers.
Issy Long, Senior Software Engineer & Homebrew Lead Maintainer

Homebrew lead maintainers @p-linnane and @issyl0 were on hand to meet users and answer questions. Cheers! 🍻

Ladybird: A browser for the bold

Ladybird is an ambitious and independent open source browser being built from scratch with performance, security, and privacy in mind. What began as a humble HTML viewer is now evolving into one of the most exciting projects in the browser space, supported by a rapidly growing global community.

Ladybird publish a monthly update showcasing bug fixes, performance improvements, and feature additions like variable font support and enhanced WebGL support.

💡 Did you know: Ladybird started life in 2018 as a tiny HTML viewer tucked inside the SerenityOS operating system. Fast-forward a few years and it’s grown up into a full-fledged, from-scratch browser with a buzzing open source community—1200 contributors and counting!

Moondream: Tiny AI, big vision

Moondream is an open source visual language model that brings visual intelligence for everyone. With a tiny 1 GB footprint and blazing performance, it runs anywhere from laptops to edge devices without the need for GPUs or complex infrastructure. Developers can caption images, detect objects, follow gaze, read documents, and more using natural language prompts. With more than 6 million downloads and thousands of GitHub stars, Moondream is trusted across industries from healthcare to robotics, making state-of-the-art vision AI as simple as writing a line of code.

Oh My Zsh: Supercharge your shell

Oh My Zsh is a community-driven framework that makes the Zsh shell stylish, powerful, and endlessly customizable. With hundreds of plugins and themes and millions of users, it is one of the most beloved ways to supercharge the command line.

People get really into customizing their prompts—myself included—but GitHub’s @casidoo raised the bar with her blog post. Safe to say her prompt looks way cooler than mine. For now… 😈

Oh my gosh, it’s the Oh My Zsh creator @robbyrussell and maintainer @carlosala discussing why your shell deserves nice things.

💡 Fun fact: Oh My Zsh started in 2009 as a weekend project by Robby Russell, and it’s now one of the most popular open-source frameworks for managing Zsh configs, with thousands of plugins and themes contributed by the community. <3

OpenCV: The computer vision powerhouse

OpenCV is the most widely used open source computer vision library in the world, powering robotics, medical imaging, and cutting-edge AI research. With a vast community of contributors, it remains the essential toolkit for developers working with images and video.

🧐 Did you know: OpenCV started in 1999 at Intel as a research project and today it powers everything from self-driving cars to Instagram filters, with over 40,000 stars on GitHub and millions of users worldwide!

Open Source Project Security Baseline (OSPSB): Raising the bar

Security isn’t glamorous, but maintaining a healthy open source ecosystem depends on it—and that’s where the Open Source Project Security Baseline (OSPSB) comes in. OSPSB, an initiative from the OpenSSF community, gives maintainers a practical, no-nonsense checklist of what “good security” actually looks like. Instead of vague best practices, it focuses on realistic, minimum requirements that any project can meet, no matter the size of the team.

At Universe 2025, OSPSB resonated with maintainers looking for clarity in a world of shifting threats. The maturity levels and self-assessment tools make it simple to understand where your project is strong, where it needs improvement, and how users can contribute back to security work — a win for the entire ecosystem.

💡 Fun fact: OSPSB is used by hundreds of projects as a self-assessment tool, and it’s supported by the GitHub Secure Open Source Fund to help maintainers keep their software resilient.

The resilience and sustainability of open source is a shared responsibility between maintainers and users. Beyond telling consumers why they should trust your project, Baseline will also tell them where they can contribute to security improvements.
Xavier René-Corail, Senior Director, GitHub Security Research

p5.js and Processing for Creative Coding

p5.js is a beginner-friendly JavaScript library that makes coding accessible for artists, educators, and developers alike. From interactive art to generative visuals, it empowers millions to express ideas through code and brings creative coding into classrooms and communities worldwide.

Processing is an open-source programming environment designed to teach code through visual art and interactive media. Used by artists, educators, and students worldwide, it bridges technology and creativity, making programming accessible, playful, and expressive.

PixiJS: Powering graphics on the web

PixiJS is a powerful HTML5 engine for creating stunning 2D graphics on the web. Built on top of WebGL and WebGPU, it delivers one of the fastest and most flexible rendering experiences available. With an intuitive API, support for custom shaders, advanced text rendering, multi-touch interactivity, and accessibility features, PixiJS empowers developers to craft beautiful, interactive experiences that run smoothly across desktop, mobile, and beyond. With over 46,000 stars on GitHub and adoption by hundreds of global brands, PixiJS has become the go-to toolkit for building games, applications, and large-scale visualizations in the browser.

💡 Fun fact: PixiJS has been around for more than 12 years and has powered everything from hit games like Happy Wheels and Subway Surfers to immersive art installations projected onto city buildings. Developer Simone Seagle used PixiJS to bring The Met’s Open Access artworks to life—animating Kandinsky’s Violett with spring physics and transforming Monet’s water lilies into a swirling, interactive experience.

SparkJS: Splat the limits of 3D

Spark (no, not that one!) is an advanced 3D Gaussian Splatting renderer for THREE.js, letting developers blend cutting-edge research with the most popular JavaScript 3D engine on the web. Portable, fast, and surprisingly lightweight, SparkJS brings real-time splat rendering to almost any device with correct sorting, animation support, and compatibility for major splat formats like .PLY, .SPZ, and .KSPLAT.

What is Gaussian Splatting? Gaussian Splatting is a graphics technique that represents 3D objects as millions of tiny, semi-transparent ellipsoids (“splats”) instead of heavy polygon meshes. It delivers photorealistic detail, smooth surfaces, and fast real-time performance, making it a rising star in computer vision, neural rendering, and now, thanks to Spark, everyday web development.

Zulip: Conversations that scale

Zulip is the open source team chat platform built for thoughtful communication at scale. Unlike traditional chat apps where conversations quickly become noise, Zulip’s unique topic-based threading keeps discussions organized and discoverable, even days later. With integrations, bots, and clients for every platform, Zulip helps distributed teams collaborate without the chaos.

💡 Fun fact: Zulip began as a small startup in 2012, was acquired by Dropbox in 2014, and open sourced in 2015. Today it has over 1500 contributors worldwide, powering communities, classrooms, nonprofits, and companies that need conversations to stay useful.

From left-to-right, @gnprice, @alya, @timabbott stand at the Zulip booth.

We want to thank the maintainers for participating at GitHub Universe in the Open Source Zone, and for your projects that are making our world turn. You all are what open source is about! <3

Even if you didn’t get to meet these folks at Universe, it’s never too late to check out their work. Or, you can keep powering open source by contributing to or sponsoring a project.

Want to showcase your project at GitHub Universe next year? Apply now! You’ll get two free tickets and a space on the show floor.

The post This year’s most influential open source projects appeared first on The GitHub Blog.

MCP joins the Linux Foundation: What this means for developers building the next era of AI tools and agents

Martin Woodward — Tue, 09 Dec 2025 21:00:13 +0000

Over the past year, AI development has exploded. More than 1.1 million public GitHub repositories now import an LLM SDK (+178% YoY), and developers created nearly 700,000 new AI repositories, according to this year’s Octoverse report. Agentic tools like vllm, ollama, continue, aider, ragflow, and cline are quickly becoming part of the modern developer stack.

As this ecosystem expands, we’ve seen a growing need to connect models to external tools and systems—securely, consistently, and across platforms. That’s the gap the Model Context Protocol (MCP) has rapidly filled.

Born as an open source idea inside Anthropic, MCP grew quickly because it was open from the very beginning and designed for the community to extend, adopt, and shape together. That openness is a core reason it became one of the fastest-growing standards in the industry. That also allowed companies like GitHub and Microsoft to join in and help build out the standard.

Now, Anthropic is donating MCP to the Agentic AI Foundation, which will be managed by the Linux Foundation, and the protocol is entering a new phase of shared stewardship. This will provide developers with a foundation for long-term tooling, production agents, and enterprise systems. This is exciting for those of us who have been involved in the MCP community. And given our long-term support of the Linux Foundation, we are hugely supportive of this move.

The past year has seen incredible growth and change for MCP. I thought it would be great to review how MCP got here, and what its transition to the Linux Foundation means for the next wave of AI development.

Before MCP: Fragmented APIs and brittle integrations

LLMs started as isolated systems. You sent them prompts and got responses back. We would use patterns like retrieval-augmented generation (RAG) to help us bring in data to give more context to the LLM, but that was limited. OpenAI’s introduction of function calling brought about a huge change as, for the first time, you could call any external function. This is what we initially built on top of as part of GitHub Copilot.

By early 2023, developers were connecting LLMs to external systems through a patchwork of incompatible APIs: bespoke extensions, IDE plugins, and platform-specific agent frameworks, among other things. Every provider had its own integration story, and none of them worked in exactly the same way.

Nick Cooper, an OpenAI engineer and MCP steering committee member, summarized it plainly: “All the platforms had their own attempts like function calling, plugin APIs, extensions, but they just didn’t get much traction.”

This wasn’t a tooling problem. It was an architecture problem.

Connecting a model to the realtime web, a database, ticketing system, search index, or CI pipeline required bespoke code that often broke with the next model update. Developers had to write deep integration glue one platform at a time.

As David Soria Parra, a senior engineer at Anthropic and one of the original architects of MCP, put it, the industry was running headfirst into an n×m integration problem with too many clients, too many systems, and no shared protocol to connect them.

In practical terms, the n×m integration problem describes a world where every model client (n) must integrate separately with every tool, service, or system developers rely on (m). This would mean five different AI clients talking to ten internal systems, resulting in fifty bespoke integrations—each with different semantics, authentication flows, and failure modes. MCP collapses this by defining a single, vendor-neutral protocol that both clients and tools can speak. With something like GitHub Copilot, where we are connecting to all of the frontier labs models and developers using Copilot, we also need to connect to hundreds of systems as part of their developer platform. This was not just an integration challenge, but an innovation challenge.

And the absence of a standard wasn’t just inefficient; it slowed real-world adoption. In regulated industries like finance, healthcare, security, developers needed secure, auditable, cross-platform ways to let models communicate with systems. What they got instead were proprietary plugin ecosystems with unclear trust boundaries.

MCP: A protocol built for how developers work

Across the industry including at Anthropic, GitHub, Microsoft, and others, engineers kept running into the same wall: reliably connecting models to context and tools. Inside Anthropic, teams noticed that their internal prototypes kept converging on similar patterns for requesting data, invoking tools, and handling long-running tasks.

Soria Parra described MCP’s origin simply: it was a way to standardize patterns Anthropic engineers were reinventing. MCP distilled those patterns into a protocol designed around communication, or how models and systems talk to each other, request context, and execute tools.

Anthropic’s Jerome Swanwick recalled an early internal hackathon where “every entry was built on MCP … went viral internally.”

That early developer traction became the seed. Once Anthropic released MCP publicly alongside high-quality reference servers, we saw the value immediately, and it was clear that the broader community understood the value immediately. MCP offered a shared way for models to communicate with external systems, regardless of client, runtime, or vendor.

Why MCP clicked: Built for real developer workflows

When MCP launched, adoption was immediate and unlike any standard I have seen before.

Developers building AI-powered tools and agents had already experienced the pain MCP solved. As Microsoft’s Den Delimarsky, a principal engineer and core MCP steering committee member focused on security and OAuth, said: “It just clicked. I got the problem they were trying to solve; I got why this needs to exist.”

Within weeks, contributors from Anthropic, Microsoft, GitHub, OpenAI, and independent developers began expanding and hardening the protocol. Over the next nine months, the community added:

OAuth flows for secure, remote servers
Sampling semantics (These help ensure consistent model behavior when tools are invoked or context is requested, giving developers more predictable execution across different MCP clients.)
Refined tool schemas
Consistent server discovery patterns
Expanded reference implementations
Improving long-running task support

Long-running task APIs are a critical feature. They allow builds, indexing operations, deployments, and other multi-minute jobs to be tracked predictably, without polling hacks or custom callback channels. This was essential for the long-running AI agent workflows that we now see today.

Delimarsky’s OAuth work also became an inflection point. Prior to it, most MCP servers ran locally, which limited usage in enterprise environments and caused installation friction. OAuth enabled remote MCP servers, unlocking secure, compliant integrations at scale. This shift is what made MCP viable for multi-machine orchestration, shared enterprise services, and non-local infrastructure.

Just as importantly, OAuth gives MCP a familiar and proven security model with no proprietary token formats or ad-hoc trust flows. That makes it significantly easier to adopt inside existing enterprise authentication stacks.

Similarly, the MCP Registry—developed in the open by the MCP community with contributions and tooling support from Anthropic, GitHub, and others—gave developers a discoverability layer and gave enterprises governance control. Toby Padilla, who leads MCP Server and Registry efforts at GitHub, described this as a way to ensure “developers can find high-quality servers, and enterprises can control what their users adopt.”

But no single company drove MCP’s trajectory. What stands out across all my conversations with the community is the sense of shared stewardship.

Cooper articulated it clearly: “I don’t meet with Anthropic, I meet with David. And I don’t meet with Google, I meet with Che.” The work was never about corporate boundaries. It was about the protocol.

This collaborative culture, reminiscent of the early days of the web, is the absolute best of open source. It’s also why, in my opinion, MCP spread so quickly.

Developer momentum: MCP enters the Octoverse

The 2025 Octoverse report, our annual deep dive into open source and public activity on GitHub, highlights an unprecedented surge in AI development:

1.13M public repositories now import an LLM SDK (+178% YoY)
693k new AI repositories were created this year
6M+ monthly commits to AI repositories
Tools like vllm, ollama, continue, aider, cline, and ragflow dominated fastest-growing repos
Standards are emerging in real time with MCP alone, hitting 37k stars in under eight months

These signals tell a clear story: developers aren’t just experimenting with LLMs, they’re operationalizing them.

With hundreds of thousands of developers building AI agents, local runners, pipelines, and inference stacks, the ecosystem needs consistent ways to connect models to tools, services, and context.

MCP isn’t riding the wave. The protocol aligns with where developers already are and where the ecosystem is heading.

The Linux Foundation move: The protocol becomes infrastructure

As MCP adoption accelerated, the need for neutral governance became unavoidable. Openness is what drove its initial adoption, but that also demands shared stewardship—especially once multiple LLM providers, tool builders, and enterprise teams began depending on the protocol.

By transitioning governance to the Linux Foundation, Anthropic and the MCP steering committee are signaling that MCP has reached the maturity threshold of a true industry standard.

Open, vendor-neutral governance offers everyone:

1. Long-term stability

A protocol is only as strong as its longevity. Linux Foundation’s backing reduces risk for teams adopting MCP for deep integrations.

2. Equal participation

Whether you’re a cloud provider, startup, or individual maintainer, Linux Foundation governance processes support equal contribution rights and transparent evolution.

3. Compatibility guarantees

As more clients, servers, and agent frameworks rely on MCP, compatibility becomes as important as the protocol itself.

4. The safety of an open standard

In an era where AI is increasingly part of regulated workloads, neutral governance makes MCP a safer bet for enterprises.

MCP is now on the same path as technologies like Kubernetes, SPDX, GraphQL, and the CNCF stack—critical infrastructure maintained in the open.

Taken together, this move aligns with the Agentic AI Foundation’s intention to bring together multiple model providers, platform teams, enterprise tool builders, and independent developers under a shared, neutral process.

What MCP unlocks for developers today

Developers often ask: “What do I actually get from adopting MCP?”

Here’s the concrete value as I see it:

1. One server, many clients

Expose a tool once. Use it across multiple AI clients, agents, shells, and IDEs.

No more bespoke function-calling adapters per model provider.

2. Predictable, testable tool invocation

MCP’s schemas make tool interaction debuggable and reliable, which is closer to API contracts than prompt engineering.

3. A protocol for agent-native workloads

As Octoverse shows, agent workflows are moving into mainstream engineering:

1M+ agent-authored pull requests via GitHub Copilot coding agent alone in the five months since it was released
Rapid growth of key AI projects like vllm and ragflow
Local inference tools exploding in popularity

Agents need structured ways to call tools and fetch context. MCP provides exactly that.

4. Secure, remote execution

OAuth and remote-server support mean MCP works for:

Enterprises
Regulated workloads
Multi-machine orchestration
Shared internal tools

5. A growing ecosystem of servers

With a growing set of community and vendor-maintained MCP servers (and more added weekly), developers can connect to:

Issue trackers
Code search and repositories
Observability systems
Internal APIs
Cloud services
Personal productivity tools

Soria Parra emphasized that MCP isn’t just for LLMs calling tools. It can also invert the workflow by letting developers use a model to understand their own complex systems.

6. It matches how developers already build software

MCP aligns with developer habits:

Schema-driven interfaces (JSON Schema–based)
Reproducible workflows
Containerized infrastructure
CI/CD environments
Distributed systems
Local-first testing

Most developers don’t want magical behavior—they want predictable systems. MCP meets that expectation.

MCP also intentionally mirrors patterns developers already know from API design, distributed systems, and standards evolution—favoring predictable, contract-based interactions over “magical” model behaviors.

What happens next

The Linux Foundation announcement is the beginning of MCP’s next phase, and the move signals:

Broader contribution
More formal governance
Deeper integration into agent frameworks
Cross-platform interoperability
An expanding ecosystem of servers and clients

Given the global developer growth highlighted in Octoverse—36M new developers on GitHub alone this year—the industry needs shared standards for AI tooling more urgently than ever.

MCP is poised to be part of that future. It’s a stable, open protocol that lets developers build agents, tools, and workflows without vendor lock-in or proprietary extensions.

The next era of software will be shaped not just by models, but by how models interact with systems. MCP is becoming the connective tissue for that interaction.

And with its new home in the Linux Foundation, that future now belongs to the community.

Explore the MCP specification and the GitHub MCP Registry to join the community working on the next phase of the protocol.

The post MCP joins the Linux Foundation: What this means for developers building the next era of AI tools and agents appeared first on The GitHub Blog.