The latest on LLMs - The GitHub Blog https://github.blog/ai-and-ml/llms/ Updates, ideas, and inspiration from GitHub to help developers build and design software. Mon, 09 Mar 2026 17:31:52 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 https://github.blog/wp-content/uploads/2019/01/cropped-github-favicon-512.png?fit=32%2C32 The latest on LLMs - The GitHub Blog https://github.blog/ai-and-ml/llms/ 32 32 153214340 Under the hood: Security architecture of GitHub Agentic Workflows https://github.blog/ai-and-ml/generative-ai/under-the-hood-security-architecture-of-github-agentic-workflows/ Mon, 09 Mar 2026 16:00:00 +0000 https://github.blog/?p=94363 GitHub Agentic Workflows are built with isolation, constrained outputs, and comprehensive logging. Learn how our threat model and security architecture help teams run agents safely in GitHub Actions.

The post Under the hood: Security architecture of GitHub Agentic Workflows appeared first on The GitHub Blog.

]]>

Whether you’re an open-source maintainer or part of an enterprise team, waking up to documentation fixes, new unit tests, and refactoring suggestions can be a true “aha” moment. But automation also raises an important concern: how do you put guardrails on agents that have access to your repository and the internet? Will you be wondering if your agent relied on documentation from a sketchy website, or pushed a commit containing an API token? What if it decides to add noisy comments to every open issue one day? Automations must be predictable to offer durable value.

But what is the safest way to add agents to existing automations like CI/CD? Agents are non-deterministic: They must consume untrusted inputs, reason over repository state, and make decisions at runtime. Letting agents operate in CI/CD without real-time supervision allows you to scale your software engineering, but it also requires novel guardrails to keep you from creating security problems.

GitHub Agentic Workflows run on top of GitHub Actions. By default, everything in an action runs in the same trust domain. Rogue agents can interfere with MCP servers, access authentication secrets, and make network requests to arbitrary hosts. A buggy or prompt-injected agent with unrestricted access to these resources can act in unexpected and insecure ways.

That’s why security is baked into the architecture of GitHub Agentic Workflows. We treat agent execution as an extension of the CI/CD model—not as a separate runtime. We separate open‑ended authoring from governed execution, then compile a workflow into a GitHub Action with explicit constraints such as permissions, outputs, auditability, and network access.

This post explains how we built Agentic Workflows with security in mind from day one, starting with the threat model and the security architecture that it needs.

Threat model

There are two properties of agentic workflows that change the threat model for automation.

First, agents’ ability to reason over repository state and act autonomously makes them valuable, but it also means they cannot be trusted by default—especially in the presence of untrusted inputs.

Second, GitHub Actions provide a highly permissive execution environment. A shared trust domain is a feature for deterministic automation, enabling broad access, composability, and good performance. But when combined with untrusted agents, having a single trust domain can create a large blast radius if something goes wrong.

Under this model, we assume an agent will try to read and write state that it shouldn’t, communicate over unintended channels, and abuse legitimate channels to perform unwanted actions. By default, GitHub Agentic Workflows run in a strict security mode with this threat model in mind, and their design is guided by four security principles: defense in depth, don’t trust agents with secrets, stage and vet all writes, and log everything.

Defend in depth

GitHub Agentic Workflows provide a layered security architecture consisting of substrate, configuration, and planning layers. Each layer limits the impact of failures above it by enforcing distinct security properties that are consistent with its assumptions.

Diagram of a three-layer system architecture with labeled sections Planning layer, Configuration layer, and Substrate layer. Each layer contains three blue tiles:

Planning: Safe Outputs MCP (GitHub write operations), Call filtering (call availability, volume), Output sanitization (secret removal, moderation).
Configuration: Compiler (GH AW extension), Firewall policies (allowlist), MCP config (Docker image, auth token).
Substrate: Action runner VM (OS, hypervisor), Docker containers (Docker daemon, network), Trusted containers (firewall, MCP gateway, API proxy).

The substrate layer rests on a GitHub Actions runner virtual machine (VM) and several trusted containers that limit the resources an agent can access. Collectively, the substrate level provides isolation among components, mediation of privileged operations and system calls, and kernel-enforced communication boundaries. These protections hold even if an untrusted user-level component is compromised and executes arbitrary code within its container isolation boundary.

Above the substrate layer is a configuration layer that includes declarative artifacts and the toolchains that interpret them to instantiate a secure system structure and connectivity. The configuration layer dictates which components are loaded, how components are connected, what communication channels are permitted, and what privileges are assigned. Externally minted tokens, such as agent API keys and GitHub access tokens, are critical inputs that bound components’ external effects—configuration controls which tokens are loaded into which containers.

The final layer of defense is the planning layer. The configuration layer dictates which components exist and how they communicate, but it does not dictate which components are active over time. The planning layer’s primary responsibility is to create a staged workflow with explicit data exchanges between them. The safe outputs subsystem, which will be described in greater detail below, is the primary instance of secure planning.

Don’t trust agents with secrets

From the beginning, we wanted workflow agents to have zero access to secrets. Agentic workflows execute as GitHub Actions, in which components share a single trust domain on top of the runner VM. In that model, sensitive material like agent authentication tokens and MCP server API keys reside in environment variables and configuration files visible to all processes in the VM.

This is dangerous because agents are susceptible to prompt injection: Attackers can craft malicious inputs like web pages or repository issues that trick agents into leaking sensitive information. For example, a prompt-injected agent with access to shell-command tools can read configuration files, SSH keys, Linux /proc state, and workflow logs to discover credentials and other secrets. It can then upload these secrets to the web or encode them within public-facing GitHub objects like repository issues, pull requests, and comments.

Our first mitigation was to isolate the agent in a dedicated container with tightly controlled egress: firewalled internet access, MCP access through a trusted MCP gateway, and LLM API calls through an API proxy. To limit internet access, agentic workflows create a private network between the agent and firewall. The MCP gateway runs in a separate trusted container, launches MCP servers, and has exclusive access to MCP authentication material.

Although agents like Claude, Codex, and Copilot must communicate with an LLM over an authenticated channel, we avoid exposing those tokens directly to the agent’s container. Instead, we place LLM auth tokens in an isolated API proxy and configure agents to route model traffic through that proxy.

Architecture diagram showing several connected Docker containers. A Codex token connects to an api-proxy container, which connects to an OpenAI service icon. A separate flow shows an agent container (linked to chroot/host) communicating over http to a gh-aw-firewall container, then over http to a gh-aw-mcpg container (linked to Host Docker Socket), then over stdio to a GitHub MCP container (linked to a GitHub PAT). A GitHub icon appears above the GitHub MCP container.

Zero-secret agents require a fundamental trade-off between security and utility. Coding workloads require broad access to compilers, interpreters, scripts, and repository state, but expanding the in-container setup would duplicate existing actions provisioning logic and increase the set of network destinations that must be allowed through the firewall.

Instead, we carefully expose host files and executables using container volume mounts and run the agent in a chroot jail. We start by mounting the entire VM host file system read-only at /host. We then overlay selected paths with empty tmpfs layers and launch the agent in a chroot jail rooted at /host. This approach keeps the host-side setup intact while constraining the agent’s writable and discoverable surface to what it needs for its job.

Stage and vet all writes

Prompt-injected agents can still do harm even if they do not have access to secrets. For example, a rogue agent could spam a repository with pointless issues and pull requests to overwhelm repository maintainers, or add objectionable URLs and other content in repository objects.

To prevent this kind of behavior, the agentic workflows compiler decomposes workflows into explicit stages and defines, for each stage:

  • The active components and permissions (read vs. write)
  • The data artifacts emitted by that stage
  • The admissible downstream consumers of those artifacts

While the agent runs, it can read GitHub state through the GitHub MCP server and can only stage its updates through the safe outputs MCP server. Once the agent exits, write operations that have been buffered by the safe outputs MCP server are processed by a suite of safe outputs analyses.

Diagram showing a GitHub-centric workflow with green arrows and two rows of components. At the top, a GitHub icon points down into three boxes: Agent (Untrusted), GitHub MCP (Read-only), and MCP config (Write-buffered). Below are three processing steps labeled Filter operations, Moderate content, and Remove secrets, each marked 'Deterministic analysis.' Green arrows indicate data flow from GitHub into the system, down through configuration to 'Remove secrets,' then left through 'Moderate content' and 'Filter operations,' looping back toward the agent.

First, safe outputs allow workflow authors to specify which write operations an agent can perform. Authors can choose which subset of GitHub updates are allowed, such as creating issues, comments, or pull requests. Second, safe outputs limits the number of updates that are allowed, such as restricting an agent to creating at most three pull requests in a given run. Third, safe outputs analyzes update content to remove unwanted patterns, such as output sanitization to remove URLs. Only artifacts that pass through the entire safe outputs pipeline can be passed on, ensuring that each stage’s side effects are explicit and vetted.

Log everything

Even with zero secrets and vetted writes, an agent can still transform repository data and invoke tools in unintended ways or try to break out of the constraints that we impose upon it. Agents are determined to accomplish their tasks by any means and have a surprisingly deep toolbox of tricks for doing so. If an agent behaves unexpectedly, post-incident analysis requires visibility into the complete execution path.

Agentic workflows make observability a first-class property of the architecture by logging extensively at each trust boundary. Network and destination-level activity is recorded at the firewall layer, model request/response metadata and authenticated requests are captured by the API proxy; and tool invocations are logged by the MCP gateway and MCP servers. We also add internal instrumentation to the agent container to audit potentially sensitive actions like environment variable accesses. Together, these logs support end-to-end forensic reconstruction, policy validation, and rapid detection of anomalous agent behavior.

Pervasive logging also lays the foundation for future information-flow controls. Every location where communication can be observed is also a location where it can be mediated. Agentic workflows already support the GitHub MCP server’s lockdown mode, and in the coming months, we’ll introduce additional safety controls that enforce policies across MCP servers based on visibility (public vs. private) and the role of a repository object’s author.

What’s next?

We’d love for you to be involved! Share your thoughts in the Community discussion or join us (and tons of other awesome makers) in the #agentic-workflows channel of the GitHub Next Discord. We look forward to seeing what you build with GitHub Agentic Workflows. Happy automating, and keep an eye out for more updates!

The post Under the hood: Security architecture of GitHub Agentic Workflows appeared first on The GitHub Blog.

]]>
94363
Automate repository tasks with GitHub Agentic Workflows   https://github.blog/ai-and-ml/automate-repository-tasks-with-github-agentic-workflows/ Fri, 13 Feb 2026 14:00:00 +0000 https://github.blog/?p=93730 Discover GitHub Agentic Workflows, now in technical preview. Build automations using coding agents in GitHub Actions to handle triage, documentation, code quality, and more.

The post Automate repository tasks with GitHub Agentic Workflows   appeared first on The GitHub Blog.

]]>

Imagine visiting your repository in the morning and feeling calm because you see:

  • Issues triaged and labelled
  • CI failures investigated with proposed fixes
  • Documentation has been updated to reflect recent code changes.
  • Two new pull requests that improve testing await your review.

All of it visible, inspectable, and operating within the boundaries you’ve defined.

That’s the future powered by GitHub Agentic Workflows: automated, intent-driven repository workflows that run in GitHub Actions, authored in plain Markdown and executed with coding agents. They’re designed for people working in GitHub, from individuals automating a single repo to teams operating at enterprise or open-source scale.

At GitHub Next, we began GitHub Agentic Workflows as an investigation into a simple question: what does repository automation with strong guardrails look like in the era of AI coding agents? A natural place to start was GitHub Actions, the heart of scalable repository automation on GitHub. By bringing automated coding agents into actions, we can enable their use across millions of repositories, while keeping decisions about when and where to use them in your hands.

GitHub Agentic Workflows are now available in technical preview. In this post, we’ll explain what they are and how they work. We invite you to put them to the test, to explore where repository-level AI automation delivers the most value.

Graphic showing quotes from customers. 'Home Assistant has thousands of open issues. No human can track what's trending or which problems affect the most users. I've built GitHub Agentic Workflows that analyze issues and surface what matters: that's the kind of judgment amplification that actually helps maintainers.'- Franck Nijhof, lead of the Home Assistant project, one of the top projects on GitHubby contributor countAgentic workflows also allow maintainers and community to experiment with repository automation together. 'Adopting GitHub’s Agentic Workflows has lowered the barrier for experimentation with AI tooling, making it significantly easier for staff, maintainers and newcomers alike. Inside of CNCF, we are benefiting from improved documentation automation along with improving team reporting across the organization. This isn't just a technical upgrade for our community, it’s part of a cultural shift that empowers our ecosystem to innovate faster with AI and agentic tooling.'- Chris Aniszczyk, CTO of the Cloud Native Computing Foundation (CNCF), whose mission is to make cloud native computing ubiquitous across the worldEnterprises are seeing similar benefits at scale. 'With GitHub Agentic Workflows, we’re able to expand how we apply agents to real engineering work at scale, including changes that span multiple repositories. The flexibility and built-in controls give us confidence to leverage Agentic Workflows across complex systems at Carvana.'- Alex Devkar, Senior Vice President, Engineering and Analytics, at Carvana

AI repository automation: A revolution through simplicity 

The concept behind GitHub Agentic Workflows is straightforward: you describe the outcomes you want in plain Markdown, add this as an automated workflow to your repository, and it executes using a coding agent in GitHub Actions.

This brings the power of coding agents into the heart of repository automation. Agentic workflows run as standard GitHub Actions workflows, with added guardrails for sandboxing, permissions, control, and review. When they execute, they can use different coding agent engines—such as Copilot CLI, Claude Code, or OpenAI Codex—depending on your configuration.

The use of GitHub Agentic Workflows makes entirely new categories of repository automation and software engineering possible, in a way that fits naturally with how developer teams already work on GitHub. All of them would be difficult or impossible to accomplish traditional YAML workflows alone:

  1. Continuous triage: automatically summarize, label, and route new issues.
  2. Continuous documentation: keep READMEs and documentation aligned with code changes.
  3. Continuous code simplificationrepeatedly identify code improvements and open pull requests for them.
  4. Continuous test improvementassess test coverage and add high-value tests.
  5. Continuous quality hygiene: proactively investigate CI failures and propose targeted fixes.
  6. Continuous reportingcreate regular reports on repository health, activity, and trends.

These are just a few examples of repository automations that showcase the power of GitHub Agentic Workflows. We call this Continuous AI: the integration of AI into the SDLC, enhancing automation and collaboration similar to continuous integration and continuous deployment (CI/CD) practices.

GitHub Agentic Workflows and Continuous AI are designed to augment existing CI/CD rather than replace it. They do not replace build, test, or release pipelines, and their use cases largely do not overlap with deterministic CI/CD workflows. Agentic workflows run on GitHub Actions because that is where GitHub provides the necessary infrastructure for permissions, logging, auditing, sandboxed execution, and rich repository context.

In our own usage at GitHub Next, we’re finding new uses for agentic workflows nearly every day. Throughout GitHub, teams have been using agentic workflows to create custom tools for themselves in minutes, replacing chores with intelligence or paving the way for humans to get work done by assembling the right information, in the right place, at the right time. A new world of possibilities is opening for teams and enterprises to keep their repositories healthy, navigable, and high-quality.

Let’s talk guardrails and control 

Designing for safety and control is non-negotiable. GitHub Agentic Workflows implements a defense-in-depth security architecture that protects against unintended behaviors and prompt-injection attacks.

Workflows run with read-only permissions by default. Write operations require explicit approval through safe outputs, which map to pre-approved, reviewable GitHub operations such as creating a pull request or adding a comment to an issue. Sandboxed execution, tool allowlisting, and network isolation help ensure that coding agents operate within controlled boundaries.

Guardrails like these make it practical to run agents continuously, not just as one-off experiments. See our security architecture for more details.

One alternative approach to agentic repository automation is to run coding agent CLIs, such as Copilot or Claude, directly inside a standard GitHub Actions YAML workflow. This approach often grants these agents more permission than is required for a specific task. In contrast, GitHub Agentic Workflows run coding agents with read-only access by default and rely on safe outputs for GitHub operations, providing tighter constraints, clearer review points, and stronger overall control.

A simple example: A daily repo report  

Let’s look at an agentic workflow which creates a daily status report for repository maintainers.

In practice, you will usually use AI assistance to create your workflows. The easiest way to do this is with an interactive coding agent. For example, with your favorite coding agent, you can enter this prompt:

Generate a workflow that creates a daily repo status report for a maintainer. Use the instructions at https://github.com/github/gh-aw/blob/main/create.md

The coding agent will interact with you to confirm your specific needs and intent, write the Markdown file, and check its validity. You can then review, refine, and validate the workflow before adding it to your repository.

This will create two files in .github/workflows

  • daily-repo-status.md (the agentic workflow)  
  • daily-repo-status.lock.yml (the corresponding agentic workflow lock file, which is executed by GitHub Actions) 

The file daily-repo-status.md will look like this: 

--- 
on: 
  schedule: daily 
 
permissions: 
  contents: read 
  issues: read 
  pull-requests: read 
 
safe-outputs: 
  create-issue: 
    title-prefix: "[repo status] " 
    labels: [report] 
 
tools: 
  github: 
---  
 
# Daily Repo Status Report 
 
Create a daily status report for maintainers. 
 
Include 
- Recent repository activity (issues, PRs, discussions, releases, code changes) 
- Progress tracking, goal reminders and highlights 
- Project status and recommendations 
- Actionable next steps for maintainers 
 
Keep it concise and link to the relevant issues/PRs.

This file has two parts: 

  1. Frontmatter (YAML between --- markers) for configuration 
  2. Markdown instructions that describe the job in natural language in natural language

The Markdown is the intent, but the trigger, permissions, tools, and allowed outputs are spelled out up front.

If you prefer, you can add the workflow to your repository manually: 

  1. Create the workflow: Add  daily-repo-status.md with the frontmatter and instructions.
  2. Create the lock file:  
    • gh extension install github/gh-aw  
    • gh aw compile
  3. Commit and push: Commit and push files to your repository.
  4. Add any required secrets: For example, add a token or API key for your coding agent.

Once you add this workflow to your repository, it will run automatically or you can trigger it manually using GitHub Actions. When the workflow runs, it creates a status report issue like this:

Screenshot of a GitHub issue titled "Daily Repo Report - February 9, 2026" showing key highlights, including 2 new releases, 1,737 commits from 16 contributors, 100 issues closed with 190 new issues opened, 50 pull requests merged from 93 opened pull requests, and 5 code quality issues opened.

What you can build with GitHub Agentic Workflows 

If you’re looking for further inspiration Peli’s Agent Factory is a guided tour through a wide range of workflows, with practical patterns you can adapt, remix, and standardize across repos.

A useful mental model: if repetitive work in a repository can be described in words, it might be a good fit for an agentic workflow.

If you’re looking for design patterns, check out ChatOps, DailyOps, DataOps, IssueOps, ProjectOps, MultiRepoOps, and Orchestration.

Uses for agent-assisted repository automation often depend on particular repos and development priorities. Your team’s approach to software development will differ from those of other teams. It pays to be imaginative about how you can use agentic automation to augment your team for your repositories for your goals.

Practical guidance for teams 

Agentic workflows bring a shift in thinking. They work best when you focus on goals and desired outputs rather than perfect prompts. You provide clarity on what success looks like, and allow the workflow to explore how to achieve it. Some boundaries are built into agentic workflows by default, and others are ones you explicitly define. This means the agent can explore and reason, but its conclusions always stay within safe, intentional limits.

You will find that your workflows can range from very general (“Improve the software”) to very specific (“Check that all technical documentation and error messages for this educational software are written in a style suitable for an audience of age 10 or above”). You can choose the level of specificity that’s appropriate for your team.

GitHub Agentic Workflows use coding agents at runtime, which incur billing costs. When using Copilot with default settings, each workflow run typically incurs two premium requests: one for the agentic work and one for a guardrail check through safe outputs. The models used can be configured to help manage these costs. Today, automated uses of Copilot are associated with a user account. For other coding agents, refer to our documentation for details. Here are a few more tips to help teams get value quickly:

  • Start with low-risk outputs such as comments, drafts, or reports before enabling pull request creation.
  • For coding, start with goal-oriented improvements such as routine refactoring, test coverage, or code simplification rather than feature work.
  • For reports, use instructions that are specific about what “good” looks like, including format, tone, links, and when to stop.
  • Agentic workflows create an agent-only, sub-loop that’s able to be autonomous because agents are acting under defined terms. But it’s important that humans stay in the broader loop of forward progress in the repository, through reports, issues, and pull requests. With GitHub Agentic Workflows, pull requests are never merged automatically, and humans must always review and approve.
  • Treat the workflow Markdown as code. Review changes, keep it small, and evolve it intentionally.

Continuous AI works best if you use it in conjunction with CI/CD. Don’t use agentic workflows as a replacement for GitHub Actions YAML workflows for CI/CD. This approach extends continuous automation to more subjective, repetitive tasks that traditional CI/CD struggle to express.

Build the future of automation with us   

GitHub Agentic Workflows are available now in technical preview and are a collaboration between GitHub, Microsoft Research, and Azure Core Upstream. We invite you to try them out and help us shape the future of repository automation.

We’d love for you to be involved! Share your thoughts in the Community discussion, or join us (and tons of other awesome makers) in the #agentic-workflows channel of the GitHub Next Discord. We look forward to seeing what you build with GitHub Agentic Workflows. Happy automating!

Try GitHub Agentic Workflows in a repo today! Install gh-aw, add a starter workflow or create one using AI, and run it. Then, share what you build (and what you want next)

The post Automate repository tasks with GitHub Agentic Workflows   appeared first on The GitHub Blog.

]]>
93730
Why AI is pushing developers toward typed languages https://github.blog/ai-and-ml/llms/why-ai-is-pushing-developers-toward-typed-languages/ Thu, 08 Jan 2026 22:25:54 +0000 https://github.blog/?p=93132 AI is settling the “typed vs. untyped” debate by turning type systems into the safety net for code you didn’t write yourself.

The post Why AI is pushing developers toward typed languages appeared first on The GitHub Blog.

]]>

It’s a tale as old as time: tabs vs. spaces, dark mode vs. light mode, typed languages vs. untyped languages. It all depends!

But as developers use AI tools, not only are they choosing the more popular (thus more trained into the model) libraries and languages, they are also using tools that reduce risk. When code comes not just from developers, but also from their AI tools, reliability becomes a much bigger part of the equation. 

Typed vs. untyped

Dynamic languages like Python and JavaScript make it easy to move quickly when building, and developers who argue for those languages push for the speed and flexibility they provide. But that agility lacks the safety net you get with typed languages.

Untyped code is not gone, and can still be great. I love, personally, that I can just write code and not define every aspect of something on my average side project. But, when you don’t control every line of code, subtle errors can pass, unchecked. That’s when the types-driven safety net concept becomes a lot more appealing, and even necessary. AI just increases the volume of “code you didn’t personally write,” which raises the stakes. 

Type systems fill a unique role of surfacing ambiguous logic and mismatches of expected inputs and outputs. They ensure that code from any source can conform to project standards. They’ve basically become a shared contract between developers, frameworks, and AI tools that are generating more and more scaffolding and boilerplate for developers. 

With AI tools and agents producing larger volumes of code and features than ever, it only makes sense that reliability is more critical. And… that is where typed languages win the debate. Not because untyped languages are “bad,” but because types catch the exact class of surprises that AI-generated code can sometimes introduce.

Is type safety that big of a deal?

Yes!

Next question.

But actually though, a 2025 academic study found that a whopping 94% of LLM-generated compilation errors were type-check failures. Imagine how much your projects would improve if 94% of your failures went away! Your life would be better. Your skin would clear. You’d get taller. Or at least you’d have fewer “why does this return a string now?” debugging sessions.

What Octoverse 2025 says about the rise of typed languages

Octoverse 2025 confirmed it: TypeScript is now the most used language on GitHub, overtaking both Python and JavaScript as of August 2025. TypeScript grew by over 1 million contributors in 2025 (+66% YoY, Aug ‘25 vs. Aug ‘24) with an estimated 2.6 million developers total. This was driven, in part, by frameworks that scaffold projects in TypeScript by default (like Astro, Next.js, and Angular). But the report also found correlative evidence that TypeScript’s rise got a boost from AI-assisted development.

That means AI is influencing not only how fast code is written, but which languages and tools developers use. And typed ecosystems are benefiting too, because they help AI slot new code into existing projects without breaking assumptions. 

It’s not just TypeScript. Other typed languages are growing fast, too! 

Luau, Roblox’s scripting language, saw >194% YoY growth as a gradually typed language. Typst, often compared to LaTeX, but with functional design and strong typing, saw >108% YoY growth. Even older languages like Java, C++, and C# saw more growth than ever in this year’s report.

That means gradual typing, optional typing, and strong typing are all seeing momentum—and each offers different levels of guardrails depending on what you’re building and how much you want AI to automate.  

Where do we go from here?

Type systems don’t replace dynamic languages. But, they have become a common safety feature for developers working with and alongside AI coding tools for a reason. As we see AI-assisted development and agent development increase in popularity, we can expect type systems to become even more central to how we build and ship reliable software.

Static types help ensure that code is more trustworthy and more maintainable. They give developers a shared, predictable structure. That reduction in surprises means you can be in the flow (pun intended!) more.

Looking to stay one step ahead? Read the latest Octoverse report and try Copilot CLI.

The post Why AI is pushing developers toward typed languages appeared first on The GitHub Blog.

]]>
93132
Solving the inference problem for open source AI projects with GitHub Models https://github.blog/ai-and-ml/llms/solving-the-inference-problem-for-open-source-ai-projects-with-github-models/ Wed, 23 Jul 2025 16:00:00 +0000 https://github.blog/?p=89716 How using GitHub’s free inference API can make your AI-powered open source software more accessible.

The post Solving the inference problem for open source AI projects with GitHub Models appeared first on The GitHub Blog.

]]>

AI features can make an open source project shine. At least, until setup asks for a paid inference API key.  Requiring contributors or even casual users to bring their own large language model (LLM) key stops adoption in its tracks:

$ my-cool-ai-tool
Error: OPENAI_API_KEY not found

Developers may not want to buy a paid plan just to try out your tool, and self hosting a model can be too heavy for laptops or GitHub Actions runners. 

GitHub Models solves that friction with a free, OpenAI-compatible inference API that every GitHub account can use with no new keys, consoles, or SDKs required. In this article, we’ll show you how to drop it into your project, run it in CI/CD, and scale when your community takes off.

Let’s jump in.

The hidden cost of “just add AI”

AI features feel ubiquitous today, but getting them running locally is still a challenge for a few reasons:

  • Paid APIs: The simplest path is to ask users for an OpenAI or Anthropic key. That’s a non-starter for many hobbyists and students because paid APIs are too expensive.
  • Local models: Running a 2 B-parameter LLM can work for lightweight tasks, but anything that requires more intelligence will quickly blow past typical laptop memory — let alone the 14 GB container that backs a GitHub Actions runner.
  • Docker images and weights: You can bundle a model with your app, but distributing multi-gigabyte weights balloons install size and slows CI.

Every additional requirement filters out potential users and contributors. What you need is an inference endpoint that’s:

  1. Free for public projects
  2. Compatible with existing OpenAI SDKs
  3. Available wherever your code runs, like your laptop, server, or Actions runner

That’s what GitHub Models provides.

GitHub Models in a nutshell

  • What it is: A REST endpoint that speaks the chat/completions spec you already know.
  • What you get: A curated set of models (GPT-4o, DeepSeek-R1, Llama 3, and more) hosted by GitHub.
  • Who can call it: Anyone with a GitHub Personal Access Token (PAT), or a repository’s built-in GITHUB_TOKEN when you opt-in via permissions.
  • How much it costs: Free tier for all personal accounts and OSS orgs; metered paid tier unlocks higher throughput and larger context windows.

Because the API mirrors OpenAI’s, any client that accepts a baseURL will work without code changes. This includes OpenAI-JS, OpenAI Python, LangChain, llamacpp, or your own curl script.

How to get started with GitHub Models

Since GitHub Models is compatible with the OpenAI chat/completions API, almost every inference SDK can use it. To get started, you can use the OpenAI SDK:

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "https://models.github.ai/inference/chat/completions",
  apiKey: process.env.GITHUB_TOKEN  // or any PAT with models:read
});

const res = await openai.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [{ role: "user", content: "Hi!" }]
});
console.log(res.choices[0].message.content);

If you write your AI open source software with GitHub Models as an inference provider, all GitHub users will be able to get up and running with it just by supplying a GitHub Personal Access Token (PAT).

And if your software runs in GitHub Actions, your users won’t even need to supply a PAT. By requesting the models: read permission in your workflow file, the built-in GitHub token will have permissions to make inference requests to GitHub Models. This means you can build a whole array of AI-powered Actions that can be shared and installed with a single click. For instance:

  • Code review or PR triage bots
  • Smart issue tagging workflows
  • Weekly repository activity report generators
  • And anything else that a GitHub Action can do

Plus, using GitHub Models makes it easy for your users to set up AI inference. And that has another positive effect: it’s easier for your contributors to set up AI inference as well. When anyone with a GitHub account can run your code end to end, you’ll be able to get contributions from the whole range of GitHub users, not just the ones with an OpenAI key.

Zero-configuration CI with GitHub Actions

Publishing an Action that relies on AI used to require users to add their inference API key as a GitHub Actions secret. Now you can ship a one-click install:

yaml 

# .github/workflows/triage.yml
permissions:
  contents: read
  issues: write
  models: read   # 👈 unlocks GitHub Models for the GITHUB_TOKEN

jobs:
  triage:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Smart issue triage
        run: node scripts/triage.js

The runner’s GITHUB_TOKEN carries the models:read scope, so your Action can call any model without extra setup. This makes it well suited for:

  • Automated pull request summaries
  • Issue deduplication and tagging
  • Weekly repository digests
  • Anything else you can script in an Action

Scaling when your project takes off

The GitHub Models inference API is free for everyone. But if you or your users want to do more inference than the free rate limits allow, you can turn on paid inference in your settings for significantly larger context windows and higher requests-per-minute. 

When your community grows, so will traffic. So it’s important to consider the following: 

  • Requests per minute (RPM): While the free tier offers default limits, the paid tier offers multiples higher.
  • Context window: Free tier tops out at standard model limits; paid enables 128k tokens on supported models.
  • Latency: The paid tier runs in its own separate deployment, so you’re not in the same queue as free tier users.

To get started, you can enable paid usage in Settings > Models for your org or enterprise. Your existing clients and tokens will keep working (but they’ll be faster and support bigger contexts).

Take this with you

LLMs are transforming how developers build and ship software, but requiring users to supply their own paid API key can be a barrier to entry. The magic only happens when the first npm install, cargo run, or go test just works.

If you maintain an AI-powered open source codebase, you should consider adding GitHub Models as a default inference provider. Your users already have free AI inference via GitHub, so there’s little downside to letting them use it with your code. That’s doubly true if your project is able to run in GitHub Actions. The best API key is no API key!

By making high-quality inference a free default for every developer on GitHub, GitHub Models gets rid of the biggest blocker to OSS AI adoption. And that opens the door to more contributions, faster onboarding, and happier users.

Want to give it a try? Check out the GitHub Models documentation or jump straight into the API reference and start shipping AI features that just work today.

The post Solving the inference problem for open source AI projects with GitHub Models appeared first on The GitHub Blog.

]]>
89716
A guide to deciding what AI model to use in GitHub Copilot https://github.blog/ai-and-ml/github-copilot/a-guide-to-deciding-what-ai-model-to-use-in-github-copilot/ Thu, 24 Apr 2025 16:00:51 +0000 https://github.blog/?p=86942 What to look for with each model and how to test them in your workflows—with tips, tricks, and pointers.

The post A guide to deciding what AI model to use in GitHub Copilot appeared first on The GitHub Blog.

]]>

To ensure that you have access to the best technology available, we’re continuously adding support for new models to GitHub Copilot. That being said, we know it can be hard to keep up with so many new models being released all the time.

All of this raises an obvious question: Which model should you use?

You can read our recent blog post for an overview of the models currently available in Copilot and their strengths, or check out our documentation for a deep dive comparing different models and tasks. But the AI landscape moves quickly. In this article we’ll explore a framework—including a few strategies—for evaluating whether any given AI model is a good fit for your use, even as new models continue to appear at a rapid pace.

It’s hard to go wrong with our base model, which has been fine-tuned specifically for programming-related tasks. But depending on what you’re working on, you likely have varying needs and preferences. There’s no single “best” model. Some may favor a more verbose model for chat, while others prefer a terse one, for example.

We spoke with several developers about their model selection process. Keep reading to discover how to apply their strategies to your own needs.

💡 Watch the video below for tips on prompt engineering to get the best results.

Why use multiple models?

There’s no reason you have to pick one model and stick with it. Since you can easily switch between models for both chat and code completion with GitHub Copilot, you can use different models for different use cases.

It's kind of like dogfooding your own stack: You won’t know if it really fits your workflow until you've shipped some real code with it.

- Anand Chowdhary, FirstQuadrant CTO and co-founder

Chat vs. code completion

Using one model for chat and another for autocomplete is one of the most common patterns we see among developers. Generally, developers prefer autocompletion models because they’re fast and responsive, which they need if they’re looking for suggestions as they think and type. Developers are more tolerant of latency in chat, when they’re in more of an exploratory state of mind (like considering a complex refactoring job, for instance).

Reasoning models for certain programming tasks

Reasoning models like OpenAI o1 often respond slower than traditional LLMs such as GPT-4o or Claude Sonnet 3.5. That’s in large part because these models break a prompt down into parts and consider multiple approaches to a problem. That introduces latency in their response times, but makes them more effective at completing complex tasks. Many developers prefer these more deliberative models for particular tasks.

For instance, Fatih Kadir Akın, a developer relations manager, uses o1 when starting new projects from scratch. “Reasoning models better ‘understand’ my vision and create more structured projects than non-reasoning models,” he explains.

FirstQuadrant CTO and co-founder Anand Chowdhary favors reasoning models for large-scale code refactoring jobs. “A model that rewrites complex backend code without careful reasoning is rarely accurate the first time,” he says. “Seeing the thought process also helps me understand the changes.”

When creating technical interview questions for her newsletter, GitHub Senior Director of Developer Advocacy, Cassidy Williams mixes models for certain tasks. When she writes a question, she uses GPT-4o to refine the prose, and then Claude 3.7 Sonnet Thinking to verify code accuracy. “Reasoning models help ensure technical correctness because of their multi-step process,” she says. “If they initially get something wrong, they often correct themselves in later steps so the final answer is more accurate.”

There’s some subjectivity, but I compare model output based on the code structure, patterns, comments, and adherence to best practices.

- Portilla Edo, cloud infrastructure engineering lead

What to look for in a new AI model

Let’s say a new model just dropped and you’re ready to try it out. Here are a few things to consider before making it your new go-to.

Recentness

Different models use different training data. That means one model might have more recent data than another, and therefore might be trained on new versions of the programming languages, frameworks, and libraries you use.

“When I’m trying out a new model, one of the first things I do is check how up to date it is,” says Xavier Portilla Edo, a cloud infrastructure engineering lead. He typically does this by creating a project manifest file for the project to see what version numbers Copilot autocomplete suggests. “If the versions are quite old, I’ll move on,” he says.

Speed and responsiveness

As mentioned, developers tend to tolerate more latency in a chat than in autocomplete. But responsiveness is still important in chat. “I enjoy bouncing ideas off a model and getting feedback,” says Rishab Kumar, a staff developer evangelist at Twilio. “For that type of interaction, I need fast responses so I can stay in the flow.”

Accuracy

Naturally, you need to evaluate which models produce the best code. “There’s some subjectivity, but I compare model output based on the code structure, patterns, comments, and adherence to best practices,” Portilla Edo says. “I also look at how readable and maintainable the code is—does it follow naming conventions? Is it modular? Are the comments helpful or just restating what the code does? These are all signals of quality that go beyond whether the code simply runs.”

How to test an AI model in your workflow

OK, so now you know what to look for in a model. But how do you actually evaluate it for responsiveness and correctness? You use it, of course.

Start with a simple app

Akın will generally start with a simple todo app written in vanilla JavaScript. “I just check the code, and how well it’s structured,” he says. Similarly, Kumar will start with a websocket server in Python. The idea is to start with something that you understand well enough to evaluate, and then layer on more complexity. “Eventually I’ll see if it can build something in 3D using 3js,” Akın says.

Portilla Edo starts by prompting a new model he wants to evaluate in Copilot Chat. “I usually ask it for simple things, like a function in Go, or a simple HTML file,” he says. Then he moves on to autocompletion to see how the model performs there.

Use it as a “daily driver” for a while

Chowdhary prefers to just jump in and start using a model. “When a new model drops, I swap it into my workflow as my daily driver and just live with it for a bit,” he says. “Available benchmarks and tests only tell you part of the story. I think the real test is seeing if it actually improves your day to day.”

For example, he checks to see if it actually speeds up his debugging jobs or produces cleaner refactors. “It’s kind of like dogfooding your own stack: You won’t know if it really fits your workflow until you’ve shipped some real code with it,” he says. “After evaluating it for a bit, I decide whether to stick with the new model or revert to my previous choice.”

Take this with you

What just about everyone agrees on is that the best way to evaluate a model is to use it.

The important thing is to keep learning. “You don’t need to be switching models all the time, but it’s important to know what’s going on,” Chowdhary says. “The state of the art is moving quickly. It’s easy to get left behind.”

Additional resources

Learn more about AI models.

The post A guide to deciding what AI model to use in GitHub Copilot appeared first on The GitHub Blog.

]]>
86942
What the heck is MCP and why is everyone talking about it? https://github.blog/ai-and-ml/llms/what-the-heck-is-mcp-and-why-is-everyone-talking-about-it/ Fri, 11 Apr 2025 16:00:44 +0000 https://github.blog/?p=86321 Everyone's talking about MCP these days when it comes to large language models (LLMs)—here’s what you need to know.

The post What the heck is MCP and why is everyone talking about it? appeared first on The GitHub Blog.

]]>

It feels like everyone’s talking about MCP (Model Context Protocol) these days when it comes to large language models (LLMs), but hardly anyone is actually defining it.

TL;DR: It’s an open standard for connecting LLMs to data and tools.

Let’s dive in deeper!

The context problem for LLMs

LLMs often struggle when they are asked for information outside of their training data. They’ll sometimes either hallucinate and say something incorrect, or simply say, “I don’t know.”

Giving them the right amount of context when you prompt them (whether it’s your codebase, your repository data, your documentation, etc.) is necessary for AI agents built on top of LLMs to be useful.

Usually, you have to really refine your prompting to give LLMs that context, or use some sort of external tool. For example, GitHub Copilot has tools like @workspace to give relevant information from your codebase to your prompts. This type of “extra tooling” is cool, but can get fairly complex fairly quickly as you implement things across different APIs and services.

A solution: Model Context Protocol, or MCP

In November, Anthropic open sourced the Model Context Protocol as a standard for connecting LLMs and AI assistants to data and tools!

MCP grew the way you sleep… slowly and then all at once. As tools and organizations have adopted the MCP standard, it has only become more and more valuable. And because MCP is model agnostic, anyone can use and create MCP integrations. As with all open standards, a rising tide lifts all boats: the more people that use it, the better it becomes.

I think that MCP has “won” the hearts of so many AI developers and tools because of this openness, and also because it’s a very “AI-first” version of existing ideas.

This isn’t the first time we’ve seen a protocol like this become a standard, either. Back in 2016, Microsoft released the Language Server Protocol (LSP), which provided standards for code editors to support programming languages. Fast forward to today: because of LSP, programming language support across editors is better than ever, to the point where developers don’t even need to think about it anymore!

MCP takes a lot of its inspiration from LSP, and could be absolutely transformative for AI tooling. It allows for everyone, from the largest tech giants to the smallest indie developer shops, to enable robust AI solutions in any AI client with minimal setup.

That’s why this is a huge deal! An open standard that is backed more and more by the tech community means better tools, better developer experiences, and better user experiences for everyone.

GitHub and MCP

We’re not just talking about MCP: we’re contributing, too!

We’re SO excited to have recently released our new open source, official, local GitHub MCP Server! It provides seamless integration with GitHub APIs, allowing for advanced automation and integration capabilities for developers to build with!

You can chat more with us about it in the GitHub Community or you can check out the official announcement.

How do I contribute and learn more?

Hoorah, I thought you’d never ask! Here’s some resources to get you on your way:

Also, if you don’t mind the shameless plug, you can also use it with agent mode now. Go forth and code!

The post What the heck is MCP and why is everyone talking about it? appeared first on The GitHub Blog.

]]>
86321
So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer https://github.blog/ai-and-ml/llms/so-many-tokens-so-little-time-introducing-a-faster-more-flexible-byte-pair-tokenizer/ Thu, 12 Dec 2024 13:51:13 +0000 https://github.blog/?p=81578 We released a new open source byte-pair tokenizer that is faster and more flexible than popular alternatives.

The post So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer appeared first on The GitHub Blog.

]]>

Large language models (LLMs), such as those used by GitHub Copilot, do not operate directly on bytes but on tokens, which can complicate scaling. In this post, we explain how we solved that challenge at GitHub to support the growing number of Copilot users and features since the first launching Copilot two years ago.

Tokenization is the process of turning bytes into tokens. The byte-pair encoding (BPE) algorithm is such a tokenizer, used (for example) by the OpenAI models we use at GitHub. The fastest of these algorithms not only have at least an O(n log(n)) complexity, they are also not incremental, and therefore badly suited for our use cases, which go beyond encoding an upfront known input. This limitation resulted in a number of scaling challenges that led us to create a novel algorithm to address them. Our algorithm not only scales linearly, but also easily out-performs popular libraries for all inputs.

Read on to find out more about why and how we created the open source bpe algorithm that substantially improves state of the art BPE implementations in order to address our broader set of use cases.

The importance of fast, flexible tokenization

Retrieval augmented generation (RAG) is essential for GitHub Copilot’s capabilities. RAG is used to improve model output by augmenting the user’s prompt with relevant snippets of text or code. A typical RAG approach works as follows:

  • Index repository content into an embeddings database to allow semantic search.
  • Given a user’s prompt, search the embeddings database for relevant snippets and include those snippets in the prompt for the LLM.

Tokenization is important for both of these steps. Most code files exceed the number of tokens that can be encoded into a single embedding, so we need to split the files into chunks that are within the token limit. When building a prompt there are also limits on the number of tokens that can be used. The amount of tokens can also impact response time and cost. Therefore, it is common to have some kind of budgeting strategy, which requires being able to track the number of tokens that components of the prompt contribute. In both of these cases, we are dynamically constructing a text and constantly need the updated token count during that process to decide how to proceed. However, most tokenizers provide only a single operation: encode a complete input text into tokens.

When scaling to millions of repositories and billions of embeddings, the efficiency of token calculation really starts to matter. Additionally, we need to consider the worst-case performance of tokenization for the stability of our production systems. A system that processes untrusted user input in the form of billions of source code files cannot allow data that, intentionally or not, causes pathological run times and threatens availability. (See this discussion of potential denial-of-service issues in the context of OpenAI’s tiktoken tokenizer.)

Some of the features that can help address these needs are:

  • Keep track of the token count for a chunk while it is being built up.
  • Count tokens for slices of an original text or abort counting when the text exceeds a given limit.
  • Split text within a certain amount of tokens at a proper UTF-8 character boundary.

Implementing these operations using current tokenization algorithms would result in at least quadratic runtime, when we would like the runtime to be linear.

bpe: A fast tokenizer

We were able to make substantial improvements to the state of the art BPE implementations in order to address the broader set of use cases that we have. Not only were we able to support more features, but we do so with much better performance and scalability than the existing libraries provide.

Our implementation is open source with an MIT license can be found at https://github.com/github/rust-gems. The Rust crates are also published to crates.io as bpe and bpe-openai. The former contains the BPE implementation itself. The latter exposes convenience tokenizers (including pre-tokenization) for recent OpenAI token models.

Read on for benchmark results and an introduction to the algorithm itself.

Performance comparison

We compare performance with two benchmarks. Both compare tokenization on randomly generated inputs of different sizes.

  • The first uses any inputs, most of which will be split into smaller pieces during pre-tokenization. Pre-tokenization splits the input text into pieces (for example, using regular expressions). BPE is applied to those pieces instead of the complete input text. Since these pieces are typically small, this can significantly impact overall performance and hide the performance characteristics of the underlying BPE implementation. This benchmark allows comparing the performance that can be expected in practice.
  • The second uses pathological inputs that won’t be split by pre-tokenization. This benchmark allows comparing the performance of the underlying BPE implementations and reflects worst-case performance.

We used OpenAI’s o200k_base token model and compared our implementation with tiktoken-rs, a wrapper for OpenAI’s tiktoken library, and Huggingface’s tokenizers. All benchmarks were run single-threaded on an Apple M1 MacBook Pro.

Here are the results, showing single-threaded throughput in MiB/s:

Line graph displaying results for the benchmark that includes pre-tokenization. Our tokenizer outperforms tiktoken by almost 4x and Huggingface by about 10x.

Line graph displaying the worst case complexity difference between our linear, Huggingface’s heap-based, and tiktoken’s quadratic implementation.

The first figure shows the results for the benchmark that includes pre-tokenization. We see that our tokenizer outperforms tiktoken by almost 4x and Huggingface by about 10x. (These numbers are in line with tiktoken’s reported performance results. Note that our single-threaded performance is only matched when using eight threads.) The second figure shows the worst case complexity difference between our linear, Huggingface’s heap-based, and tiktoken’s quadratic implementation.

The rest of this post details how we achieved this result. We explain the basic principle of byte-pair encoding, the insight that allows the faster algorithm, and a high-level description of the algorithm itself.

Byte-pair encoding

BPE is a technique to encode text as a sequence of tokens from a token dictionary. The token dictionary is just an ordered list of tokens. Each token is either a single byte, or the concatenation of a pair of previously defined tokens. A string is encoded by replacing bytes with single-byte tokens and token pairs by concatenated tokens, in dictionary order.

Let’s see how the string abacbb is tokenized using the following dictionary:

a b c ac bb ab acbb

Initially, the string is tokenized into the single-byte tokens. Next, all occurrences (left to right) of the token pair a c are replaced by the token ac. This procedure is repeated until no more replacements are possible. For our input string abacbb, tokenization proceeds as follows:

1. a b a c b b
2. a b ac  b b
3. a b ac  bb
4. ab  ac  bb
5. ab  acbb

Note that initially we have several pairs of single-byte tokens that appear in the dictionary, such as a b and a c. Even though ab appears earlier in the string, ac is chosen because the token appears first in the token dictionary. It is this behavior that makes BPE non-incremental with respect to string operations such as slicing or appending. For example, the substring abacb is tokenized as ab ac b, but if another b is added, the resulting string abacbb is tokenized as ab acbb. Two tokens from the prefix abacb are gone, and the encoding for the longer string even ends up being shorter.

The two main strategies for implementing BPE are:

  • A naive approach that repeatedly iterates over the tokens to merge the next eligible token pair, resulting in quadratic complexity.
  • A heap-based approach that keeps eligible token pairs sorted, resulting in O(n log(n)) complexity.

However, all tokenizers require the full text up front. Tokenizing a substring or an extended string means starting the encoding from scratch. For that reason, the more interesting use cases from above quickly become very expensive (at least O(n2 log(n))). So, how can we do better?

Composing valid encodings

The difficulty of the byte-pair encoding algorithm (as described above) is that token pair replacements can happen anywhere in the string and can influence the final tokens at the beginning of the string. However, it turns out that there is a property, which we call compatibility, that allows us to build up tokenizations left-to-right:

Given a valid encoding, we can append an additional token to produce a new valid encoding if the pair of the last token and the appended token are a valid encoding.

A valid encoding means that the original BPE algorithm produces that same encoding. We’ll show what this means with an example, and refer to the crate’s README for a detailed proof.

The sequence ab ac is a valid encoding for our example token dictionary.

  • Is ab ac b a valid encoding? Check if the pair ac b is compatible:
    1. a c b
    2. ac  b
    

    We got the same tokens back, which means ab ac b is the encoding ab ac b.

  • Is ab ac bb a valid encoding? Again, check if the pair ac bb is compatible:

    1. a c b b
    2. ac  b b
    3. ac  bb 
    4. acbb 
    

    In this case, the tokens are incompatible, and ab ac bb is not valid.

The next section explains how we can go from building valid encodings to finding the encoding for a given input string.

Linear encoding

Using the compatibility rule, we can implement linear encoding with a dynamic programming algorithm.

The algorithm works by checking for each of the possible last tokens whether it is compatible with the tokenization of the remaining prefix. As we saw in the previous section, we only need the last token of the prefix’s encoding to decide this.

Let’s apply this idea to our example abacbb and write down the full encodings for every prefix:

  • a ——-> a
  • ab ——> ab
  • aba —–> ab a
  • abac —-> ab ac
  • abacb —> ab ac b
  • abacbb –> ab acbb

We only store the last token for every prefix. This gives us a ab a ac b for the first five prefixes. We can find the last token for a prefix with a simple lookup in the list. For example, the last token for ab is ab, and the last token for abac is ac.

For the last token of abacbb we have three token candidates: b, bb, and acbb. For each of these we must check whether it is compatible with the last token of the remaining prefix: b b, ac bb, or ab acbb. Retokenizing these combinations gives bb, acbb, and ab acbb, which means acbb is the only valid choice here. The algorithm works forward by computing the last token for every position in the input, using the last tokens for previous positions in the way we just described.

The resulting algorithm looks roughly like this:

let last_tokens = vec![];
for pos in 0..text.len() {
  for candidate in all_potential_tokens_for_suffix(text[0..pos + 1]) {
    if token_len(candidate) == pos + 1 {
      last_tokens.push(candidate);
      break;
    } else if is_compatible(
      last_tokens[pos + 1 - token_len(candidate)],
      candidate,
    ) {
      last_tokens.push(candidate);
      break;
    }
  }
}

How do we implement this efficiently?

  • Use an Aho-Corasick string matching automaton to get all suffix tokens of the string until position i. Start with the longest token, since those are more likely to survive than shorter tokens.
  • Retokenize the token pair efficiently to implement the compatibility check. We could precompute and store all valid pairings, but with large token dictionaries this will be a lot of data. It turns out that we can retokenize pairs on the fly in effectively constant time.

The string matching automaton is linear for the input length. Both the number of overlapping tokens and retokenization are bounded by constants determined by the token dictionary. Together, this gives us a linear runtime.

Putting it together

The Rust crate contains several different encoders based on this approach:

  • Appending and prepending encoders that incrementally encode a text when content is added to it. The algorithm works using the approach outlined above, storing the last token for every text position. Whenever the text is extended, the last tokens for the new positions are computed using the previously computed last tokens. At any point the current token count is available in constant time. Furthermore, constant time snapshots and rollbacks are supported, which make it easy to implement dynamic chunk construction approaches.
  • A fast full-text encoder based on backtracking. Instead of storing the last token for every text position, it only stores the tokens for the full input. The algorithm works left to right, picking a candidate token for the remaining input, and checking if it is compatible with the last token. The algorithm backtracks on the last token if none of the candidates are compatible. By trying the longest candidates first, very little backtracking happens in practice.
  • An interval encoder which allows O(1) token counting on subranges of the original text (after preprocessing the text in O(n) time). The algorithm works by encoding the substring until the last token lines up with the token at that position for the original text. Usually, only a small prefix needs to be encoded before this alignment happens.

We have explained our algorithm at a high level so far. The crate’s README contains more technical details and is a great starting point for studying the code itself.

The post So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer appeared first on The GitHub Blog.

]]>
81578
Unlocking the power of unstructured data with RAG https://github.blog/ai-and-ml/llms/unlocking-the-power-of-unstructured-data-with-rag/ Thu, 13 Jun 2024 16:00:28 +0000 https://github.blog/?p=78382 Unstructured data holds valuable information about codebases, organizational best practices, and customer feedback. Here are some ways you can leverage it with RAG, or retrieval-augmented generation.

The post Unlocking the power of unstructured data with RAG appeared first on The GitHub Blog.

]]>

Whether they’re building a new product or improving a process or feature, developers and IT leaders need data and insights to make informed decisions.

When it comes to software development, this data exists in two ways: unstructured and structured. While structured data follows a specific and predefined format, unstructured data—like email, an audio or visual file, code comment, or commit message—doesn’t. This makes unstructured data hard to organize and interpret, which means teams can miss out on potentially valuable insights.

To make the most of their unstructured data, development teams are turning to retrieval-augmented generation, or RAG, a method for customizing large language models (LLMs). They can use RAG to keep LLMs up to date with organizational knowledge and the latest information available on the web. They can also use RAG and LLMs to surface and extract insights from unstructured data.

GitHub data scientists, Pam Moriarty and Jessica Guo, explain unstructured data’s unique value in software development, and how developers and organizations can use RAG to create greater efficiency and value in the development process.

Unstructured data in software development

When it comes to software development, unstructured data includes source code and the context surrounding it, as these sources of information don’t follow a predefined format.

Here are some examples of unstructured data on GitHub:

  • README files describe in text the purpose behind project source code, and include instructions for source code use, how to contribute, and other details that developers decide is important to include. While they’re usually written in Markdown, README files don’t follow a predefined structure.
  • Code files are more orderly than README files in that they follow the syntax of a programming language. But not all code files have the exact same fields nor are they all written in the same format. Additionally, some parts of the file, like coding logic and variable names, are decided by individual developers.
  • Package documentation explains how the software works and how to use it. Documentation, written in natural language, can include installation instructions, troubleshooting tips, a description of the package’s API, and a list of any dependencies required to use the package. It can also include code snippets that highlight the package’s features.
  • Code comments explain the function behind certain code blocks in a code file. They’re text comments written in natural language and make the source code easier to understand by other developers.
  • Wiki pages, while not limited to unstructured data, can contain helpful text documentation about installation instructions, API references, and other information.
  • Commit messages describe in natural language text the changes a developer made to a codebase and why.
  • Issue and pull request descriptions are written in natural language and in a text field. They can contain any kind of information a developer chooses to include about a bug, feature request, or general task in a project.
  • Discussions contain a wealth and variety of information, from developer and end- user feedback to open-ended conversations about a topic. As long as a repository enables discussions, anyone with a GitHub account can start a discussion.
  • Review comments are where developers can discuss changes before they’re merged into a codebase. Consequently, they contain information in natural language about code quality, context behind certain decisions, and concerns about potential bugs.

The value of unstructured data

The same features that make unstructured data valuable also make it hard to analyze.

Unstructured data lacks inherent organization, as it often consists of free-form text, images, or multimedia content.

“Without clear boundaries or predefined formats, extracting meaningful information from unstructured data becomes very challenging,” Guo says.

But LLMs can help to identify complex patterns in unstructured data—especially text. Though not all unstructured data is text, a lot of text is unstructured. And LLMs can help you to analyze it.

“When dealing with ambiguous, semi-structured or unstructured data, LLMs dramatically excel at identifying patterns, sentiments, entities, and topics within text data and uncover valuable insights that might otherwise remain hidden,” Guo explains.

Here are a few reasons why developers and IT leaders might consider using RAG-powered LLMs to leverage unstructured data:

  • Surface organizational best practices and establish consistency. Through RAG, an LLM can receive a prompt with additional context pulled from an organization’s repositories and documents. So, instead of sifting through and piece-mealing documents, developers can quickly receive answers from an LLM that align with their organization’s knowledge and best practices.
  • Accelerate and deepen understanding of an existing codebase—including its conventions, functions, common issues, and bugs. Understanding and familiarizing yourself with code written by another developer is a persisting challenge for several reasons, including but not limited to: code complexity, use of different coding styles, a lack of documentation, use of legacy code or deprecated libraries and APIs, and the buildup of technical debt from quick fixes and workarounds.

RAG can help to mediate these pain points by enabling developers to ask and receive answers in natural language about a specific codebase. It can also guide developers to relevant documentation or existing solutions.

Accelerated and deepened understanding of a codebase enables junior developers to contribute their first pull request with less onboarding time and senior developers to mitigate live site incidents, even when they’re unfamiliar with the service that’s failing. It also means that legacy code suffering from “code rot” and natural aging can be more quickly modernized and easily maintained.

Unstructured data doesn’t just help to improve development processes. It can also improve product decisions by surfacing user pain points.

Moriarty says, “Structured data might show a user’s decision to upgrade or renew a subscription, or how frequently they use a product or not. While those decisions represent the user’s attitude and feelings toward the product, it’s not a complete representation. Unstructured data allows for more nuanced and qualitative feedback, making for a more complete picture.”

A lot of information and feedback is shared during informal discussions, whether those discussions happen on a call, over email, on social platforms, or in an instant message. From these discussions, decision makers and builders can find helpful feedback to improve a service or product, and understand general public and user sentiment.

What about structured data?

Contrary to unstructured data, structured data—like relational databases, Protobuf files, and configuration files—follows a specific and predefined format.

We’re not saying unstructured data is more valuable than structured. But the processes for analyzing structured data are more straightforward: you can use SQL functions to modify the data and traditional statistical methods to understand the relationship between different variables.

That’s not to say AI isn’t used for structured data analysis. “There’s a reason that machine learning, given its predictive power, is and continues to be widespread across industries that use data,” according to Moriarty.

However, “Structured data is often numeric, and numbers are simply easier to analyze for patterns than words are,” Moriarty says. Not to mention that methods for analyzing structured data have been around longer** **than those for analyzing unstructured data: “A longer history with more focus just means there are more established approaches, and more people are familiar with it,” she explains.

That’s why the demand to enhance structured data might seem less urgent, according to Guo. “The potential for transformative impact is significantly greater when applied to unstructured data,” she says.

How does RAG extract value from unstructured data?

With RAG, an LLM can use data sources beyond its training data to generate an output.

RAG is a prompting method that uses retrieval—a process for searching for and accessing information—to add more context to a prompt that generates an LLM response.

This method is designed to improve the quality and relevance of an LLM’s outputs. Additional data sources include a vector database, traditional database, or search engine. So, developers who use an enterprise AI tool equipped with RAG can receive AI outputs customized to their organization’s best practices and knowledge, and proprietary data.

We break down these data sources in our RAG explainer, but here’s a quick summary:

  • Vector databases. While you code in your IDE, algorithms create embeddings for your code snippets, which are stored in a vector database. An AI coding tool can search that database to find snippets from across your codebase that are similar to the code you’re currently writing and generate a suggestion.

And when you’re engaging with GitHub Copilot Chat on GitHub.com or in the IDE, your query or code is transformed into an embedding. Our retrieval service then fetches relevant embeddings from the vector database for the repository you’ve indexed. These embeddings are turned back into text and code when they’re added to the prompt as additional context for the LLM. This entire process leverages unstructured data, even though the retrieval system uses embeddings internally.

  • General text search. When developers engage with GitHub Copilot Chat under a GitHub Copilot Enterprise plan, they can index repositories—specifically code and documentation. So, when a developer on GitHub.com or in the IDE asks GitHub Copilot Chat a question about an indexed repository, the AI coding tool can retrieve data from all of those indexed, unstructured data sources. And on GitHub.com, GitHub Copilot Chat can tap into a collection of unstructured data in Markdown files from across repositories, which we call knowledge bases.

Learn about GitHub Copilot Enterprise features >

But wait, why is Markdown considered unstructured data? Though you can use Markdown to format a file, the file itself can contain essentially any kind of data. Think about it this way: how would you put the contents of a Markdown file in a table?

  • External or internal search engine. The retrieval method searches and pulls information from a wide range of sources from the public web or your internal platforms and websites. That information is used for RAG, which means the AI model now has data from additional files—like text, image, video, and audio—to answer your questions.

Retrieval also taps into internal search engines. So, if a developer wants to ask a question about a specific repository, they can index the repository and then send their question to GitHub Copilot Chat on GitHub.com. Retrieval uses our internal search engine to find relevant code or text from the indexed files, which are then used by RAG to prompt the LLM for a contextually relevant response.

Stay smart: LLMs can do things they weren’t trained to do, so it’s important to always evaluate and verify their outputs.

Use RAG to unlock insights from unstructured data

As developers improve their productivity and write more code with AI tools like GitHub Copilot, there’ll be even more unstructured data. Not just in the code itself, but also the information used to build, contextualize, maintain, and improve that code.

That means even more data containing rich insights that organizations can surface and leverage, or let sink and disappear.

Developers and IT leaders can use RAG as a tool to help improve their productivity, produce high-quality and consistent code at greater speed, preserve and share information, and increase their understanding of existing codebases, which can impact reduced onboarding time.

With a RAG-powered AI tool, developers and IT leaders can quickly discover, analyze, and evaluate a wealth of unstructured data—simply by asking a question.

A RAG reading list 📚

The post Unlocking the power of unstructured data with RAG appeared first on The GitHub Blog.

]]>
78382
How AI enhances static application security testing (SAST) https://github.blog/ai-and-ml/llms/how-ai-enhances-static-application-security-testing-sast/ Thu, 09 May 2024 16:00:24 +0000 https://github.blog/?p=77987 Here’s how SAST tools combine generative AI with code scanning to help you deliver features faster and keep vulnerabilities out of code.

The post How AI enhances static application security testing (SAST) appeared first on The GitHub Blog.

]]>

In a 2023 GitHub survey, developers reported that their top task, second only to writing code (32%), was finding and fixing security vulnerabilities (31%).

As their teams “shift left” and integrate security checks earlier into the software development lifecycle (SDLC), developers have become the first line of defense against vulnerabilities.

Unfortunately, we’ve found that “shifting left” has been more about shifting the burden of security practices to developers, rather than their benefits. But with AI, there’s promise: 45% of developers think teams will benefit from using AI to facilitate security reviews. And they’re not wrong.

We spoke with Tiferet Gazit, the AI lead for GitHub Advanced Security, and Keith Hoodlet, principal security specialist at GitHub, to discuss security pain points for developers, the value of using an AI-powered security tool, and how AI enhances static application security testing (SAST).

Why are developers frustrated with security?

Before sharing insights from Gazit and Hoodlet, let’s hear from developers directly.

In late 2019, Microsoft’s One Engineering System team sat down with a handful of developers to understand their frustrations with following security and compliance guidelines. Though that was a few years ago, their pain points still resonate today:

  • When conducting security reviews, some developers are forced to use tools that weren’t designed for them, which negatively impacts their ability to find and address security vulnerabilities.
  • Also, the priority for most developers is to write and review code. Yet, in the age of shifting left, they’re also expected to review, understand, and remediate vulnerabilities as part of their day-to-day responsibilities.

When developers execute a program, they have everything they need in a run-time environment. Completing a security review is less straightforward. Often, developers need to exit their IDEs to view vulnerability alerts, research vulnerability types online, and then revisit their IDEs to address the vulnerability. This is what we call context-switching, and it can increase cognitive load and decrease productivity.

In short, security isn’t an inherent part of the development process, and developers often feel less confident in how secure their code is.

Without intervention, these frustrations will only increase over time. 75% of enterprise software engineers are expected to use AI coding assistants by 2028, according to Gartner. That means as developers improve their productivity and write more code with AI tools like GitHub Copilot, there will be even more code to review.

Security experts are stretched thin, too

It’s typically reported that for every 100 developers, there’s one security expert who ends up being the last line of defense against vulnerabilities (and is responsible for setting and enforcing security policies), which is a significant undertaking. While the exact numbers might vary, the ISC2 (International Information System Security Certification Consortium) reported a demand for four million more security professionals in its 2023 workforce study.

While AI doesn’t replace security experts, it can help them augment their knowledge and capabilities, especially when their expertise is in high demand.

“AI can help with those code and security reviews to ensure that increased momentum doesn’t lead to increased vulnerabilities,” Gazit says.

How AI enhances SAST tools

SAST tools aren’t the only kind of security tool used by developers, but they’re one of the most popular. Let’s look at how AI can help SAST tools do their job more efficiently.

Increased vulnerability detection

In order for SAST tools to detect vulnerabilities in code, they need to be shown what to look for. So, security experts use a process called modeling to identify points where exploitable user-controlled data enters and flows throughout a codebase. But given how often those components change, modeling popular libraries and frameworks is hard work.

That’s where AI comes in.

Security teams are experimenting with AI to model an extensive range of open source frameworks and libraries, improving the teams’ understanding of what’s inside of each software component.

Watch how Nick Liffen, director of GitHub Advanced Security, and Niroshan Rajadurai, VP of GTM strategy for AI and DevSecOps, show how AI could model unknown packages.

Contextualized vulnerabilities directly in a workspace

Code scanning autofix is an example of an AI-powered security feature that combines a SAST tool—in this case, GitHub’s CodeQL—with the generative AI capabilities of GitHub Copilot.

With code scanning autofix, developers receive an AI-suggested code fix alongside an alert directly in a pull request. Then, they get a clear explanation of the vulnerability and the fix suggestion, specific to their particular use case. To view and apply autofix suggestions directly in the CLI, they can enable the GitHub CLI extension.

In its first iteration, code scanning autofix analyzes and suggests fixes in JavaScript, TypeScript, Python, Java, C#, and Go. It can generate a fix for more than 90% of vulnerability types—and over two-thirds of those fixes can be merged with little to no edits. More languages like C++ and Ruby will be supported in the future.

The payoff is that developers can remediate vulnerabilities faster and in their workflows, rather than catching those vulnerabilities later in production.

A fortified SDLC

Developers use SAST tools to protect their code throughout the SDLC.

Once developers enable a code scanning solution like CodeQL, the SAST tool will scan your source code, integrating security checks as part of their CI/CD workflow:

  • When you make changes to a codebase and create pull requests on GitHub, CodeQL will automatically conduct a full scan of your code as if the pull request was merged. It will then alert you if a vulnerability is found in the files changed in the pull request.

    That means developers have the ability to continuously monitor the security posture of their source code as modules come together—even before changes are merged to their main branch. As a result, developers can remediate vulnerabilities right away, in development, and before their code is sent to production.

  • Outside of commits and pull requests, you can also set CodeQL to run at specified times in your GitHub Actions workflow. So, if you want CodeQL to regularly scan your code at specific time intervals, you can schedule that using a GitHub Actions workflow.

Are you already using code scanning autofix?

Share your feedback and ask questions here >

See code scanning autofix in action

“Autofix makes CodeQL friendlier for developers by suggesting a fix and providing contextual explanations of the vulnerability and its remediation,” Gazit says. “This use of AI lowers the barrier of entry for developers who are tasked with fixing vulnerabilities.”

Let’s say a bad actor inserts a SQL injection into your application. The SQL injection enters your codebase through a user input field, and if the code comprising the injection exploits unintentional vulnerabilities, then the bad actor gets unauthorized access to sensitive data in your application.

SQL injections are a common type of vulnerability often found with a SAST tool.
Here’s a step-by-step look at how code scanning autofix, powered by GitHub Copilot, would detect a SQL injection and then surface it in an alert with an AI-suggested fix.

a flow chart against a dark background shows a SQL injection entering an application, the steps that GitHub's SAST tool CodeQL takes to trace the injection throughout a code base and generate an alert, and the steps that GitHub Copilot takes to augment that alert with an AI-generated fix and context.

Step 1: Hunt for vulnerabilities. Code scanning with CodeQL can be enabled for free on all public repositories and scheduled to run automatically. The scanning process has four main parts, all centered around your source code: tokenization, abstraction, semantic analysis, and taint analysis. Here’s a detailed breakdown of each of those steps.

In short, tokenizing your source code standardizes it, and that allows CodeQL to analyze it later. Abstracting your source code transforms your lines of code into a hierarchical structure that shows the relationship between those lines of code. Semantic analysis uses that abstraction to understand the meaning of your source code.

Finally, taint analysis looks at the way your source code handles user input data. It identifies data sources (where input data enters the source code), flow steps (where data is passed through the code), sanitizers (functions that make input data safe), and sinks (functions that if called with unsanitized data could cause harm). Advanced SAST tools like CodeQL can evaluate how well input data is sanitized or validated, and decide from there whether to raise the path as a potential vulnerability.

Step 2: Construct a prompt to generate a fix. For all languages supported by CodeQL, developers will see a SQL injection alert surfaced in a pull request in their repository, along with a natural language description of the vulnerability and contextual documentation. These alerts will also include a suggested fix that developers can accept, edit, or dismiss.

Here’s what’s included in the prompt, that’s sent to GitHub Copilot, to generate the enhanced alert:

  • The initial CodeQL alert and general information about the type of vulnerability detected. This will usually include an example of the vulnerability and how to fix it, extracted from the CodeQL query help.
  • Code snippets and line numbers, potentially from multiple source-code files, along the data flow identified during CodeQL’s taint analysis. These code snippets signal the places where edits are most likely needed in your source.

To guide the format of GitHub Copilot’s response, our machine learning engineers:

  • Constrain GitHub Copilot’s underlying model to only edit the code included in the prompt.
  • Ask the model to generate outputs in Markdown, including a detailed natural language explanation of the vulnerability and the suggested fix.
  • Ask for “before” and “after” code blocks, demonstrating the snippets that require changes (including some surrounding context lines) and the edits to be made.
  • Instruct the model to list any external dependencies used in the fix, such as data sanitization libraries.

Step 3: Check for undesirable code. Code snippets that match or nearly match runs of about 150 characters of public code on GitHub are then filtered from AI-generated coding suggestions. Vulnerable code, and off-topic, harmful, or offensive content are also filtered out.

You can explore the GitHub Copilot Trust Center to learn more about GitHub Copilot’s filters and responsible data handling.

Step 4: Apply finishing touches. Before developers see GitHub Copilot’s suggested fix, a fix generator processes and refines the LLM output to detect and correct any small errors.

The fix generator does this by:

  • Conducting a fuzzy search to ensure the “after” code blocks and line numbers, which contain the AI-generated suggested code fixes, match the “before” code blocks and line numbers. A fuzzy search looks for exact and similar matches between the code blocks, so the fix generator can catch and correct small errors, like those related to indentation, semicolon, or code comment differences between the two code blocks.
  • Using a parser to check for syntax errors.
  • Conducting semantic checks to evaluate the logic of the AI-suggested code fix. Name-resolution and type checks, for example, help ensure that the suggested code matches and maintains the intention and functionality of the original code.
  • Verifying any dependencies suggested by GitHub Copilot. This means locating the relevant configuration file containing information about the project’s dependencies to see if the needed dependency already exists in the project. If not, the fix generator verifies that the suggested dependencies exist in the ecosystem’s package registry, and checks for known vulnerable or malicious packages. It then adds new and needed dependencies to the configuration file as part of the fix suggestion.

Step 5: Explain the vulnerability and suggested fix. The final step is to surface the CodeQL alert to developers in a pull request. With code scanning autofix, the original CodeQL alert is enhanced with an AI-suggested fix, a natural language explanation of the vulnerability and suggested fix, and a diff patch. Developers can accept the suggested edit as is, refine the suggested edit, or dismiss it.

a  flow chart against a dark background details steps that show how a prompt to GitHub Copilot ultimately results in a security alert enhanced with an AI-suggested fix and additional context.

How developers, the SDLC, and organizations benefit from AI-powered SAST tools

With AI, security checks have the ability to smoothly integrate into a developer’s workflow, making security a feature of the SDLC rather than an afterthought dealt with in production. When developers can help secure code more easily in the development phase, the SDLC as a whole is hardened. And when the SDLC is better protected, organizations can focus more on innovation.

“When you treat security as a feature of the SDLC, your applications become more robust against increasingly complex attacks, which saves you time and money,” Hoodlet says. “You can direct those saved costs towards other improvements and experimentation with new features. The result? Organizations build a reputation for building secure products while freeing up resources for innovation.” Additionally, security teams are free to focus on the strategic initiatives that deserve their expertise.

Organizations that adopt AI-enhanced SAST tools can help developers to feel supported and productive in their security practices, so that developers can:

  • Help secure more code in development. Just look at the numbers. Code scanning autofix powered by GitHub Copilot can generate a fix for more than 90% of vulnerability types detected in your codebase, and more than two-thirds of its suggestions can be merged with little to no edits.
  • Become faster and better at remediating vulnerabilities. Through code scanning autofix, developers are given natural language explanations about an AI-generated code fix. They’re also given a description of the detected vulnerability that’s tailored to its detection in a specific codebase, rather than a general one. This specific context helps developers to better understand the nature of a detected vulnerability, why it exists in a codebase, and how to fix it.

  • Receive security guidance directly in their workspace. Developers receive all the benefits of an AI-enhanced SAST tool directly in a pull request. Unlike traditional security tools, this one is made for them.

Looking to secure your organization with the power of AI?

Learn more about SAST or get started today.

The post How AI enhances static application security testing (SAST) appeared first on The GitHub Blog.

]]>
77987
Customizing and fine-tuning LLMs: What you need to know https://github.blog/ai-and-ml/llms/customizing-and-fine-tuning-llms-what-you-need-to-know/ Wed, 28 Feb 2024 18:00:52 +0000 https://github.blog/?p=76775 Learn how your organization can customize its LLM-based solution through retrieval augmented generation and fine-tuning.

The post Customizing and fine-tuning LLMs: What you need to know appeared first on The GitHub Blog.

]]>

How to write function in Python to reverse a string
How to write SQL query to select users from a database by age
How to implement binary search in Java

How often do you have to break the flow, leave your IDE, and search for answers to questions (that are maybe similar to the ones above)? And how often do you end up getting distracted and end up watching cat videos instead of getting back to work? (This happens to the best of them, even to GitHub’s VP of Developer Relations, Martin Woodward.)

It doesn’t have to be that way. A developer’s ability to get AI coding assistance directly in a workspace was found to reduce context switching and conserve a developer’s mental energy. When directly integrated into workspaces, these tools become familiar enough with a developer’s code to quickly provide tailored suggestions. Now, without getting sidetracked, developers can get customized answers to coding questions like:

Can you suggest a better way to structure my code for scalability?
Can you help me debug this function? It's not returning the expected results.
Can you help me understand this piece of code in this repository?

But how do AI coding assistants provide customized answers? What can organizations and developers do to receive more tailored solutions? And how, ultimately, do customized AI coding assistants benefit organizations as a whole?

We talked to Alireza Goudarzi, a senior machine learning researcher at GitHub, to get the answers. ⬇️

How AI coding assistants provide customized answers

When it comes to problem solving, context is everything.

Business decision makers use information gathered from internal metrics, customer meetings, employee feedback, and more to make decisions about what resources their companies need. Meanwhile, developers use details from pull requests, a folder in a project, open issues, and more to solve coding problems.

Large language models, or LLMs, do something similar:

  • Generative AI coding tools are powered by LLMs, which are sets of algorithms trained on large amounts of code and human language.
  • Today’s LLMs are structured as transformers, a kind of architecture that makes the model good at connecting the dots between data. Following the transformer architecture is what enables today’s LLMs to generate responses that are more contextually relevant than previous AI models.
  • Though transformer LLMs are good at connecting the dots, they need to learn what data to process and in what order.
  • A generative AI coding assistant in the IDE can be instructed to use data from open files or code written before and after the cursor to understand the context around the current line of code and suggest a relevant code completion.
  • As a chatbot in an IDE or on a website, a generative AI coding assistant can provide guidance by using data from indexed repositories, customized knowledge bases, developer-provided input in a prompt or query, and even search engine integrations.

All input data—the code, query, and additional context—passes through something called a context window, which is present in all transformer-based LLMs. The size of the context window represents the capacity of data an LLM can process. Though it can’t process an infinite amount of data, it can grow larger. But because that window is limited, prompt engineers have to figure out what data, and in what order, to feed the model so it generates the most useful, contextually relevant responses for the developer.

How to customize your LLM

Customizing an LLM is not the same as training it. Training an LLM means building the scaffolding and neural networks to enable deep learning. Customizing an LLM means adapting a pre-trained LLM to specific tasks, such as generating information about a specific repository or updating your organization’s legacy code into a different language.

There are a few approaches to customizing your LLM: retrieval augmented generation, in-context learning, and fine-tuning.

We broke these down in this post about the architecture of today’s LLM applications and how GitHub Copilot is getting better at understanding your code. Here’s a recap.

Retrieval-augmented generation (RAG)

RAG typically uses something called embeddings to retrieve information from a vector database. Vector databases are a big deal because they transform your source code into retrievable data while maintaining the code’s semantic complexity and nuance.

In practice, that means an LLM-based coding assistant using RAG can generate relevant answers to questions about a private repository or proprietary source code. It also means that LLMs can use information from external search engines to generate their responses.

If you’re wondering what a vector database is, we have you covered:

  • Vector databases store embeddings of your repository code and documentation. The embeddings are what make your code and documentation readable by an LLM. (This is similar to the way programming languages are converted into a binary system language for a computer to understand.)
  • As developers code in an IDE, algorithms transform code snippets in the IDE into embeddings. Algorithms then make approximate matches between the embeddings that are created for those IDE snippets and the embeddings already stored in the vector database.
  • When asking a question to a chat-based AI coding assistant, the questions and requests written in natural language are also transformed into embeddings. A similar process to the one described above takes place: the embeddings created for the natural language prompts are matched to embeddings already stored in vector databases.

Vector databases and embeddings allow algorithms to quickly search for approximate matches (not just exact ones) on the data they store. This is important because if an LLM’s algorithms only make exact matches, it could be the case that no data is included as context. Embeddings improve an LLM’s semantic understanding, so the LLM can find data that might be relevant to a developer’s code or question and use it as context to generate a useful response.

Have questions about what data GitHub Copilot uses and how?

 

Read this for answers to frequently asked questions and visit the GitHub Copilot Trust Center for more details.

In-context learning

In-context learning, a method sometimes referred to as prompt engineering, is when developers give the model specific instructions or examples at the time of inference (also known as the time they’re typing or vocalizing a question or request). By providing these instructions and examples, the LLM understands the developer is asking it to infer what they need and will generate a contextually relevant output.

In-context learning can be done in a variety of ways, like providing examples, rephrasing your queries, and adding a sentence that states your goal at a high-level.

Fine-tuning

Fine-tuning your model can result in a highly customized LLM that excels at a specific task. There are two ways to customize your model with fine-tuning: supervised learning and reinforcement learning from human feedback (RLHF).

Under supervised learning, there is a predefined correct answer that the model is taught to generate. Under RLHF, there is high-level feedback that the model uses to gauge whether its generated response is acceptable or not.

Let’s dive deeper.

Supervised learning

This method is when the model’s generated output is evaluated against an intended or known output. For example, you know that the sentiment behind a statement like this is negative: “This sentence is unclear.” To evaluate the LLM, you’d feed this sentence to the model and query it to label the sentiment as positive or negative.

If the model labels it as positive, then you’d adjust the model’s parameters (variables that can be weighed or prioritized differently to change a model’s output) and try prompting it again to see if it can classify the sentiment as negative.

But even smaller models can have over 300 million parameters. Those are a lot of variables to sift through and adjust (and re-adjust). This method also requires time-intensive labeling. Each input sample requires an output that’s labeled with exactly the correct answer, such as “Negative,” for the example above. That label gives the output something to measure against so adjustments can be made to the model’s parameters.

Reinforcement learning from human feedback (RLHF)

RLHF requires either direct human feedback or creating a reward model that’s trained to model human feedback (by predicting if a user will accept or reject the output from the pre-trained LLM). The learnings from the reward model are passed to the pre-trained LLM, which will adjust its outputs based on user acceptance rate.

The benefit to RLHF is that it doesn’t require supervised learning and, consequently, expands the criteria for what’s an acceptable output. For example, with enough human feedback, the LLM can learn that if there’s an 80% probability that a user will accept an output, then it’s fine to generate.

For more on LLMs and how they process data, read:

How to customize GitHub Copilot

GitHub Copilot’s contextual understanding has continuously evolved over time. The first version was only able to consider the file you were working on in your IDE to be contextually relevant. We then expanded the context to neighboring tabs, which are all the open files in your IDE that GitHub Copilot can comb through to find additional context.

Just a year and a half later, we launched GitHub Copilot Enterprise, which uses an organization’s indexed repositories to provide developers with coding assistance that’s customized to their codebases. With GitHub Copilot Enterprise, organizations can tailor GitHub Copilot suggestions in the following ways:

  • Index their source code repositories in vector databases, which improves semantic search and gives their developers a customized coding experience.
  • Create knowledge bases, which are Markdown files from a collection of repositories that provide GitHub Copilot with additional context through unstructured data, or data that doesn’t live in a database or spreadsheet.

In practice, this can benefit organizations in several ways:

  • Enterprise developers gain a deeper understanding of your organization’s unique codebase. Senior and junior developers alike can prompt GitHub Copilot for code summaries, coding suggestions, and answers about code behavior. As a result of this streamlined code navigation and comprehension, enterprise developers implement features, resolve issues, and modernize code faster.
  • Complex data is quickly translated into organizational knowledge and best practices. Because GitHub Copilot receives context through the repositories and documentation your organization chooses to index, developers receive coding suggestions and guidance that are more useful because they align with organizational knowledge and best practices.
    It’s not just developers, but also their non-developer and cross-functional team members who can use natural language to prompt Copilot Chat in GitHub.com for answers and guidance on relevant documentation or existing solutions. Data and solutions captured in repositories becomes more accessible across the organization, improving collaboration and increasing awareness of business goals and practices.
  • Faster pull requests create smart, efficient, and accessible development workflows. With GitHub Copilot Enterprise, developers can use GitHub Copilot to generate pull request summaries directly in GitHub.com, helping them communicate clearly with reviewers while also saving valuable time. For developers reviewing pull requests, GitHub Copilot can be used to help them quickly gain a strong understanding of proposed changes and, as a result, focus more time on providing valuable feedback.

GitHub Copilot Enterprise is now generally available.

 

Read more about GitHub’s most advanced AI offering, and how it’s customized to your organization’s knowledge and codebase.

Best practices for customizing your LLM

Customized LLMs help organizations increase value out of all of the data they have access to, even if that data’s unstructured. Using this data to customize an LLM can reveal valuable insights, help you make data-driven decisions, and make enterprise information easier to find overall.

Here are our top tips for customizing an LLM.

Select an AI solution that uses RAG

Like we mentioned above, not all of your organization’s data will be contained in a database or spreadsheet. A lot of data comes in the form of text, like code documentation.

Organizations that opt into GitHub Copilot Enterprise will have a customized chat experience with GitHub Copilot in GitHub.com. GitHub Copilot Chat will have access to the organization’s selected repositories and knowledge base files (also known as Markdown documentation files) across a collection of those repositories.

Adopt innersource practices

Kyle Daigle, GitHub’s chief operating officer, previously shared the value of adapting communication best practices from the open source community to their internal teams in a process known as innersource. One of those best practices is writing something down and making it easily discoverable.

How does this practice pay off? It provides more documentation, which means more context for an AI tool to generate tailored solutions to our organization. Effective AI adoption requires establishing this foundation of context.

Moreover, developers can use GitHub Copilot Chat in their preferred natural language—from German to Telugu. That means more documentation, and therefore more context for AI, improves global collaboration. All of your developers can work on the same code while using their own natural language to understand and improve it.

Here are Daigle’s top tips for innersource adoption:

  1. If you like what you hear, record it and make it discoverable (and remember: plenty of video and productivity tools now provide AI-powered summaries and action items).
  2. If you come up with a useful solution for your team, share it out with the wider organization so they can benefit from it, too.
  3. Offer feedback to publicly shared information and solutions. But remember to critique the work, not the person.
  4. If you request a change to a project or document, explain why you’re requesting that change.

✨ Bonus points if you add all of these notes to your relevant GitHub repositories and format them in Markdown.

How do you expand your LLM results?

The answer lies in search engine integration.

Transformer-based LLMs have impressive semantic understanding even without embedding and high-dimensional vectors. This is because they’re trained on a large_ _amount of unlabeled natural language data and publicly available source code. They also use a self-supervised learning process where they use a portion of input data to learn basic learning objectives, and then apply what they’ve learned to the rest of the input.

When a search engine is integrated into an LLM application, the LLM is able to retrieve search engine results relevant to your prompt because of the semantic understanding it’s gained through its training. That means an LLM-based coding assistant with search engine integration (made possible through a search engine’s API) will have a broader pool of current information that it can retrieve information from.

Why does this matter to your organization?

Let’s say a developer asks an AI coding tool a question about the most recent version of Java. However, the LLM was trained on data from before the release, and the organization hasn’t updated its repositories’ knowledge with information about the latest release. The AI coding tool can still answer the developer’s question by conducting a web search to retrieve the answer.

A generative AI coding assistant that can retrieve data from both custom and publicly available data sources gives employees customized and comprehensive guidance.

The path forward

50% of enterprise software engineers are expected to use machine-learning powered coding tools by 2027, according to Gartner.

Today, developers are using AI coding assistants to get a head start on complex code translation tasks, build better test coverage, tackle new problems with creative solutions, and find answers to coding-related questions without leaving their IDEs. With customization, developers can also quickly find solutions tailored to an organization’s proprietary or private source code, and build better communication and collaboration with their non-technical team members.

In the future, we imagine a workspace that offers more customization for organizations. For example, your ability to fine-tune a generative AI coding assistant could improve code completion suggestions. Additionally, integrating an AI coding tool into your custom tech stack could feed the tool with more context that’s specific to your organization and from services and data beyond GitHub.

The post Customizing and fine-tuning LLMs: What you need to know appeared first on The GitHub Blog.

]]>
76775