The latest on machine learning - The GitHub Blog

Under the hood: Security architecture of GitHub Agentic Workflows

Landon Cox — Mon, 09 Mar 2026 16:00:00 +0000

Whether you’re an open-source maintainer or part of an enterprise team, waking up to documentation fixes, new unit tests, and refactoring suggestions can be a true “aha” moment. But automation also raises an important concern: how do you put guardrails on agents that have access to your repository and the internet? Will you be wondering if your agent relied on documentation from a sketchy website, or pushed a commit containing an API token? What if it decides to add noisy comments to every open issue one day? Automations must be predictable to offer durable value.

But what is the safest way to add agents to existing automations like CI/CD? Agents are non-deterministic: They must consume untrusted inputs, reason over repository state, and make decisions at runtime. Letting agents operate in CI/CD without real-time supervision allows you to scale your software engineering, but it also requires novel guardrails to keep you from creating security problems.

GitHub Agentic Workflows run on top of GitHub Actions. By default, everything in an action runs in the same trust domain. Rogue agents can interfere with MCP servers, access authentication secrets, and make network requests to arbitrary hosts. A buggy or prompt-injected agent with unrestricted access to these resources can act in unexpected and insecure ways.

That’s why security is baked into the architecture of GitHub Agentic Workflows. We treat agent execution as an extension of the CI/CD model—not as a separate runtime. We separate open‑ended authoring from governed execution, then compile a workflow into a GitHub Action with explicit constraints such as permissions, outputs, auditability, and network access.

This post explains how we built Agentic Workflows with security in mind from day one, starting with the threat model and the security architecture that it needs.

Threat model

There are two properties of agentic workflows that change the threat model for automation.

First, agents’ ability to reason over repository state and act autonomously makes them valuable, but it also means they cannot be trusted by default—especially in the presence of untrusted inputs.

Second, GitHub Actions provide a highly permissive execution environment. A shared trust domain is a feature for deterministic automation, enabling broad access, composability, and good performance. But when combined with untrusted agents, having a single trust domain can create a large blast radius if something goes wrong.

Under this model, we assume an agent will try to read and write state that it shouldn’t, communicate over unintended channels, and abuse legitimate channels to perform unwanted actions. By default, GitHub Agentic Workflows run in a strict security mode with this threat model in mind, and their design is guided by four security principles: defense in depth, don’t trust agents with secrets, stage and vet all writes, and log everything.

Defend in depth

GitHub Agentic Workflows provide a layered security architecture consisting of substrate, configuration, and planning layers. Each layer limits the impact of failures above it by enforcing distinct security properties that are consistent with its assumptions.

The substrate layer rests on a GitHub Actions runner virtual machine (VM) and several trusted containers that limit the resources an agent can access. Collectively, the substrate level provides isolation among components, mediation of privileged operations and system calls, and kernel-enforced communication boundaries. These protections hold even if an untrusted user-level component is compromised and executes arbitrary code within its container isolation boundary.

Above the substrate layer is a configuration layer that includes declarative artifacts and the toolchains that interpret them to instantiate a secure system structure and connectivity. The configuration layer dictates which components are loaded, how components are connected, what communication channels are permitted, and what privileges are assigned. Externally minted tokens, such as agent API keys and GitHub access tokens, are critical inputs that bound components’ external effects—configuration controls which tokens are loaded into which containers.

The final layer of defense is the planning layer. The configuration layer dictates which components exist and how they communicate, but it does not dictate which components are active over time. The planning layer’s primary responsibility is to create a staged workflow with explicit data exchanges between them. The safe outputs subsystem, which will be described in greater detail below, is the primary instance of secure planning.

Don’t trust agents with secrets

From the beginning, we wanted workflow agents to have zero access to secrets. Agentic workflows execute as GitHub Actions, in which components share a single trust domain on top of the runner VM. In that model, sensitive material like agent authentication tokens and MCP server API keys reside in environment variables and configuration files visible to all processes in the VM.

This is dangerous because agents are susceptible to prompt injection: Attackers can craft malicious inputs like web pages or repository issues that trick agents into leaking sensitive information. For example, a prompt-injected agent with access to shell-command tools can read configuration files, SSH keys, Linux /proc state, and workflow logs to discover credentials and other secrets. It can then upload these secrets to the web or encode them within public-facing GitHub objects like repository issues, pull requests, and comments.

Our first mitigation was to isolate the agent in a dedicated container with tightly controlled egress: firewalled internet access, MCP access through a trusted MCP gateway, and LLM API calls through an API proxy. To limit internet access, agentic workflows create a private network between the agent and firewall. The MCP gateway runs in a separate trusted container, launches MCP servers, and has exclusive access to MCP authentication material.

Although agents like Claude, Codex, and Copilot must communicate with an LLM over an authenticated channel, we avoid exposing those tokens directly to the agent’s container. Instead, we place LLM auth tokens in an isolated API proxy and configure agents to route model traffic through that proxy.

Zero-secret agents require a fundamental trade-off between security and utility. Coding workloads require broad access to compilers, interpreters, scripts, and repository state, but expanding the in-container setup would duplicate existing actions provisioning logic and increase the set of network destinations that must be allowed through the firewall.

Instead, we carefully expose host files and executables using container volume mounts and run the agent in a chroot jail. We start by mounting the entire VM host file system read-only at /host. We then overlay selected paths with empty tmpfs layers and launch the agent in a chroot jail rooted at /host. This approach keeps the host-side setup intact while constraining the agent’s writable and discoverable surface to what it needs for its job.

Stage and vet all writes

Prompt-injected agents can still do harm even if they do not have access to secrets. For example, a rogue agent could spam a repository with pointless issues and pull requests to overwhelm repository maintainers, or add objectionable URLs and other content in repository objects.

To prevent this kind of behavior, the agentic workflows compiler decomposes workflows into explicit stages and defines, for each stage:

The active components and permissions (read vs. write)
The data artifacts emitted by that stage
The admissible downstream consumers of those artifacts

While the agent runs, it can read GitHub state through the GitHub MCP server and can only stage its updates through the safe outputs MCP server. Once the agent exits, write operations that have been buffered by the safe outputs MCP server are processed by a suite of safe outputs analyses.

First, safe outputs allow workflow authors to specify which write operations an agent can perform. Authors can choose which subset of GitHub updates are allowed, such as creating issues, comments, or pull requests. Second, safe outputs limits the number of updates that are allowed, such as restricting an agent to creating at most three pull requests in a given run. Third, safe outputs analyzes update content to remove unwanted patterns, such as output sanitization to remove URLs. Only artifacts that pass through the entire safe outputs pipeline can be passed on, ensuring that each stage’s side effects are explicit and vetted.

Log everything

Even with zero secrets and vetted writes, an agent can still transform repository data and invoke tools in unintended ways or try to break out of the constraints that we impose upon it. Agents are determined to accomplish their tasks by any means and have a surprisingly deep toolbox of tricks for doing so. If an agent behaves unexpectedly, post-incident analysis requires visibility into the complete execution path.

Agentic workflows make observability a first-class property of the architecture by logging extensively at each trust boundary. Network and destination-level activity is recorded at the firewall layer, model request/response metadata and authenticated requests are captured by the API proxy; and tool invocations are logged by the MCP gateway and MCP servers. We also add internal instrumentation to the agent container to audit potentially sensitive actions like environment variable accesses. Together, these logs support end-to-end forensic reconstruction, policy validation, and rapid detection of anomalous agent behavior.

Pervasive logging also lays the foundation for future information-flow controls. Every location where communication can be observed is also a location where it can be mediated. Agentic workflows already support the GitHub MCP server’s lockdown mode, and in the coming months, we’ll introduce additional safety controls that enforce policies across MCP servers based on visibility (public vs. private) and the role of a repository object’s author.

What’s next?

We’d love for you to be involved! Share your thoughts in the Community discussion or join us (and tons of other awesome makers) in the #agentic-workflows channel of the GitHub Next Discord. We look forward to seeing what you build with GitHub Agentic Workflows. Happy automating, and keep an eye out for more updates!

The post Under the hood: Security architecture of GitHub Agentic Workflows appeared first on The GitHub Blog.

Automate repository tasks with GitHub Agentic Workflows

Don Syme — Fri, 13 Feb 2026 14:00:00 +0000

Imagine visiting your repository in the morning and feeling calm because you see:

Issues triaged and labelled
CI failures investigated with proposed fixes
Documentation has been updated to reflect recent code changes.
Two new pull requests that improve testing await your review.

All of it visible, inspectable, and operating within the boundaries you’ve defined.

That’s the future powered by GitHub Agentic Workflows: automated, intent-driven repository workflows that run in GitHub Actions, authored in plain Markdown and executed with coding agents. They’re designed for people working in GitHub, from individuals automating a single repo to teams operating at enterprise or open-source scale.

At GitHub Next, we began GitHub Agentic Workflows as an investigation into a simple question: what does repository automation with strong guardrails look like in the era of AI coding agents? A natural place to start was GitHub Actions, the heart of scalable repository automation on GitHub. By bringing automated coding agents into actions, we can enable their use across millions of repositories, while keeping decisions about when and where to use them in your hands.

GitHub Agentic Workflows are now available in technical preview. In this post, we’ll explain what they are and how they work. We invite you to put them to the test, to explore where repository-level AI automation delivers the most value.

AI repository automation: A revolution through simplicity

The concept behind GitHub Agentic Workflows is straightforward: you describe the outcomes you want in plain Markdown, add this as an automated workflow to your repository, and it executes using a coding agent in GitHub Actions.

This brings the power of coding agents into the heart of repository automation. Agentic workflows run as standard GitHub Actions workflows, with added guardrails for sandboxing, permissions, control, and review. When they execute, they can use different coding agent engines—such as Copilot CLI, Claude Code, or OpenAI Codex—depending on your configuration.

The use of GitHub Agentic Workflows makes entirely new categories of repository automation and software engineering possible, in a way that fits naturally with how developer teams already work on GitHub. All of them would be difficult or impossible to accomplish traditional YAML workflows alone:

Continuous triage: automatically summarize, label, and route new issues.
Continuous documentation: keep READMEs and documentation aligned with code changes.
Continuous code simplification: repeatedly identify code improvements and open pull requests for them.
Continuous test improvement: assess test coverage and add high-value tests.
Continuous quality hygiene: proactively investigate CI failures and propose targeted fixes.
Continuous reporting: create regular reports on repository health, activity, and trends.

These are just a few examples of repository automations that showcase the power of GitHub Agentic Workflows. We call this Continuous AI: the integration of AI into the SDLC, enhancing automation and collaboration similar to continuous integration and continuous deployment (CI/CD) practices.

GitHub Agentic Workflows and Continuous AI are designed to augment existing CI/CD rather than replace it. They do not replace build, test, or release pipelines, and their use cases largely do not overlap with deterministic CI/CD workflows. Agentic workflows run on GitHub Actions because that is where GitHub provides the necessary infrastructure for permissions, logging, auditing, sandboxed execution, and rich repository context.

In our own usage at GitHub Next, we’re finding new uses for agentic workflows nearly every day. Throughout GitHub, teams have been using agentic workflows to create custom tools for themselves in minutes, replacing chores with intelligence or paving the way for humans to get work done by assembling the right information, in the right place, at the right time. A new world of possibilities is opening for teams and enterprises to keep their repositories healthy, navigable, and high-quality.

Let’s talk guardrails and control

Designing for safety and control is non-negotiable. GitHub Agentic Workflows implements a defense-in-depth security architecture that protects against unintended behaviors and prompt-injection attacks.

Workflows run with read-only permissions by default. Write operations require explicit approval through safe outputs, which map to pre-approved, reviewable GitHub operations such as creating a pull request or adding a comment to an issue. Sandboxed execution, tool allowlisting, and network isolation help ensure that coding agents operate within controlled boundaries.

Guardrails like these make it practical to run agents continuously, not just as one-off experiments. See our security architecture for more details.

One alternative approach to agentic repository automation is to run coding agent CLIs, such as Copilot or Claude, directly inside a standard GitHub Actions YAML workflow. This approach often grants these agents more permission than is required for a specific task. In contrast, GitHub Agentic Workflows run coding agents with read-only access by default and rely on safe outputs for GitHub operations, providing tighter constraints, clearer review points, and stronger overall control.

A simple example: A daily repo report

Let’s look at an agentic workflow which creates a daily status report for repository maintainers.

In practice, you will usually use AI assistance to create your workflows. The easiest way to do this is with an interactive coding agent. For example, with your favorite coding agent, you can enter this prompt:

Generate a workflow that creates a daily repo status report for a maintainer. Use the instructions at https://github.com/github/gh-aw/blob/main/create.md

The coding agent will interact with you to confirm your specific needs and intent, write the Markdown file, and check its validity. You can then review, refine, and validate the workflow before adding it to your repository.

This will create two files in .github/workflows:

daily-repo-status.md (the agentic workflow)
daily-repo-status.lock.yml (the corresponding agentic workflow lock file, which is executed by GitHub Actions)

The file daily-repo-status.md will look like this:

--- 
on: 
  schedule: daily 
 
permissions: 
  contents: read 
  issues: read 
  pull-requests: read 
 
safe-outputs: 
  create-issue: 
    title-prefix: "[repo status] " 
    labels: [report] 
 
tools: 
  github: 
---  
 
# Daily Repo Status Report 
 
Create a daily status report for maintainers. 
 
Include 
- Recent repository activity (issues, PRs, discussions, releases, code changes) 
- Progress tracking, goal reminders and highlights 
- Project status and recommendations 
- Actionable next steps for maintainers 
 
Keep it concise and link to the relevant issues/PRs.

This file has two parts:

Frontmatter (YAML between --- markers) for configuration
Markdown instructions that describe the job in natural language in natural language

The Markdown is the intent, but the trigger, permissions, tools, and allowed outputs are spelled out up front.

If you prefer, you can add the workflow to your repository manually:

Create the workflow: Add  daily-repo-status.md with the frontmatter and instructions.
Create the lock file:
- gh extension install github/gh-aw 
- gh aw compile
Commit and push: Commit and push files to your repository.
Add any required secrets: For example, add a token or API key for your coding agent.

Once you add this workflow to your repository, it will run automatically or you can trigger it manually using GitHub Actions. When the workflow runs, it creates a status report issue like this:

What you can build with GitHub Agentic Workflows

If you’re looking for further inspiration Peli’s Agent Factory is a guided tour through a wide range of workflows, with practical patterns you can adapt, remix, and standardize across repos.

A useful mental model: if repetitive work in a repository can be described in words, it might be a good fit for an agentic workflow.

If you’re looking for design patterns, check out ChatOps, DailyOps, DataOps, IssueOps, ProjectOps, MultiRepoOps, and Orchestration.

Uses for agent-assisted repository automation often depend on particular repos and development priorities. Your team’s approach to software development will differ from those of other teams. It pays to be imaginative about how you can use agentic automation to augment your team for your repositories for your goals.

Practical guidance for teams

Agentic workflows bring a shift in thinking. They work best when you focus on goals and desired outputs rather than perfect prompts. You provide clarity on what success looks like, and allow the workflow to explore how to achieve it. Some boundaries are built into agentic workflows by default, and others are ones you explicitly define. This means the agent can explore and reason, but its conclusions always stay within safe, intentional limits.

You will find that your workflows can range from very general (“Improve the software”) to very specific (“Check that all technical documentation and error messages for this educational software are written in a style suitable for an audience of age 10 or above”). You can choose the level of specificity that’s appropriate for your team.

GitHub Agentic Workflows use coding agents at runtime, which incur billing costs. When using Copilot with default settings, each workflow run typically incurs two premium requests: one for the agentic work and one for a guardrail check through safe outputs. The models used can be configured to help manage these costs. Today, automated uses of Copilot are associated with a user account. For other coding agents, refer to our documentation for details. Here are a few more tips to help teams get value quickly:

Start with low-risk outputs such as comments, drafts, or reports before enabling pull request creation.
For coding, start with goal-oriented improvements such as routine refactoring, test coverage, or code simplification rather than feature work.
For reports, use instructions that are specific about what “good” looks like, including format, tone, links, and when to stop.
Agentic workflows create an agent-only, sub-loop that’s able to be autonomous because agents are acting under defined terms. But it’s important that humans stay in the broader loop of forward progress in the repository, through reports, issues, and pull requests. With GitHub Agentic Workflows, pull requests are never merged automatically, and humans must always review and approve.
Treat the workflow Markdown as code. Review changes, keep it small, and evolve it intentionally.

Continuous AI works best if you use it in conjunction with CI/CD. Don’t use agentic workflows as a replacement for GitHub Actions YAML workflows for CI/CD. This approach extends continuous automation to more subjective, repetitive tasks that traditional CI/CD struggle to express.

Build the future of automation with us

GitHub Agentic Workflows are available now in technical preview and are a collaboration between GitHub, Microsoft Research, and Azure Core Upstream. We invite you to try them out and help us shape the future of repository automation.

We’d love for you to be involved! Share your thoughts in the Community discussion, or join us (and tons of other awesome makers) in the #agentic-workflows channel of the GitHub Next Discord. We look forward to seeing what you build with GitHub Agentic Workflows. Happy automating!

Try GitHub Agentic Workflows in a repo today! Install gh-aw, add a starter workflow or create one using AI, and run it. Then, share what you build (and what you want next).

The post Automate repository tasks with GitHub Agentic Workflows appeared first on The GitHub Blog.

How students teamed up to decode 2,000-year-old texts using AI

Juan Pablo Flores Cortés — Thu, 03 Oct 2024 17:51:58 +0000

research and copywriting by Emily Akers

What began as a challenge to decipher texts buried for centuries under the ashes of Mount Vesuvius quickly turned into a historical breakthrough using AI and GitHub tools. Three students—Youssef Nader, Luke Farritor, and Julian Schilliger—worked together across time zones and borders to unlock the secrets of 2,000-year-old scrolls, ultimately winning the Vesuvius Challenge and earning $700,000 in prizes. Their work showcases the powerful intersection of AI, open-source collaboration, and the drive to solve mysteries that have puzzled scientists for centuries.

About the challenge

In March of 2023, a group of leading technologists created the Vesuvius Challenge, a competition to decipher the Herculaneum Papyri which were buried after the eruption of Mount Vesuvius 2,000 years ago. Due to carbonization, these scrolls cannot be opened without falling to pieces. So how do we access these ancient texts?

In an interview with NPR’s Simon Scott, Brent Seals, a computer science professor at the University of Kentucky, spoke about the virtual unwrapping of scrolls and evolution of technology in this realm. This technology has been around for decades, but there was a breakthrough in 2015 which led to the reading of a scroll in the Dead Sea Scrolls collection. The scientists used tomography and X-rays and found success with the Dead Sea Scrolls, but were unsuccessful when it came to Herculaneum texts. “Not only were those scrolls difficult to apply virtual unwrapping to, but the ink from the ancient world did not readily show up in the scans that we made, and we needed an AI-based approach to be able to see that ink,” Seals said. Once the scientists captured these scans of the Herculaneum papyri, technologists from around the world set out to analyze them.

See Brent Seals speak about the process of using virtual unwrapping to recover some of the text on the Herculaneum and how this led to the creation of the Vesuvius Challenge.

Meet the Students

With hundreds of engineers around the world working on this, it’s no surprise that the winning group met and worked together completely online. Youssef Nader, Luke Farritor, and Julian Schilliger teamed up through the challenge’s Discord server and chatted about the other projects they had worked on under the Vesuvius Challenge. Luke won the First Letters Prize and reached out to Youssef upon his being named runner-up. While we weren’t able to sit down with Luke, he described the process of realizing he had found the first letter in this interview with his school, University of Nebraska-Lincoln.

Click to hear Luke explain his experience and the moment he realized that he had discovered the first letters in the Vesuvius Challenge.

At this point in the competition, Luke was working on segmentation and Youssef had created AI detection models. Shortly thereafter, the two teamed up with Julian who made a breakthrough automating the segmentation working extensively with GitHub Copilot.

“There are so many benefits to using GitHub as a student, from a free GitHub Pro account to the use of GitHub Copilot in Visual Studio Code,” Julian said. “For the Vesuvius challenge, I had to write pipeline most of the time. I’d write one piece of code that I needed to use to achieve the next goal and if I knew what I wanted to write, I could use the auto completion tool to help me write faster. It was a huge time-saver!”

We also sat down with Youssef who shared insight into his experience.

“We were working in different time zones, so sometimes, we had to intentionally overlap with each other which led to some very late nights and early mornings,” he said. “All of it was worth it because not only did the team win the challenge, but they were able to create special memories together. I remember a night or so before the submission, Julian sent me an exciting message that around 1600 cm^2 of the scrolls were being segmented by his software. We spent the better part of the night hunting for the title of the scroll.”

Youssef recalled the morning that everything started coming together. “I had almost given up hope. I had tried so many different things. In the morning, I was running one last experiment and to my surprise, it worked. There were some parts that made me feel connected to this 2000 year old scroll in a way. It required tracing the writing of this ancient writer on the scroll and finding smart ways of figuring out what letter it would be based on ink deposits.”

These three students realized the dream of all papyrologists from 1754 AD onward. Papyrologist Gianluca Del Mastro recalled meeting Luke in Kentucky for the First Letter Prize. “I saw this young student in front of me. It amazed me as I was expecting someone older. It made me realize we have entered a new world of information technology in which it’s possible to make new discoveries even if you are very young.”

The team invites everyone to take a look at their winning team’s code on GitHub. Having used GitHub for many years and utilizing the tools in the Student Developer Pack, Youssef and Julian felt it was the perfect place to share their team’s findings. “This challenge from the very beginning was to foster collaboration even in the face of competition. Housing our code on GitHub was the only thing that made sense so the community can continue to build and to have easy access to collaborate and push progress forward,” said Youssef.

After the challenge

Dr. Del Mastro had the chance to meet two of the winning teammates after his team sponsored their flight out to Naples, Italy to see the scrolls in person. It was the first time that Youssef and Julian were able to meet in person. “It was surreal to meet in person after spending so much time collaborating over the internet,” Youssef shared. While in Naples, the two went to a conference where they were able to meet some of the professors who were behind the evaluation of their work. Youssef happily reports that he, Julian, and Luke are still in touch and hope they can all work on a project together in the future.

Left to Right: Aya Elzoheiry, Youssef Nader, Julian Schilliger, Marzia D’Angelo, Claudio Vergara, Fabrizio Diozzi, Alessia Lavorante.
Front: Rossella Villa

The experience was life changing in so many ways. Not only did the three winners help uncover part of the past, but one of them found his future. Julian shared that through the challenge, he’s met so many wonderful teachers and mentors who opened his eyes to all the work to be done at the intersection of code and history. Since completing the challenge, he accepted a full-time role working at the Vesuvius Project where he spends his time decoding the scrolls and learning new information about the ancient past.

Julian said, “Youssef, Luke, and I won this grand prize, but this is only a small piece in the ongoing efforts to decode the scrolls. Lots of people have worked on this before 2023 and there is plenty left to be done in 2024.” If you’re interested in getting involved, check out their Discord.

Are you a student or teacher? Get started with the Student Developer Pack.

The post How students teamed up to decode 2,000-year-old texts using AI appeared first on The GitHub Blog.

How GitHub harnesses AI to transform customer feedback into action

Mariana Borges — Tue, 30 Jul 2024 17:00:12 +0000

In today’s rapidly evolving tech landscape, the buzz around “generative AI” is impossible to ignore. It’s everywhere—on TV and social media, in productivity tools, at conferences, in our phones, you name it. The hype is real and what excites us at GitHub is the transformative potential of AI that we’re just beginning to unlock.

At GitHub, our primary goal is to continuously improve our platform to better serve our beloved developer community. We receive countless pieces of feedback through our support portal every day. The sheer volume of feedback we receive can be daunting. Despite our best efforts, manually sifting through all of that text data is an overwhelming challenge that results in a lot of untapped opportunities. Manual data classification and analysis by humans is error-prone and very time-consuming, as it often involves handling vast amounts of data, leading to fatigue and inconsistency. A Harvard Business Review study reveals that data scientists spend about 80% of their time on tasks like data collection and organization, including manual classification, which impedes efficiency and delays the discovery of valuable insights. This inefficiency is driving a shift toward automated systems that offer greater accuracy and speed, moving away from traditional analytics methods.

This challenge drove us to combine powerful data-mining techniques with machine learning algorithms to extract, interpret, and analyze customer feedback at scale. By transforming customer feedback into actionable insights through advanced AI analytics, we are able to advance our products and reinforce our commitment to user trust, ensuring that every voice is heard and valued.

Amplifying developer voices with AI

When I joined GitHub’s Customer Success Engineering team as a program manager, I started working closely with multiple product and engineering teams to provide actionable insights on product performance. One question remained constant during my tenure at GitHub: what are the main pain points customers are experiencing with our products? Even though it seems an easy question to answer, I found it very difficult to distill down the vast amount of data and ensure I was highlighting the right opportunities to improve customer experience. After reading hundreds of support tickets, I was driven by a focused mission: finding a way to honor the insights and feedback we receive from our vast user base and—in particular—let their voices help guide us as we prioritize the development of new features.

Although I have a passion for analytics, and I helped build multiple internal tools in the past, I knew this wouldn’t be an easy task due to the complexity of customer feedback text hidden in support tickets. I turned to my colleague, Steven Solomon, a staff software engineer with extensive experience, to explore potential solutions. Eventually, inspiration struck us: what if we could leverage the power of AI to systematically analyze and interpret our developer community’s feedback?

We then began to explore the market for AI-driven analytics solutions, but we quickly realized that we needed a tool that adhered to strict security and privacy regulations and incorporated tailored business metrics to be able to tell a compelling story to our product teams. We were inspired by the idea that “being able to visualize data and tell stories with it is key to turning it into information that can be used to drive better decision making” (Storytelling with Data). Motivated by this principle, we assembled a team of software engineers who shared the same mission and passion to create a unique internal AI analytics tool that presents the most relevant and actionable trends, complete with business context specifically tailored to GitHub’s product areas.

Experimenting with open source models

As the world’s largest open source code ecosystem, our journey with AI-driven analytics started by looking into open-source AI models hosted in our platform, including BERTopic. BERTopic is an open source topic modeling framework that leverages Bidirectional Encoder Representations from Transformers (BERT) embeddings to create dynamic and interpretable topics. BERT language model is an open source machine learning framework for natural language processing (NLP). BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context. BERTopic combines BERT’s ability to generate high-quality document embeddings with a clustering algorithm, typically Hierarchical Density-Based Spatial Clustering of Applications with Noise, (HDBSCAN), to group similar documents together. The topics are then derived by extracting and aggregating the most representative words from each cluster.

One of the standout capabilities of BERT is its ability to understand and process multiple languages. This multilingual capability stems from its training on diverse datasets that include text in various languages. As a result, BERTopic can effectively analyze feedback from our global user base, identifying themes and issues regardless of the language in which the feedback is provided. This multilingual proficiency ensures that we capture a comprehensive picture of our user feedback, allowing us to more effectively address the needs of our international community.

One key aspect to highlight is that we don’t train any models with customer feedback from support tickets. Instead, we apply the pre-trained models to analyze the feedback text data and generate the insights.

The representative words generated by this model kick-started our project, but there was still a key piece missing—the outputs needed to be easily understandable by humans. A group of words is different from a full representative sentence of customer pain. This led to the next phase of development: summarizing those clusters into actionable insights.

Summarizing insights with GPT-4

To ensure we could display the customer feedback insights in a more comprehensible form, we decided to summarize them using GPT-4, a powerful and popular large language model (LLM).

GPT-4 is particularly effective at summarizing topic clusters because of its advanced natural language processing capabilities. The model can also be optimized to better understand the specific context of data. Optimizing GPT-4 without retraining the model involves adjusting the way we use it, such as adjusting prompts and setting parameters for specific tasks.

Optimizing GPT-4 without retraining the model involves:

Optimizing prompts. Crafting and refining prompts to guide the model in generating relevant summaries.
Setting parameters. Adjusting settings like temperature, max tokens, top-p, frequency, and presence penalties to control the model’s output.
Iterative feedback. Continuously improving the model’s performance through human feedback and A/B testing.

This approach allows us to provide more precise and relevant summaries, ensuring that we surface valuable patterns to help uncover untapped opportunities and make more informed decisions.

Ship to learn

At GitHub, our “ship to learn” ethos is deeply rooted in our history. We truly value the journey as much as the destination. We believe that we can learn from every failure and that those failures lead us closer to success.

At first, we weren’t sure how to effectively communicate the data we generated. Generating useful AI insights might be a difficult task, but telling a good story with them can be an even more difficult task. Good visuals can help inform better business decisions, while bad visuals can confuse the audience and impede efficiency. Understanding the context, choosing the right visuals, and only displaying the important information are key aspects to successfully communicate data. To understand the context completely, we needed to fully understand our audience’s needs, so we revisited the fundamental question of why we needed these insights in the first place. The specifics of what data to show and how to present it would follow.

Embracing the “ship to learn” mindset, we decided to quickly generate an Azure Data Explorer (ADX) dashboard for the first trials. We developed multiple visuals and shared them across the company to collect feedback. This process helped us identify which visualizations our internal users found valuable and which ones were less effective. It became clear that we needed a tailored tool that incorporated business-specific context into our data. Only then could we effectively tell stories such as “Here are the top 10 customer pain points in support tickets for X product/feature.” This meant that we needed to create our own tool with advanced filtering capabilities to effectively navigate the intricacies of our feedback insights. Additionally, we needed the ability to connect the insights generated by our internal systems, enabling us to prioritize actions more effectively.

This marked the beginning of developing our internal web application to communicate insights through visuals. We now had the data, the context, and the effective visuals. The final piece was ensuring we focused our audience’s attention on the most important insights. Attributes, such as position on the page, color, and size, can help direct the audience’s attention to the most important information. Once again, we decided to ship our minimum viable product (MVP) to start collecting feedback and iterating on the visuals most relevant to our product teams. Following its official internal launch, our tool began revealing valuable insights in massive customer feedback text datasets, unlocking an array of new use cases that we were eager to explore.

Real-world impact

Integrating AI into our feedback analysis process has driven impactful outcomes:

Transitioning from manual classification to automated trend identification. Using automated AI-driven trend identification has significantly enhanced our ability to scale our data analysis efforts. This shift saves time and increases the precision with which we understand and respond to developer feedback in support tickets.
Identifying and addressing common pain points. Clustering feedback helps us identify recurring problems quicker and address them more efficiently. This can minimize disruption and enhance user productivity on the platform.
Improving feature prioritization. By understanding what our developer community needs most, we can focus our efforts on the features that will provide the greatest benefit to them.
Making data-driven decisions. By taking advantage of the clear, summarized insights our tool provides, our internal teams can make more informed decisions that are more aligned with the needs and desires of our developer community.
Discovering new self-serve opportunities. The insights generated enable the identification of self-help opportunities that empower customers to resolve issues on their own more swiftly. This expedites problem resolution for users and enhances their capability to manage future issues independently, reducing dependency on direct support.

Moving forward

As we continue to refine our AI-driven analytics capabilities and incorporate more sophisticated techniques, we are excited about the potential to further enhance our understanding of customer feedback. Our commitment to leveraging AI not only demonstrates our dedication to innovation but also ensures that the voice of our developers remains at the heart of everything we do.

In conclusion, using AI to analyze customer feedback has transformed how we interact with and respond to our developer community. By turning vast amounts of feedback text data into actionable insights, we are better equipped to meet the needs of our users and drive the future of software development.

Next time you provide feedback on one of our platforms or through a support ticket, keep this in mind and add as many details as you can. Your detailed feedback helps us make more informed decisions and improve GitHub for everyone.

The post How GitHub harnesses AI to transform customer feedback into action appeared first on The GitHub Blog.

How MLOps can drive governance for machine learning: A conversation with Algorithmia

GitHub Partnerships — Thu, 11 Mar 2021 16:59:50 +0000

This post features a guest interview with Diego M. Oppenheimer, CEO at Algorithmia

Over the past few years, machine learning has grown in adoption within the enterprise. More organizations are realizing the importance of machine learning to deliver results for their businesses. Increasingly, data scientists and machine learning engineers are storing models and even canonical data sets on GitHub so they can productize their work.

To explore how businesses are making machine learning enterprise-ready, our VP of Business Development and Partner Engineering, Dana Lawson, sat down with Diego M. Oppenheimer, CEO at Algorithmia, to talk about the current state of MLOps (machine learning operations) and what companies should consider as part of their own machine learning workflows.

Let’s dive in!

For our audience that is unfamiliar with this topic, can you tell them what MLOps is?

Machine learning operations (MLOps) is the discipline of delivering machine learning models through repeatable and efficient workflows. In short, it’s what enables businesses to scale their production capacities to the point of delivering significant results from machine learning. And it’s going to be an essential component to enterprises industrializing their AI efforts in the future.

Besides common architectural challenges (such as hardware orchestration, container management, load balancing, and inference API management), organizations also struggle with security, governance, and versioning of ML artifacts—this is a challenge that must be solved to ensure that machine learning can be widely productized in the future. This is where MLOps comes in, and why businesses need it to unlock the value in their AI.

What are most machine learning teams doing today, and what are their biggest challenges in operationalizing their work?

Today, most machine learning teams in the enterprise are working with a disparate toolchain. They do this because there are different tools optimized for each part of the ML lifecycle—there’s tools for storage data sets, tools for large models, then also for versioning notebooks, evaluating and testing models, and then deploying to production.

In addition to maintaining and productizing their own models, ML teams need a way to continuously collaborate with other teams. Many ML teams are incorporating DevOps principles into their work, but others still work in their own research silos, apart from product or go-to-market teams that can help them put all the pieces together in a secure, compliant, and reliable way to reach business goals.

Tell us about the current trends you’re seeing in machine learning

In our 2021 enterprise trends in machine learning report, we found that 83% of organizations have increased their budgets for AI and ML year-on-year, and the average number of data scientists employed has increased by 76% over the same period. But those same organizations are struggling to manage and scale those efforts. Our report found that the time required to deploy an ML model has actually increased—implying that many organizations are manually scaling their ML efforts rather than focusing on the underlying operational efficiencies that enable businesses to achieve greater results through ML. In other words, they’re taking on more technical debt instead of fixing a broken ML lifecycle.

Related to the challenge of technical debt, we’re also seeing that organizations are struggling with a variety of operational issues, especially when it comes to governance. In our report, 56% of survey respondents indicated that they struggle with IT governance, security, and auditability requirements—and 67% reported needing to comply with multiple regulations. This was the top issue reported by respondents, but a variety of other issues spanned across the ML lifecycle. For example, 49% reported integration and compatibility issues surrounding ML technologies, programming languages, and frameworks, making that the second-most-common challenge.

In a few years, what will MLOps look like?

To solve the problem of disparate ML tooling, there needs to be a clearly defined, canonical stack for AI and machine learning. Algorithmia has recently joined the AI Infrastructure Alliance, a group of like-minded companies that are coming together to define this canonical stack. Our hope is that open standards will help accelerate adoption and innovation with machine learning, making it truly portable and scalable, with endpoints to help organizations manage security, governance, and monitoring, much as we see with open source standards like Linux or Kubernetes today.

Ideally, with those open standards, businesses will be able to train, evaluate, and host models anywhere. Similar to containerized applications today, you’ll be able to deploy models anywhere—whether it’s in the cloud, on the edge, or somewhere in between for fog computing.

At GitHub, we love automation and the interconnected toolchain. What do you see as some of the benefits of embracing MLOps for deployment automation?

Besides freeing you from time-consuming and error-prone manual operations, automated deployments are an important component of model governance. As a policy, some organizations require that deployments involve multiple approvals and are only done through automated processes.

In addition, continuous deployment workflows help with tracing the deployed artifacts to their sources, while also making the deployment process repeatable. So productionizing your models on Algorithmia through rule-based automations is a good practice as part of your organization’s model governance framework.

Productionize machine learning with GitHub Actions and Algorithmia

Algorithmia is machine learning operations (MLOps) software that manages all stages of the production ML lifecycle within existing operational processes so you can put models into production quickly, securely, and cost-effectively. To get started making your machine learning enterprise-ready, check out Algorithmia’s actions in GitHub Marketplace, and visit the walkthrough on the Algorithmia blog.

The post How MLOps can drive governance for machine learning: A conversation with Algorithmia appeared first on The GitHub Blog.

Using GitHub Actions for MLOps & Data Science

Hamel Husain — Wed, 17 Jun 2020 15:00:29 +0000

Background

Machine Learning Operations (or MLOps) enables Data Scientists to work in a more collaborative fashion, by providing testing, lineage, versioning, and historical information in an automated way. Because the landscape of MLOps is nascent, data scientists are often forced to implement these tools from scratch. The closely related discipline of DevOps offers some help, however many DevOps tools are generic and require the implementation of “ML awareness” through custom code. Furthermore, these platforms often require disparate tools that are decoupled from your code leading to poor debugging and reproducibility.

To mitigate these concerns, we have created a series of GitHub Actions that integrate parts of the data science and machine learning workflow with a software development workflow. Furthermore, we provide components and examples that automate common tasks.

An Example Of MLOps Using GitHub Actions

Consider the below example of how an experiment tracking system can be integrated with GitHub Actions to enable MLOps. In the below example, we demonstrate how you can orchestrate a machine learning pipeline to run on the infrastructure of your choice, collect metrics using an experiment tracking system, and report the results back to a pull request.

A screenshot of this pull request.

For a live demonstration of the above example, please see this talk.

MLOps is not limited to the above example. Due to the composability of GitHub Actions, you can stack workflows in many ways that can help data scientists. Below is a concrete example of a very simple workflow that adds links to mybinder.org on pull requests:

name: Binder
on: 
  pull_request:
    types: [opened, reopened]

jobs:
  Create-Binder-Badge:
    runs-on: ubuntu-latest
    steps:

    - name: checkout pull request branch
      uses: actions/checkout@v2
      with:
        ref: ${{ github.event.pull_request.head.sha }}

    - name: comment on PR with Binder link
      uses: actions/github-script@v1
      with:
        github-token: ${{secrets.GITHUB_TOKEN}}
        script: |
          var BRANCH_NAME = process.env.BRANCH_NAME;
          github.issues.createComment({
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.repo,
            body: `[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/${context.repo.owner}/${context.repo.repo}/${BRANCH_NAME}) 👈 Launch a binder notebook on this branch`
          }) 
      env:
        BRANCH_NAME: ${{ github.event.pull_request.head.ref }}

When the above YAML file is added to a repository’s .github/workflow directory, pull requests can be annotated with a useful link as illustrated below [1]:

A Growing Ecosystem of MLOps & Data Science Actions

There is a growing number of Actions available for machine learning ops and data science. Below are some concrete examples that are in use today, categorized by topic.

Orchestrating Machine Learning Pipelines:

Submit Argo Workflows – allows you to orchestrate machine learning pipelines that run on Kubernetes.
Publish Kubeflow Pipelines to GKE – Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.

Jupyter Notebooks:

Run parameterized Notebooks – run notebooks programmatically using papermill.
Repo2Docker Action – Automatically turn data-science repositories into Jupyter-enabled Docker containers using repo2docker.
fastai/fastpages – share information from Jupyter notebooks as blog posts using GitHub Actions & GitHub Pages.

End-To-End Workflow Orchestration:

Examples and templates for utilizing Azure Machine Learning from GitHub Actions.

Experiment Tracking

Fetch runs from Weights & Biases – W&B is an experiment tracking and logging system for machine learning and is free for open-source projects.

This is by no means an exhaustive list of the things you might want to automate with GitHub Actions with respect to data science and machine learning. You can follow our progress towards this goal on our page, which contains links to blog posts, GitHub Actions, talks, and examples that are relevant to this topic.

We invite the community to create other Actions that might be useful for the community. Some ideas for getting started include data and model versioning, model deployment, data validation, as well as expanding upon some of the areas mentioned above. A great place to start is the documentation for GitHub Actions, particularly on how to build Actions for the community!

Our page with relevant materials.
GitHub Actions official documentation.
Hello world Docker Action: A template to demonstrate how to build a Docker Action for other people to use
Using self-hosted runners.
This talk introducing Actions for data science, including some live-coding!
Awesome Actions: A curated list of interesting GitHub Actions by topic
Useful GitHub Actions to know about when getting started:
- actions/checkout: Allows you to quickly clone the contents of your repository into your environment, which you often want to do. This does a number of other things such as automatically mount your repository’s files into downstream Docker containers.
- mxschmitt/action-tmate: Provides a way to debug Actions interactively. This uses port forwarding to give you a terminal in the browser that is connected to your Actions runner. Be careful not to expose sensitive information if you use this.
- actions/github-script: Gives you a pre-authenticated ocotokit.js client that allows you to interact with the GitHub API to accomplish almost any task on GitHub automatically. Only these endpoints are supported.

Footnotes:

[1] This example workflow will not work on pull requests from forks. To enable this, you have to trigger a PR comment to occur via a different event.

The post Using GitHub Actions for MLOps & Data Science appeared first on The GitHub Blog.

C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages

Kavita Ganesan — Tue, 02 Jul 2019 16:51:00 +0000

GitHub hosts over 300 programming languages—from commonly used languages such as Python, Java, and Javascript to esoteric languages such as Befunge, only known to very small communities.

Figure 1: Top 10 programming languages hosted by GitHub by repository count

One of the necessary challenges that GitHub faces is to be able to recognize these different languages. When some code is pushed to a repository, it’s important to recognize the type of code that was added for the purposes of search, security vulnerability alerting, and syntax highlighting—and to show the repository’s content distribution to users.

Despite the appearance, language recognition isn’t a trivial task. File names and extensions, while providing a good indication of what the coding language is likely to be, do not offer the full picture. In fact, many extensions are associated with the same language (e.g., “.pl”, “.pm”, “.t”, “.pod” are all associated with Perl), while others are ambiguous and used almost interchangeably across languages (e.g., “.h” is commonly used to indicate many languages of the “C” family, including C, C++, and Objective-C). In other cases, files are simply provided with no extension (especially for executable scripts) or with the incorrect extension (either on purpose or accidentally).

Linguist is the tool we currently use to detect coding languages at GitHub. Linguist a Ruby-based application that uses various strategies for language detection, leveraging naming conventions and file extensions and also taking into account Vim or Emacs modelines, as well as the content at the top of the file (shebang). Linguist handles language disambiguation via heuristics and, failing that, via a Naive Bayes classifier trained on a small sample of data.

Although Linguist does a good job making file-level language predictions (84% accuracy), its performance declines considerably when files use unexpected naming conventions and, crucially, when a file extension is not provided. This renders Linguist unsuitable for content such as GitHub Gists or code snippets within README’s, issues, and pull requests.

In order to make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua based on an Artificial Neural Network (ANN) architecture which can handle language predictions in tricky scenarios. The current version of the model is able to make predictions for the top 50 languages hosted by GitHub and surpasses Linguist in accuracy and performance.

The Nuts and Bolts Behind OctoLingua

OctoLingua was built from scratch using Python, Keras with TensorFlow backend—and is built to be accurate, robust, and easy to maintain. In this section, we describe our data sources, model architecture, and performance benchmark for OctoLingua. We also describe what it takes to add support for a new language.

Data sources

The current version of OctoLingua was trained on files retrieved from Rosetta Code and from a set of quality repositories internally crowdsourced. We limited our language set to the top 50 languages hosted on GitHub.

Rosetta Code was an excellent starter dataset as it contained source code for the same task expressed in different programming languages. For example, the task of generating a Fibonacci sequence is expressed in C, C++, CoffeeScript, D, Java, Julia, and more. However, the coverage across languages was not uniform where some languages only have a handful of files and some files were just too sparsely populated. Augmenting our training set with some additional sources was therefore necessary and substantially improved language coverage and performance.

Our process for adding a new language is now fully automated. We programmatically collect source code from public repositories on GitHub. We choose repositories that meet a minimum qualifying criteria such as having a minimum number of forks, covering the target language and covering specific file extensions. For this stage of data collection, we determine the primary language of a repository using the classification from Linguist.

Features: leveraging prior knowledge

Traditionally, for text classification problems with Neural Networks, memory-based architectures such as Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) are often employed. However, given that programming languages have differences in vocabulary, commenting style, file extensions, structure, libraries import style and other minor differences, we opted for a simpler approach that leverages all this information by extracting some relevant features in tabular form to be fed to our classifier. The features currently extracted are as follows:

Top five special characters per file
Top 20 tokens per file
File extension
Presence of certain special characters commonly used in source code files such as colons, curly braces, and semicolons

The Artificial Neural Network (ANN) model

We use the above features as input to a two-layer Artificial Neural Network built using Keras with Tensorflow backend.

The diagram below shows that the feature extraction step produces an n-dimensional tabular input for our classifier. As the information moves along the layers of our network, it is regularized by dropout and ultimately produces a 51-dimensional output which represents the predicted probability that the given code is written in each of the top 50 GitHub languages plus the probability that it is not written in any of those.

Figure 2: The ANN Structure of our initial model (50 languages + 1 for “other”)

We used 90% of our dataset for training over approximately eight epochs. Additionally, we removed a percentage of file extensions from our training data at the training step, to encourage the model to learn from the vocabulary of the files, and not overfit on the file extension feature, which is highly predictive.

Performance benchmark

OctoLingua vs. Linguist

In Figure 3, we show the F1 Score (harmonic mean between precision and recall) of OctoLingua and Linguist calculated on the same test set (10% from our initial data source).

Here we show three tests. The first test is with the test set untouched in any way. The second test uses the same set of test files with file extension information removed and the third test also uses the same set of files but this time with file extensions scrambled so as to confuse the classifiers (e.g., a Java file may have a “.txt” extension and a Python file may have a “.java”) extension.

The intuition behind scrambling or removing the file extensions in our test set is to assess the robustness of OctoLingua in classifying files when a key feature is removed or is misleading. A classifier that does not rely heavily on extension would be extremely useful to classify gists and snippets, since in those cases it is common for people not to provide accurate extension information (e.g., many code-related gists have a .txt extension).

The table below shows how OctoLingua maintains a good performance under various conditions, suggesting that the model learns primarily from the vocabulary of the code, rather than from meta information (i.e. file extension), whereas Linguist fails as soon as the information on file extensions is altered.

Figure 3: Performance of OctoLingua vs. Linguist on the same test set

Effect of removing file extension during training time

As mentioned earlier, during training time we removed a percentage of file extensions from our training data to encourage the model to learn from the vocabulary of the files. The table below shows the performance of our model with different fractions of file extensions removed during training time.

Figure 4: Performance of OctoLingua with different percentage of file extensions removed on our three test variations

Notice that with no file extension removed during training time, the performance of OctoLingua on test files with no extensions and randomized extensions decreases considerably from that on the regular test data. On the other hand, when the model is trained on a dataset where some file extensions are removed, the model performance does not decline much on the modified test set. This confirms that removing the file extension from a fraction of files at training time induces our classifier to learn more from the vocabulary. It also shows that the file extension feature, while highly predictive, had a tendency to dominate and prevented more weights from being assigned to the content features.

Supporting a new language

Adding a new language in OctoLingua is fairly straightforward. It starts with obtaining a bulk of files in the new language (we can do this programmatically as described in data sources). These files are split into a training and a test set and then run through our preprocessor and feature extractor. This new train and test set is added to our existing pool of training and testing data. The new testing set allows us to verify that the accuracy of our model remains acceptable.

Figure 5: Adding a new language with OctoLingua

Our plans

As of now, OctoLingua is at the “advanced prototyping stage”. Our language classification engine is already robust and reliable, but does not yet support all coding languages on our platform. Aside from broadening language support—which would be rather straightforward—we aim to enable language detection at various levels of granularity. Our current implementation already allows us, with a small modification to our machine learning engine, to classify code snippets. It wouldn’t be too far fetched to take the model to the stage where it can reliably detect and classify embedded languages.

We are also contemplating the possibility of open sourcing our model and would love to hear from the community if you’re interested.

Summary

With OctoLingua, our goal is to provide a service that enables robust and reliable source code language detection at multiple levels of granularity, from file level or snippet level to potentially line-level language detection and classification. Eventually, this service can support, among others, code searchability, code sharing, language highlighting, and diff rendering—all of this aimed at supporting developers in their day to day development work in addition to helping them write quality code. If you are interested in leveraging or contributing to our work, please feel free to get in touch on Twitter @github!

Authors

Kavita Ganesan, @kavgan, Machine Learning Engineer
Romano Foti, @romanofoti, Senior Machine Learning Engineer

The post C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages appeared first on The GitHub Blog.

Towards Natural Language Semantic Code Search

GitHub Engineering — Tue, 18 Sep 2018 07:00:00 +0000

This blog post complements a live demonstration on our recently announced site: experiments.github.com

Motivation

Searching code on GitHub is currently limited to keyword search. This assumes either the user knows the syntax, or can anticipate what keywords might be in comments surrounding the code they are looking for. Our machine learning scientists have been researching ways to enable the semantic search of code.

To fully grasp the concept of semantic search, consider the below search query, “ping REST api and return results”:

Note that the demonstrated semantic search returns reasonable results even though there are no keywords in common between the search query and the text (the code & comments found do not contain the words “Ping”, “REST” or “api”)! The implications of augmenting keyword search with semantic search are profound. For example, such a capability would expedite the process of on-boarding new software engineers onto projects and bolster the discoverability of code in general.

In this post, we want to share how we are leveraging deep learning to make progress towards this goal. We also share an open source example with code and data that you can use to reproduce these results!

Introduction

One of the key areas of machine learning research underway at GitHub is representation learning of entities, such as repos, code, issues, profiles and users. We have made significant progress towards enabling semantic search by learning representations of code that share a common vector space as text. For example, consider the below diagram:

In the above example, Text 2 (blue) is a reasonable description of the code, whereas Text 1 (red) is not related to the code at all. Our goal is to learn representations where (text, code) pairs that describe the same concept are close neighbors, whereas unrelated (text, code) pairs are further apart. By representing text and code in the same vector space, we can vectorize a user’s search query and lookup the nearest neighbor that represents code. Below is a four-part description of the approach we are currently using to accomplish this task:

1. Learning Representations of Code

In order to learn a representation of code, we train a sequence-to-sequence model that learns to summarize code. A way to accomplish this for Python is to supply (code, docstring) pairs where the docstring is the target variable the model is trying to predict. One active area of research for us is incorporating domain specific optimizations like tree-based LSTMs, gated-graph networks and syntax-aware tokenization. Below is a screenshot that showcases the code summarizer model at work. In this example, there are two python functions supplied as input, and in both cases the model produces a reasonable summary of the code as output:

It should be noted that in the above examples, the model produces the summary by using the entire code blob, not merely the function name.

Building a code summarizer is a very exciting project on its own, however, we can utilize the encoder from this model as a general purpose feature extractor for code. After extracting the encoder from this model, we can fine-tune it for the task of mapping code to the vector space of natural language.

We can evaluate this model objectively using the BLEU score. Currently we have been able to achieve a BLEU score of 13.5 on a holdout set of python code, using the fairseq-py library for sequence to sequence models.

2. Learning Representations of Text Phrases

In addition to learning a representation for code, we needed to find a suitable representation for short phrases (like sentences found in Python docstrings). Initially, we experimented with the Universal Sentence Encoder, a pre-trained encoder for text that is available on TensorFlow Hub. While the embeddings from worked reasonably well, we found that it was advantageous to learn embeddings that were specific to the vocabulary and semantics of software development. One area of ongoing research involves evaluating different domain-specific corpuses for training our own model, ranging from GitHub issues to third party datasets.

To learn this representation of phrases, we trained a neural language model by leveraging the fast.ai library. This library gave us easy access to state of the art architectures such as AWD LSTMs, and to techniques such as cyclical learning rates with random restarts. We extracted representations of phrases from this model by summarizing the hidden states using the concat pooling approach found in this paper.

One of the most challenging aspects of this exercise was to evaluate the quality of these embeddings. We are currently building a variety of downstream supervised tasks similar to those outlined here that will aid us in evaluating the quality of these embeddings objectively. In the meantime, we sanity check our embeddings by manually examining the similarity between similar phrases. The below screenshot illustrates examples where we search the vectorized docstrings for similarity against user-supplied phrases:

3. Mapping Code Representations To The Same Vector-Space as Text

Next, we map the code representations we learned from the code summarization model (part 1) to the vector space of text. We accomplish this by fine-tuning the encoder of this model. The inputs to this model are still code blobs, however the target variable the model is now the vectorized version of docstrings. These docstrings are vectorized using the approach discussed in the previous section.

Concretely, we perform multi-dimensional regression with cosine proximity loss to bring the hidden state of the encoder into the same vector-space as text.

We are actively researching alternate approaches that directly learn a joint vector space of code and natural language, borrowing from some ideas outlined here.

4. Creating a Semantic Search System

Finally, after successfully creating a model that can vectorize code into the same vector-space as text, we can create a semantic search mechanism. In its most simple form, we can store the vectorized version of all code in a database, and perform nearest neighbor lookups to a vectorized search query.

Another active area of our research is determining the best way to augment existing keyword search with semantic results and how to incorporate additional information such as context and relevance.
Furthermore, we are actively exploring ways to evaluate the quality of search results that will allow us to iterate quickly on this problem. We leave these topics for discussion in a future blog post.

Summary

The below diagram summarizes all the steps in our current semantic-search workflow:

We are exploring ways to improve almost every component of this approach, including data preparation, model architecture, evaluation procedures, and overall system design. What is described in this blog post is only a minimal example that scratches the surface.

Open Source Examples

Our open-source end-to-end tutorial contains a detailed walkthrough of the approach outlined in this blog, along with code and data you can use to reproduce the results.

This open source example (with some modifications) is also used as a tutorial for the kubeflow project, which is implemented here.

Limitations and Intended Use Case(s)

We believe that semantic code search will be most useful for targeted searches of code within specific entities such as repos, organizations or users as opposed to general purpose “how to” queries. The live demonstration of semantic code search hosted on our recently announced Experiments site does not allow users to perform targeted searches of repos. Instead, this demonstration is designed to share a taste of what might be possible and searches only a limited, static set of python code.

Furthermore, like all machine learning techniques, the efficacy of this approach is limited by the training data used. For example, the data used to train these models are (code, docstring) pairs. Therefore, search queries that closely resemble a docstring have the greatest chance of success. On the other hand, queries that do not resemble a docstring or contain concepts for which there is little data may not yield sensible results. Therefore, it is not difficult to challenge our live demonstration and discover the limitations of this approach. Nevertheless, our initial results indicate that this is an extremely fruitful area of research that we are excited to share with you.

There are many more use cases for semantic code search. For example, we could extend the ideas presented here to allow users to search for code using the language of their choice (French, Mandarin, Arabic, etc.) across many different programming languages simultaneously.

Get In Touch

This is an exciting time for the machine learning research team at GitHub and we are looking to expand. If our work interests you, please get in touch!

Contributors

hamelsmu

hohsiangwu

The post Towards Natural Language Semantic Code Search appeared first on The GitHub Blog.