How we architect and optimize GitHub - The GitHub Blog

Continuous AI for accessibility: How GitHub transforms feedback into inclusion

Carie Fisher — Thu, 12 Mar 2026 16:00:00 +0000

For years, accessibility feedback at GitHub didn’t have a clear place to go.

Unlike typical product feedback, accessibility issues don’t belong to any single team—they cut across the entire ecosystem. For example, a screen reader user might report a broken workflow that touches navigation, authentication, and settings. A keyboard-only user might hit a trap in a shared component used across dozens of pages. A low vision user might flag a color contrast issue that affects every surface using a shared design element. No single team owns any of these problems—but every one of them blocks a real person.

These reports require coordination that our existing processes weren’t originally built for. Feedback was often scattered across backlogs, bugs lingered without owners, and users followed up to silence. Improvements were often promised for a mythical “phase two” that rarely materialized.

We knew we needed to change this. But before we could build something better, we had to lay the groundwork—centralizing scattered reports, creating templates, and triaging years of backlog. Only once we had that foundation in place could we ask: How can AI make this easier?

The answer was an internal workflow, powered by GitHub Actions, GitHub Copilot, and GitHub Models, that ensures every piece of user and customer feedback becomes a tracked, prioritized issue. When someone reports an accessibility barrier, their feedback is captured, reviewed, and followed through until it’s addressed. We didn’t want AI to replace human judgment—we wanted it to handle repetitive work so humans could focus on fixing the software.

This is how we went from chaos to a system where every piece of accessibility feedback is tracked, prioritized, and acted on—not eventually, but continuously.

Accessibility as a living system

Continuous AI for accessibility weaves inclusion into the fabric of software development. It’s not a single product or a one-time audit—it’s a living methodology that combines automation, artificial intelligence, and human expertise.

This philosophy connects directly to our support for the 2025 Global Accessibility Awareness Day (GAAD) pledge: strengthening accessibility across the open source ecosystem by ensuring user and customer feedback is routed to the right teams and translated into meaningful platform improvements.

The most important breakthroughs rarely come from code scanners—they come from listening to real people. But listening at scale is hard, which is why we needed technology to help amplify those voices. We built a feedback workflow that functions less like a static ticketing system and more like a dynamic engine—leveraging GitHub products to clarify, structure, and track user and customer feedback, turning it into implementation-ready solutions.

Designing for people first

Before jumping into solutions, we stepped back to understand who this system needed to serve:

Issue submitters: Community managers, support agents, and sales reps submit issues on behalf of users and customers. They aren’t always accessibility experts, so they need a system that guides them and teaches accessibility concepts in the flow of work.
Accessibility and service teams: Engineers and designers responsible for fixes need structured, actionable data—reproducible steps, WCAG mapping, severity scores, and clear ownership.
Program and product managers: Leadership needs visibility into pain points by category, trends, and progress over time to allocate resources strategically.

With these personas in mind, we knew we wanted to 1) treat feedback as data flowing through a pipeline and 2) build a system able to evolve with us.

How feedback flows

With that foundation set, we built an architecture around an event-driven pattern, where each step triggers a GitHub Action that orchestrates what comes next—ensuring consistent handling no matter where the feedback originates. We built this system largely by hand starting in mid-2024. Today, tools like Agentic Workflows let you create GitHub Actions using natural language—meaning this kind of system could be built in a fraction of the time.

The workflow reacts to key events: Issue creation launches GitHub Copilot analysis via the GitHub Models API, status changes initiate hand-offs between teams, and resolution triggers submitter follow-up with the user. Every Action can also be triggered manually or re-run as needed—automation covers the common path, while humans can step in at any point.

Feedback isn’t just captured—it continuously flows through the right channels, providing visibility, structure, and actionability at every stage.

*Click images to enlarge.

1. Actioning intake

Feedback can come from anywhere—support tickets, social media posts, email, direct outreach—but most users choose the GitHub accessibility discussion board. It’s where they can work together and build community around shared experiences. Today, 90% of the accessibility feedback flows through that single channel. Because posts are public, other users can confirm the problem, add context, or suggest workarounds—so issues often arrive with richer detail than a support ticket ever could. Regardless of the source, every piece of feedback gets acknowledged within five business days, and even feedback we can’t act on gets a response pointing to helpful resources.

When feedback requires action from internal teams, a team member manually creates a tracking issue using our custom accessibility feedback issue template. Issue templates are pre-defined forms that standardize how information is collected when opening a new issue. The template captures the initial context—what the user reported, where it came from, and which components are involved—so nothing is lost between intake and triage.

This is where automation kicks in. Creating the issue triggers a GitHub Action that engages GitHub Copilot, and a second Action adds the issue to a project board, providing a centralized view of current status, surfacing trends, and helping identify emerging needs.

2. GitHub Copilot analysis

With the tracking issue created, a GitHub Action workflow programmatically calls the GitHub Models API to analyze the report. We chose stored prompts over model fine-tuning so that anyone on the team can update the AI’s behavior through a pull request—no retraining pipeline, no specialized ML knowledge required.

We configured GitHub Copilot using custom instructions developed by our accessibility subject matter experts. Our prompt serves two roles: triage analysis, which classifies issues by WCAG violation, severity, and affected user group, and accessibility coaching, where GitHub Copilot acts as a subject-matter expert to help teams write and review accessible code.

These instruction files point to our accessibility policies, component library, and internal documentation that details how we interpret and apply WCAG success criteria. When our standards evolve, the team updates the markdown and instruction files via pull request—the AI’s behavior changes with the next run, not the next training cycle. For a detailed walkthrough of this approach, see our guide on optimizing GitHub Copilot custom instructions for accessibility.

The automation works in two steps. First, an Action fires on issue creation and triggers GitHub Copilot to analyze the report. GitHub Copilot populates approximately 80% of the issue’s metadata automatically—over 40 data points including issue type, user segment, original source, affected components, and enough context to understand the user’s experience. The remaining 20% requires manual input from the team member. GitHub Copilot then posts a comment on the issue containing:

A summary of the problem and user impact
Suggested WCAG success criteria for potential violations
Severity level (sev1 through sev4, where sev1 is critical)
Impacted user groups (screen reader users, keyboard users, low vision users, etc.)
Recommended team assignment (design, engineering, or both)
A checklist of low-barrier accessibility tests so the submitter can verify the issue

Then a second Action fires on that comment, parses the response, applies labels based on the severity GitHub Copilot assigned, updates the issue’s status on the project board, and assigns it to the submitter for review.

If GitHub Copilot’s analysis seems off, anyone can flag it by opening an issue describing what it got wrong and what it should have said—feeding directly into our continuous improvement process.

3. Submitter review

Before we act on GitHub Copilot’s recommendations, two layers of review happen—starting with the issue submitter.

The submitter attempts to replicate the problem the user reported. The checklist GitHub Copilot provides in its comment guides our community managers, support agents, and sales reps through expert-level testing procedures—no accessibility expertise required. Each item includes plain-language explanations, step-by-step instructions, and links to tools and documentation.

Example questions include:

Can you navigate the page using only a keyboard? Press “Tab” to move through interactive elements. Can you reach all buttons, links, and form fields? Can you see where your focus is at all times?
Do images have descriptive alt text? Right-click an image and select “Inspect” to view the markup. Does the alt attribute describe the image’s purpose, or is it a generic file name?
Are interactive elements clearly labeled? Using a screen reader, navigate to a button or link. Is its purpose announced clearly? Alternatively, review the accessibility tree in your browser’s developer tools to inspect how elements are exposed to assistive technologies.

If the submitter can replicate the problem, they mark the issue as reviewed, which triggers the next GitHub Action. If they can’t reproduce it, they reach out to the user for more details. Once new information arrives, the submitter can re-run the GitHub Copilot analysis—either by manually triggering the Action from the Actions tab or by removing and re-adding the relevant label to kick it off automatically. AI provides the draft, but humans provide the verification.

4. Accessibility team review

Once the submitter marks the issue as reviewed, a GitHub Action updates its status on the workflow project board and adds it to a separate accessibility first responder board. This alerts the accessibility team—engineers, designers, champions, testing vendors, and managers—that GitHub Copilot’s analysis is ready for their review.

The team validates GitHub Copilot’s analysis—checking the severity level, WCAG mapping, and category labels—and corrects anything the AI got wrong. When there’s a discrepancy, we assume the human is correct. We log these corrections and use them to refine the prompt files, improving future accuracy.

Once validated, the team determines the resolution approach:

Documentation or settings update: Provide the solution directly to the user.
Code fix by the accessibility team: Create a pull request directly.
Service team needed: Assign the issue to the appropriate service team and track it through resolution.

With a path forward set, the team marks the issue as triaged. An Action then reassigns it to the submitter, who communicates the plan to the user—letting them know what’s being done and what to expect.

5. Linking to audits

As part of the review process, the team connects user and customer feedback to our formal accessibility audit system.

Roughly 75–80% of the time, reported issues correspond to something we already know about from internal audits. Instead of creating duplicates, we find the existing internal audit issue and add a customer-reported label. This lets us prioritize based on real-world impact—a sev2 issue might technically be less critical than a sev1, but if multiple users are reporting it, we bump up its priority.

If the feedback reveals something new, we create a new audit issue and link it to the tracking issue.

6. Closing the loop

This is the most critical step for trust. Users who take the time to report accessibility barriers deserve to know their feedback led to action.

Once a resolution path is set, the submitter reaches out to the original user to let them know the plan—what’s being fixed, and what to expect. When the fix ships, the submitter follows up again and asks the user to test it. Because most issues originate from the community discussion board, we post confirmations there for everyone to see.

If the user confirms the fix works, we close the tracking issue. If the fix doesn’t fully address the problem, the submitter gathers more details and the process loops back to the accessibility team review. We don’t close issues until the user confirms the fix works for them.

7. Continuous improvement

The workflow doesn’t end when an issue closes—it feeds back into itself.

When submitters or accessibility team members spot inaccuracies in GitHub Copilot’s output, they open a new issue requesting a review of the results. Every GitHub Copilot analysis comment includes a link to create this issue at the bottom, so the feedback loop is built into the workflow itself. The team reviews the inaccuracy, and the correction becomes a pull request to the custom instruction and prompt files described earlier.

We also automate the integration of new accessibility guidance. A separate GitHub Action scans our internal accessibility guide repository weekly and incorporates changes into GitHub Copilot’s custom instructions automatically.

The goal isn’t perfection—it’s continuous improvement. Each quarter, we review accuracy metrics and refine our instructions. These reviews feed into quarterly and fiscal year reports that track resolution times, WCAG failure patterns, and feedback volume trends—giving leadership visibility into both progress and persistent gaps. The system gets smarter over time, and now we have the data to show it.

Impact in numbers

A year ago, nearly half of accessibility feedback sat unresolved for over 300 days. Today, that backlog isn’t just smaller—it’s gone. And the improvements don’t stop there.

89% of issues now close within 90 days (up from 21%)
62% reduction in average resolution time (118 days → 45 days)
70% reduction in manual administrative time
1,150% increase in issues resolved within 30 days (4 → 50 year-over-year)
50% reduction in critical sev1 issues
100% of issues closed within 60 days in our most recent quarter

We track this through automated weekly and quarterly reports generated by GitHub Actions—surfacing which WCAG criteria fail most often and how resolution times trend over time.

Beyond the numbers

A user named James emailed us to report that the GitHub Copilot CLI was inaccessible. Decorative formatting created noise for screen readers, and interactive elements were impossible to navigate.

A team member created a tracking issue. Within moments, GitHub Copilot analyzed the report—mapping James’s description to specific technical concepts, linking to internal documentation, and providing reproduction steps so the submitter could experience the product exactly as James did.

With that context, the team member realized our engineering team had already shipped accessible CLI updates earlier in the year—James simply wasn’t aware.

They replied immediately. His response? “Thanks for pointing out the –screen-reader mode, which I think will help massively.”

Because the AI workflow identified the problem correctly, we turned a frustration into a resolution in hours.

But the most rewarding result isn’t the speed—it’s the feedback from users. Not just that we responded, but that the fixes actually worked for them:

“Huge thanks to the team for updating the contributions graph in the high contrast theme. The addition of borders around the grid edges is a small but meaningful improvement. Keep it up!”
“Let’s say you want to create several labels for your GitHub-powered workflow: bug, enhancement, dependency updates… But what if you are blind? Before you had only hex codes randomly thrown at you… now it’s fixed, and those colors have meaningful English names. Well done, GitHub!”
“This may not be very professional but I literally just screamed! This fix has actually made my day… Before this I was getting my wife to manage the GitHub issues but now I can actually navigate them by myself! It means a lot that I can now be a bit more independent so thank you again.”

That independence is the point. Every workflow, every automation, every review—it all exists so moments like these are the expectation, not the exception.

The bigger picture

Stories like these remind us why the foundation matters. Design annotations, code scanners, accessibility champions, and testing with people with disabilities—these aren’t replaced by AI. They are what make AI-assisted workflows effective. Without that human foundation, AI is just a faster way to miss the point.

We’re still learning, and the system is still evolving. But every piece of feedback teaches us something, and that knowledge now flows continuously back to our team, our users, and the tools we build.

If you maintain a repository—whether it’s a massive enterprise project or a weekend open-source library—you can build this kind of system today. Start small. Create an issue template for accessibility. Add a .github/copilot-instructions.md file with your team’s accessibility standards. Let AI handle the triage and formatting so your team can focus on what really matters: writing more inclusive code.

And if you hit an accessibility barrier while using GitHub, please share your feedback. It won’t disappear into a backlog. We’re listening—and now we have the system to follow through.

The post Continuous AI for accessibility: How GitHub transforms feedback into inclusion appeared first on The GitHub Blog.

How we rebuilt the search architecture for high availability in GitHub Enterprise Server

David Tippett — Tue, 03 Mar 2026 18:45:09 +0000

So much of what you interact with on GitHub depends on search—obviously the search bars and filtering experiences like the GitHub Issues page, but it is also the core of the releases page, projects page, the counts for issues and pull requests, and more. Given that search is such a core part of the GitHub platform, we’ve spent the last year making it even more durable. That means, less time spent managing GitHub Enterprise Server, and more time working on what your customers care most about.

In recent years, GitHub Enterprise Server administrators had to be especially careful with search indexes, the special database tables optimized for searching. If they didn’t follow maintenance or upgrade steps in exactly the right order, search indexes could become damaged and need repair, or they might get locked and cause problems during upgrades. Quick context if you’re not running High Availability (HA) setups, they’re designed to keep GitHub Enterprise Server running smoothly even if part of the system fails. You have a primary node that handles all the writes and traffic, and replica nodes that stay in sync and can take over if needed.

Much of this difficulty comes from how previous versions of Elasticsearch, our search database of choice, were integrated. HA GitHub Enterprise Server installations use a leader/follower pattern. The leader (primary server) receives all the writes, updates, and traffic. Followers (replicas) are designed to be read-only. This pattern is deeply ingrained into all of the operations of GitHub Enterprise Server.

This is where Elasticsearch started running into issues. Since it couldn’t support having a primary node and a replica node, GitHub engineering had to create an Elasticsearch cluster across the primary and replica nodes. This made replicating data straightforward and additionally gave some performance benefits, since each node could locally handle search requests.

Unfortunately, the problems of clustering across servers eventually began to outweigh the benefits. For example, at any point Elasticsearch could move a primary shard (responsible for receiving/validating writes) to a replica. If that replica was then taken down for maintenance, GitHub Enterprise Server could end up in a locked state. The replica would wait for Elasticsearch to be healthy before starting up, but Elasticsearch couldn’t become healthy until the replica rejoined.

For a number of GitHub Enterprise Server releases, engineers at GitHub tried to make this mode more stable. We implemented checks to ensure Elasticsearch was in a healthy state, as well as other processes to try and correct drifting states. We went as far as attempting to build a “search mirroring” system that would allow us to move away from the clustered mode. But database replication is incredibly challenging and these efforts needed consistency.

What changed?

After years of work, we’re now able to use Elasticsearch’s Cross Cluster Replication (CCR) feature to support HA GitHub Enterprise.

“But David,” you say, “That’s replication between clusters. How does that help here?”

I’m so glad you asked. With this mode, we’re moving to use several, “single-node” Elasticsearch clusters. Now each Enterprise server instance will operate as independent single node Elasticsearch clusters.

CCR lets us share the index data between nodes in a way that is carefully controlled and natively supported by Elasticsearch. It copies data once it’s been persisted to the Lucene segments (Elasticsearch’s underlying data store). This ensures we’re replicating data that has been durably persisted within the Elasticsearch cluster.

In other words, now that Elasticsearch supports a leader/follower pattern, GitHub Enterprise Server administrators will no longer be left in a state where critical data winds up on read-only nodes.

Under the hood

Elasticsearch has an auto-follow API, but it only applies to indexes created after the policy exists. GitHub Enterprise Server HA installations already have a long-lived set of indexes, so we need a bootstrap step that attaches followers to existing indexes, then enables auto-follow for anything created in the future.

Here’s a sample of what that workflow looks like:

function bootstrap_ccr(primary, replica):
  # Fetch the current indexes on each 
  primary_indexes = list_indexes(primary)
  replica_indexes = list_indexes(replica)

  # Filter out the system indexes
  managed = filter(primary_indexes, is_managed_ghe_index)
  
  # For indexes without follower patterns we need to
  #   initialize that contract
  for index in managed:
    if index not in replica_indexes:
      ensure_follower_index(replica, leader=primary, index=index)
    else:
      ensure_following(replica, leader=primary, index=index)

  # Finally we will setup auto-follower patterns 
  #   so new indexes are automatically followed
  ensure_auto_follow_policy(
    replica,
    leader=primary,
    patterns=[managed_index_patterns],
    exclude=[system_index_patterns]
  )

This is just one of the new workflows we’ve created to enable CCR in GitHub Enterprise Server. We’ve needed to engineer custom workflows for failover, index deletion, and upgrades. Elasticsearch only handles the document replication, and we’re responsible for the rest of the index’s lifecycle.

How to get started with CCR mode

To get started using the new CCR mode, reach out to [email protected] and let them know you’d like to use the new HA mode for GitHub Enterprise Server. They’ll set up your organization so that you can download the required license.

Once you’ve downloaded your new license, you’ll need to set `ghe-config app.elasticsearch.ccr true`. With that finished, administrators can run a `config-apply` or an upgrade on your cluster to move to 3.19.1, which is the first release to support this new architecture.

When your GitHub Enterprise Server restarts, Elasticsearch will migrate your installation to use the new replication method. This will consolidate all the data onto the primary nodes, break clustering across nodes, and restart replication using CCR. This update may take some time depending on the size of your GitHub Enterprise Server instance.

While the new HA method is optional for now, we’ll be making it our default over the next two years. We want to ensure there’s ample time for GitHub Enterprise administrators to get their feedback in, so now is the time to try it out.

We’re excited for you to start using the new HA mode for a more seamless experience managing GitHub Enterprise Server.

Want to get the most out of search on your High Availability GitHub Enterprise Server deployment? Reach out to support to get set up with our new search architecture!

The post How we rebuilt the search architecture for high availability in GitHub Enterprise Server appeared first on The GitHub Blog.

GitHub Issues search now supports nested queries and boolean operators: Here’s how we (re)built it

Deborah Digges — Tue, 13 May 2025 16:00:00 +0000

Originally, Issues search was limited by a simple, flat structure of queries. But with advanced search syntax, you can now construct searches using logical AND/OR operators and nested parentheses, pinpointing the exact set of issues you care about.

Building this feature presented significant challenges: ensuring backward compatibility with existing searches, maintaining performance under high query volume, and crafting a user-friendly experience for nested searches. We’re excited to take you behind the scenes to share how we took this long-requested feature from idea to production.

Here’s what you can do with the new syntax and how it works behind the scenes

Issues search now supports building queries with logical AND/OR operators across all fields, with the ability to nest query terms. For example is:issue state:open author:rileybroughten (type:Bug OR type:Epic) finds all issues that are open AND were authored by rileybroughten AND are either of type bug or epic.

How did we get here?

Previously, as mentioned, Issues search only supported a flat list of query fields and terms, which were implicitly joined by a logical AND. For example, the query assignee:@me label:support new-project translated to “give me all issues that are assigned to me AND have the label support AND contain the text new-project.”

But the developer community has been asking for more flexibility in issue search, repeatedly, for nearly a decade now. They wanted to be able to find all issues that had either the label support or the label question, using the query label:support OR label:question. So, we shipped an enhancement towards this request in 2021, when we enabled an OR style search using a comma-separated list of values.

However, they still wanted the flexibility to search this way across all issue fields, and not just the labels field. So we got to work.

Technical architecture and implementation

From an architectural perspective, we swapped out the existing search module for Issues (IssuesQuery), with a new search module (ConditionalIssuesQuery), that was capable of handling nested queries while continuing to support existing query formats.

This involved rewriting IssueQuery, the search module that parsed query strings and mapped them into Elasticsearch queries.

To build a new search module, we first needed to understand the existing search module, and how a single search query flowed through the system. At a high level, when a user performs a search, there are three stages in its execution:

Parse: Breaking the user input string into a structure that is easier to process (like a list or a tree)
Query: Transforming the parsed structure into an Elasticsearch query document, and making a query against Elasticsearch.
Normalize: Mapping the results obtained from Elasticsearch (JSON) into Ruby objects for easy access and pruning the results to remove records that had since been removed from the database.

Each stage presented its own challenges, which we’ll explore in more detail below. The Normalize step remained unchanged during the re-write, so we won’t dive into that one.

Parse stage

The user input string (the search phrase) is first parsed into an intermediate structure. The search phrase could include:

Query terms: The relevant words the user is trying to find more information about (ex: “models”)
Search filters: These restrict the set of returned search documents based on some criteria (ex: “assignee:Deborah-Digges”)

Example search phrase:

Find all issues assigned to me that contain the word “codespaces”:
- is:issue assignee:@me codespaces
Find all issues with the label documentation that are assigned to me:
- assignee:@me label:documentation

The old parsing method: flat list

When only flat, simple queries were supported, it was sufficient to parse the user’s search string into a list of search terms and filters, which would then be passed along to the next stage of the search process.

The new parsing method: abstract syntax tree

As nested queries may be recursive, parsing the search string into a list was no longer sufficient. We changed this component to parse the user’s search string into an Abstract Syntax Tree (AST) using the parsing library parslet.

We defined a grammar (a PEG or Parsing Expression Grammar) to represent the structure of a search string. The grammar supports both the existing query syntax and the new nested query syntax, to allow for backward compatibility.

A simplified grammar for a boolean expression described by a PEG grammar for the parslet parser is shown below:

class Parser < Parslet::Parser
  rule(:space)  { match[" "].repeat(1) }
  rule(:space?) { space.maybe }

  rule(:lparen) { str("(") >> space? }
  rule(:rparen) { str(")") >> space? }

  rule(:and_operator) { str("and") >> space? }
  rule(:or_operator)  { str("or")  >> space? }

  rule(:var) { str("var") >> match["0-9"].repeat(1).as(:var) >> space? }

  # The primary rule deals with parentheses.
  rule(:primary) { lparen >> or_operation >> rparen | var }

  # Note that following rules are both right-recursive.
  rule(:and_operation) { 
    (primary.as(:left) >> and_operator >> 
      and_operation.as(:right)).as(:and) | 
    primary }
    
  rule(:or_operation)  { 
    (and_operation.as(:left) >> or_operator >> 
      or_operation.as(:right)).as(:or) | 
    and_operation }

  # We start at the lowest precedence rule.
  root(:or_operation)
end

For example, this user search string:
is:issue AND (author:deborah-digges OR author:monalisa )
would be parsed into the following AST:

{
  "root": {
    "and": {
      "left": {
        "filter_term": {
          "attribute": "is",
          "value": [
            {
              "filter_value": "issue"
            }
          ]
        }
      },
      "right": {
        "or": {
          "left": {
            "filter_term": {
              "attribute": "author",
              "value": [
                {
                  "filter_value": "deborah-digges"
                }
              ]
            }
          },
          "right": {
            "filter_term": {
              "attribute": "author",
              "value": [
                {
                  "filter_value": "monalisa"
                }
              ]
            }
          }
        }
      }
    }
  }
}

Query

Once the query is parsed into an intermediate structure, the next steps are to:

Transform this intermediate structure into a query document that Elasticsearch understands
Execute the query against Elasticsearch to obtain results

Executing the query in step 2 remained the same between the old and new systems, so let’s only go over the differences in building the query document below.

The old query generation: linear mapping of filter terms using filter classes

Each filter term (Ex: label:documentation) has a class that knows how to convert it into a snippet of an Elasticsearch query document. During query document generation, the correct class for each filter term is invoked to construct the overall query document.

The new query generation: recursive AST traversal to generate Elasticsearch bool query

We recursively traversed the AST generated during parsing to build an equivalent Elasticsearch query document. The nested structure and boolean operators map nicely to Elasticsearch’s boolean query with the AND, OR, and NOT operators mapping to the must, should, and should_not clauses.

We re-used the building blocks for the smaller pieces of query generation to recursively construct a nested query document during the tree traversal.

Continuing from the example in the parsing stage, the AST would be transformed into a query document that looked like this:

{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must": [
              {
                "bool": {
                  "must": {
                    "prefix": {
                      "_index": "issues"
                    }
                  }
                }
              },
              {
                "bool": {
                  "should": {
                    "terms": {
                      "author_id": [
                        "",
                        ""
                      ]
                    }
                  }
                }
              }
            ]
          }
        }
      ]
    }
    // SOME TERMS OMITTED FOR BREVITY
  }
}

With this new query document, we execute a search against Elasticsearch. This search now supports logical AND/OR operators and parentheses to search for issues in a more fine-grained manner.

Considerations

Issues is one of the oldest and most heavily -used features on GitHub. Changing core functionality like Issues search, a feature with an average of nearly 2000 queries per second (QPS)—that’s almost 160M queries a day!—presented a number of challenges to overcome.

Ensuring backward compatibility

Issue searches are often bookmarked, shared among users, and linked in documents, making them important artifacts for developers and teams. Therefore, we wanted to introduce this new capability for nested search queries without breaking existing queries for users.

We validated the new search system before it even reached users by:

Testing extensively: We ran our new search module against all unit and integration tests for the existing search module. To ensure that the GraphQL and REST API contracts remained unchanged, we ran the tests for the search endpoint both with the feature flag for the new search system enabled and disabled.
Validating correctness in production with dark-shipping: For 1% of issue searches, we ran the user’s search against both the existing and new search systems in a background job, and logged differences in responses. By analyzing these differences we were able to fix bugs and missed edge cases before they reached our users.
- We weren’t sure at the outset how to define “differences,” but we settled on “number of results” for the first iteration. In general, it seemed that we could determine whether a user would be surprised by the results of their search against the new search capability if a search returned a different number of results when they were run within a second or less of each other.

Preventing performance degradation

We expected more complex nested queries to use more resources on the backend than simpler queries, so we needed to establish a realistic baseline for nested queries, while ensuring no regression in the performance of existing, simpler ones.

For 1% of Issue searches, we ran equivalent queries against both the existing and the new search systems. We used scientist, GitHub’s open source Ruby library, for carefully refactoring critical paths, to compare the performance of equivalent queries to ensure that there was no regression.

Preserving user experience

We didn’t want users to have a worse experience than before just because more complex searches were possible.

We collaborated closely with product and design teams to ensure usability didn’t decrease as we added this feature by:

Limiting the number of nested levels in a query to five. From customer interviews, we found this to be a sweet spot for both utility and usability.
Providing helpful UI/UX cues: We highlight the AND/OR keywords in search queries, and provide users with the same auto-complete feature for filter terms in the UI that they were accustomed to for simple flat queries.

Minimizing risk to existing users

For a feature that is used by millions of users a day, we needed to be intentional about rolling it out in a way that minimized risk to users.

We built confidence in our system by:

Limiting blast radius: To gradually build confidence, we only integrated the new system in the GraphQL API and the Issues tab for a repository in the UI to start. This gave us time to collect, respond to, and incorporate feedback without risking a degraded experience for all consumers. Once we were happy with its performance, we rolled it out to the Issues dashboard and the REST API.
Testing internally and with trusted partners: As with every feature we build at GitHub, we tested this feature internally for the entire period of its development by shipping it to our own team during the early days, and then gradually rolling it out to all GitHub employees. We then shipped it to trusted partners to gather initial user feedback.

And there you have it, that’s how we built, validated, and shipped the new and improved Issues search!

Feedback

Want to try out this exciting new functionality? Head to our docs to learn about how to use boolean operators and parentheses to search for the issues you care about!

If you have any feedback for this feature, please drop us a note on our community discussions.

Acknowledgements

Special thanks to AJ Schuster, Riley Broughten, Stephanie Goldstein, Eric Jorgensen Mike Melanson and Laura Lindeman for the feedback on several iterations of this blog post!

The post GitHub Issues search now supports nested queries and boolean operators: Here’s how we (re)built it appeared first on The GitHub Blog.

Introducing sub-issues: Enhancing issue management on GitHub

Shaun Wong — Fri, 11 Apr 2025 22:33:30 +0000

Recently we launched sub-issues, a feature designed to tackle complex issue management scenarios. This blog post delves into the journey of building sub-issues, what we learned along the way, how we implemented sub-issues, and the benefits of being able to use sub-issues to build itself.

What are sub-issues?

Sub-issues are a way to break a larger issue into smaller, more manageable tasks. With this feature, you can now create hierarchical lists within a single issue, making it easier to track progress and dependencies. By providing a clear structure, sub-issues help teams stay organized and focused on their goals.

For example, I often realize that a batch of work requires multiple steps, like implementing code in different repositories. Breaking this task into discrete sub-issues makes it easier to track progress and more clearly define the work I need to do. In practice we’ve noticed this helps keep linked PRs more concise and easier to review.

A brief history

Issues have long been at the heart of project management on GitHub. From tracking bugs to planning feature development, issues provide a flexible and collaborative way for teams to organize their work. Over time, we’ve enriched this foundation with tools like labels, milestones, and task lists, all to make project management even more intuitive and powerful.

One of the key challenges we set out to solve was how to better represent and manage hierarchical tasks within issues. As projects grow in complexity, breaking down work into smaller, actionable steps becomes essential. We want to empower users to seamlessly manage these nested relationships while maintaining the simplicity and clarity GitHub is known for.

Our journey toward sub-issues began with a fundamental goal: to create a system that integrates deeply into the GitHub Issues experience, enabling users to visually and functionally organize their work without adding unnecessary complexity. Achieving this required careful design and technical innovation.

Building sub-issues

To build sub-issues, we began by designing a new hierarchical structure for tasks rather than modifying the existing task list functionality. We introduced the ability to nest tasks within tasks, creating a hierarchical structure. This required updates to our data models and rendering logic to support nested sub-issues.

From a data modeling perspective, the sub-issues table stores the relationships between parent and child issues. For example, if Issue X is a parent of Issue Y, the sub-issues table would store this link, ensuring the hierarchical relationship is maintained.

In addition, we roll up sub-issue completion information into a sub-issue list table. This allows us to performantly get progress without having to traverse through a list of sub-issues. For instance, when Issue Y is completed, the system automatically updates the progress of Issue X, eliminating the need to manually check the status of all sub-issues.

We wanted a straightforward representation of sub-issues as relationships in MySQL. This approach provided several benefits, including easier support for sub-issues in environments like GitHub Enterprise Server and GitHub Enterprise Cloud with data residency.

We exposed sub-issues through GraphQL endpoints, which let us build upon the new Issues experience and leverage newly crafted list-view components. This approach provided some benefits, including more efficient data fetching and enhanced flexibility in how issue data is queried and displayed. Overall, we could move faster because we reused existing components and leveraged new components that would be used in multiple features. This was all made possible by building sub-issues in the React ecosystem.

We also focused on providing intuitive controls for creating, editing, and managing sub-issues. To this end, we worked closely with accessibility designers and GitHub’s shared components team that built the list view that powers sub-issues.

Our goal was to make it as easy as possible for users to break down their tasks without disrupting their workflow.

Using sub-issues in practice

Dogfooding is a best practice at GitHub and it’s how we build GitHub! We used sub-issues extensively within our own teams throughout the company to manage complex projects and track progress. Having a discrete area to manage our issue hierarchy resulted in a simpler, more performant experience. Through this hands-on experience, we identified areas for improvement and ensured that the feature met our high standards.

Our teams found that sub-Issues significantly improved their ability to manage large projects. By breaking down tasks into smaller, actionable items, they maintained better visibility and control over their work. The hierarchical structure also made it easier to identify dependencies and ensure nothing fell through the cracks.

Gathering early feedback

Building sub-issues was a team effort. Feedback from our beta testers was instrumental in shaping the final product and ensuring it met the needs of our community. For example, understanding how much metadata to display in the sub-issue list was crucial. We initially started with only issue titles, but eventually added the issue number and repository name, if the issue was from another repository.

Building features at GitHub makes it really easy to improve our own features as we go. It was really cool to start breaking down the sub-issues work using sub-issues. This allowed us to experience the feature firsthand and identify any pain points or areas for improvement. For example, the has:sub-issues-progress and has:parent-issue filters evolved from early discussions around filtering syntax. This hands-on approach ensured that we delivered a polished and user-friendly product.

These lessons have been invaluable in not only improving sub-issues, but also in shaping our approach to future feature development. By involving users early and actively using our own features, we can continue to build products that truly meet the needs of our community. These practices will be important to our development process going forward, ensuring that we deliver high-quality, user-centric solutions.

Call to action

Sub-issues are designed to help you break down complex tasks into manageable pieces, providing clarity and structure to your workflows. Whether you’re tracking dependencies, managing progress, or organizing cross-repository work, sub-issues offer a powerful way to stay on top of your projects.

We’d love for you to try sub-issues and see how they can improve your workflow. Your feedback is invaluable in helping us refine and enhance this feature. Join the conversation in our community discussion to share your thoughts, experiences, and suggestions.

Thank you for being an integral part of the GitHub community. Together, we’re shaping the future of collaborative development!

The post Introducing sub-issues: Enhancing issue management on GitHub appeared first on The GitHub Blog.

Breaking down CPU speed: How utilization impacts performance

Andreas Strikos — Mon, 25 Nov 2024 17:00:20 +0000

Introduction ⛵

The GitHub Performance Engineering team regularly conducts experiments to observe how our systems perform under varying load conditions. A consistent pattern in these experiments is the significant impact of CPU utilization on system performance. We’ve observed that as CPU utilization rises, it can lead to increased latency, which provides an opportunity to optimize system efficiency. Addressing this challenge allows us to maintain performance levels while reducing the need for additional machines, ultimately preventing inefficiencies.

Although we recognized the correlation between higher CPU utilization and increased latency, we saw an opportunity to explore the specific thresholds and impacts at various stages in greater detail. With a diverse set of instance types powered by different CPU families, we focused on understanding the unique performance characteristics of each CPU model. This deeper insight empowered us to make smarter, data-driven decisions, enabling us to provision our infrastructure with greater efficiency and confidence.

With these goals in mind, we embarked on a new journey of exploration and experimentation to uncover these insights.

Experiment setup 🧰

Collecting accurate data for this type of experiment was no easy feat. We needed to gather data from workloads that were as close to our production as possible, while also capturing how the system behaves under different phases of load. Since CPU usage patterns vary across workloads, we focused primarily on our flagship workloads. However, increasing the load could introduce small performance discrepancies, so our goal was to minimize disruption for our users.

Fortunately, a year ago, the Performance Engineering team developed an environment designed to meet these requirements, codenamed Large Unicorn Collider (LUC). This environment operates within a small portion of our Kubernetes clusters, mirroring the same architecture and configuration as our flagship workloads. It also has the flexibility to be hosted on dedicated machines, preventing interference from or with other workloads. Typically, the LUC environment remains idle, but when needed, we can direct a small, adjustable amount of traffic towards it. Activating or deactivating this traffic takes only seconds, allowing us to react quickly if performance concerns arise.

To accurately assess the impact of CPU utilization, we first established a baseline by sending moderate production traffic to a LUC Kubernetes pod hosted on one of its dedicated machines. This provided us with a benchmark for comparison. Importantly, the number of requests handled by the LUC pods remained constant throughout the experiment, ensuring consistent CPU load over time.

Once the baseline was set, we gradually increased CPU utilization using a tool called “stress,” which artificially occupies a specified number of CPU cores by running random processing tasks. Each instance type has a different number of CPU cores, so we adjusted the steps accordingly. However, the common factor across all instances was the total CPU utilization.

Note: It’s important to recognize that this is not a direct 1:1 comparison to the load generated by actual production workloads. The stress tool continuously runs mathematical operations, while our production workloads involve I/O operations and interrupts, which place different demands on system resources. Nevertheless, this approach still offers valuable insights into how our CPUs perform under load.

With the environment set up and our plan in place, we proceeded to collect as much data as possible to analyze the impact.

Results 📃

With our experiment setup finalized, let’s examine the data we gathered. As previously mentioned, we repeated the process across different instance types. Each instance type showed unique behavior and varying thresholds where performance started to decline.

As anticipated, CPU time increased for all instance types as CPU utilization rose. The graph below illustrates the CPU time per request as CPU utilization increases.

CPU time per request vs CPU utilization

The latency differences between instance types are expected due to the variations in CPU models. Focusing on the percentage increase in latency may provide more meaningful insights.

Latency percentage increase vs CPU utilization

In both graphs, one line stands out by deviating more than the others. We’ll examine this case in detail shortly.

Turbo Boost effect

An interesting observation is how CPU frequency changes as utilization increases, which can be attributed to Intel’s Turbo Boost Technology. Since all the instances we used are equipped with Intel CPUs, the impact of Turbo Boost is noticeable across all of them. In the graph below, you can see how the CPU frequency decreases as the CPU utilization increases. The red arrows are showing the CPU utilization level.

CPU Cores Frequency

When CPU utilization remains at lower levels (around 30% or below), we benefit from increased core frequencies, leading to faster CPU times and, consequently, lower overall latency. However, as the demand for more CPU cores rises and utilization increases, we are likely to reach the CPU’s thermal and power limits, causing frequencies to decrease. In essence, lower CPU utilization results in better performance, while higher utilization leads to a decline in performance. For instance, a workload running on a specific node with approximately 30% CPU utilization will report faster response times compared to the same workload on the same VM when CPU utilization exceeds 50%.

Hyper-Threading

Variations in CPU frequency are not the only factors influencing performance changes. All our nodes have Hyper-Threading enabled, an Intel technology that allows a single physical CPU core to operate as two virtual cores. Although there is only one physical core, the Linux kernel recognizes it as two virtual CPU cores. The kernel attempts to distribute the CPU load across these cores, aiming to keep only one hardware thread (virtual core) busy per physical core. This approach is effective until we reach a certain level of CPU utilization. Beyond this threshold, we cannot fully utilize both virtual CPU cores, resulting in reduced performance compared to normal operation.

Finding the “Golden Ratio” of CPU utilization

Underutilized nodes lead to wasted resources, power, and space in our data centers, while nodes that are excessively utilized also create inefficiencies. As noted, higher CPU utilization results in decreased performance, which can give a misleading impression that additional resources are necessary, resulting in a cycle of over-provisioning. This issue is particularly pronounced with blocking workloads that do not follow an asynchronous model. As CPU performance deteriorates, each process can manage fewer tasks per second, making existing capacity inadequate. To achieve the optimal balance—the “Golden Ratio” of CPU utilization—we must identify a threshold where CPU utilization is sufficiently high to ensure efficiency without significantly impairing performance. Striving to keep our nodes near this threshold will enable us to utilize our current hardware more effectively alongside our existing software.

Since we already have experimental data demonstrating how CPU time increases with rising utilization, we can develop a mathematical model to identify this threshold. First, we need to determine what percentage of CPU time degradation is acceptable for our specific use case. This may depend on user expectations or performance Service Level Agreements (SLAs). Once we establish this threshold, it will help us select a level of CPU utilization that remains within acceptable limits.

We can plot the CPU utilization vs. CPU time (latency) and find the point where:

CPU utilization is high enough to avoid resource underutilization.
CPU time degradation does not exceed your acceptable limit.

A specific example derived from the data above can be illustrated in the following graph.

Percentage Increase in P50 Latency vs CPU Utilization

In this example, we aim to achieve less than 40% CPU time degradation, which would correspond to a CPU utilization of 61% on the specific instance.

Outlier case

As previously mentioned, there was a specific instance that displayed some outlying data points. Our experiment confirmed an already recognized issue where certain instances were not achieving their advertised maximum Turbo Boost CPU frequency. Instead, we observed steady CPU frequencies that fell below the maximum advertised value under low CPU utilization. In the example below, you can see an instance from a CPU family that advertises Turbo Boost frequencies above 3 GHz, but it is only reporting a maximum CPU frequency of 2.8 GHz.

CPU cores frequency

This issue turned out to be caused by a disabled CPU C-state, which prevented the CPU cores from halting even when they were not in use. As a result, these cores were perceived as “busy” by the turbo driver, limiting our ability to take advantage of Turbo Boost benefits with higher CPU frequencies. By enabling the C-state and allowing for optimization and power reduction during idle mode, we observed the expected Turbo Boost behavior. This change had an immediate impact on the CPU time spent by our test workloads. The images below illustrate the prompt changes in CPU frequencies and latency reported following the C-state adjustment.

CPU cores frequency

P50 CPU time on a request

Upon re-evaluating the percentage change in CPU time, we now observe similar behavior across all instances.

Percentage Increase in P50 Latency vs CPU Utilization

Wrap-up

As we anticipated many of these insights, our objective was to validate our theories using data from our complex system. While we confirmed that performance lowers as CPU utilization increases across different CPU families, by identifying optimal CPU utilization thresholds, we can achieve a better balance between performance and efficiency, ensuring that our infrastructure remains both cost-effective and high performing. Going forward, these insights will inform us of our resource provisioning strategies and help us maximize the effectiveness of our hardware investments.

Thank you for sticking with us until the end!! A special shout-out to @adrmike, @schlubbi, @terrorobe, the @github/compute-platform and finally the @github/performance-engineering team for their invaluable assistance throughout these experiments, data analysis, and for reviewing the content for accuracy and consistency. ❤️

The post Breaking down CPU speed: How utilization impacts performance appeared first on The GitHub Blog.

How we improved push processing on GitHub

Will Haltom — Tue, 11 Jun 2024 17:00:33 +0000

What happens when you push to GitHub? The answer, “My repository gets my changes” or maybe, “The refs on my remote get updated” is pretty much right—and that is a really important thing that happens, but there’s a whole lot more that goes on after that. To name a few examples:

Pull requests are synchronized, meaning the diff and commits in your pull request reflect your newly pushed changes.
Push webhooks are dispatched.
Workflows are triggered.
If you push an app configuration file (like for Dependabot or GitHub Actions), the app is automatically installed on your repository.
GitHub Pages are published.
Codespaces configuration is updated.
And much, much more.

Those are some pretty important things, and this is just a sample of what goes on for every push. In fact, in the GitHub monolith, there are over 60 different pieces of logic owned by 20 different services that run in direct response to a push. That’s actually really cool—we should be doing a bunch of interesting things when code gets pushed to GitHub. In some sense, that’s a big part of what GitHub is, the place you push code¹ and then cool stuff happens.

The problem

What’s not so cool is that, up until recently, all of these things were the responsibility of a single, enormous background job. Whenever GitHub’s Ruby on Rails monolith was notified of a push, it enqueued a massive job called the RepositoryPushJob. This job was the home for all push processing logic, and its size and complexity led to many problems. The job triggered one thing after another in a long, sequential series of steps, kind of like this:

There are a few things wrong with this picture. Let’s highlight some of them:

This job was huge, and hard to retry. The size of the RepositoryPushJob made it very difficult for different push processing tasks to be retried correctly. On a retry, all the logic of the job is repeated from the beginning, which is not always appropriate for individual tasks. For example:
- Writing Push records to the database can be retried liberally on errors and reattempted any amount of time after the push, and will gracefully handle duplicate data.
- Sending push webhooks, on the other hand, is much more time-sensitive and should not be reattempted too long after the push has occurred. It is also not desirable to dispatch multiples of the same webhook.
Most of these steps were never retried at all. The above difficulties with conflicting retry concerns ultimately led to retries of RepositoryPushJob being avoided in most cases. To prevent one step from killing the entire job, however, much of the push handling logic was wrapped in code catching any and all errors. This lack of retries led to issues where crucial pieces of push processing never occurred.
Tight coupling of many concerns created a huge blast radius for problems. While most of the dozens of tasks in this job rescued all errors, for historical reasons, a few pieces of work in the beginning of the job did not. This meant that all of the later steps had an implicit dependency on the initial parts of the job. As more concerns are combined within the same job, the likelihood of errors impacting the entire job increases.
- For example, writing data to our Pushes MySQL cluster occurred in the beginning of the RepositoryPushJob. This meant that everything occurring after that had an implicit dependency on this cluster. This structure led to incidents where errors from this database cluster meant that user pull requests were not synchronized, even though pull requests have no explicit need to connect to this cluster.
A super long sequential process is bad for latency. It’s fine for the first few steps, but what about the things that happen last? They have to wait for every other piece of logic to run before they get a chance. In some cases, this structure led to a second or more of unnecessary latency for user-facing push tasks, including pull request synchronization.

What did we do about this?

At a high level, we took this very long sequential process and decoupled it into many isolated, parallel processes. We used the following approach:

We added a new Kafka topic that we publish an event to for each push.
We examined each of the many push processing tasks and grouped them by owning service and/or logical relationships (for example, order dependency, retry-ability).
For each coherent group of tasks, we placed them into a new background job with a clear owner and appropriate retry configuration.
Finally, we configured these jobs to be enqueued for each publish of the new Kafka event.
- To do this, we used an internal system at GitHub that facilitates enqueueing background jobs in response to Kafka events via independent consumers.

We had to make investments in several areas to support this architecture, including:

Creating a reliable publisher for our Kafka event–one that would retry until broker acknowledgement.
Setting up a dedicated pool of job workers to handle the new job queues we’d need for this level of fan out.
Improving observability to ensure we could carefully monitor the flow of push events throughout this pipeline and detect any bottlenecks or problems.
Devising a system for consistent per-event feature flagging, to ensure that we could gradually roll out (and roll back if needed) the new system without risk of data loss or double processing of events between the old and new pipelines.

Now, things look like this:

A push triggers a Kafka event, which is fanned out via independent consumers to many isolated jobs that can process the event without worrying about any other consumers.

Results

A smaller blast radius for problems.
- This can be clearly seen from the diagram. Previously, an issue with a single step in the very long push handling process could impact everything downstream. Now, issues with one piece of push handling logic don’t have the ability to take down much else.
- Structurally, this decreases the risk of dependencies. For example, there are around 300 million push processing operations executed per day in the new pipeline that previously implicitly depended on the Pushes MySQL cluster and now have no such dependency, simply as a product of being moved into isolated processes.
- Decoupling also means better ownership. In splitting up these jobs, we distributed ownership of the push processing code from one owning team to 15+ more appropriate service owners. New push functionality in our monolith can be added and iterated on by the owning team without unintentional impact to other teams.
Pushes are processed with lower latency.
- By running these jobs in parallel, no push processing task has to wait for others to complete. This means better latency for just about everything that happens on push.
- For example, we can see a notable decrease in pull request sync time:
Improved observability.
- By breaking things up into smaller jobs, we get a much clearer picture of what’s going on with each job. This lets us set up observability and monitoring that is much more finely scoped than anything we had before, and helps us to quickly pinpoint any problems with pushes.
Pushes are more reliably processed.
- By reducing the size and complexity of the jobs that process pushes, we are able to retry more things than in the previous system. Each job can have retry configuration that’s appropriate for its own small set of concerns, without having to worry about re-executing other, unrelated logic on retry.
- If we define a “fully processed” push as a push event for which all the desired operations are completed with no failures, the old RepositoryPushJob system fully processed about 99.897% of pushes.
- In the worst-case estimate, the new pipeline fully processes 99.999% of pushes.

Conclusion

Pushing code to GitHub is one of the most fundamental interactions that developers have with GitHub every day. It’s important that our system handles everyone’s pushes reliably and efficiently, and over the past several months we have significantly improved the ability of our monolith to correctly and fully process pushes from our users. Through platform level investments like this one, we strive to make GitHub the home for all developers (and their many pushes!) far into the future.

Notes

People push to GitHub a whole lot, as you can imagine. In the last 30 days, we’ve received around 500 million pushes from 8.5 million users. ↩

The post How we improved push processing on GitHub appeared first on The GitHub Blog.

How we organize and get things done with SERVICEOWNERS

Max Beizer — Tue, 19 Dec 2023 18:00:08 +0000

GitHub’s primary codebase is a large Ruby on Rails monolith with over 4.2 million lines of code across roughly 30,000 files. As the platform has grown over the years, we have come to realize that we need a new way to organize and think about the systems we run. Our traditional approach to organizing Hubbers and code has been through CODEOWNERS in combination with GitHub teams, organizations, issues, and repositories. However, as GitHub’s user base continues to grow, we have discovered we need a new layer of abstraction. This is where SERVICEOWNERS comes in.

Service-oriented architecture is not new, but we do not talk often about how large engineering teams organize around services–especially in a hybrid monolith/services architecture. GitHub engineering determined that we were missing a layer in between CODEOWNERS, how we group humans, and work to be done. Injecting a “service” layer between groups of functionality and the people maintaining them opens up a number of interesting possibilities.

One side-effect of adopting SERVICEOWNERS was realizing that the “ownership” model does not quite express how we work. Given our place in the open source ecosystem and what we value as a company, we thought the “maintainer” model more accurately describes the relationships between services and teams. So, no team “owns” anything, but instead they “maintain” various services.

Consistent, company-wide grouping

Achieving consistency in how we map running code–both within and without our monolith–to humans has had a number of positive outcomes. It promotes a shared lexicon, and, therefore, a shared understanding. The durability of services reduces the disruption of team reorganization. As priorities shift, services stay the same and require only minimal updates to yaml metadata to be accurate. Consistency in our service definitions also allows us to centralize information about the services we run into a service catalog. Our service catalog is the one-stop shop for up-to-date information on the services that power GitHub. Within the catalog, Hubbers of all stripes can find information on, for example, how a service is performing vis-a-vis an SLO. Each service in the service catalog also has a number of scorecards as part of our fundamentals program.

With clearly defined services collected in a service catalog, we can easily visualize the relationships between services. We can identify dependencies and map how information flows through the platform. All this information improves the onboarding experience for new engineers, too, as the relationships between services define the platform architecture—without having to rely on out-of-date docs or hand-waving to explain our architecture.

The service catalog also has the huge benefit of centralizing information about which teams maintain which services, how to reach an on-call engineer, and what expectations the service has set in terms of support SLAs. Clean lines of communication to maintainers of running services has been a huge help in reducing our incident remediation time. Incident commanders know how to contact on-call engineers because they can find it in the service catalog. All of this is only possible thanks to SERVICEOWNERS.

How it works

The `SERVICEOWNERS` file

A SERVICEOWNERS file lives next to the CODEOWNERS file within our monolith. Like a traditional CODEOWNERS file, SERVICEOWNERS consists of a series of glob patterns (for example, app/api/integration*), directory names (for example, config/access_control/) and filenames (for example, app/api/grants.rb) followed by a service name (for example :apps maps to the team github/apps). Our CI enforces rules like:

There can be no duplicate patterns/directories/files in SERVICEOWNERS.
All new files added to the github/github repository must have a service owner.
All patterns/directories/files must match at least one existing file.
Files matched by multiple glob patterns must be disambiguated by a file or directory definition.

The `service-mappings.yaml` file

A service-mappings file defines how services referenced in the SERVICEOWNERS file relate to services in the service catalog and GitHub teams. This configuration can define a service’s product manager, engineering manager, repository, and chat information. Service mappings can also define information about a service’s various classifications, such as its “tier” rating, with zero being critical to the GitHub platform and three being experimental/non-critical.

The `serviceowners` gem

We have developed a Ruby gem we integrate with our Rails app that combines data from the SERVICEOWNERS and service-mappings files to produce several types of output. The serviceowners gem generates our CODEOWNERS file. So, instead of manually updating CODEOWNERS, changing which team or teams maintain a service is a one-line YAML change. The serviceowners gem also has an executable which allows engineers to query information about the maintainer of a file or which files a service maintains.

Because it’s GitHub, there’s of course also a chat-op for that:

me: hubot serviceowners for test/jobs/do_the_thing_with_the_stuff_test.rb

hubot: The file test/jobs/do_the_thing_with_the_stuff_test.rb is part of the github/some_service service and is maintained by the cool-fun-team team who can be reached in #hijinx.

The `ownership.yaml` file

The above examples mostly focus on breaking up the monolith into services, but our service catalog can slurp up service information from any repository within the GitHub org that has an ownership.yaml file. Like the service-mappings file, ownership expresses version controlled values for various service metadata. This allows us to have the boundaries of a service span across multiple repositories; for example, the GitHub Desktop app can have a component service within the monolith while also having its own standalone artifact from a different repository. Another benefit of the ownership file is that it allows us to focus code changes to the monolith codebase primarily around functionality and not maintainership.

Conclusion

The combination of CODEOWNERS and SERVICEOWNERS has provided serious value for us at GitHub. The tooling we continue to build atop these primitives will serve to make maintaining services clearer and easier. That’s good news for the future of GitHub. It also pairs quite nicely with our open source identity and access management project, entitlements, too. If SERVICEOWNERS sounds like something your organization, open source or corporate alike, would benefit from, let us know on X at @github.

The post How we organize and get things done with SERVICEOWNERS appeared first on The GitHub Blog.

Measuring Git performance with OpenTelemetry

Jeff Hostetler — Mon, 16 Oct 2023 15:28:35 +0000

When I think about large codebases, the repositories for Microsoft Windows and Office are top of mind. When Microsoft began migrating these codebases to Git in 2017, they contained 3.5M files and a full clone was more than 300GB. The scale of that repository was so much bigger than anything that had been tried with Git to date. As a principal software engineer on the Git client team, I knew how painful and frustrating it could be to work in these gigantic repositories, so our team set out to make it easier. Our first task: understanding and improving the performance of Git at scale.

Collecting performance data was an essential part of that effort. Having this kind of performance data helped guide our engineering efforts and let us track our progress, as we improved Git performance and made it easier to work in these very large repositories. That’s why I added the Trace2 feature to core Git in 2019—so that others could do similar analysis of Git performance on their repositories.

Trace2 is an open source performance logging/tracing framework built into Git that emits messages at key points in each command, such as process exit and expensive loops. You can learn more about it here.

Whether they’re Windows-sized or not, organizations can benefit from understanding the work their engineers do and the types of tools that help them succeed. Today, we see enterprise customers creating ever-larger monorepos and placing heavy demands on Git to perform at scale. At the same time, users expect Git to remain interactive and responsive no matter the size or shape of the repository. So it’s more important than ever to have performance monitoring tools to help us understand how Git is performing for them.

Unfortunately, it’s not sufficient to just run Git in a debugger/profiler on test data or a simulated load. Meaningful results come from seeing how Git performs on real monorepos under daily use by real users, both in isolation and in aggregate. Making sense of the data and finding insights also requires tools to visualize the results.

Trace2 writes very detailed performance data, but it may be a little difficult to consume without some help. So today, we’re introducing an open source tool to post-process the data and move it into the OpenTelemetry ecosystem. With OpenTelemetry visualization tools, you’ll be able to easily study your Git performance data.

This tool can be configured by users to identify where data shapes cause performance deterioration, to notice problematic trends early on, and to realize where Git’s own performance needs to be improved. Whether you’re simply interested in your own statistics or are part of an engineering systems/developer experience team, we believe in democratizing the power of this kind of analysis. Here’s how to use it.

Open sourcing trace2receiver

The emerging standard for analyzing software’s performance at scale is OpenTelemetry.

An article from the Cloud Native Computing Foundation (CNCF) gives an overview of the OpenTelemetry technologies.

The centerpiece in their model is a collector service daemon. You can customize it with various receiver, pipeline, and exporter component modules to suit your needs. You can also collect data from different telemetry sources or in different formats, normalize and/or filter it, and then send it to different data sinks for analysis and visualization.

We wanted a way to let users capture their Trace2 data and send it to an OpenTelemetry-compatible data sink, so we created an open source trace2receiver receiver component that you can add to your custom collector. With this new receiver component your collector can listen for Trace2 data from Git commands, translate it into a common format (such as OTLP), and relay it to a local or cloud-based visualization tool.

Want to jump in and build and run your own custom collector using trace2receiver? See the project documentation for all the tool installation and platform-specific setup you’ll need to do.

Open sourcing a sample collector

If you want a very quick start, I’ve created an open source sample collector that uses the trace2receiver component. It contains a ready-to-go sample collector, complete with basic configuration and platform installers. This will let you kick the tires with minimal effort. Just plug in your favorite data sink/cloud provider, build it, run one of the platform installers, and start collecting data. See the README for more details.

See trace2receiver in action

We can use trace2receiver to collect Git telemetry data for two orthogonal purposes. First, we can dive into an individual command from start to finish and see where time is spent. This is especially important when a Git command spawns a (possibly nested) series of child commands, which OpenTelemetry calls a “distributed trace.” Second, we can aggregate data over time from different users and machines, compute summary metrics such as average command times, and get a high level picture of how Git is performing at scale, plus perceived user frustration and opportunities for improvement. We’ll look at each of these cases in the following sections.

Distributed tracing

Let’s start with distributed tracing. The CNCF defines distributed tracing as a way to track a request through a distributed system. That’s a broader definition than we need here, but the concepts are the same: We want to track the flow within an individual command and/or the flow across a series of nested Git commands.

I previously wrote about Trace2, how it works, and how we can use it to interactively study the performance of an individual command, like git status, or a series of nested commands, like git push which might spawn six or seven helper commands behind the scenes. When Trace2 was set to log directly to the console, we could watch in real-time as commands were executed and see where the time was spent.

This is essentially equivalent to an OpenTelemetry distributed trace. What the trace2receiver does for us here is map the Trace2 event stream into a series of OpenTelemetry “spans” with the proper parent-child relationships. The transformed data can then be forwarded to a visualization tool or database with a compatible OpenTelemetry exporter.

Let’s see what happens when we do this on an instance of the torvalds/linux.git repository.

Git fetch example

The following image shows data for a git fetch command using a local instance of the SigNoz observability tools. My custom collector contained a pipeline to route data from the trace2receiver component to an exporter component that sent data to SigNoz.

I configured my custom collector to send data to two exporters, so we can see the same data in an Application Insights database. This is possible and simple because of the open standards supported by OpenTelemetry.

Both examples show a distributed trace of git fetch. Notice the duration of the top-level command and of each of the various helper commands that were spawned by Git.

This graph tells me that, for most of the time, git fetch was waiting on git-remote-https (the grandchild) to receive the newest objects. It also suggests that the repository is well-structured, since git maintenance runs very quickly. We likely can’t do very much to improve this particular command invocation, since it seems fairly optimal already.

As a long-time Git expert, I can further infer that the received packfile was small, because Git unpacked it (and wrote individual loose objects) rather than writing and indexing a new packfile. Even if your team doesn’t yet have the domain experts to draw detailed insights from the collected data, these insights could help support engineers or outside Git experts to better interpret your environment.

In this example, the custom collector was set to report dl:summary level telemetry, so we only see elapsed process times for each command. In the next example, we’ll crank up the verbosity to see what else we can learn.

Git status example

The following images show data for git status in SigNoz. In the first image, the FSMonitor and Untracked Cache features are turned off. In the second image, I’ve turned on FSMonitor. In the third, I’ve turned on both. Let’s see how they affect Git performance. Note that the horizontal axis is different in each image. We can see how command times decreased from 970 to 204 to 40 ms as these features were turned on.

In these graphs, the detail level was set to dl:verbose, so the collector also sent region-level details.

The git:status span (row) shows the total command time. The region(...) spans show the major regions and nested sub-regions within the command. Basically, this gives us a fuller accounting of where time was spent in the computation.

The total command time here was 970 ms.

In the above image, about half of the time (429 ms) was spent in region(progress,refresh_index) (and the sub-regions within it) scanning the worktree for recently modified files. This information will be used later in region(status,worktree) to compute the set of modified tracked files.

The other half (489 ms) was in region(status,untracked) where Git scans the worktree for the existence of untracked files.

As we can see, on large repositories, these scans are very expensive.

In the above image, FSMonitor was enabled. The total command time here was reduced from 970 to 204 ms.

With FSMonitor, Git doesn’t need to scan the disk to identify the recently modified files; it can just ask the FSMonitor daemon, since it already knows the answer.

Here we see a new region(fsm_client,query) where Git asks the daemon and a new region(fsmonitor,apply_results) where Git uses the answer to update its in-memory data structures. The original region(progress,refresh_index) is still present, but it doesn’t need to do anything. The time for this phase has been reduced from 429 to just 15 ms.

FSMonitor also helped reduce the time spent in region(status,untracked) from 489 to 173 ms, but it is still expensive. Let’s see what happens when we enable both and let FSMonitor and the untracked cache work together.

In the above image, FSMonitor and the Untracked Cache were both turned on. The total command time was reduced to just 40 ms.

This gives the best result for large repositories. In addition to the FSMonitor savings, the time in region(status,untracked) drops from 173 to 12 ms.

This is a massive savings on a very frequently run command.

For more information on FSMonitor and Untracked Cache and an explanation of these major regions, see my earlier FSMonitor article.

Data aggregation

Looking at individual commands is valuable, but it’s only half the story. Sometimes we need to aggregate data from many command invocations across many users, machines, operating systems, and repositories to understand which commands are important, frequently used, or are causing users frustration.

This analysis can be used to guide future investments. Where is performance trending in the monorepo? How fast is it getting there? Do we need to take preemptive steps to stave off a bigger problem? Is it better to try to speed up a very slow command that is used maybe once a year or to try to shave a few milliseconds off of a command used millions of times a day? We need data to help us answer these questions.

When using Git on large monorepos, users may experience slow commands (or rather, commands that run more slowly than they were expecting). But slowness can be very subjective. So we need to be able to measure the performance that they are seeing, compare it with their peers, and inform the priority of a fix. We also need enough context so that we can investigate it and answer questions like: Was that a regular occurrence or a fluke? Was it a random network problem? Or was it a fetch from a data center on the other side of the planet? Is that slowness to be expected on that class of machine (laptop vs server)? By collecting and aggregating over time, we were able to confidently answer these kinds of questions.

The raw data

Let’s take a look at what the raw telemetry looks like when it gets to a data sink and see what we can learn from the data.

We saw earlier that my custom collector was sending data to both Azure and SigNoz, so we should be able to look at the data in either. Let’s switch gears and use my Azure Application Insights (AppIns) database here. There are many different data sink and visualization tools, so the database schema may vary, but the concepts should transcend.

Earlier, I showed the distributed trace of a git fetch command in the Azure Portal. My custom collector is configured to send telemetry data to an Application Insights (AppIns) database and we can use the Azure Portal to query the data. However, I find the Azure Data Explorer a little easier to use than the portal, so let’s connect Data Explorer to my AppIns database. From Data Explorer, I’ll run my queries and let it automatically pull data from my AppIns database.

The above image shows a Kusto query on the data. In the top-left panel I’ve asked for the 10 most-recent commands on any repository with the “demo-linux” nickname (I’ll explain nicknames later in this post). The bottom-left panel shows (a clipped view of) the 10 matching database rows. The panel on the right shows an expanded view of the ninth row.

The AppIns database has a legacy schema that predates OpenTelemetry, so some of OpenTelemetry fields are mapped into top-level AppIns fields and some are mapped into the customDimensions JSON object/dictionary. Additionally, some types of data records are kept in different database tables. I’m going to gloss over all of that here and point out a few things in the data.

The record in the expanded view shows a git status command. Let’s look at a few rows here. In the top-level fields:

The normalized command name is git:status.
The command duration was 671 ms. (AppIns tends to use milliseconds.)

In the customDimensions fields:

The original command line is shown (as a nested JSON record in "trace2.cmd.argv").
The "trace2.machine.arch" and "trace2.machine.os" fields show that it ran on an arm64 mac.
The user was running Git version 2.42.0.
"trace2.process.data"["status"]["count/changed"] shows that it found 13 modified files in the working directory.

Command frequency example

The above image shows a Kusto query with command counts and the P80 command duration grouped by repository, operating system, and processor. For example, there were 21 instances of git status on “demo-linux” and 80% of them took less than 0.55 seconds.

Grouping status by nickname example

The above image shows a comparison of git status times between “demo-linux” and my “demo-chromium” clone of chromium/chromium.git.

Without going too deep into Kusto queries or Azure, the above examples are intended to demonstrate how you can focus on different aspects of the available data and motivate you to create your own investigations. The exact layout of the data may vary depending on the data sink that you select and its storage format, but the general techniques shown here can be used to build a better understanding of Git regardless of the details of your setup.

Data partition suggestions

Your custom collector will send all of your Git telemetry data to your data sink. That is a good first step. However, you may want to partition the data by various criteria, rather than reporting composite metrics. As we saw above, the performance of git status on the “demo-linux” repository is not really comparable with the performance on the “demo-chromium” repository, since the Chromium repository and working directory is so much larger than the Linux repository. So a single composite P80 value for git:status across all repositories might not be that useful.

Let’s talk about some partitioning strategies to help you get more from the data.

Partition on repo nicknames

Earlier, we used a repo nickname to distinguish between our two demo repositories. We can tell Git to send a nickname with the data for every command and we can use that in our queries.

The way I configured each client machine in the previous example was to:

Tell the collector that otel.trace2.nickname is the name of the Git config key in the collector’s filter.yml file.
Globally set trace2.configParams to tell Git to send all Git config values with the otel.trace2.* prefix to the telemetry stream.
Locally set otel.trace2.nickname to the appropriate nickname (like “demo-linux” or “demo-chromium” in the earlier example) in each working directory.

Telemetry will arrive at the data sink with trace2.param.set["otel.trace2.nickname"] in the meta data. We can then use the nickname to partition our Kusto queries.

Partition on other config values

There’s nothing magic about the otel.trace2.* prefix. You can also use existing Git config values or create some custom ones.

For example, you could globally set trace2.configParams to 'otel.trace2.*,core.fsmonitor,core.untrackedcache' and let Git send the repo nickname and whether the FSMonitor and untracked cache features were enabled.

You could also set a global config value to define user cohorts for some A/B testing or a machine type to distinguish laptops from build servers.

These are just a few examples of how you might add fields to the telemetry stream to partition the data and help you better understand Git performance.

Caveats

When exploring your own Git data, it’s important to be aware of several limitations and caveats that may skew your analysis of the performance or behaviors of certain commands. I’ve listed a few common issues below.

Laptops can sleep while Git commands are running

Laptops can go to sleep or hibernate without notice. If a Git command is running when the laptop goes to sleep and finishes after the laptop is resumed, Git will accidentally include the time spent sleeping in the Trace2 event data because Git always reports the current time in each event. So you may see an arbitrary span with an unexpected and very large delay.¹

So if you occasionally find a command that runs for several days, see if it started late on a Friday afternoon and finished first thing Monday morning before sounding any alarms.

Git hooks

Git lets you define hooks to be run at various points in the lifespan of a Git command. Hooks are typically shell scripts, usually used to test a pre-condition before allowing a Git command to proceed or to ensure that some system state is updated before the command completes. They do not emit Trace2 telemetry events, so we will not have any visibility into them.

Since Git blocks while the hook is running, the time spent in the hook will be attributed to the process span (and a child span, if enabled).

If a hook shell script runs helper Git commands, those Git child processes will inherit the span context for the parent Git command, so they will appear as immediate children of the outer Git command rather than the missing hook script process. This may help explain where time was spent, but it may cause a little confusion when you try to line things up.

Interactive commands

Some Git commands have a (sometimes unexpected) interactive component:

Commands like git commit will start and wait for your editor to close before continuing.
Commands like git fetch or git push might require a password from the terminal or an interactive credential helper.
Commands like git log or git blame can automatically spawn a pager and may cause the foreground Git command to block on I/O to the pager process or otherwise just block until the pager exits.

In all of these cases, it can look like it took hours for a Git command to complete because it was waiting on you to respond.

Hidden child processes

We can use the dl:process or dl:verbose detail levels to gain insight into hidden hooks, your editor, or other interactive processes.

The trace2receiver creates child(...) spans from Trace2 child_start and child_exit event pairs. These spans capture the time that Git spent waiting for each child process. This works whether the child is a shell script or a helper Git command. In the case of a helper command, there will also be a process span for the Git helper process (that will be slightly shorter because of process startup overhead), but in the case of a shell script, this is usually the only hint that an external process was involved.

In the above image we see a git commit command on a repository with a pre-commit` hook installed. The child(hook:pre-commit) span shows the time spent waiting for the hook to run. Since Git blocks on the hook, we can infer that the hook itself did something (sleep) for about five seconds and then ran four helper commands. The process spans for the helper commands appear to be direct children of the git:commit process span rather than of a synthetic shell script process span or of the child span.

From the child(class:editor) span we can also see that an editor was started and it took almost seven seconds for it to appear on the screen and for me to close it. We don’t have any other information about the activity of the editor besides the command line arguments that we used to start it.

Finally, I should mention that when we enable dl:process or dl:verbose detail levels, we will also get some child spans that may not be that helpful. Here the child(class:unknown) span refers to the git maintenance process immediately below it.²

What’s next

Once you have some telemetry data you can:

Create various dashboards to summarize the data and track it over time.
Consider the use of various Git performance features, such as: Scalar, Sparse Checkout, Sparse Index, Partial Clone, FSMonitor, and Commit Graph.
Consider adding a Git Bundle Server to your network.
Use git maintenance to keep your repositories healthy and efficient.
Consider enabling parallel checkout on your large repositories.

You might also see what other large organizations are saying:

Conclusion

My goal in this article was to help you start collecting Git performance data and present some examples of how someone might use that data. Git performance is often very dependent upon the data-shape of your repository, so I can’t make a single, sweeping recommendation that will help everyone. (Try Scalar)

But with the new trace2receiver component and an OpenTelemetry custom collector, you should now be able to collect performance data for your repositories and begin to analyze and find your organization’s Git pain points. Let that guide you to making improvements — whether that is upstreaming a new feature into Git, adding a network cache server to reduce latency, or making better use of some of the existing performance features that we’ve created.

The trace2receiver component is open source and covered by the MIT License, so grab the code and try it out.

See the contribution guide for details on how to contribute.

Notes

It is possible on some platforms to detect system suspend/resume events and modify or annotate the telemetry data stream, but the current release of the trace2receiver does not support that. ↩
The term “unknown” is misleading here, but it is how the child_start event is labeled in the Trace2 data stream. Think of it as “unclassified”. Git tries to classify child processes when it creates them, for example “hook” or “editor”, but some call-sites in Git have not been updated to pass that information down, so they are labeled as unknown. ↩

The post Measuring Git performance with OpenTelemetry appeared first on The GitHub Blog.

Optimize your GitHub Codespaces costs with upgraded virtual machines

Craig Peters — Thu, 31 Aug 2023 15:00:52 +0000

Since we released GitHub Codespaces in 2021, we’ve made a number of updates aimed at improving usability, controlling cost, and more (for example, free usage for all, one click into templates, and concurrency policies). Now, GitHub has improved our developer experience and reduced usage costs at the same time by taking advantage of new virtual machines that provide all of our users twice the RAM, and approximately 10-30% improved CPU performance after adopting Advanced Micro Devices (AMD)-based hosts. These changes enable you to achieve the same (or better) machine performance for half the cost of the previous machine generation.

How this change helps you

In our previous VM generation, memory intensive workloads often had to overprovision CPUs just to get enough RAM in order to run, particularly when running multiple services. For professional developers this was particularly frustrating because of the increased complexity of their development environments, and their higher expectations for performance. Now, rather than having to choose between paying a premium for larger developer machines or sacrificing developer experience and productivity, you can get the best of both worlds.

For example, at GitHub we use our own software and services to build GitHub itself. GitHub uses Codespaces to build not only Codespaces, but the entire platform. GitHub has a large Ruby monolith that requires significant CPU and RAM to test, and also sets an extremely high bar for developer experience. In order to operate these environments while maximizing developer happiness, GitHub used the largest virtual machines available in Codespaces.

Once the new machine types were available, GitHub’s internal developer experience (DX) team started by moving a few dev teams with RAM-hungry workflows to machines with half the CPU count, but the same RAM, to test whether they would be sufficient. With very little effort, and nearly zero developer impact, testing showed that developers were just as successful on the smaller machines, and GitHub incurred half the cost. As additional teams tried moving the fewer-core machines, there was only one build process that turned out to be CPU architecture dependent. The fix was simple—to specify the CPU architecture so that QEMU could emulate appropriately. No other negative impacts were identified.

Due to the success of the initial trials, we quickly rolled out the changes to more teams. The result? Approximately 50% savings!

Figure 1: Codespaces cost for GitHub during the introduction of the AMD machines

Since we’ve rolled out the AMD machines for GitHub, we’ve seen no problems and had only happy users.

You can do the same in your organization by working with your development teams using GitHub Codespaces to test smaller machines on your existing development environments. All Codespaces virtual machines have been upgraded, so testing is as simple as having some developers try working in a smaller machine than they usually do. In most cases, no other configuration changes are necessary!

Once you have found the sweet spot for performance and experience, you can set a policy within your organization to restrict machine types, ensuring cost controls while providing environments that allow your developers to do their best work.

Save costs while empowering your developers

Now that these changes are in your hands, we invite you to see how much more you can get out of GitHub Codespaces by taking advantage of the improved processing power and increased headroom the RAM provides. As ever, please reach out to your account team, or participate in the GitHub Codespaces Community Discussions to provide us your feedback.

The post Optimize your GitHub Codespaces costs with upgraded virtual machines appeared first on The GitHub Blog.

How we build containerized services at GitHub using GitHub

MV Karan — Wed, 02 Aug 2023 12:31:17 +0000

The developer experience engineering team at GitHub works on creating safe, delightful, and inclusive solutions for GitHub engineers to efficiently code, ship, and operate software–setting an example for the world on how to build software with GitHub. To achieve this we provide our developers with a paved path–a comprehensive suite of automated tools and applications to streamline our runtime platforms, deployment, and hosting that helps power some of the microservices on the GitHub.com platform and many internal tools. Let’s take a deeper look at how one of our main paved paths works.

Our development ecosystem

GitHub’s main paved path covers everything that’s needed for running software–creating, deploying, scaling, debugging, and running applications. It is an ecosystem of tools like Kubernetes, Docker, load balancers, and many custom apps that work together to create a cohesive experience for our engineers. It isn’t just infrastructure and isn’t just Kubernetes. Kubernetes is our base layer, and the paved path is a mix of conventions, tools, and settings built on top of it.

The kind of services that we typically run using the paved path include web apps, computation pipelines, batch processors, and monitoring systems.

Kubernetes, which is the base layer of the paved path, runs in a multi-cluster, multi-region topology.

Benefits of the paved path

There are hundreds of services at GitHub–from a small internal tool to an external API supporting production workloads. For a variety of reasons, it would be inefficient to spin up virtual machines for each service.

Planning and capacity usage across all services wouldn’t be efficient. We would encounter significant overhead in managing both physical and Kubernetes infrastructure on an ongoing basis.
Teams would need to build deep expertise in managing their own Kubernetes clusters and would have less time to focus on their application’s unique needs.
We would have less central visibility of applications.
Security and compliance would be difficult to standardize and enforce.

With the paved path based on Kubernetes and other runtime apps, we’re instead able to:

Plan capacity centrally and only for the Kubernetes nodes, so we can optimally use capacity across nodes, as small workloads and large workloads coexist on the same machines.
Scale rapidly thanks to central capacity planning.
Easily manage configuration and deployments across services in one central control plane.
Consistently provide insights into app and deployment performance for individual services.

Onboarding a service

Onboarding a service with the code living in its own repository has been made easy with our ChatOps command service, called Hubot, and GitHub Apps. Service owners can easily generate some basic scaffolding needed to deploy the service by running a command like:

hubot gh-platform app scaffold monalisa-app

A custom GitHub App installed on the service’s GitHub repository will then automatically generate a pull request to add the necessary configurations, which includes:

A deployment.yaml file that defines the service’s deployment environments.
Kubernetes manifests that define Deployment and Service objects for deploying the service.
A Debian Dockerfile that runs a trivial web server to start off with, which will be used by the Kubernetes manifests.
Setting up CI builds as GitHub Checks that build the Docker images on every push, and store in a container registry ready for deployment.

Each service that is onboarded to the paved path has its unique Kubernetes namespace that is defined by - and generally has a staging and production environment. This helps separate the workloads of multiple services, and also multiple environments for the same service since each environment gets its own Kubernetes namespace.

Deploying a service

At GitHub, we deploy branches and perform deployments through Hubot ChatOps commands. To deploy a branch named bug-fixes in the monalisa-app repository to the staging environment, a developer would run a ChatOps command like:

hubot deploy monalisa-app/bug-fixes to staging

This triggers a deployment that fetches the Docker image associated with the latest commit in the bug-fixes branch, updates the Kubernetes manifests, and applies those manifests to the clusters in the runtime platform relevant to that environment.

Typically, the Docker image would be deployed to multiple Kubernetes clusters across multiple geographical sites in a region that forms a part of the runtime platform.

To automate pull request merges into the busiest branches and orchestrate the rollout across environments we’re also using merge queue and deployment pipelines, which our engineers can observe and interact with during their deployment.

Securing our services

For any company, the security of the platform itself, along with services running within it, is critical. In addition to our engineering-wide practices, such as requiring two-person reviews on every pull request, we also have Security and Platform teams automating security measures, such as:

Pre-built Docker images to be used as base images for the Dockerfiles. These base images contain only the necessary packages/dependencies with security updates, a set of installed software that is auditable and curated according to shared needs.
Build-time and periodic scanning of all packages and running container images, for any vulnerabilities or dependencies needing a patch, powered by our own products for software supply chain security like Dependabot.
Build time and periodic scanning of GitHub repositories of services for exposed secrets and vulnerabilities, using GitHub’s native security features for advanced security like code scanning and secret scanning.
Multiple authentication and authorization mechanisms that allow only the relevant individuals to directly access underlying Kubernetes resources.
Comprehensive telemetry for threat detection.
Services running in the platform are by default accessible only within GitHub’s internal networks and not exposed on the public internet.
Branch protection policies are enforced on all production repositories. These policies prevent merging a pull request until designated automated tests pass and the change has been reviewed by a different developer from the one who proposed the change.

Another key aspect of security for an application is how secrets like keys and tokens are managed. At GitHub, we use a centralized secret store to manage secrets. Each service and each environment within the service has its own vault to store secrets. These secrets are then injected into the relevant pods in Kubernetes, which are then exposed to the containers.

The deployment flow, from merge to rollout

The whole deployment process would look something like this:

A GitHub engineer merges a pull request to a branch in a repository. In the above example, it is the bug-fixes branch in the monalisa-app repository. This repository would also contain the Kubernetes manifest template files for deploying the application.
The pull request merge triggers relevant CI workflows. One of them is a build of the Docker image, which builds the container image based on the Dockerfile specified in the repository, and pushes the image to an internal artifact registry.
Once all the CI workflows have completed successfully, the engineer initiates a deployment by running a ChatOps command like hubot deploy monalisa-app/bug-fixes to staging. This triggers a deployment to our environments such as Staging.
The build systems fetch the Kubernetes manifest files from the repository branch, replaces the latest image to be deployed from the artifact registry, injects app secrets from the secret store, and runs some custom operations. At the end of this stage, a ready-to-deploy Kubernetes manifest is available.
Our deployment systems then apply the Kubernetes manifest to relevant clusters and monitor the rollout status of new changes.

Conclusion

GitHub’s internal paved path helps developers at GitHub focus on building services and delivering value to our users, with minimal focus on the infrastructure. We accomplish this by providing a streamlined path to our GitHub engineers that uses the power of containers and Kubernetes; scalable security, authentication, and authorization mechanisms; and the GitHub.com platform itself.

Want to try some of these for yourself? Learn more about all of GitHub’s features on github.com/features. If you have adopted any of our practices for your own development, give us a shout on Twitter!

The post How we build containerized services at GitHub using GitHub appeared first on The GitHub Blog.