Taylor Blau, Author at The GitHub Blog https://github.blog/author/ttaylorr/ Updates, ideas, and inspiration from GitHub to help developers build and design software. Mon, 17 Nov 2025 18:18:15 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 https://github.blog/wp-content/uploads/2019/01/cropped-github-favicon-512.png?fit=32%2C32 Taylor Blau, Author at The GitHub Blog https://github.blog/author/ttaylorr/ 32 32 153214340 Highlights from Git 2.52 https://github.blog/open-source/git/highlights-from-git-2-52/ Mon, 17 Nov 2025 17:54:31 +0000 https://github.blog/?p=92344 The open source Git project just released Git 2.52. Here is GitHub’s look at some of the most interesting features and changes introduced since last time.

The post Highlights from Git 2.52 appeared first on The GitHub Blog.

]]>

The open source Git project just released Git 2.52 with features and bug fixes from over 94 contributors, 33 of them new. We last caught up with you on the latest in Git back when 2.51 was released.

To celebrate this most recent release, here is GitHub’s look at some of the most interesting features and changes introduced since last time.

Tree-level blame information

If you’re a seasoned Git user, then you are no doubt familiar with git blame, Git’s tool for figuring out which commit most recently modified each line at a given filepath. Git’s blame functionality is great for figuring out when a bug was introduced, or why some code was written the way it was.

If you want to know which commit last modified any portion of a given filepath, that’s easy enough to do with git log -1 -- path/to/my/file, since -1 will give us only the first commit which modifies that path. But what if instead you want to know which commit most recently modified every file in some directory? Answering that question may seem contrived, but it’s not. If you’ve ever looked at a repository’s file listing on GitHub, the middle column of information has a link to the commit which most recently modified that path, along with (part of) its commit message.

Screenshot of GitHub's repository file.
GitHub’s repository file listing, showing tree-level blame information.

The question remains: how do we efficiently determine which commit most recently modified each file in a given directory? You could imagine that you might enumerate each tree entry, feeding it to git log -1 and collecting the output there, like so:

$ git ls-tree -z --name-only HEAD^{tree} | xargs -0 -I{} sh -c '
    git log -1 --format="$1 %h %s" -- $1
  ' -- {}  | column -t -l3 
.cirrus.yml     1e77de10810  ci: update FreeBSD image to 14.3
.clang-format   37215410730  clang-format: exclude control macros from SpaceBeforeParens
.editorconfig   c84209a0529  editorconfig: add .bash extension
.gitattributes  d3b58320923  merge-file doc: set conflict-marker-size attribute
.github         5db9d35a28f  Merge branch 'js/ci-github-actions-update'
[...]

That works, but not efficiently. To see why, consider a case with files A, B, and C introduced by commits C1, C2, and C3, respectively. To blame A, we walk from C3 back to C1 in order to determine that C1 was the most recent commit to modify A. That traversal passed through C2 and C3, but since we were only looking for modifications to A, we’ll end up revisiting those commits when trying to blame B and C. In this example, we visit those three commits six times in total, which is twice the necessary number of history traversals.

Git 2.52 introduces a new command which comes up with the same information in a fraction of the time: git last-modified. To get a sense for how much faster last-modified is than the example above, here are some hyperfine results:

Benchmark 1: git ls-tree + log
  Time (mean ± σ):      3.962 s ±  0.011 s    [User: 2.676 s, System: 1.330 s]
  Range (min … max):    3.940 s …  3.984 s    10 runs

Benchmark 2: git last-modified
  Time (mean ± σ):     722.7 ms ±   4.6 ms    [User: 682.4 ms, System: 40.1 ms]
  Range (min … max):   717.3 ms … 731.3 ms    10 runs

Summary
  git last-modified ran
    5.48 ± 0.04 times faster than git ls-tree + log

The core functionality behind git last-modified was written by GitHub over many years (originally called blame-tree in GitHub’s fork of Git), and is what has powered our tree-level blame since 2012. Earlier this year, we shared those patches with engineers at GitLab, who tidied up years of development into a reviewable series of patches which landed in this release.

There are still some features in GitHub’s version of this command that have yet to make their way into a Git release, including an on-disk format to cache the results of previous runs. In the meantime, check out git last-modified, available in Git 2.52.

[source, source, source]

Advanced repository maintenance strategies

Returning readers of this series may recall our coverage of the git maintenance command. If this is your first time reading along, or you could use a refresher, we’ve got you covered.

git maintenance is a Git command which can perform repository housekeeping tasks either on a scheduled or ad-hoc basis. The maintenance command can perform a variety of tasks, like repacking the contents of your repository, updating commit-graphs, expiring stale reflog entries, and much more. Put together, maintenance ensures that your repository continues to operate smoothly and efficiently.

By default (or when running the gc task), git maintenance relies on git gc internally to repack your repository, and remove any unreachable objects. This has a couple of drawbacks, namely that git gc performs “all-into-one” repacks to consolidate the contents of your repository, which can be sluggish for very large repositories. As an alternative, git maintenance has an incremental-repack strategy, but this never prunes out any unreachable objects.

Git 2.52 bridges this gap by introducing a new geometric task within git maintenance that avoids all-into-one repacks when possible, and prunes unreachable objects on a less frequent basis. This new task uses tools (like geometric repacking) that were designed at GitHub and have powered GitHub’s own repository maintenance for many years. Those tools have been in Git since 2.33, but were awkward to use or discover since their implementation was buried within git repack, not git gc.

The geometric task here works by inspecting the contents of your repository to determine if we can combine some number of packfiles to form a geometric progression by object count. If it can, it performs a geometric repack, condensing the contents of your repository without pruning any objects. Alternatively, if a geometric repack would pack the entirety of your repository into a single pack, then a full git gc is performed instead, which consolidates the contents of your repository and prunes out unreachable objects.

Git 2.52 makes it a breeze to keep even your largest repositories running smoothly. Check out the new geometric strategy, or any of the many other capabilities of git maintenance can do in 2.52.

[source]


The tip of the iceberg…

Now that we’ve covered some of the larger changes in more detail, let’s take a closer look at a selection of some other new features and updates in this release.

  • This release saw a couple of new sub-commands be added to git refs, Git’s relatively new tool for providing low-level access to your repository’s references. Prior to this release, git refs was capable of migrating between reference backends (e.g., to have your repository store reference data in the reftable format), along with verifying the internal representation of those references.

    git refs now includes two new sub-commands: git refs list and git refs exists. The former is an alias for git for-each-ref and supports the same set of options. The latter works like git show-ref --exists, and can be used to quickly determine whether or not a given reference exists.

    Neither of these new sub-commands introduce new functionality, but they do consolidate a couple of common reference-related operations into a single Git command rather than many individual ones.

    [source]

  • If you’ve ever scripted around Git, you are likely familiar with Git’s rev-parse command. If not, you’d be forgiven for thinking that rev-parse is designed to just resolve the various ways to describe a commit into a full object ID. In reality, rev-parse can perform functionality totally unrelated to resolving object IDs, including shell quoting, option parsing (as a replacement for getopt), printing local GIT_ environment variables, resolving paths inside of $GIT_DIR and so much more.

    Git 2.52 introduces the first step to giving some of this functionality a new home via its new git repo command. The git repo command—currently designated as experimental—is designed to be a general-purpose tool for retrieving pieces of information about your repository. For example, you can check whether or not a repository is shallow or bare, along with what type of object and reference format it uses, like so:

    $ keys='layout.bare layout.shallow object.format references.format'
    $ git repo info $keys
    layout.bare=false
    layout.shallow=false
    object.format=sha1
    references.format=files

    The new git repo command can also print out some general statistics about your repository’s structure and contents via its git repo structure sub-command:

    $ git repo structure
    Counting objects: 497533, done.
    | Repository structure | Value  |
    | -------------------- | ------ |
    | * References         |        |
    |   * Count            |   2871 |
    |     * Branches       |     58 |
    |     * Tags           |   1273 |
    |     * Remotes        |   1534 |
    |     * Others         |      6 |
    |                      |        |
    | * Reachable objects  |        |
    |   * Count            | 497533 |
    |     * Commits        |  91386 |
    |     * Trees          | 208050 |
    |     * Blobs          | 197103 |
    |     * Tags           |    994 |

    [source, source, source]

  • Back in 2.28, the Git project introduced the init.defaultBranch configuration option to provide a default branch name for any repositories created with git init. Since its introduction, the default value of that configuration option was “master”, though many set init.defaultBranch to “main” instead.

    Beginning in Git 3.0, the default value for init.defaultBranch will change to “main”. That means that any repositories created in Git 3.0 or newer using git init will have their default branch named “main” without the need for any additional configuration.

    If you want to get a sneak peak of that, or any other planned change for Git 3.0, you can build Git locally with the WITH_BREAKING_CHANGES build-flag to try out the new changes today.

    [source, source]

  • By default, Git uses SHA-1 to provide a content-addressable hash of any object in your repository. In Git 3.0, Git will instead use SHA-256 which offers more appealing security properties. Back in our coverage of Git 2.45, we talked about some new changes which enable writing out separate copies of new objects using both SHA-1 and SHA-256 as a transitory step towards interoperability between the two.

    In Git 2.52, the rest of that work towards interoperability begins. Though the changes that landed in this release are focused on laying the groundwork for future interoperability features, the hope is that eventually you can use a Git repository with one hash algorithm, while pushing and pulling from another repository using a different hash algorithm.

    [source]

  • Speaking of other bleeding-edge changes in Git, this release is the first to (optionally) use Rust code for some internal functionality within Git. This mode is optional and guarded behind a new WITH_RUST build flag. When built with this mode enabled, Git will use a Rust implementation for encoding and decoding variable-width integers.

    Though this release only introduces a Rust variant of some minor utility functionality, it sets up the infrastructure for much more interesting parts of Git to be rewritten in Rust.

    Rust support is not yet mandatory, so Git 2.52 will continue to run just fine on platforms that don’t have a Rust compiler. However, Rust support will be required for Git 3.0, at which point many more components of Git will likely depend on Rust code.

    [source, source, source]

  • Long-time readers may recall our coverage of changed-path Bloom filters within Git from back in 2.28. If not, a changed-path Bloom filter is a probabilistic data structure that can approximate which file path(s) were modified by a commit (relative to its first parent). Since Bloom filters never have false negatives (i.e. indicating a commit did not modify some path when it in fact did), they can be used to accelerate many path-scoped traversals throughout Git (including last-modified above!).

    More recently, we covered new ways of using Bloom filters within Git, like providing multiple paths of interest at the same time (e.g., git log /my/subdir /my/other/subdir) which previously were not supported with Bloom filters. At that time, we wrote that there were ongoing discussions about supporting Bloom filters in even more of Git’s expressive pathspec syntax.

    This release delivers the result of those discussions, and now supports the performance benefits of using Bloom filters in even more scenarios. One example here is when a pathspec contains wildcards in some, but not all of its components, like foo/bar/*/baz, where Git will now use its Bloom filter for the non-wildcard components of the path. To read about even more scenarios that can now leverage Bloom filters, check out the link below.

    [source]

  • This release also saw a number of performance improvements across many areas of the project. git describe learned how to use a priority queue to speed up performance by 30%. git remote picked up a couple of new tricks to optimize renaming references with its rename sub-command. git ls-files can keep the index sparse in cases where it couldn’t before. git log -L became significantly faster by avoiding some unnecessary tree-level diffs when processing merge commits. Finally, xdiff (the library that powers Git’s file-level diff and merge engine) benefitted from a pair of optimizations (here, and here) in this release, and even more optimizations that will likely land in a future release.

    [source, source, source, source]

  • Last but not least, some updates to Git’s sparse-checkout feature, which learned a new “clean” sub-command. git sparse-checkout clean can help you recover from tricky cases where some files are left outside of your sparse-checkout definition when changing which part(s) of the repository you have checked out.

    The details of how one might get into this situation, and why recovering from it with pre-2.52 tools alone was so difficult, are surprisingly technical. If you’re interested in all of the gory details, this commit has all of the information about this change.

    In the meantime, if you use sparse-checkout and have ever had difficulty cleaning up when switching your sparse-checkout definition, give git sparse-checkout clean a whirl with Git 2.52.

    [source]

…the rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.52, or any previous version in the Git repository.

The post Highlights from Git 2.52 appeared first on The GitHub Blog.

]]>
92344
Post-quantum security for SSH access on GitHub https://github.blog/engineering/platform-security/post-quantum-security-for-ssh-access-on-github/ Mon, 15 Sep 2025 16:00:00 +0000 https://github.blog/?p=90756 GitHub is introducing post-quantum secure key exchange methods for SSH access to better protect Git data in transit.

The post Post-quantum security for SSH access on GitHub appeared first on The GitHub Blog.

]]>

Today, we’re announcing some changes that will improve the security of accessing Git data over SSH.

What’s changing?

We’re adding a new post-quantum secure SSH key exchange algorithm, known alternately as sntrup761x25519-sha512 and [email protected], to our SSH endpoints for accessing Git data.

This only affects SSH access and doesn’t impact HTTPS access at all.

It also does not affect GitHub Enterprise Cloud with data residency in the United States region.

Why are we making these changes?

These changes will keep your data secure both now and far into the future by ensuring they are protected against future decryption attacks carried out on quantum computers.

When you make an SSH connection, a key exchange algorithm is used for both sides to agree on a secret. The secret is then used to generate encryption and integrity keys. While today’s key exchange algorithms are secure, new ones are being introduced that are secure against cryptanalytic attacks carried out by quantum computers.

We don’t know if it will ever be possible to produce a quantum computer powerful enough to break traditional key exchange algorithms. Nevertheless, an attacker could save encrypted sessions now and, if a suitable quantum computer is built in the future, decrypt them later. This is known as a “store now, decrypt later” attack.

To protect your traffic to GitHub when using SSH, we’re rolling out a hybrid post-quantum key exchange algorithm: sntrup761x25519-sha512 (also known by the older name [email protected]). This provides security against quantum computers by combining a new post-quantum-secure algorithm, Streamlined NTRU Prime, with the classical Elliptic Curve Diffie-Hellman algorithm using the X25519 curve. Even though these post-quantum algorithms are newer and thus have received less testing, combining them with the classical algorithm ensures that security won’t be weaker than what the classical algorithm provides.

These changes are rolling out to github.com and non-US resident GitHub Enterprise Cloud regions. Only FIPS-approved cryptography may be used within the US region, and this post-quantum algorithm isn’t approved by FIPS.

When are these changes effective?

We’ll enable the new algorithm on September 17, 2025 for GitHub.com and GitHub Enterprise Cloud with data residency (with the exception of the US region).

This will also be included in GitHub Enterprise Server 3.19.

How do I prepare?

This change only affects connections with a Git client over SSH. If your Git remotes start with https://, you won’t be impacted by this change.

For most uses, the new key exchange algorithm won’t result in any noticeable change. If your SSH client supports [email protected] or sntrup761x25519-sha512 (for example, OpenSSH 9.0 or newer), it will automatically choose the new algorithm by default if your client prefers it. No configuration change should be necessary unless you modified your client’s defaults.

If you use an older SSH client, your client should fall back to an older key exchange algorithm. That means you won’t experience the security benefits of using a post-quantum algorithm until you upgrade, but your SSH experience should continue to work as normal, since the SSH protocol automatically picks an algorithm that both sides support.

If you want to test whether your version of OpenSSH supports this algorithm, you can run the following command: ssh -Q kex. That lists all of the key exchange algorithms supported, so if you see sntrup761x25519-sha512 or [email protected], then it’s supported.

To check which key exchange algorithm OpenSSH uses when you connect to GitHub.com, run the following command on Linux, macOS, Git Bash, or other Unix-like environments:

$ ssh -v [email protected] exit 2>&1 | grep 'kex: algorithm:'

For other implementations of SSH, please see the documentation for that implementation.

What’s next?

We’ll keep an eye on the latest developments in security. As the SSH libraries we use begin to support additional post-quantum algorithms, including ones that comply with FIPS, we’ll update you on our offerings.

The post Post-quantum security for SSH access on GitHub appeared first on The GitHub Blog.

]]>
90756
Highlights from Git 2.51 https://github.blog/open-source/git/highlights-from-git-2-51/ Mon, 18 Aug 2025 17:04:36 +0000 https://github.blog/?p=90238 The open source Git project just released Git 2.51. Here is GitHub’s look at some of the most interesting features and changes introduced since last time.

The post Highlights from Git 2.51 appeared first on The GitHub Blog.

]]>

The open source Git project just released Git 2.51 with features and bug fixes from over 91 contributors, 21 of them new. We last caught up with you on the latest in Git back when 2.50 was released.

To celebrate this most recent release, here is GitHub’s look at some of the most interesting features and changes introduced since last time.

Cruft-free multi-pack indexes

Returning readers will have likely seen our coverage of cruft packs, multi-pack indexes (MIDXs), and reachability bitmaps. In case you’re new around here or otherwise need a refresher, here’s a brief overview:

Git stores repository contents as “objects” (blobs, trees, commits), either individually (“loose” objects, e.g. $GIT_DIR/objects/08/10d6a05...) or grouped into “packfiles” ($GIT_DIR/objects/pack). Each pack has an index (*.idx) that maps object hashes to offsets. With many packs, lookups slow down to O(M*log(N)), (where M is the number of packs in your repository, and N is the number of objects within a given pack).

A MIDX works like a pack index but covers the objects across multiple individual packfiles, reducing the lookup cost to O(log(N)), where N is the total number of objects in your repository. We use MIDXs at GitHub to store the contents of your repository after splitting it into multiple packs. We also use MIDXs to store a collection of reachability bitmaps for some selection of commits to quickly determine which object(s) are reachable from a given commit1.

However, we store unreachable objects separately in what is known as a “cruft pack”. Cruft packs were meant to exclude unreachable objects from the MIDX, but we realized pretty quickly that doing so was impossible. The exact reasons are spelled out in this commit, but the gist is as follows: if a once-unreachable object (stored in a cruft pack) later becomes reachable from some bitmapped commit, but the only copy of that object is stored in a cruft pack outside of the MIDX, then that object has no bit position, making it impossible to write a reachability bitmap.

Git 2.51 introduces a change to how the non-cruft portion of your repository is packed. When generating a new pack, Git used to exclude any object which appeared in at least one pack that would not be deleted during a repack operation, including cruft packs. In 2.51, Git now will store additional copies of objects (and their ancestors) whose only other copy is within a cruft pack. Carrying this process out repeatedly guarantees that the set of non-cruft packs does not have any object which reaches some other object not stored within that set of packs. (In other words, the set of non-cruft packs is closed under reachability.)

As a result, Git 2.51 has a new repack.MIDXMustContainCruft configuration which uses the new repacking behavior described above to store cruft packs outside of the MIDX. Using this at GitHub has allowed us to write significantly smaller MIDXs, in a fraction of the time, and resulting in faster repository read performance overall. (In our primary monorepo, MIDXs shrunk by about 38%, we wrote them 35% faster, and improved read performance by around 5%.)

Give cruft-less MIDXs a try today using the new repack.MIDXMustContainCruft configuration option.

[source]

Smaller packs with path walk

In Git 2.49, we talked about Git’s new “name-hash v2” feature, which changed the way that Git selects pairs of objects to delta-compress against one another. The full details are covered in that post, but here’s a quick gist. When preparing a packfile, Git computes a hash of all objects based on their filepath. Those hashes are then used to sort the list of objects to be packed, and Git uses a sliding window to search between pairs of objects to identify good delta/base candidates.

Prior to 2.49, Git used a single hash function based on the object’s filepath, with a heavy bias towards the last 16 characters of the path. That hash function, dating back all the way to 2006, works well in many circumstances, but can fall short when, say, unrelated blobs appear in paths whose final 16 characters are similar. Git 2.49 introduced a new hash function which takes more of the directory structure into account2, resulting in significantly smaller packs in some circumstances.

Git 2.51 takes the spirit of that change and goes a step further by introducing a new way to collect objects when repacking, called “path walk”. Instead of walking objects in revision order with Git emitting objects with their corresponding path names along the way, the path walk approach emits all objects from a given path at the same time. This approach avoids the name-hash heuristic altogether and can look for deltas within groups of objects that are known to be at the same path.

As a result, Git can generate packs using the path walk approach that are often significantly smaller than even those generated with the new name hash function described above. Its timings are competitive even with generating packs using the existing revision order traversal.

Try it out today by repacking with the new --path-walk command-line option.

[source]

Stash interchange format

If you’ve ever needed to switch to another branch, but wanted to save any uncommitted changes, you have likely used git stash. The stash command stores the state of your working copy and index, and then restores your local copy to match whatever was in HEAD at the time you stashed.

If you’ve ever wondered how Git actually stores a stash entry, then this section is for you. Whenever you push something onto your stash, Git creates three3 commits behind the scenes. There are two commits generated which capture the staged and unstaged changes. The staged changes represent whatever was in your index at the time of stashing, and the working directory changes represent everything you changed in your local copy but didn’t add to the index. Finally, Git creates a third commit listing the other two as its parents, capturing the entire snapshot.

Those internally generated commits are stored in the special refs/stash ref, and multiple stash entries are managed with the reflog. They can be accessed with git stash list, and so on. Since there is only one stash entry in refs/stash at a time, it’s extremely cumbersome to migrate stash entries from one machine to another.

Git 2.51 introduces a variant of the internal stash representation that allows multiple stash entries to be represented as a sequence of commits. Instead of using the first two parents to store changes from the index and working copy, this new representation adds one more parent to refer to the previous stash entry. That results in stash entries that contain four4 parents, and can be treated like an ordinary log of commits.

As a consequence of that, you can now export your stashes to a single reference, and then push or pull it like you would a normal branch or tag. Git 2.51 makes this easy by introducing two new sub-commands to git stash to import and export, respectively. You can now do something like:

$ git stash export --to-ref refs/stashes/my-stash
$ git push origin refs/stashes/my-stash

on one machine to push the contents of your stash to origin, and then:

$ git fetch origin '+refs/stashes/*:refs/stashes/*'
$ git stash import refs/stashes/my-stash

on another, preserving the contents of your stash between the two.

[source]


All that…

Now that we’ve covered some of the larger changes in more detail, let’s take a quicker look at a selection of some other new features and updates in this release.

  • If you’ve ever scripted around the object contents of your repository, you have no doubt encountered git cat-file, Git’s dedicated tool to print the raw contents of a given object.

    git cat-file also has specialized --batch and --batch-check modes, which take a sequence of objects over stdin and print each object’s information (and contents, in the case of --batch). For example, here’s some basic information about the README.md file in Git’s own repository.

    $ echo HEAD:README.md | git.compile cat-file --batch-check
    d87bca1b8c3ebf3f32deb557ae9796ddc5b792ca blob 3662

    Here, Git is telling us the object ID, type, and size for the object we specified, just as we expect. cat-file produces the same information for tree and commit objects. But what happens if we give it the path to a submodule? Prior to Git 2.51, cat-file would just print missing. But Git 2.51 improves this output, making cat-file more useful in a new variety of scripting scenarios:

    [ pre-2.51 git ]
    $ echo HEAD:sha1collisiondetection | git cat-file --batch-check
    HEAD:sha1collisiondetection missing
    
    [ git 2.51 ]
    $ echo HEAD:sha1collisiondetection | git cat-file --batch-check 855827c583bc30645ba427885caa40c5b81764d2 submodule

    [source]

  • Back in our coverage of 2.28, we talked about Git’s new changed-path Bloom feature. If you aren’t familiar with Bloom filters, or could use a refresher about how they’re used in Git, then read on.

    A Bloom filter is a probabilistic data structure that behaves like a set, with one difference. It can only tell you with 100% certainty whether an element is not in the set, but may have some false positives when indicating that an item is in the set.

    Git uses Bloom filters in its commit-graph data structure to store a probabilistic set of which paths were modified by that commit relative to its first parent. That allows history traversals like git log origin -- path/to/my/file to quickly skip over commits which are known not to modify that path (or any of its parents). However, because Git’s full pathspec syntax is far more expressive than that, Bloom filters can’t always optimize pathspec-scoped history traversals.

    Git 2.51 addresses part of that limitation by adding support for using multiple pathspec items, like git log -- path/to/a path/to/b, which previously could not make use of changed-path Bloom filters. At the time of writing, there is ongoing discussion about adding support for even more special cases.

    [source]

  • The modern equivalents of git checkout, known as git switch and git restore have been considered experimental since their introduction back in Git 2.23. These commands delineate the many jobs that git checkout can perform into separate, more purpose-built commands. Six years later5, these commands are no longer considered experimental, making their command-line interface stable and backwards compatible across future releases.

    [source]

  • Even if you’re a veteran Git user, it’s not unlikely to encounter a new Git command (among the 144!6)  every once in a while. One such command you might not have heard of is git whatchanged, which behaves like its modern alternative git log --raw.

    That command is now marked as deprecated with eventual plans to remove it in Git 3.0. As with other similar deprecations, you can still use this command behind the aptly-named --i-still-use-this flag7.

    [source]

  • Speaking of Git 3.0, this release saw a few more entries added to the BreakingChanges list. First, Git’s reftable backend (which we talked about extensively in our coverage of Git 2.45) will become the new default format in repositories created with Git 3.0, when it is eventually released. Git 3.0 will also use the SHA-256 hash function as its default hash when initializing new repositories.

    Though there is no official release date yet planned for Git 3.0, you can get a feel for some of the new defaults by building Git yourself with the WITH_BREAKING_CHANGES flag.

    [source, source]

  • Last but not least, a couple of updates on Git’s internal development process. Git has historically prioritized wide platform compatibility, and, as a result, has taken a conservative approach to adopting features from newer C standards. Though Git has required a C99-compatible compiler since near the end of 2021, it has adopted features from that standard gradually, since some of the compilers Git targets only have partial support for the standard.

    One example is the bool keyword, which became part of the C standard in C99. Here, the project began experimenting with the bool keyword back in late 2023. This release declares that experiment a success and now permits the use of bool throughout its codebase. This release also began documenting C99 features that the project is using experimentally along with C99 features that the project doesn’t use.

    Finally, this release saw an update to Git’s guidelines on submitting patches, which have historically required contributions to be non-anonymous, and submitted under a contributor’s legal name. Git now aligns more closely with the Linux kernel’s approach, to permit submitting patches with an identity other than the contributor’s legal name.

    [source, source, source]

…and a bag of chips

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.51, or any previous version in the Git repository.


1 For some bit position (corresponding to a single object in your repository,) a 1 means that object can be reached from that bitmap’s associated commit, and a 0 means it is not reachable from that commit. There are also four type-level bitmaps (for blobs, trees, commits, and annotated tags); the XOR of those bitmaps is the all 1s bitmap. For more details on multi-pack reachability bitmaps, check out our previous post on Scaling monorepo maintenance. ⤴️

2 For the curious, each layer of the directory is hashed individually, then downshifted and XOR ed into the overall result. This results in a hash function which is more sensitive to the whole path structure, rather than just the final 16 characters. ⤴️

3 Usually. Git will sometimes generate a fourth commit if you stashed untracked (new files that haven’t yet been committed) or ignored files (that match one or more patterns in a .gitignore). ⤴️

4 Or five. ⤴️

5 Almost to the day; Git 2.23 was released on August 16, 2019, and Git 2.51 was released on August 18, 2025. ⤴️

6 It’s true; git --list-cmds=builtins | wc -l outputs “144” with Git 2.51. ⤴️

7 If you are somehow a diehard git whatchanged user, please let us know by sending a message to the Git mailing list. ⤴️

The post Highlights from Git 2.51 appeared first on The GitHub Blog.

]]>
90238
Git security vulnerabilities announced https://github.blog/open-source/git/git-security-vulnerabilities-announced-6/ Tue, 08 Jul 2025 17:02:11 +0000 https://github.blog/?p=89409 Today, the Git project released new versions to address seven security vulnerabilities that affect all prior versions of Git.

The post Git security vulnerabilities announced appeared first on The GitHub Blog.

]]>

Today, the Git project released new versions to address seven security vulnerabilities that affect all prior versions of Git.

Vulnerabilities in Git

CVE-2025-48384

When reading a configuration value, Git will strip any trailing carriage return (CR) and line feed (LF) characters. When writing a configuration value, however, Git does not quote trailing CR characters, causing them to be lost when they are read later on. When initializing a submodule whose path contains a trailing CR character, the stripped path is used, causing the submodule to be checked out in the wrong place.

If a symlink already exists between the stripped path and the submodule’s hooks directory, an attacker can execute arbitrary code through the submodule’s post-checkout hook.

[source]

CVE-2025-48385

When cloning a repository, Git can optionally fetch a bundle, allowing the server to offload a portion of the clone to a CDN. The Git client does not properly validate the advertised bundle(s), allowing the remote side to perform protocol injection. When a specially crafted bundle is advertised, the remote end can cause the client to write the bundle to an arbitrary location, which may lead to code execution similar to the previous CVE.

[source]

CVE-2025-48386 (Windows only)

When cloning from an authenticated remote, Git uses a credential helper in order to authenticate the request. Git includes a handful of credential helpers, including Wincred, which uses the Windows Credential Manager to store its credentials.

Wincred uses the contents of a static buffer as a unique key to store and retrieve credentials. However, it does not properly bounds check the remaining space in the buffer, leading to potential buffer overflows.

[source]

Vulnerabilities in Git GUI and Gitk

This release resolves four new CVEs related to Gitk and Git GUI. Both tools are Tcl/Tk-based graphical interfaces used to interact with Git repositories. Gitk is focused on showing a repository’s history, whereas Git GUI focuses on making changes to existing repositories.

CVE-2025-27613 (Gitk)

When running Gitk in a specially crafted repository without additional command-line arguments, Gitk can write and truncate arbitrary writable files. The “Support per-file encoding” option must be enabled; however, the operation of “Show origin of this line” is affected regardless.

[source]

CVE-2025-27614 (Gitk)

If a user is tricked into running gitk filename (where filename has a particular structure), they may run arbitrary scripts supplied by the attacker, leading to arbitrary code execution.

[source]

CVE-2025-46334 (Git GUI, Windows only)

If a malicious repository includes an executable sh.exe, or common textconv programs (for e.g.,  astextplain, exif, or ps2ascii), path lookup on Windows may locate these executables in the working tree. If a user running Git GUI in such a repository selects either the “Git Bash” or “Browse Files” from the menu, these programs may be invoked, leading to arbitrary code execution.

[source]

CVE-2025-46835 (Git GUI)

When a user is tricked into editing a file in a specially named directory in an untrusted repository, Git GUI can create and overwrite arbitrary writable files, similar to CVE-2025-27613.

[source]

Upgrade to the latest Git version

The most effective way to protect against these vulnerabilities is to upgrade to Git 2.50.1, the newest release containing fixes for the aforementioned vulnerabilities. If you can’t upgrade immediately, you can reduce your risk by doing the following:

  • Avoid running git clone with --recurse-submodules against untrusted repositories.
  • Disable auto-fetching bundle URIs by setting the transfer.bundleURI configuration value to “false.”
  • Avoid using the wincred credential helper on Windows.
  • Avoid running Gitk and Git GUI in untrusted repositories.

In order to protect users against attacks related to these vulnerabilities, GitHub has taken proactive steps. Specifically, we have scheduled releases of GitHub Desktop. GitHub Codespaces and GitHub Actions will update their versions of Git shortly. GitHub itself, including Enterprise Server, is unaffected by these vulnerabilities.


CVE-2025-48384, CVE-2025-48385, and CVE-2025-48386 were discovered by David Leadbeater. Justin Tobler and Patrick Steinhardt provided fixes for CVEs 2025-48384 and 2025-48385 respectively. The fix for CVE-2025-48386 is joint work between Taylor Blau and Jeff King

CVE-2025-46835 was found and fixed by Johannes Sixt. Mark Levedahl discovered and fixed CVE-2025-46334. Avi Halachmi discovered both CVE-2025-27613 and CVE-2025-27614, and fixed the latter. CVE-2025-27613 was fixed by Johannes Sixt.

The post Git security vulnerabilities announced appeared first on The GitHub Blog.

]]>
89409
Highlights from Git 2.50 https://github.blog/open-source/git/highlights-from-git-2-50/ Mon, 16 Jun 2025 17:12:27 +0000 https://github.blog/?p=88787 The open source Git project just released Git 2.50. Here is GitHub’s look at some of the most interesting features and changes introduced since last time.

The post Highlights from Git 2.50 appeared first on The GitHub Blog.

]]>

The open source Git project just released Git 2.50 with features and bug fixes from 98 contributors, 35 of them new. We last caught up with you on the latest in Git back when 2.49 was released.

💡 Before we get into the details of this latest release, we wanted to remind you that Git Merge, the conference for Git users and developers is back this year on September 29-30, in San Francisco. Git Merge will feature talks from developers working on Git, and in the Git ecosystem. Tickets are on sale now; check out the website to learn more.

With that out of the way, let’s take a look at some of the most interesting features and changes from Git 2.50.

Improvements for multiple cruft packs

When we covered Git 2.43, we talked about newly added support for multiple cruft packs. Git 2.50 improves on that with better command-line ergonomics, and some important bugfixes. In case you’re new to the series, need a refresher, or aren’t familiar with cruft packs, here’s a brief overview:

Git objects may be either reachable or unreachable. The set of reachable objects is everything you can walk to starting from one of your repository’s references: traversing from commits to their parent(s), trees to their sub-tree(s), and so on. Any object that you didn’t visit by repeating that process over all of your references is unreachable.

In Git 2.37, Git introduced cruft packs, a new way to store your repository’s unreachable objects. A cruft pack looks like an ordinary packfile with the addition of an .mtimes file, which is used to keep track of when each object was most recently written in order to determine when it is safe1 to discard it.

However, updating the cruft pack could be cumbersome–particularly in repositories with many unreachable objects–since a repository’s cruft pack must be rewritten in order to add new objects. Git 2.43 began to address this through a new command-line option: git repack --max-cruft-size. This option was designed to split unreachable objects across multiple packs, each no larger than the value specified by --max-cruft-size. But there were a couple of problems:

  • If you’re familiar with git repack’s --max-pack-size option, --max-cruft-size’s behavior is quite confusing. The former option specifies the maximum size an individual pack can be, while the latter involves how and when to move objects between multiple packs.
  • The feature was broken to begin with! Since --max-cruft-size also imposes on cruft packs the same pack-size constraints as --max-pack-size does on non-cruft packs, it is often impossible to get the behavior you want.

For example, suppose you had two 100 MiB cruft packs and ran git repack --max-cruft-size=200M. You might expect Git to merge them into a single 200 MiB pack. But since --max-cruft-size also dictates the maximum size of the output pack, Git will refuse to combine them, or worse: rewrite the same pack repeatedly.

Git 2.50 addresses both of these issues with a new option: --combine-cruft-below-size. Instead of specifying the maximum size of the output pack, it determines which existing cruft pack(s) are eligible to be combined. This is particularly helpful for repositories that have accumulated many unreachable objects spread across multiple cruft packs. With this new option, you can gradually reduce the number of cruft packs in your repository over time by combining existing ones together.

With the introduction of --combine-cruft-below-size, Git 2.50 repurposed --max-cruft-size to behave as a cruft pack-specific override for --max-pack-size. Now --max-cruft-size only determines the size of the outgoing pack, not which packs get combined into it.

Along the way, a bug was uncovered that prevented objects stored in multiple cruft packs from being “freshened” in certain circumstances. In other words, some unreachable objects don’t have their modification times updated when they are rewritten, leading to them being removed from the repository earlier than they otherwise would have been. Git 2.50 squashes this bug, meaning that you can now efficiently manage multiple cruft packs and freshen their objects to your heart’s content.

[source, source]

Incremental multi-pack reachability bitmaps

​​Back in our coverage of Git 2.47, we talked about preliminary support for incremental multi-pack indexes. Multi-pack indexes (MIDXs) act like a single pack *.idx file for objects spread across multiple packs.

Multi-pack indexes are extremely useful to accelerate object lookup performance in large repositories by binary searching through a single index containing most of your repository’s contents, rather than repeatedly searching through each individual packfile. But multi-pack indexes aren’t just useful for accelerating object lookups. They’re also the basis for multi-pack reachability bitmaps, the MIDX-specific analogue of classic single-pack reachability bitmaps. If neither of those are familiar to you, don’t worry; here’s a brief refresher. Single-pack reachability bitmaps store a collection of bitmaps corresponding to a selection of commits. Each bit position in a pack bitmap refers to one object in that pack. In each individual commit’s bitmap, the set bits correspond to objects that are reachable from that commit, and the unset bits represent those that are not.

Multi-pack bitmaps were introduced to take advantage of the substantial performance increase afforded to us by reachability bitmaps. Instead of having bitmaps whose bit positions correspond to the set of objects in a single pack, a multi-pack bitmap’s bit positions correspond to the set of objects in a multi-pack index, which may include objects from arbitrarily many individual packs. If you’re curious to learn more about how multi-pack bitmaps work, you can read our earlier post Scaling monorepo maintenance.

However, like cruft packs above, multi-pack indexes can be cumbersome to update as your repository grows larger, since each update requires rewriting the entire multi-pack index and its corresponding bitmap, regardless of how many objects or packs are being added. In Git 2.47, the file format for multi-pack indexes became incremental, allowing multiple multi-pack index layers to be layered on top of one another forming a chain of MIDXs. This made it much easier to add objects to your repository’s MIDX, but the incremental MIDX format at the time did not yet have support for multi-pack bitmaps.

Git 2.50 brings support for the multi-pack reachability format to incremental MIDX chains, with each MIDX layer having its own *.bitmap file. These bitmap layers can be used in conjunction with one another to provide reachability information about selected commits at any layer of the MIDX chain. In effect, this allows extremely large repositories to quickly and efficiently add new reachability bitmaps as new commits are pushed to the repository, regardless of how large the repository is.

This feature is still considered highly experimental, and support for repacking objects into incremental multi-pack indexes and bitmaps is still fairly bare-bones. This is an active area of development, so we’ll make sure to cover any notable developments to incremental multi-pack reachability bitmaps in this series in the future.

[source]

The ORT merge engine replaces recursive

This release also saw some exciting updates related to merging. Way back when Git 2.33 was released, we talked about a new merge engine called “ORT” (standing for “Ostensibly Recursive’s Twin”).

ORT is a from-scratch rewrite of Git’s old merging engine, called “recursive.” ORT is significantly faster, more maintainable, and has many new features that were difficult to implement on top of its predecessor.

One of those features is the ability for Git to determine whether or not two things are mergeable without actually persisting any new objects necessary to construct the merge in the repository. Previously, the only way to tell whether two things are mergeable was to run git merge-tree --write-tree on them. That works, but in this example merge-tree wrote any new objects generated by the merge into the repository. Over time, these can accumulate and cause performance issues. In Git 2.50, you can make the same determination without writing any new objects by using merge-tree’s new --quiet mode and relying on its exit code.

Most excitingly in this release is that ORT has entirely superseded recursive, and recursive is no longer part of Git’s source code. When ORT was first introduced, it was only accessible through git merge’s -s option to select a strategy. In Git 2.34, ORT became the default choice over recursive, though the latter was still available in case there were bugs or behavior differences between the two. Now, 16 versions and two and a half years later, recursive has been completely removed from Git, with its author, Elijah Newren, writing:

As a wise man once told me, “Deleted code is debugged code!”

As of Git 2.50, recursive has been completely debugged deleted. For more about ORT’s internals and its development, check out this five part series from Elijah here, here, here, here, and here.

[source, source, source]


  • If you’ve ever scripted around your repository’s objects, you are likely familiar with git cat-file, Git’s purpose-built tool to list objects and print their contents. git cat-file has many modes, like --batch (for printing out the contents of objects), or --batch-check (for printing out certain information about objects without printing their contents).

    Oftentimes it is useful to dump the set of all objects of a certain type in your repository. For commits, git rev-list can easily enumerate a set of commits. But what about, say, trees? In the past, to filter down to just the tree objects from a list of objects, you might have written something like:

    $ git cat-file --batch-check='%(objecttype) %(objectname)' \
        --buffer <in | perl -ne 'print "$1\n" if /^tree ([0-9a-f]+)/'
    Git 2.50 brings Git’s object filtering mechanism used in partial clones to git cat-file, so the above can be rewritten a little more concisely like:
    $ git cat-file --batch-check='%(objectname)' --filter='object:type=tree' <in

    [source]

  • While we’re on the topic, let’s discuss a little-known git cat-file command-line option: --allow-unknown-type. This arcane option was used with objects that have a type other than blob, tree, commit, or tag. This is a quirk dating back a little more than a decade ago that allows git hash-object to write objects with arbitrary types. In the time since, this feature has gotten very little use. In fact, git cat-file -p --allow-unknown-type can’t even print out the contents of one of these objects!

    $ oid="$(git hash-object -w -t notatype --literally /dev/null)"
    $ git cat-file -p $oid
    fatal: invalid object type
    

    This release makes the --allow-unknown-type option silently do nothing, and removes support from git hash-object to write objects with unknown types in the first place.

    [source]

  • The git maintenance command learned a number of new tricks this release as well. It can now perform a few new different kinds of tasks, like worktree-prune, rerere-gc, and reflog-expire. worktree-prune mirrors git gc’s functionality to remove stale or broken Git worktrees. rerere-gc also mirrors existing functionality exposed via git gc to expire old rerere entries from previously recorded merge conflict resolutions. Finally, reflog-expire can be used to remove stale unreachable objects from out of the reflog.

    git maintenance also ships with new configuration for the existing loose-objects task. This task removes lingering loose objects that have since been packed away, and then makes new pack(s) for any loose objects that remain. The size of those packs was previously fixed at a maximum of 50,000, and can now be configured by the maintenance.loose-objects.batchSize configuration.

    [source, source, source]

  • If you’ve ever needed to recover some work you lost, you may be familiar with Git’s reflog feature, which allows you to track changes to a reference over time. For example, you can go back and revisit earlier versions of your repository’s main branch by doing git show main@{2} (to show main prior to the two most recent updates) or main@{1.week.ago} (to show where your copy of the branch was at a week ago).

    Reflog entries can accumulate over time, and you can reach for git reflog expire in the event you need to clean them up. But how do you delete the entirety of a branch’s reflog? If you’re not yet running Git 2.50 and thought “surely it’s git reflog delete”, you’d be wrong! Prior to Git 2.50, the only way to drop a branch’s entire reflog was to do git reflog expire $BRANCH --expire=all.

    In Git 2.50, a new drop sub-command was introduced, so you can accomplish the same as above with the much more natural git reflog drop $BRANCH.

    [source]

  • Speaking of references, Git 2.50 also received some attention to how references are processed and used throughout its codebase. When using the low-level git update-ref command, Git used to spend time checking whether or not the proposed refname could also be a valid object ID, making its lookups ambiguous. Since update-ref is such a low-level command, this check is no longer done, delivering some performance benefits to higher-level commands that rely on update-ref for their functionality.

    Git 2.50 also learned how to cache whether or not any prefix of a proposed reference name already exists (for example, you can’t create a reference ref/heads/foo/bar/baz if either refs/heads/foo/bar or refs/heads/foo already exists).

    Finally, in order to make those checks, Git used to create a new reference iterator for each individual prefix. Git 2.50’s reference backends learned how to “seek” existing iterators, saving time by being able to reuse the same iterator when checking each possible prefix.

    [source]

  • If you’ve ever had to tinker with Git’s low-level curl configuration, you may be familiar with Git’s configuration options for tuning HTTP connections, like http.lowSpeedLimit and http.lowSpeedTime which are used to terminate an HTTP connection that is transferring data too slowly.

    These options can be useful when fine-tuning Git to work in complex networking environments. But what if you want to tweak Git’s TCP Keepalive behavior? This can be useful to control when and how often to send keepalive probes, as well as how many to send, before terminating a connection that hasn’t sent data recently.

    Prior to Git 2.50, this wasn’t possible, but this version introduces three new configuration options: http.keepAliveIdle, http.keepAliveInterval, and http.keepAliveCount which can be used to control the fine-grained behavior of curl’s TCP probing (provided your operating system supports it).

    [source]

  • Git is famously portable and runs on a wide variety of operating systems and environments with very few dependencies. Over the years, various parts of Git have been written in Perl, including some commands like the original implementation of git add -i . These days, very few remaining Git commands are written in Perl.

    This version reduces Git’s usage of Perl by removing it as a dependency of the test suite and documentation toolchain. Many Perl one-liners from Git’s test suite were rewritten to use other Shell functions or builtins, and some were rewritten as tiny C programs. For the handful of remaining hard dependencies on Perl, those tests will be skipped on systems that don’t have a working Perl.

    [source, source]

  • This release also shipped a minor cosmetic update to git rebase -i. When starting a rebase, your $EDITOR might appear with contents that look something like:

    pick c108101daa foo
    pick d2a0730acf bar
    pick e5291f9231 baz
    

    You can edit that list to break, reword, or exec (among many others), and Git will happily execute your rebase. But if you change the commit message in your rebase’s TODO script, they won’t actually change!

    That’s because the commit messages shown in the TODO script are just meant to help you identify which commits you’re rebasing. (If you want to rewrite any commit messages along the way, you can use the reword command instead). To clarify that these messages are cosmetic, Git will now prefix them with a # comment character like so:

    pick c108101daa # foo
    pick d2a0730acf # bar
    pick e5291f9231 # baz
    

    [source]

  • Long time readers of this series will recall our coverage of Git’s bundle feature (when Git added support for partial bundles), though we haven’t covered Git’s bundle-uri feature. Git bundles are a way to package your repositories contents: both its objects and the references that point at them into a single *.bundle file.

    While Git has had support for bundles since as early as v1.5.1 (nearly 18 years ago!), its bundle-uri feature is much newer. In short, the bundle-uri feature allows a server to serve part of a clone by first directing the client to download a *.bundle file. After the client does so, it will try to perform a fill-in fetch to gather any missing data advertised by the server but not part of the bundle.

    To speed up this fill-in fetch, your Git client will advertise any references that it picked up from the *.bundle itself. But in previous versions of Git, this could sometimes result in slower clones overall! That’s because up until Git 2.50, Git would only advertise the branches in refs/heads/* when asking the server to send the remaining set of objects.

    Git 2.50 now includes advertises all references it knows about from the *.bundle when doing a fill-in fetch on the server, making bundle-uri-enabled clones much faster.

    For more details about these changes, you can check out this blog post from Scott Chacon.

    [source]

  • Last but not least, git add -p (and git add -i) now work much more smoothly in sparse checkouts by no longer having to expand the sparse index. This follows in a long line of work that has been gradually adding sparse-index compatibility to Git commands that interact with the index.

    Now you can interactively stage parts of your changes before committing in a sparse checkout without having to wait for Git to populate the sparsified parts of your repository’s index. Give it a whirl on your local sparse checkout today!

    [source]


The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.50, or any previous version in the Git repository.

🎉 Git turned 20 this year! Celebrate by watching our interview of Linus Torvalds, where we discuss how it forever changed software development.

1 It’s never truly safe to remove an unreachable object from a Git repository that is accepting incoming writes, because marking an object as unreachable can race with incoming reference updates, pushes, etc. At GitHub, we use Git’s –expire-to feature (which we wrote about in our coverage of Git 2.39) in something we call “limbo repositories” to quickly recover objects that shouldn’t have been deleted, before deleting them for good.  ↩️

The post Highlights from Git 2.50 appeared first on The GitHub Blog.

]]>
88787
Git turns 20: A Q&A with Linus Torvalds https://github.blog/open-source/git/git-turns-20-a-qa-with-linus-torvalds/ Mon, 07 Apr 2025 22:58:14 +0000 https://github.blog/?p=86171 To celebrate two decades of Git, we sat down with Linus Torvalds—the creator of Git and Linux—to discuss how it forever changed software development.

The post Git turns 20: A Q&A with Linus Torvalds appeared first on The GitHub Blog.

]]>

Exactly twenty years ago, on April 7, 2005, Linus Torvalds made the very first commit to a new version control system called Git. Torvalds famously wrote Git in just 10 days after Linux kernel developers lost access to their proprietary tool, BitKeeper, due to licensing disagreements. In fact, in that first commit, he’d written enough of Git to use Git to make the commit!

Git’s unconventional and decentralized design—nowadays ubiquitous and seemingly obvious—was revolutionary at the time, and reshaped how software teams collaborate and develop. (To wit, GitHub!)

To celebrate two decades of Git, we sat down with Linus himself to revisit those early days, explore the key design decisions behind Git’s lasting success, and discuss how it forever changed software development.

Check out the transcript of our interview below, and watch the full video above.

The following transcript has been lightly edited for clarity.


Taylor Blau: It’s been 20 years, almost to the hour, since Git was self-hosted enough to write its initial commit. Did you expect to be sitting here 20 years later, still using it and talking about it?

Linus Torvalds: Still using it, yes. Maybe not talking about it. I mean, that has been one of the big surprises—basically how much it took over the whole SCM world. I saw it as a solution to my problems, and I obviously thought it was superior. Even literally 20 years ago to the day, I thought that first version, which was pretty raw—to be honest, even that version was superior to CVS.

But at the same time, I’d seen CVS just hold on to the market—I mean, SVN came around, but it’s just CVS in another guise, right?—for many, many decades. So I was like, okay, this market is very sticky. I can’t use CVS because I hate it with a passion, so I’ll do my own thing. I couldn’t use BitKeeper, obviously, anymore. So I was like, okay, I’ll do something that works for me, and I won’t care about anybody else. And really that showed in the first few months and years—people were complaining that it was kind of hard to use, not intuitive enough. And then something happened, like there was a switch that was thrown.

“I’ll do something that works for me, and I won’t care about anybody else.”

Well, you mentioned BitKeeper. Maybe we can talk about that.

Sure.

Pretty famously, you wrote the initial version of Git in around 10 or so days as a replacement for the kernel.

Yes and no. It was actually fewer than—well, it was about 10 days until I could use it for the kernel, yes. But to be fair, the whole process started like December or November the year before, so 2004.

What happened was BitKeeper had always worked fairly well for me. It wasn’t perfect, but it was light years ahead of anything else I’ve tried. But BitKeeper in the kernel community was always very, like, not entirely welcomed by the community because it was commercial. It was free for open source use because Larry McVoy, who I knew, really liked open source. I mean, at the same time, he was making a business around it and he wanted to sell BitKeeper to big companies. [It] not being open source and being used for one of the biggest open source projects around was kind of a sticking point for a lot of people. And it was for me, too.

I mean, to some degree I really wanted to use open source, but at the same time I’m very pragmatic and there was nothing open source that was even remotely good enough. So I was kind of hoping that something would come up that would be better. But what did come up was that Tridge in Australia basically reverse engineered BitKeeper, which wasn’t that hard because BitKeeper internally was basically a good wrapper around SCCS, which goes back to the 60s. SCCS is almost worse than CVS.

But that was explicitly against the license rules for BitKeeper. BitKeeper was like, you can use this for open source, but you can’t reverse engineer it. And you can’t try to clone BitKeeper. And that made for huge issues. And this was all in private, so I was talking to Larry and I was emailing with Tridge and we were trying to come up with a solution, but Tridge and Larry were really on completely opposite ends of the spectrum and there was no solution coming up.

So by the time I started writing Git, I had actually been thinking about the issue for four months and thinking about what worked for me and thinking about “How do I do something that does even better than BitKeeper does but doesn’t do it the way BitKeeper does it?” I did not want to be in the situation where Larry would say, “Hey, you did the one thing you were not supposed to do.”

“…how do I do something that does even better than BitKeeper does, but doesn’t do it the way BitKeeper does it.”

So yes, the writing part was maybe 10 days until I started using Git for the kernel, but there was a lot of mental going over what the ideas should be.

I want to talk about maybe both of those things. We can start with that 10-day period. So as I understand it, you had taken that period as a time away from the kernel and had mostly focused on Git in isolation. What was that transition like for you to just be working on Git and not thinking about the kernel?

Well, since it was only two weeks, it ended up being that way. It wasn’t actually a huge deal. I’d done things like that just for—I’ve been on, like in the last 35 years, I’ve been on vacation a couple of times, right, not very many times. But I have been away from the kernel for two weeks at a time before.

And it was kind of interesting because it was—one of my reactions was how much easier it is to do programming in the userspace. There’s so much less you need to care about. You don’t need to worry about memory allocations. You don’t need to worry about a lot of things. And debugging is so much easier when you have all this infrastructure that you’re writing when you’re doing a kernel.

So it was actually somewhat—I mean, I wouldn’t say relaxing, but it was fun to do something userspace-y where I had a fairly clear goal of what I wanted. I mean, a clear goal in the sense I knew the direction. I didn’t know the details.

One of the things I find so interesting about Git, especially 20 years on, is it’s so… the development model that it encourages, to me, seems so simple that it’s almost obvious at this point. But I don’t say that as a reductive term. I think there must have been quite a lot of thought into distilling down from the universe of source control ideas down into something that became Git. Tell me, what were the non-obvious choices you made at the time?

The fact that you say it’s obvious now, I think it wasn’t obvious at the time. I think one of the reasons people found Git to be very hard to use was that most people who started without using Git were coming from a background of something CVS like. And the Git mindset, I came at it from a file system person’s standpoint, where I had this disdain and almost hatred of most source control management projects, so I was not at all interested in maintaining the status quo.

And like the biggest issue for me—well, there were two huge issues. One was performance—back then I still applied a lot of patches, which I mean, Git has made almost go away because now I just merge other people’s code.

But for me, one of the goals was that I could apply a patch series in basically half a minute, even when it was like 50, 100 patches.

You shouldn’t need a coffee to…

Exactly. And that was important to me because it’s actually a quality-of-life thing. It’s one of those things where if things are just instant, some mistake happens, you see the result immediately and you just go on and you fix it. And some of the other projects I had been looking at took like half a minute per patch, which was not acceptable to me. And that was because the kernel is a very large project and a lot of these SCMs were not designed to be scalable.

“And that was important to me because it’s actually a quality-of-life thing.”

So that was one of the issues. But one of the issues really was, I knew I needed it to be distributed, but it needed to be really, really stable. And people kind of think that using the SHA-1 hashes was a huge mistake. But to me, SHA-1 hashes were never about the security. It was about finding corruption.

Because we’d actually had some of that during the BitKeeper things, where BitKeeper used CRCs and MD5s, right, but didn’t use it for everything. So one of the early designs for me was absolutely everything was protected by a really good hash.

And that kind of drove the whole project—having two or three really fundamental design ideas. Which is why at a low level it is actually fairly simple right and then the complexities are in the details and the user interfaces and in all the things it has to be able to do—because everybody wants it to do crazy things. But having a low level design that has a few core concepts made it easier to write and much easier to think and also to some degree explain to people what the ideas are.

And I kind of compare it to Unix. Unix has like a core philosophy of everything is a process, everything is a file, you pipe things between things. And then the reality is it’s not actually simple. I mean, there’s the simple concepts that underlie the philosophy, but then all the details are very complicated.

I think that’s what made me appreciate Unix in the first place. And I think Git has some of the same kind of, there’s a fundamental core simplicity to the design and then there’s the complexity of implementation.

There’s a through line from Unix into the way that Git was designed.

Yes.

You mentioned SHA-1. One of the things that I think about in this week or two where you were developing the first version of Git is you made a lot of decisions that have stuck with us.

Yeah.

Were there any, including SHA-1 or not, that you regretted or wish you had done differently?

Well, I mean, SHA-1 I regret in the sense that I think it caused a lot of pointless churn with the whole “trying to support SHA-256 as well as SHA-1.” And I understand why it happened, but I do think it was mostly pointless.

I don’t think there was a huge, real need for it, but people were worried, so it was short. So I think there’s a lot of wasted effort there. There’s a number of other small issues. I think I made a mistake in how the index file entries are sorted. I think there’s these stupid details that made things harder than they should be.

But at the same time, many of those things could be fixed, but they’re small enough. It doesn’t really matter. All the complexities are elsewhere in the end.

So it sounds like you have few regrets. I think that’s good. Were there any moments where you weren’t sure that what you were trying to achieve was going to work or come together or be usable? Or did you already have a pretty clear idea?

I had a clear idea of the initial stages but I wasn’t sure how it would work in the long run. So honestly, after the first week, I had something that was good for applying patches, but not so much for everything else. I had the basics for doing merges, and the data structures were in place for that, but it actually took, I think it took an additional week before I did my first merge.

There were a number of things where I had kind of the big picture and result in mind, but I wasn’t sure if I’d get there. Yeah, the first steps, I mean the first week or two, I mean, you can go and look at the code—and people have—and it is not complicated code.

No.

I think the first version was 10,000 lines or something.

You can more or less read it in a single sitting.

Yeah, and it’s fairly straightforward and doesn’t do a lot of error checking and stuff like that. It’s really a, “Let’s get this working because I have another project that I consider to be more important than I need to get back to.” It really was. It happened where I would hit issues that required me to do some changes.

“There were a number of things where I had kind of the big picture and result in mind, but I wasn’t sure if I’d get there.”

The first version—I think we ended up doing a backwards incompatible object store transfer at one point. At least fsck complains about some of the old objects we had because I changed the data format.

I didn’t know where that came from.

Yeah, no. The first version just was not doing everything it needed to do.

And I forget if I actually did a conversion or not. I may not have ever needed to convert. And we just have a few warnings for a few objects in the kernel where fsck will say, “Hey, this is an old, no longer supported format.” That kind of thing. But on the other, on the whole, it really worked, I mean, surprisingly well.

The big issue was always people’s acceptance of it.

Right.

And that took a long time.

“But on the other, on the whole, it really worked, I mean, surprisingly well.”

Well, we talked a little bit about how merging was put in place but not functional until maybe week two or week three. What were the other features that you left out of the initial version that you later realized were actually quite essential to the project?

Well, it wasn’t so much “later realized,” it was stuff that I didn’t care about, but I knew that if this is going to go anywhere, somebody else will. I mean, the first week when I was using it for the kernel, I was literally using the raw, what are now called “plumbing commands” by hand.

Of course.

Because there was no so-called porcelain. There was nothing above that to make it usable. So to make a commit, you’d do these very arcane things.

Set your index, commit-tree.

Yeah, commit-tree, write, and that just returns an SHA that you write by hand into the head file and that was it.

Did hash-object exist in the first version?

I think that was one of the first binaries that I had where I could just check that I could hash everything by hand and it would return the hash to standard out, then you could do whatever you wanted to it. But it was like the early porcelain was me scripting shell scripts around these very hard-to-use things.

And honestly, it wasn’t easy to use even with my shell scripts.

But to be fair, the first initial target audience for this were pretty hardcore kernel people who had been using BitKeeper. They at least knew a lot of the concepts I was aiming for. People picked it up.

It didn’t take that long before some other kernel developers started actually using it. I was actually surprised by how quickly some source control people started coming in. And I started getting patches from the outside within days of making the first Git version public.

I want to move forward a bit. You made the decision to hand off maintainership to Junio pretty early on in the project. I wonder if you could tell me a little bit about what it’s been like to watch him run the project and really watch the community interact with it at a little bit of a distance after all these years?

I mean, to be honest, I maintained Git for like three or four months. I think I handed it off in August [of 2005] or something like that.

And when I handed it off, I truly just handed it off. I was like, “I’m still around.” I was still reading the Git mailing list, which I don’t do anymore. Junio wanted to make sure that if he asked me anything, I’d be okay.

But at the same time, I was like, this is not what I want to do. I mean, this is… I still feel silly. My oldest daughter went off to college, and two months later, she sends this text to me and says that I’m more well-known at the computer science lab for Git than for Linux because they actually use Git for everything there. And I was like, Git was never a big thing for me. Git was an “I need to get this done to do the kernel.” And it’s kind of ridiculous that, yes, I used four months of my life maintaining it, but now, at the 20 years later…

Yes, you should definitely talk to Junio, not to me because he’s been doing a great job and I’m very happy it worked out so well. But to be honest I’ll take credit for having worked with people on the internet for long enough that I was like—during the four months I was maintaining Git, I was pretty good at picking up who has got the good taste to be a good maintainer.

My oldest daughter went off to college, and two months later, she sends this text to me and says that I’m more well known at the computer science lab for Git than for Linux because they actually use Git for everything there.

That’s what it’s about—taste—for you.

For me, it’s hard to describe. You can see it in patches, you can see it in how they react to other people’s code, “how they think” kind of things. Junio was not the first person in the project, but he was one of the early ones that was around from pretty much week one after I had made it public.

So he was one of the early persons—but it wasn’t like you’re the first one, tag you’re it. It was more like okay, I have now seen this person work for three months and I don’t want to maintain this project. I will ask him if he wants to be the maintainer. I think he was a bit nervous at first, but it really has been working out.

Yeah he’s certainly run the project very admirably in the…

Yeah, I mean, so taste is to me very important, but practically speaking, the fact that you stick around with a project for 20 years, that’s the even more important part, right? And he has.

I think he’s knowledgeable about almost every area of the tree to a surprising degree.

Okay, so we’ve talked a lot about early Git. I want to talk a little bit about the middle period of Git maybe, or maybe even the period we’re in now.

One of the things that I find so interesting about the tool, given how ubiquitous it’s become, it’s clearly been effective at aiding the kernel’s development, but it’s also been really effective for university students writing little class projects on their laptops. What do you think was unique about Git that made it effective at both extremes of the software engineering spectrum?

So the distributed nature really ends up making so many things so easy and that was one big part that set Git apart from pretty much all SCMs before, was… I mean there had been distributed SCMs, but there had, as far as I know, never been something where it was like the number one design goal—I mean, along with the other number one design goals—where it means that you can work with Git purely locally and then later if you want to make it available in any other place it’s so easy.

And that’s very different from, say, CVS where you have to set up this kind of repository and if you ever want to move it anywhere else it’s just very very painful and you can’t share it with somebody else without losing track of it.

Or there’s always going to be one special repository when you’re using a traditional SCM and the fact that Git didn’t do that, and very much by design didn’t do that, I mean that’s what made services like GitHub trivial. I mean I’m trivializing GitHub because I realized there’s a lot of work in making all the infrastructure around Git, but at the same time the basic Git hosting site is basically nothing because the whole design of Git is designed around making it easy to copy, and every repository is the same and equal.

And I think that ended up being what made it so easy to then use as an individual developer. When you make a new Git repository, it’s not a big deal. It’s like you do git init and you’re done. And you don’t need to set up any infrastructure and you don’t need to do any of the stuff that you traditionally needed to do with an SCM. And then if that project ever grows to be something where you decide, “Oh, maybe I want other people to work with it,” that works too. And again, you don’t have to do anything about it. You just push it to GitHub and again, you’re done.

That was something I very much wanted. I didn’t realize how many other people wanted it, too. I thought people were happy with CVS and SVN. Well, I didn’t really think that, but I thought they were sufficient for most people, let’s put it that way.

I’ve lived my whole life with version control as part of software development, and one of the things I’m curious about is how you see Git’s role in shaping how software development gets done today.

That’s too big of a question for me. I don’t know. It wasn’t why I wrote Git. I wrote it for my own issues.

I think GitHub and the other hosting services have made it clear how easy it is now to make all these random small projects in ways that it didn’t used to be. And that has resulted in a lot of dead projects too. You find these one-off things where somebody did something and left it behind and it’s still there.

But does that really change how software development is done in the big picture? I don’t know. I mean, it changes the details. It makes collaboration easier to some degree. It makes it easier to do these throwaway projects. And if they don’t work, they don’t work. And if they do work, now you can work together with other people. But I’m not sure it changed anything fundamentally in software development.

“It makes collaboration easier to some degree.”

Moving ahead a little bit, modern software development has never been changing faster than it is today…

Are you going to say the AI word?

I’m not going to say the AI word, unless you want me to.

No, no, no.

…what are some of the areas of the tool that you think have evolved or maybe still need to evolve to continue to support the new and demanding workflows that people are using it for?

I’d love to see more bug tracking stuff. I mean, everybody is doing that. I mean, there are, whether you call it bug tracking or issues or whatever you want to call it, I’d love to see that be more unified. Because right now it’s very fragmented where every single hosting site does their own version of it.

And I understand why they do it. A, there is no kind of standard good base. And B, it’s also a way to do the value add and keep people in that ecosystem even when Git itself means that it’s really easy to move the code.

But I do wish there was a more unified thing where bug tracking and issues in general would be something that would be more shared among the hosting sites.

You mentioned earlier that it’s at least been a while since you regularly followed the mailing list.

Yeah.

In fact, it’s been a little bit of time since you even committed to the project. I think by my count, August of 2022 was the last time…

Yeah, I have a few experimental patches in my tree that I just keep around. So these days I do a pull of the Git sources and I have, I think, four or five patches that I use myself. And I think I’ve posted a couple of them to the Git mailing list, but they’re not very important. They’re like details that tend to be very specific to my workflow.

But honestly, I mean, this is true of the Linux kernel, too. I’ve been doing Linux for 35 years, and it did everything I needed in the first year—right? And the thing that keeps me going on the kernel side is, A, hardware keeps evolving, and a kernel needs to evolve with that, of course. But B, it’s all the needs of other people. Never in my life would I need all of the features that the kernel does. But I’m interested in kernels, and I’m still doing that 35 years later.

When it came to Git, it was like Git did what I needed within the first year. In fact, mostly within the first few months. And when it did what I needed, I lost interest. Because when it comes to kernels, I’m really interested in how they work, and this is what I do. But when it comes to SCMs, it’s like—yeah, I’m not at all interested.

“When it came to Git, it was like Git did what I needed within the first year. In fact, mostly within the first few months.”

Have there been any features that you’ve followed in the past handful of years from the project that you found interesting?

I liked how the merge strategies got slightly smarter. I liked how some of the scripts were finally rewritten in C just to make them faster, because even though I don’t apply, like, 100 patch series anymore, I do end up doing things like rebasing for test trees and stuff like that and having some of the performance improvements.

But then, I mean, those are fairly small implementation details in the end. They’re not the kind of big changes that, I mean—I think the biggest change that I was still tracking a few years ago was all the multiple hashes thing, which really looks very painful to me.

Have there been any tools in the ecosystem that you’ve used alongside? I mean, I’m a huge tig user myself. I don’t know if you’ve ever used this.

I never—no, even early on when we had, like when Git was really hard to use and they were like these add-on UIs, the only wrapper around Git I ever used was gitk. And that was obviously integrated into Git fairly quickly, right? But I still use the entire command language. I don’t use any of the editor integration stuff. I don’t do any of that because my editor is too stupid to integrate with anything, much less Git.

I mean, I occasionally do statistics on my Git history usage just because I’m like, “What commands do I use?” And it turns out I use five Git commands. And git merge and git blame and git log are three of them, pretty much. So, I’m a very casual user of Git in that sense.

I have to ask about what the other two are.

I mean obviously git commit and git pull. I did this top five thing at some point and it may have changed, but there’s not a lot of—I do have a few scripts that then do use git rev-list and go really low and do statistics for the project…

In terms of your interaction with the project, what do you feel like have been some of the features in the project either from early on or in the time since that maybe haven’t gotten the appreciation they deserve?

I mean Git has gotten so much more appreciation than it deserves. But that’s the reverse of what I would ask me. A big thing for me was when people actually started appreciating what Git could do instead of complaining about how different it was.

And that, I mean, that was several years after the initial Git. I think it was these strange web developers who started using Git in a big way. It’s like Ruby on Rails, I think. Which I had no idea, I still don’t know what Ruby even is. But the Ruby on Rails people started using Git sometime in 2008, something like this.

It was strange because it brought in a completely new kind of Git user—at least one that I hadn’t seen before. It must have existed in the background, it just made it very obvious that suddenly you had all these young people who had never used SCM in their life before and Git was the first thing they ever used and it was what the project they were using was using, so it was kind of the default thing.

And I think it changed the dynamics. When you didn’t have these old timers who had used a very different SCM their whole life, and suddenly you had young people who had never seen anything else and appreciated it, and instead of saying, “Git is so hard,” I started seeing these people who were complaining about “How do I do this when this old project is in CVS?” So, that was funny.

But yeah, no. The fact that people are appreciating Git, I mean, way more than I ever thought. Especially considering the first few years when I got a lot of hate for it.

Really?

Oh, the complaints kept coming.

Tell me about it.

Oh, I mean, it’s more like I can’t point to details. You’d have to Google it. But the number of people who sent me, “Why does it do this?” And the flame wars over my choice of names. For example, I didn’t have git status, which actually is one of the commands I use fairly regularly now.

It’s in the top five?

It’s probably not in the top five, but it’s still something fairly common. I don’t think I’d ever used it with CVS because it was so slow.

And people had all these expectations. So I just remember the first few years, the complaints about why the names of the subcommands are different for no good reason. And the main reason was I just didn’t like CVS very much, so I did things differently on purpose sometimes.

And the shift literally between 2007 and 2010—those years, when people went from complaining about how hard Git was to use to really appreciating some of the power of Git, was interesting to me.

I want to spend maybe just a moment thinking about the future of the project. In your mind, what are the biggest challenges that Git either is facing or will face?

I don’t even know. I mean, it has just been so much more successful than I ever… I mean, the statistics are insane. It went from use for the kernel and a couple of other projects to being fairly popular to now being like 98% of the SCMs used. I mean, that’s a number I saw in some report from last year.

So, I mean, it’s—I don’t know how true that is, but it’s like big. And in that sense, I wouldn’t worry about challenges because I think SCMs, there is a very strong network effect. And that’s probably why, once it took off, it took off in a big way. Just when every other project is using Git, by default, all the new projects will use Git, too. Because the pain of having two different SCMs for two different projects to work on is just not worth it.

So I would not see that as a challenge for Git as much as I would see it as a challenge for anybody else who thinks they have something better. And honestly, because Git does everything that I need, the challenges would likely come from new users.

I mean, we saw some of that. We saw some of that with people who used Git in ways that explicitly were things I consider to be the wrong approach. Like Microsoft, the monorepo for everything, which showed scalability issues. I’m not saying Microsoft was wrong to do that. I’m saying this is literally what Git was not designed to do.

I assume most of those problems have been solved because I’m not seeing any complaints, but at the same time I’m not following the Git mailing list as much as I used to.

I don’t even know if the large file issue is considered to be solved. If you want to put a DVD image in Git, that was like, why would you ever want to do that?

But, I mean, that’s the challenge. When Git is everywhere, you find all these people who do strange things that you would never imagine—that I didn’t imagine and that I consider to be actively wrong.

But hey, I mean, that’s a personal opinion. Clearly other people have very different personal opinions. So that’s always a challenge. I mean, that’s something I see in the kernel, too, where I go, why the hell are you doing that? I mean, that shouldn’t work, but you’re clearly doing it.

“When Git is everywhere, you find all these people who do strange things that you would never imagine—that I didn’t imagine and that I consider to be actively wrong.”

We talked about how Git is obviously a huge dominant component in software development. At the same time, there are new version control upstarts that seem to pop up. Pijul comes to mind, Jujutsu, Piper, and things like that. I’m curious if you’ve ever tried any of them.

No, I don’t. I mean, literally, since I came from this, from being completely uninterested in source control, why would I look at alternatives now that I have something that works for me?

I really came into Git not liking source control, and now I don’t hate it anymore. And I think that databases are my particular—like, that’s the most boring-thing-in-life thing. But SCMs still haven’t been something I’m really interested in.

“I really came into Git not liking source control, and now I don’t hate it anymore.”

You’ve given me a little bit of an end to my last question for you. So on schedule, Linux came about 34 years ago, Git 20…

Oh, that question.

And so we’re maybe five or so years overdue for the next big thing.

No, no, I see it the other way around. All the projects that I’ve had to make, I had to make because I couldn’t find anything better that somebody else did.

But I much prefer other people solving my problems for me. So me having to come up with a project is actually a failure of the world—and the world just hasn’t failed in the last 20 years for me.

I started doing Linux because I needed an operating system and there was nothing that suited my needs. I started doing Git for the same reason. And there hasn’t been any… I started Subsurface, which is my divelog, well, no longer my divelog software, but that was so specialized that it never took off in a big way. And that solved one particular problem, but my computer use is actually so limited that I think I’ve solved all the problems.

Part of it is probably, I’ve been doing it so long that I can only do things in certain ways. I’m still using the same editor that I used when I was in college because my fingers have learned one thing and there’s no going back. And I know the editor is crap and I maintain it because it’s a dead project that nobody else uses.

“But I much prefer other people solving my problems for me. So me having to come up with a project is actually a failure of the world—and the world just hasn’t failed in the last 20 years for me.

So, I have a source tree and I compile my own version every time I install a new machine and I would suggest nobody ever use that editor but I can’t. I’ve tried multiple times finding an editor that is more modern and does fancy things like colorize my source code and do things like that. And every time I try it, I’m like, “Yeah, these hands are too old for this.” So I really hope there’s no project that comes along that makes me go, “I have to do this.”

Well, on that note.

On that note.

Thank you for 20 years of Git.

Well, hey, I did it for my own very selfish reasons. And really—I mean, this is the point to say again that yes, out of the 20 years, I spent four months on it. So really, all the credit goes to Junio and all the other people who are involved in Git that have by now done so much more than I ever did.

In any event, thank you.

The post Git turns 20: A Q&A with Linus Torvalds appeared first on The GitHub Blog.

]]>
86171
Highlights from Git 2.49 https://github.blog/open-source/git/highlights-from-git-2-49/ Fri, 14 Mar 2025 17:19:46 +0000 https://github.blog/?p=83226 The open source Git project just released Git 2.49. Here is GitHub’s look at some of the most interesting features and changes introduced since last time.

The post Highlights from Git 2.49 appeared first on The GitHub Blog.

]]>

The open source Git project just released Git 2.49 with features and bug fixes from over 89 contributors, 24 of them new. We last caught up with you on the latest in Git back when 2.48 was released.

To celebrate this most recent release, here is GitHub’s look at some of the most interesting features and changes introduced since last time.

Faster packing with name-hash v2

Many times over this series of blog posts, we have talked about Git’s object storage model, where objects can be written individually (known as “loose” objects), or grouped together in packfiles. Git uses packfiles in a wide variety of functions, including local storage (when you repack or GC your repository), as well as when sending data to or from another Git repository (like fetching, cloning, or pushing).

Storing objects together in packfiles has a couple of benefits over storing them individually as loose. One obvious benefit is that object lookups can be performed much more quickly in pack storage. When looking up a loose object, Git has to make multiple system calls to find the object you’re looking for, open it, read it, and close it. These system calls can be made faster using the operating system’s block cache, but because objects are looked up by a SHA-1 (or SHA-256) of their contents, this pseudo-random access isn’t very cache-efficient.

But most interesting to our discussion is that since loose objects are stored individually, we can only compress their contents in isolation, and can’t store objects as deltas of other similar objects that already exist in your repository. For example, say you’re making a series of small changes to a large blob in your repository. When those objects are initially written, they are each stored individually and zlib compressed. But if the majority of the file’s content remains unchanged among edit pairs, Git can further compress these objects by storing successive versions as deltas of earlier ones. Roughly speaking, this allows Git to store the changes made to an object (relative to some other object) instead of multiple copies of nearly identical blobs.

But how does Git figure out which pairs of objects are good candidates to store as delta-base pairs? One useful proxy is to compare objects that appear at similar paths. Git does this today by computing what it calls a “name hash”, which is effectively a sortable numeric hash that weights more heavily towards the final 16 non-whitespace characters in a filepath (source). This function comes from Linus all the way back in 2006, and excels at grouping functions with similar extensions (all ending in .c, .h, etc.), or files that were moved from one directory to another (a/foo.txt to b/foo.txt).

But the existing name-hash implementation can lead to poor compression when there are many files that have the same basename but very different contents, like having many CHANGELOG.md files for different subsystems stored together in your repository. Git 2.49 introduces a new variant of the hash function that takes more of the directory structure into account when computing its hash. Among other changes, each layer of the directory hierarchy gets its own hash, which is downshifted and then XORed into the overall hash. This creates a hash function which is more sensitive to the whole path, not just the final 16 characters.

This can lead to significant improvements both in packing performance, but also in the resulting pack’s overall size. For instance, using the new hash function was able to improve the time it took to repack microsoft/fluentui from ~96 seconds to ~34 seconds, and slimming down the resulting pack’s size from 439 MiB to just 160 MiB (source).

While this feature isn’t (yet) compatible with Git’s reachability bitmaps feature, you can try it out for yourself using either git repack’s or git pack-objects’s new --name-hash-version flag via the latest release.

[source]

Backfill historical blobs in partial clones

Have you ever been working in a partial clone and gotten this unfriendly output?

$ git blame README.md
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)
Receiving objects: 100% (1/1), 1.64 KiB | 8.10 MiB/s, done.
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)
Receiving objects: 100% (1/1), 1.64 KiB | 7.30 MiB/s, done.
[...]

What happened here? To understand the answer to that question, let’s work through an example scenario:

Suppose that you are working in a partial clone that you cloned with --filter=blob:none. In this case, your repository is going to have all of its trees, commit, and annotated tag objects, but only the set of blobs which are immediately reachable from HEAD. Put otherwise, your local clone only has the set of blobs it needs to populate a full checkout at the latest revision, and loading any historical blobs will fault in any missing objects from wherever you cloned your repository.

In the above example, we asked for a blame of the file at path README.md. In order to construct that blame, however, we need to see every historical version of the file in order to compute the diff at each layer to figure out whether or not a revision modified a given line. But here we see Git loading in each historical version of the object one by one, leading to bloated storage and poor performance.

Git 2.49 introduces a new tool, git backfill, which can fault in any missing historical blobs from a --filter=blob:none clone in a small number of batches. These requests use the new path-walk API (also introduced in Git 2.49) to group together objects that appear at the same path, resulting in much better delta compression in the packfile(s) sent back from the server. Since these requests are sent in batches instead of one-by-one, we can easily backfill all missing blobs in only a few packs instead of one pack per blob.

After running git backfill in the above example, our experience looks more like:

$ git clone --sparse --filter=blob:none [email protected]:git/git.git[...] # downloads historical commits/trees/tags
$ cd git
$ git sparse-checkout add builtin
[...] # downloads current contents of builtin/
$ git backfill --sparse
[...] # backfills historical contents of builtin/
$ git blame -- builtin/backfill.c
85127bcdeab (Derrick Stolee 2025-02-03 17:11:07 +0000 1) /* We need this macro to access core_apply_sparse_checkout */
85127bcdeab (Derrick Stolee 2025-02-03 17:11:07 +0000 2) #define USE_THE_REPOSITORY_VARIABLE
85127bcdeab (Derrick Stolee 2025-02-03 17:11:07 +0000 3)
[...]

But running git backfill immediately after cloning a repository with --filter=blob:none doesn’t bring much benefit, since it would have been more convenient to simply clone the repository without an object filter enabled in the first place. When using the backfill command’s --sparse option (the default whenever the sparse checkout feature is enabled in your repository), Git will only download blobs that appear within your sparse checkout, avoiding objects that you wouldn’t checkout anyway.

To try it out, run git backfill in any --filter=blob:none clone of a repository using Git 2.49 today!

[source, source]


  • We discussed above that Git uses compression powered by zlib when writing loose objects, or individual objects within packs and so forth. zlib is an incredibly popular compression library, and has an emphasis on portability. Over the years, there have been a couple of popular forks (like intel/zlib and cloudflare/zlib) that contain optimizations not present in upstream zlib.

The zlib-ng fork merges many of the optimizations made above, as well as removes dead code and workarounds for historical compilers from upstream zlib, placing a further emphasis on performance. For instance, zlib-ng has support for SIMD instruction sets (like SSE2, and AVX2) built-in to its core algorithms. Though zlib-ng is a drop-in replacement for zlib, the Git project needed to update its compatibility layer to accommodate zlib-ng.

In Git 2.49, you can now build Git with zlib-ng by passing ZLIB_NG when building with the GNU Make, or the zlib_backend option when building with Meson. Early experimental results show a ~25% speed-up when printing the contents of all objects in the Git repository (from ~52.1 seconds down to ~40.3 seconds).

[source]

  • This release marks a major milestone in the Git project with the first pieces of Rust code being checked in. Specifically, this release introduces two Rust crates: libgit-sys, and libgit which are low- and high-level wrappers around a small portion of Git’s library code, respectively.

The Git project has long been evolving its code to be more library-oriented, doing things like replacing functions that exit the program with ones that return an integer and let the caller decide to exit or, cleaning up memory leaks, etc. This release takes advantage of that work to provide a proof-of-concept Rust crate that wraps part of Git’s config.h API.

This isn’t a fully-featured wrapper around Git’s entire library interface, and there is still much more work to be done throughout the project before that can become a reality, but this is a very exciting step along the way.

[source]

  • Speaking of the “libification” effort, there were a handful of other related changes that went into this release. The ongoing effort to move away from global variables like the_repository continues, and many more commands in this release use the provided repository instead of using the global one.

This release also saw a lot of effort being put into squelching -Wsign-compare warnings, which occur when a signed value is compared against an unsigned one. This can lead to surprising behavior when comparing, say, negative signed values against unsigned ones, where a comparison like -1 &lt; 2 (which should return true) ends up returning false instead.

Hopefully you won’t notice these changes in your day-to-day use of Git, but they are important steps along the way to bringing the project closer to being able to be used as a standalone library.

[source, source, source, source, source]

  • Long-time readers might remember our coverage of Git 2.39 where we discussed git repack’s new --expire-to option. In case you’re new around here or could use a refresher, we’ve got you covered. The --expire-to option in git repack controls the behavior of unreachable objects which were pruned out of the repository. By default, pruned objects are simply deleted, but --expire-to allows you to move them off to the side in case you want to hold onto them for backup purposes, etc.

git repack is a fairly low-level command though, and most users will likely interact with Git’s garbage collection feature through git gc. In large part, git gc is a wrapper around functionality that is implemented in git repack, but up until this release, git gc didn’t expose its own command-line option to use --expire-to. That changed in Git 2.49, where you can now experiment with this behavior via git gc --expire-to!

[source]

  • You may have read that Git’s help.autocorrect feature is too fast for Formula One drivers. In case you haven’t, here are the details. If you’ve ever seen output like:
$ git psuh
git: 'psuh' is not a git command. See 'git --help'.

The most similar command is
push

…then you have used Git’s autocorrect feature. But its configuration options don’t quite match the convention of other, similar options. For instance, in other parts of Git, specifying values like “true”, “yes”, “on”, or “1” for boolean-valued settings all meant the same thing. But help.autocorrect deviates from that trend slightly: it has special meanings for “never”, “immediate”, and “prompt”, but interprets a numeric value to mean that Git should automatically run whatever command it suggests after waiting that many deciseconds.

So while you might have thought that setting help.autocorrect to “1” would enable the autocorrect behavior, you’d be wrong: it will instead run the corrected command before you can even blink your eyes1. Git 2.49 changes the convention of help.autocorrect to interpret “1” like other boolean-valued commands, and positive numbers greater than 1 as it would have before. While you can’t specify that you want the autocorrect behavior in exactly 1 decisecond anymore, you probably never meant to anyway.

[source, source]

  • You might be aware of git clone’s --branch option, which allows you to clone a repository’s history leading up to a specific branch or tag instead of the whole thing. This option is often used in CI farms when they want to clone a specific branch or tag for testing.

But what if you want to clone a specific revision that isn’t at any branch or tag in your repository, what do you do? Prior to Git 2.49, the only thing you could do is initialize an empty repository and fetch a specific revision after adding the repository you’re fetching from as a remote.

Git 2.49 introduces a convenient alternative to round out --branch‘s functionality with a new --revision option, which fetches history leading up to the specified revision, regardless of whether or not there is a branch or tag pointing at it.

[source]

  • Speaking of remotes, you might know that the git remote command uses your repository’s configuration to store the list of remotes that it knows about. You might not know that there were actually two different mechanisms which preceded storing remotes in configuration files. In the very early days, remotes were configured via separate files in $GIT_DIR/branches (source). A couple of weeks later, the convention changed to use $GIT_DIR/remote instead of the /branches directory (source).

Both conventions have long since been deprecated and replaced with the configuration-based mechanism we’re familiar with today (source, source). But Git has maintained support for them over the years as part of its backwards compatibility. When Git 3.0 is eventually released, these features will be removed entirely.

If you want to learn more about Git’s upcoming breaking changes, you can read all about them in Documentation/BreakingChanges.adoc. If you really want to live on the bleeding edge, you can build Git with the WITH_BREAKING_CHANGES compile time switch, which compiles out features that will be removed in Git 3.0.

[source, source]

  • Last but not least, the Git project had two wonderful Outreachy interns that recently completed their projects! Usman Akinyemi worked on adding support to include uname information in Git’s user agent when making HTTP requests, and Seyi Kuforiji worked on converting more unit tests to use the Clar testing framework.

You can learn more about their projects here and here. Congratulations, Usman and Seyi!

[source, source, source, source]

🎉 Join us at Git Merge 2025

Graphic promoting Git Merge 2025

To celebrate Git’s 20th anniversary, we’re hosting Git Merge 2025 at GitHub HQ in San Francisco on September 29–30, 2025. It’s a conference dedicated to the version control tool that started it all—and the people who use it every day. The call for speakers is open until May 13, so if you’ve got a great Git story to tell, we’d love to hear it.

The full schedule will be available in July. See you there!

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.49, or any previous version in the Git repository.


  1. It’s true. It takes humans about 100-150 milliseconds to blink their eyes, and setting help.autocorrect to “1” will run the suggested command after waiting only 100 milliseconds (1 decisecond). 

The post Highlights from Git 2.49 appeared first on The GitHub Blog.

]]>
83226
Git security vulnerabilities announced https://github.blog/open-source/git/git-security-vulnerabilities-announced-5/ Tue, 14 Jan 2025 18:04:36 +0000 https://github.blog/?p=82019 A new set of Git releases were published to address a variety of security vulnerabilities. All users are encouraged to upgrade. Take a look at GitHub’s view of the latest round of releases.

The post Git security vulnerabilities announced appeared first on The GitHub Blog.

]]>

Today, the Git project released new versions to address a pair of security vulnerabilities, CVE-2024-50349 and CVE-2024-52006, that affect all prior versions of Git.

CVE-2024-50349

When Git needs to fill in credentials interactively without the use of a credential helper, it prints out the hostname and asks the user to fill in the appropriate username/password pair for that host. However, Git prints out the hostname after URL-decoding it. This allows an attacker to craft URLs containing ANSI escape sequences that may be used to construct an intentionally misleading prompt. The attacker may then tweak the prompt to trick a user into providing credentials for a different Git host back to the attacker.

[source]

CVE-2024-52006

When using a credential helper (as opposed to asking the user for their credentials interactively as above), Git uses a line-based protocol to pass information between itself and the credential helper. A specially-crafted URL containing a carriage return can be used to inject unintended values into the protocol stream, causing the helper to retrieve the password for one server while sending it to another.

This vulnerability is related to CVE-2020-5260, but relies on behavior where single carriage return characters are interpreted by some credential helper implementations as newlines.

[source]

Upgrade to the latest Git version

The most effective way to protect against these vulnerabilities is to upgrade to Git 2.48.1. If you can’t upgrade immediately, reduce your risk by taking the following steps:

  • Avoid running git clone with --recurse-submodules against untrusted repositories.
  • Avoid using the credential helper by only cloning publicly available repositories.

In order to protect users against attacks related to these vulnerabilities, GitHub has taken proactive steps. Specifically, we have scheduled releases of GitHub Desktop (CVE-2025-23040), Git LFS (CVE-2024-53263), and Git Credential Manager (CVE-2024-50338) that prevent exploiting this vulnerability for today, January 14.

GitHub has also proactively patched our products that were affected by similar vulnerabilities, including GitHub Codespaces and the GitHub CLI.


CVE-2024-50349 and CVE-2024-52006 were both reported by RyotaK. The fixes for both CVEs were developed by Johannes Schindelin, with input and review from members of the private git-security mailing list.

The post Git security vulnerabilities announced appeared first on The GitHub Blog.

]]>
82019
Highlights from Git 2.48 https://github.blog/open-source/git/highlights-from-git-2-48/ Fri, 10 Jan 2025 18:28:50 +0000 https://github.blog/?p=81991 The open source Git project just released Git 2.48. Here is GitHub's look at some of the most interesting features and changes introduced since last time.

The post Highlights from Git 2.48 appeared first on The GitHub Blog.

]]>

The open source Git project just released Git 2.48 with features and bug fixes from over 93 contributors, 35 of them new. We last caught up with you on the latest in Git back when 2.47 was released.

To celebrate this most recent release, here is GitHub’s look at some of the most interesting features and changes introduced since last time.

Faster SHA-1s without compromising security

When we published our coverage of Git’s 2.47 release, we neglected to mention a handful of performance changes that went in toward the very end of the cycle. Because this version contains a bugfix related to those changes, we figured now was as good a time as any1 to discuss those changes.

You likely know that Git famously uses SHA-1 hashes by default to identify objects within your repository. (We have covered Git’s capability to use SHA-256 as its primary hash function instead, but for this tidbit we’ll focus on Git in its SHA-1 mode). What you may not know is that Git uses SHA-1 hashes internally in a couple of spots, too. Most notable for our purposes is that the pack format includes a trailing SHA-1 that stores the checksum of the preceding bytes. Git uses this data to validate that a pack’s contents matches what was advertised, and didn’t get corrupted in transit.

By default, Git uses a collision detecting implementation of SHA-1, hardening it against common SHA-1 attacks like SHAttered and Shambles. (GitHub also uses the collision detecting SHA-1 implementation). While the collision detecting SHA-1 implementation protects Git against collision attacks, it does so at the cost of a few extra CPU cycles to look for the telltale signs of these attacks while checksumming.

In most cases, the performance impact is negligible, and the benefit outweighs the minor performance cost. But when computing the checksum of a large pack (like when cloning a large repository), the cost adds up. For instance, we used Callgrind and measured that Git spends around 78% of its CPU computing a checksum during a simulated clone of torvalds/linux.

Luckily, the trailing checksum is a data integrity measure, not a security one. For our purposes, this means that we can safely use a faster, non-collision-detecting implementation of SHA-1 specifically when computing trailing checksums (but nowhere else) without compromising security. Git 2.47 introduced new build-time options to specify a separate hash function implementation used specifically when computing trailing checksums. GitHub has used this option, and as a result measured a 10-13% performance improvement in serving fetches/clones across all repositories.

You can try out Git’s ability to select alternative hash function implementations by building with make OPENSSL_SHA1_UNSAFE=1, or other _UNSAFE variants.

[source, source, source]

Bringing --remerge-diff to range-diff

Regular readers of this series will no doubt recall our coverage of Git’s range-diff command (introduced back in Git 2.19), and the newer –remerge-diff option (released in Git 2.36). In case you’re a first-time reader, or neither of those ring a bell for you, don’t worry; here’s a brief refresher.

Git’s range-diff command allows you to compare two sequences of commits, including changes to their order, commit messages, and the actual content changes they introduce. This can be useful when comparing a sequence of commits that were rebased (potentially tweaking the order and changes within the patches along the way), to what that set of commits looked like before the rebase.

Git’s --remerge-diff option tells git show, git log, or various diff-related commands to view the differences between where Git would have stopped with the merge, and what is recorded in the merge. This can be useful when dealing with merge conflicts, since the --remerge-diff view will show you the difference between the conflicts and their resolution, showing you how a given merge conflict was handled.

In Git 2.48, these two features meet for the first time, and range-diff now accepts a --remerge-diff option, so that if someone rebases a sequence of commits with --rebase-merges and potentially needs to make some changes, then the changes in merge commits can also be reviewed as part of the range-diff.

As a side effect of this work, a longstanding bug with --remerge-diff was also fixed, which in particular will allow git log –remerge-diff to be used together with options that change the order of commit traversal (such as --reverse).

[source, source]

Memory leak-free tests in Git

Beginning all the way back in Git 2.34, the Git project has been focused on reducing memory leaks with the goal of ultimately making Git leak-free. Since Git is a command line tool, each execution typically only lasts for a brief period of time, after which the kernel will free any memory allocated to Git that Git itself did not free. Because of this, memory leaks in Git have not posed a significant practical issue for everyday use.

But, having memory leaks in Git makes it difficult to convert much of Git’s internals into a callable library, where having memory leaks would be a significant issue. To address this, there has been a concerted effort over many years to reduce the number of memory leaks in Git’s codebase, with the ultimate goal of eliminating them altogether.

After much effort toward that end, Git can now run its test suite successfully with leak checking enabled. As a satisfying end result, much of the test infrastructure we talked about back in 2.34 was removed, resulting in a simpler test infrastructure. Making Git memory leak-free represents significant progress toward being able to convert parts of Git’s internals into a callable library.

[source, source, source, source, source, source, source, source, source, source, source, source]

Introducing Meson into Git

The Git project uses GNU Make as the primary means to compile Git, meaning that if you can obtain a copy of Git’s source, running make should be all you need to get a working Git binary (provided you have the necessary dependencies, etc.). There are a couple of exceptions, namely that Git has some support for Autoconf and CMake, though they are not as up-to-date as Git’s Makefile.

But as the Git project approaches its 20th anniversary later this year, its Makefile is starting to show its age. There have been over 2,000 commits to the Makefile, resulting in a build script that is nearly 4,000 lines long.

In this release, the Git project has support for a new build system, Meson, as an alternative to building with GNU Make. While support for Meson is not yet as robust as building with Make, Meson does offer a handful of advantages over Make. Meson is easier to use than Make, making the project more accessible to newcomers or contributors who don’t have significant experience working with Make. Meson also has extensive IDE support, and supports out-of-tree and cross-platform builds, which are both significant assets to the Git project.

Git will retain support for Make and CMake in addition to Meson for the foreseeable future, and retain support for Autoconf for a little longer. But if you’re curious to experiment with Git’s newest build system, you can run meson setup build && ninja -C build on a fresh copy of Git 2.48 or newer.

[source]


  • As the Git project has grown over the years, it has accumulated a number of features and modes that, while reasonable when first introduced, have since become outdated or superseded and are now deprecated. In Git 2.48, the Git project began collecting these now-deprecated features in a list stored in Documentation/BreakingChanges.txt.

    This document enables the Git project to discuss deprecating certain features and collects the project’s anticipated deprecations in a single place. On the other side of the equation, it allows users to see if they might be affected by an upcoming deprecation, and share their use-case of a particular feature with the project. Check out the list to see if there is anything on there that you might miss, and to get an early picture of what an eventual Git 3.0 release might look like!

    [source]

  • If you’ve ever scripted around your repository’s references, you are likely familiar with Git’s for-each-ref command. In case you’re not, for-each-ref is a flexible tool that allows you to list references in your repository, apply custom formatting specifiers to them, and much more.

    Back in Git 2.44, we talked about some performance improvements that allowed git for-each-ref to run significantly faster by combining reference filtering and formatting into the same codepath, eliminating the need to store and sort the results in certain conditions.

    Git 2.48 extends those changes by allowing us to take advantage of the same optimizations even when asked to output the references in sorted order (under certain conditions). As long as those conditions are met, you can quickly output a small number of references even under --sort=refname independent of how many references your repository actually has.

    [source]

  • While we’re on the topic of references, the reftable subsystem has received some more attention in this release. Git’s reftable implementation was updated to avoid explicit dependencies on some of Git’s convenience APIs, making further progress on being able to compile the reftable code without libgit.a. The reftable implementation was also updated to gracefully handle memory allocation failures instead of exiting the process immediately. Last but not least, the reftable code was updated to be able to reuse reference iterators, resulting in faster reference creation and lower memory usage when using reftables.

    For more about reftables, check out our previous coverage of reftables.

    [source, source, source, source]

  • When you clone from a remote repository, the default branch that the remote repository uses is reflected in refs/remotes/origin/HEAD locally2. In prior versions of Git, subsequent fetches and pulls did not update this symbolic reference. With Git 2.48, if the remote has a default branch but refs/remotes/origin/HEAD is missing locally, then a fetch will update it.

    If you want to take it a step further, you can set remote.origin.followRemoteHead configuration to warn or always; if you do so, when refs/remotes/origin/HEAD already exists but does not match the default branch on the remote side, then when you run git fetch it will either warn you about the change or just automatically update refs/remote/origin/HEAD to the appropriate value depending on what setting you used.

    [source, source]

  • Partial clones also received some love this cycle, fixing an infinite loop and avoiding promisor to non-promisor references that could break the repository after a git gc.

    For those unfamiliar with partial clones or want to learn more about their internals, you can read the guide “Get up to speed with partial clone and shallow clone.”

    [source, source, source, source]


The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.48, or any previous version in the Git repository.

Notes


  1. A better time would have been in the Highlights from Git 2.47 blog post, but who’s counting? 
  2. By default, anyway. You can specify a name other than “origin” with the -o option 

The post Highlights from Git 2.48 appeared first on The GitHub Blog.

]]>
81991
Highlights from Git 2.47 https://github.blog/open-source/git/highlights-from-git-2-47/ Mon, 07 Oct 2024 15:59:17 +0000 https://github.blog/?p=80315 Git 2.47 is here, with features like incremental multi-pack indexes and more. Check out our coverage of some of the highlights here.

The post Highlights from Git 2.47 appeared first on The GitHub Blog.

]]>

The open source Git project just released Git 2.47 with features and bug fixes from over 83 contributors, 28 of them new. We last caught up with you on the latest in Git back when 2.46 was released.

To celebrate this most recent release, here is GitHub’s look at some of the most interesting features and changes introduced since last time.

Incremental multi-pack indexes

Returning readers of this series will no doubt remember our coverage of all things related to multi-pack indexes (MIDXs). If you’re new here, or could use a refresher, here’s a brief recap.

Git stores objects (the blobs, trees, commits, and tags that make up your repository’s contents) in one of two formats: either loose or packed. Loose objects are the individual files stored in the two-character sub-directories of $GIT_DIR/objects, each representing a shard of the total set of loose objects. For instance, the object 08103b9f2b6e7fbed517a7e268e4e371d84a9a10 would be stored loose at $GIT_DIR/objects/08/103b9f2b6e7fbed517a7e268e4e371d84a9a10.

Objects can also be packed together in a single file known as a packfile. Packfiles store multiple objects together in a binary format, which has a couple of advantages over storing objects loose. Packfiles often have better cache locality because similar objects are often packed next to or near each other. Packfiles also have the advantage of being able to represent objects as deltas of one another, enabling a more compact representation of pairs of similar objects.

However, repositories can start to experience poor performance when they accumulate many packfiles, since Git has to search through each packfile to perform every object lookup. To improve performance when a repository accumulates too many packs, a repository must repack to generate a single new pack which contains the combined contents of all existing packs. This leaves the repository with only a single pack (resulting in faster lookup times), but the cost of generating that pack can be expensive.

In Git 2.21, multi-pack indexes were introduced to mitigate this expense. MIDXs are an index mapping between objects to the pack and location within that pack at which they appear. Because MIDXs can store information about objects across multiple packs, they enable fast object lookups for repositories that have many individual packs, like so:

non-incremental multi-pack index (MIDX) with objects in three packs

Here the multi-pack index is shown as a series of colored rectangles, each representing an object. The arrows point to those objects’ location within the pack from which they were selected in the MIDX, and encode the information stored in the MIDX itself.

But generating and updating the repository’s MIDX takes time, too: each object in the packs which are part of the MIDX need to be examined to record their object ID and offset within their source pack. This time can stretch even further if you are using multi-pack reachability bitmaps, since it adds a potentially large number of traversals covering significant portions of the repository to the runtime.

So what is there to do? Repacking your repository to optimize object lookups can be slow, but so can updating your repository’s multi-pack index.

Git 2.47 introduces a new experimental feature known as incremental multi-pack indexes, which allow storing more than one multi-pack index together in a chain of MIDX layers. Each layer contains packs and objects which are distinct from earlier layers, so the MIDX can be updated quickly via an append operation that only takes time proportional to the new objects being added, not the size of the overall MIDX. Here’s an example:

incremental MIDX with two layers, describing objects in six unique packs)

The first half of the figure is the same as earlier, but the second half shows a new incremental layer in the multi-pack index chain. The objects contained in the MIDX on the second half are unique to the ones on the first half. But note that the source packs which appear in the MIDX on the second half have some overlap with the objects which appear in the MIDX on the first half.

In Git 2.47, the incremental multi-pack index feature is still considered experimental, and doesn’t yet support multi-pack reachability bitmaps. But support for incremental multi-pack bitmaps is currently under review and will hopefully appear in a future release.

(At GitHub, we plan to use incremental multi-pack bitmaps as part of further scaling efforts to support even larger repositories during repository maintenance. When we do, expect a blog post from us covering the details.)

You can experiment with incremental multi-pack indexes by running:

$ git multi-pack-index write --incremental

to add new packs to your repository’s existing MIDX today.

[source]

Quickly find base branches with for-each-ref

Have you ever been working on a branch, or spelunking through a new codebase and wondered to yourself, “what is this branch based on”? It’s a common question, but the answer can be surprisingly difficult to answer with the previously existing tools.

A good approximation for determining what branch was the likely starting point for some commit C is to select the branch which minimizes the first-parent commits which are unique to C. (Here, “first parent commits” are the commits which are reachable by only walking through a merge commit’s first parent instead of traversing through all of its parents).

If you’re wondering: “why limit the traversal to the first-parent history?”, the answer is because the first-parent history reflects the main path through history which leads up to a commit. By minimizing the number of unique first-parent commits among a set of candidate base branches, you are essentially searching for the one whose primary development path is closest to commit C. So the branch with the fewest unique first-parent commits is likely where C originated or was branched from.

You might think that you could use something like git rev-list --count --first-parent to count the number of first-parent commits between two endpoints. But that’s not quite the case, since rev-list will remove all commits reachable from the base before returning the unique count.

Git 2.47 introduces a new tool for figuring out which branch was the likely starting point for some commit via a new atom used in for-each-ref‘s --format specification. For example, let’s say I’m trying to figure out which branch name was picked for a topic I worked on upstream.

$ needle=fcb2205b77470c60f996a3206b2d4aebf6e951e3
$ git for-each-ref --contains $needle refs/remotes/origin | wc -l
63

Naively searching for the set of branches which contain the thing I’m looking for can return many results, for example if my commit was merged and is now contained in many other branches. But the new %(is-base:) atom can produce the right answer:

$ git for-each-ref --format="%(refname) %(is-base:$needle)" refs/remotes/origin \
  | grep '('
refs/remotes/origin/tb/incremental-midx-part-1 (fcb2205b77470c60f996a3206b2d4aebf6e951e3)

[source]


  • Git is famously portable and compatible with a wide variety of systems and architectures, including some fairly exotic ones. But until this most recent release, Git has lacked a formal platform support policy.

    This release includes a new “Platform Support Policy” document which outlines Git’s official policy on the matter. The exact details can be found in the source link below, but the current gist is that platforms must have C99 or C11, use versions of dependencies which are stable or have long-term support, and must have an active security support system. Discussions about adding additional requirements, including possibly depending upon Rust in a future version, are ongoing.

    The policy also has suggestions for platform maintainers on which branches to test and how to report and fix compatibility issues.

    [source]

  • A couple of releases ago, we discussed Git’s preliminary support for a new reference backend known as reftable. If you’re fuzzy on the details, our previous post is chock full of them.

    This release brings a number of unit tests which were written in the reftable implementation’s custom testing framework to Git’s standard unit test framework. These migrations were done by Chandra Pratap, one of the Git project’s Google Summer of Code (GSoC) contributors.

    This release also saw reftable gain better support when dealing with concurrent writers, particularly during stack compaction. The reftable backend also gained support for git for-each-ref’s –exclude option which we wrote about when Git 2.42 was released.

    [source, source, source, source, source, source, source, source, source, source, source, source]

  • While we’re on the topic of unit testing, there were a number of other areas of the project which received more thorough unit test coverage, or migrated over existing test from Git’s Shell-based integration test suite.

    Git’s hashmap API, OID array, and urlmatch normalization features all were converted from Shell-based tests with custom helpers to unit tests. The unit test framework itself also received significant attention, ultimately resulting in using the Clar framework, which was originally written to replace the unit test framework in libgit2.

    Many of these unit test conversions were done by Ghanshyam Thakkar, another one of Git’s GSoC contributors. Congratulations, Ghanshyam!

    [source, source, source, source, source, source, source]

  • While we’re on the topic of Google Summer of Code contributors, we should mention last (but not least!) another student, shejialuo, improved git fsck to check the reference storage backend for integrity in addition to the regular object store. They introduced a new git refs verify sub-command which is run through via git fsck, and catches many reference corruption issues.

    [source]

  • Since at least 2019, there has been an effort to find and annotate unused parameters in functions across Git’s codebase. Annotating parameters as unused can help identify better APIs, and often the presence of an unused parameter can point out a legitimate bug in that function’s implementation.

    For many years, the Git project has sought to compile with -Wunused-parameter under its special DEVELOPER=1 mode, making it a compile-time error to have or introduce any unused parameters across the codebase. During that time, there have been many unused parameter cleanups and bug fixes, all done while working around other active development going on in related areas.

    In this release, that effort came to a close. Now when compiling with DEVELOPER=1, it is now a compile-time error to have unused parameters, making Git’s codebase cleaner and safer going forward.

    [source, source, source, source, source, source, source, source]

  • Way back when Git 2.34 was released, we covered a burgeoning effort to find and fix memory leaks throughout the Git codebase. Back then, we wrote that since Git typically has a very short runtime, it is much less urgent to free memory than it is in, say, library code, since a process’s memory will be “freed” by the operating system when the process stops.

    But as Git internals continue to be reshaped with the eventual goal of having them be call-able as a first party library, plugging any memory leaks throughout the codebase is vitally important.

    That effort has continued in this release, with more leaks throughout the codebase being plugged. For all of the details, check out the source links below:

    [source, source, source, source]

  • The git mergetool command learned a new tool configuration for Visual Studio Code. While it has always been possible to manually configure Git to run VSCode’s 3-way merge resolution, it required manual configuration.

    In Git 2.47, you can now easily configure your repository by running:

    $ git config set merge.tool vscode
    

    and subsequent runs of git mergetool will automatically open VSCode in the correct configuration.

    [source]

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.47, or any previous version in the Git repository.

The post Highlights from Git 2.47 appeared first on The GitHub Blog.

]]>
80315