Git: Clear unreachable files from the .git directory

Whale-sized objects can be problematic.

Every object you commit in Git is copied into its object store, within the .git directory. If you undo a commit, the commit object and associated file objects remain in Git’s object store, at least for a while. The normal garbage collection process will clean them out, by default in 30 days.

But if you accidentally commit large files, this behaviour can lead to unmanageable disk usage. In this case, you can force the garbage collection by running git gc directly:

$ git -c gc.reflogExpireUnreachable=now gc --prune=now

This command clears out old commits not on any branch and the files they refer to. It reclaims disk space at the cost of losing reflog entries, which will prevent you from undoing some operations. To be sure you don’t lose anything you care about, only run it after checking all your in-progress branches are in the expected state, and perhaps push them a remote server.

The -c option temporarily sets gc.reflogExpireUnreachable to 0 days (“now”), down from the 30-day default. The --prune option at 0 days (“now”), down from the 14-day default, clears out the unreachable files.

Worked example

Let’s walk through a worked example. Imagine you’re improving the documentation on a project with a caching feature. Whilst testing, you set up a temporary cache file that is quite large:

$ du -sh tmp_cache
1.0G  tmp_cache

You make your changes to the documentation and commit:

$ git status
On branch main
Changes to be committed:
        modified:   README.md

Untracked files:
        tmp_cache

$ git add --all

$ git commit -m "Explain caching"
[main 831b925] Explain caching
 2 files changed, 2 insertions(+)
 create mode 100644 tmp_cache

Ah, woops. You committed the cache file as well, due to using git add --all. You can see this by looking at the latest commit with file stats:

$ git show --oneline --stat
c1443d4 (HEAD -> main) Explain caching
 README.md |   2 ++
 tmp_cache | Bin 0 -> 1073741824 bytes
 2 files changed, 2 insertions(+)

The tmp_cache file has been copied into the object store, increasing its size:

$ du -sh .git
1.0G  .git

To remove the cache file from the commit, first use git rm --cached to mark the file as removed, but keep it on disk (perhaps for further testing):

$ git rm --cached tmp_cache
rm 'tmp_cache'

$ git status
On branch main
Changes to be committed:
        deleted:    tmp_cache

Untracked files:
        tmp_cache

Then, use git commit --amend to add that change into the previous commit:

$ git commit --amend
[main 25082ae] Explain caching
 Date: Thu Aug 31 10:04:35 2023 +0100
 1 file changed, 2 insertions(+)

The commit stats now show only the changes to README.md:

$ git show --stat --oneline
25082ae (HEAD -> main) Explain caching
 README.md | 2 ++
 1 file changed, 2 insertions(+)

So far, we’ve removed the problematic commit from our branch. But it still lives in the reflog:

$ git reflog
25082ae (HEAD -> main) HEAD@{0}: commit (amend): Explain caching
831b925 HEAD@{1}: commit: Explain caching
...

…and the file object still consumes space in the object store:

$ du -sh .git
1.0G  .git

Use the git gc command to clear out the reflog and file objects:

$ git -c gc.reflogExpireUnreachable=now gc --prune=now
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 10 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 0), reused 5 (delta 0), pack-reused 0

That brings the .git directory back down to size:

$ du -sh .git
124K  .git

Fin

Thanks to William Berg for prompting this post, by asking about the issue whilst beta-reading my new book Boost Your Git DX.

May your repositories remain a manageable size,

—Adam


Read my book Boost Your Git DX for many more Git lessons.


Subscribe via RSS, Twitter, Mastodon, or email:

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: