Finding and removing large files in git history

Sometimes it behooves one to erase commits of large binaries from history. Binaries don’t play nicely in git. They slow it down. This post will help you find them and remove them like they were never there.

In svn, people only download the latest version of the repository, so if you commit a large binary, you can delete it in a future commit and the rest of your team is unharmed. In git, everyone who clones a repo gets the entire history, so they’re stuck downloading every version of every binary ever committed. Yuck!

Therefore, the svn-to-git conversion is a good time to delete all the large binaries from history. Do this before anyone has cloned the repository, before you push all the commits to a shared place like bitbucket or github.

Caution: Never alter commits that are in your repo and someone else’s, if you ever plan to talk to their repo again.

Step 1: Identify the large files. We need to search through all of the history to find the files that are good candidates for deletion. As far as I can tell, this is nontrivial, so here is a complicated command that lists the sum of the sizes of all revisions of files that are over a million bytes. Run it on a mac.

git rev-list master | while read rev; do git ls-tree -lr $rev | cut -c54- | grep -v ‘^ ‘; done | sort -u | perl -e ‘
  while () {
    chomp;
    @stuff=split(“\t”);
    $sums{$stuff[1]} += $stuff[0];
  }
  print “$sums{$_} $_\n” for (keys %sums);
| sort -rn >> large_files.txt

Please replace master with a list of all branches you care about.
This command says: List all commits in the history of these branches. For each one, list all the files; descend into directories recursively; include the size of the file. Cut out everything before the size of the file (which starts at character 54). Anything that starts with space is under a million bytes, so skip it. Now, choose only the unique lines; that’s approximately the unique large revisions. Sum the sizes for each filename, and output these biggest-first. Store the output in a file.

If this works, large_files.txt will look something like mine:

186028032 AccessibilityNative/WindowsAccessibleHandler/WindowsAccessibleHandler.sdf
94973848 quorum/installers/windows/jdk-7u21-windows-x64.exe
93300120 quorum/installers/windows/jdk-7u21-windows-i586.exe
84144520 quorum/installers/windows/jdk-7-windows-x64.exe
83345288 quorum/installers/windows/jdk-7-windows-i586.exe
57712115 quorum/Run/Default.jar

Yeah, let’s not retain multiple versions of the jdk in our repository.

Step 2: Decide which large files to keep. For any file you want to keep in the history, delete its line from large_files.txt.

Step 3: Remove them like they were never there. This is the fun part. If large_files.txt is still in the same format as before, do this:

git filter-branch –tree-filter ‘rm -rf `cat /full/path/to/large_files.txt | cut -d ” ” -f 2` ‘ –prune-empty 

This says: Create an alternate universe with a history that looks like , except for each commit, take its files and remove everything in large_files.txt (which contains the filename in the second space-delimited field). Drop any commits which only affected files that don’t exist anymore. Point at this new version of history.

Whew. If this worked, then when you push to a brand-new repository for sharing, those binaries won’t go. Not in the current revision, not in any history. It is like they were never there.

————————-

OH GOD WHAT DID I DO: If you change your mind or mess up, you can undo this operation.
First, look at the history of where your branch has pointed recently:
git reflog

Here’s my output:

→ git reflog bbm2e9429a7 bbm2@{0}: filter-branch: rewrite08d7da5 bbm2@{1}: branch: Created from HEAD

The top line is the filter-branch I just did. The line before that lists the tip of the branch before that crazy filter operation.
I can do git log 08d7da5 to check on it, and git ls-tree 08d7da5 to see what’s in it. (If you want all the files to be listed, then git ls-tree -r 08d7da5.)

When I’m sure I want to undo the filter-branch, then:
git checkout
git reset @{1}

will put the branch riiiight back where it was. If you don’t like the weird @{1} notation, you can use the specific commit name instead, and tell the branch exactly where you want it to be.

It’s important to feel safe to experiment. In git, as long as it was ever committed in the last 30 days, you won’t lose it.

27 thoughts on “Finding and removing large files in git history

  1. Thanks! That's useful!It makes a good point that the same blob may have multiple names. I focused on multiple versions of a file with the same name, and the other post gets out the same large file with multiple names.

  2. Hi – thanks for that good info. however, I tried step 2 after step 1 went ok. I got this :Rewrite 3cfac5e96468a9f1dcacf447189a5b6c1358ead1 (1/6641)/usr/libexec/git-core/git-filter-branch: line 336: /bin/rm: Argument list too longwhere the syntax is wrong here ?

  3. ah! You have so many files to remove that the command line can't handle it. Let's put multiple rm statements in a script instead.Edit your large_files.txt, and change all the size numbers to \”rm -rf\” iin vi this would be :g/^[0-9]*/s//rm -rf/when your file looks likerm -rf /path/to/a/filerm -rf /path/to/another/filethen make it executable:chmod u+x large_files.txtand then do this instead of the line that failed for you above:git filter-branch –tree-filter '/full/path/to/large_files.txt' –prune-empty

  4. (this comment is a modified version of the one I made at http://blog.jessitron.com/2014/01/removing-files-from-git.html, cross-posted at Jessica's request)This post is a great description of how you can use git-filter-branch to remove unwanted large files from Git repo history, but it's worth noting that there is a new alternative which I would say is faster and simpler to use, ie The BFG Repo-Cleaner:http://rtyley.github.io/bfg-repo-cleaner/The BFG is a Scala/JGit-based tool that, ok, I wrote, but also is specifically designed for the removal of unwanted data from Git repository history. Typically in the past git-filter-branch has been used for this, which worked, but managed to be an order of magnitude more complicated even than the command you're describing above… it also worked quite slowly, in a single-threaded manner with a degree of flexibility that severely constrained performance. Cleaning a medium-biggish repo could be an overnight task. The BFG, unlike git-filter-branch, does not give you the opportunity to handle a file differently based on where or when it was committed within your history (you don't care _where_ the bad data is, you just want it _gone_), and this constraint gives a dramatic performance benefit. In addition, The BFG can easily make use of all the cores on your machine using Scala parallelism – the overall speed benefit is around 10-50x even on small repos, and on *large* repos, it's more like 500x. You can see a nice demonstration of this here, where we race a quad-core Mac against a Raspberry Pi – and the Raspberry Pi wins:http://www.youtube.com/watch?v=Ir4IHzPhJuIFinally- it's much simpler to use. Here's an example of using it to remove all blobs bigger than 1 megabyte – as you can see it 's pretty simple:$ bfg –strip-blobs-bigger-than 1M my-repo.git(It might be that you actually have some /useful/ files bigger than 1MB, so by default The BFG protects all files in your latest commit, so you only lose the old unused files, that you no longer require.)

  5. There is a saying in the Zen customized, “Birth and lack of way of life are the outstanding problem.” That is where actual Buddhist work out needs primary. Klik4D

  6. They need to have a excellent display online marketing strategy in position that contains regular pre-show marketing emails with their concentrate on audiences and offer benefits for going to their offices, as well as interesting on-site actions and effective post-show marketing emails. giving you some industry insight

  7. Process to Fitzgibbons, and others. The audio collection gives the ranking a specific audio and semi-country experience, even when the songs itself owes more to New You are able to than Chattanooga. To my ear, a more consistent country/bluegrass design would better assistance the tale, milieu and artists. important link

  8. As almost everyone’s expenses knowledgeable, many organizations had to cut back on their display participation and the wide variety of individuals they sent to be existing at activities was considerably reduced. Consequently, some reveals stopped to are available or became much smaller versions of their halcyon periods. address

  9. Why do I get 'Permission Denied'?luka$ git filter-branch –force –tree-filter 'rm -rf `/Volumes/RamDisk/FF/large_files.txt | cut -d \” \” -f 2` ' –prune-empty masterRewrite 072caf825338a50130903528862caa12cebd1c87 (1/3214)/Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-filter-branch: line 318: /Volumes/RamDisk/FF/large_files.txt: Permission deniedRewrite 2e7c35b1dd73b2b2ceb010a3cd98ad6906b0716e (2/3214)/Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-filter-branch: line 318: /Volumes/RamDisk/FF/large_files.txt: Permission denied…I made sure .git-rewrite is not in the repo (deleted) and force flag does not help?

Comments are closed.

Discover more from Jessitron

Subscribe now to keep reading and get access to the full archive.

Continue reading