Sunday, August 18, 2013

Finding and removing large files in git history

Sometimes it behooves one to erase commits of large binaries from history. Binaries don't play nicely in git. They slow it down. This post will help you find them and remove them like they were never there.

In svn, people only download the latest version of the repository, so if you commit a large binary, you can delete it in a future commit and the rest of your team is unharmed. In git, everyone who clones a repo gets the entire history, so they're stuck downloading every version of every binary ever committed. Yuck!

Therefore, the svn-to-git conversion is a good time to delete all the large binaries from history. Do this before anyone has cloned the repository, before you push all the commits to a shared place like bitbucket or github.

Caution: Never alter commits that are in your repo and someone else's, if you ever plan to talk to their repo again.

Step 1: Identify the large files. We need to search through all of the history to find the files that are good candidates for deletion. As far as I can tell, this is nontrivial, so here is a complicated command that lists the sum of the sizes of all revisions of files that are over a million bytes. Run it on a mac.

git rev-list master | while read rev; do git ls-tree -lr $rev | cut -c54- | grep -v '^ '; done | sort -u | perl -e '
  while (<>) {
    chomp;
    @stuff=split("\t");
    $sums{$stuff[1]} += $stuff[0];
  }
  print "$sums{$_} $_\n" for (keys %sums);
' | sort -rn >> large_files.txt

Please replace master with a list of all branches you care about.
This command says: List all commits in the history of these branches. For each one, list all the files; descend into directories recursively; include the size of the file. Cut out everything before the size of the file (which starts at character 54). Anything that starts with space is under a million bytes, so skip it. Now, choose only the unique lines; that's approximately the unique large revisions. Sum the sizes for each filename, and output these biggest-first. Store the output in a file.

If this works, large_files.txt will look something like mine:

186028032 AccessibilityNative/WindowsAccessibleHandler/WindowsAccessibleHandler.sdf
94973848 quorum/installers/windows/jdk-7u21-windows-x64.exe
93300120 quorum/installers/windows/jdk-7u21-windows-i586.exe
84144520 quorum/installers/windows/jdk-7-windows-x64.exe
83345288 quorum/installers/windows/jdk-7-windows-i586.exe
57712115 quorum/Run/Default.jar

Yeah, let's not retain multiple versions of the jdk in our repository.

Step 2: Decide which large files to keep. For any file you want to keep in the history, delete its line from large_files.txt.

Step 3: Remove them like they were never there. This is the fun part. If large_files.txt is still in the same format as before, do this:

git filter-branch --tree-filter 'rm -rf `cat /full/path/to/large_files.txt | cut -d " " -f 2` ' --prune-empty <BRANCHES>

This says: Create an alternate universe with a history that looks like <BRANCHES>, except for each commit, take its files and remove everything in large_files.txt (which contains the filename in the second space-delimited field). Drop any commits which only affected files that don't exist anymore. Point <BRANCHES> at this new version of history.

Whew. If this worked, then when you push to a brand-new repository for sharing, those binaries won't go. Not in the current revision, not in any history. It is like they were never there.

-------------------------

OH GOD WHAT DID I DO: If you change your mind or mess up, you can undo this operation.
First, look at the history of where your branch has pointed recently:
git reflog <BRANCH>

Here's my output:
→ git reflog bbm2e9429a7 bbm2@{0}: filter-branch: rewrite08d7da5 bbm2@{1}: branch: Created from HEAD

The top line is the filter-branch I just did. The line before that lists the tip of the branch before that crazy filter operation.
I can do git log 08d7da5 to check on it, and git ls-tree 08d7da5 to see what's in it. (If you want all the files to be listed, then git ls-tree -r 08d7da5.)

When I'm sure I want to undo the filter-branch, then:
git checkout <BRANCH>
git reset <BRANCH>@{1}

will put the branch riiiight back where it was. If you don't like the weird @{1} notation, you can use the specific commit name instead, and tell the branch exactly where you want it to be.

It's important to feel safe to experiment. In git, as long as it was ever committed in the last 30 days, you won't lose it.





27 comments:

  1. Hi Jessica,

    Just on your large-file script, I've found the one here to be incredibly useful:

    http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/

    The nice thing is it's across all the objects in your repository, not just the tip of the branches.

    I hope that helps.
    Charles

    ReplyDelete
    Replies
    1. Thanks! That's useful!
      It makes a good point that the same blob may have multiple names. I focused on multiple versions of a file with the same name, and the other post gets out the same large file with multiple names.

      Delete
  2. I've also used https://github.com/cmaitchison/git_diet as it's a nice collection to search for large objects in git history and the purge them

    ReplyDelete
  3. Hi - thanks for that good info. however, I tried step 2 after step 1 went ok. I got this :
    Rewrite 3cfac5e96468a9f1dcacf447189a5b6c1358ead1 (1/6641)/usr/libexec/git-core/git-filter-branch: line 336: /bin/rm: Argument list too long
    where the syntax is wrong here ?

    ReplyDelete
    Replies
    1. ah! You have so many files to remove that the command line can't handle it. Let's put multiple rm statements in a script instead.

      Edit your large_files.txt, and change all the size numbers to "rm -rf" i
      in vi this would be :g/^[0-9]*/s//rm -rf/

      when your file looks like
      rm -rf /path/to/a/file
      rm -rf /path/to/another/file

      then make it executable:
      chmod u+x large_files.txt

      and then do this instead of the line that failed for you above:
      git filter-branch --tree-filter '/full/path/to/large_files.txt' --prune-empty

      Delete
    2. thanks - figured it out after running it outside of git....but thanks a lot. I used xargs instead which worked fine.

      Delete
  4. (this comment is a modified version of the one I made at http://blog.jessitron.com/2014/01/removing-files-from-git.html, cross-posted at Jessica's request)

    This post is a great description of how you can use git-filter-branch to remove unwanted large files from Git repo history, but it's worth noting that there is a new alternative which I would say is faster and simpler to use, ie The BFG Repo-Cleaner:

    http://rtyley.github.io/bfg-repo-cleaner/

    The BFG is a Scala/JGit-based tool that, ok, I wrote, but also is specifically designed for the removal of unwanted data from Git repository history. Typically in the past git-filter-branch has been used for this, which worked, but managed to be an order of magnitude more complicated even than the command you're describing above... it also worked quite slowly, in a single-threaded manner with a degree of flexibility that severely constrained performance. Cleaning a medium-biggish repo could be an overnight task. The BFG, unlike git-filter-branch, does not give you the opportunity to handle a file differently based on where or when it was committed within your history (you don't care _where_ the bad data is, you just want it _gone_), and this constraint gives a dramatic performance benefit. In addition, The BFG can easily make use of all the cores on your machine using Scala parallelism - the overall speed benefit is around 10-50x even on small repos, and on *large* repos, it's more like 500x. You can see a nice demonstration of this here, where we race a quad-core Mac against a Raspberry Pi - and the Raspberry Pi wins:

    http://www.youtube.com/watch?v=Ir4IHzPhJuI

    Finally- it's much simpler to use. Here's an example of using it to remove all blobs bigger than 1 megabyte - as you can see it 's pretty simple:

    $ bfg --strip-blobs-bigger-than 1M my-repo.git

    (It might be that you actually have some /useful/ files bigger than 1MB, so by default The BFG protects all files in your latest commit, so you only lose the old unused files, that you no longer require.)

    ReplyDelete
  5. Awesome. Saved GitHub folks a beefy chunk of .jars :-P

    ReplyDelete
  6. I get this error message -bash: syntax error near unexpected token `newline'

    ReplyDelete
  7. I get this error message -bash: syntax error near unexpected token `newline'

    ReplyDelete
  8. When I run the step 3 I get this error:

    -bash: syntax error near unexpected token `newline'

    ReplyDelete
  9. There is a saying in the Zen customized, “Birth and lack of way of life are the outstanding problem.” That is where actual Buddhist work out needs primary. Klik4D

    ReplyDelete
  10. The BFG is a Scala/JGit-based tool that, ok, I wrote, but also is specifically designed for the removal of unwanted data from Git repository history. Best tattoo shops in Chicago

    ReplyDelete
  11. The BFG is a Scala/JGit-based tool that, ok, I wrote, but also is specifically designed for the removal of unwanted data from Git repository. free netflix account

    ReplyDelete
  12. They need to have a excellent display online marketing strategy in position that contains regular pre-show marketing emails with their concentrate on audiences and offer benefits for going to their offices, as well as interesting on-site actions and effective post-show marketing emails. giving you some industry insight

    ReplyDelete
  13. The BFG is a Scala/JGit-based tool that, ok, I wrote, but also is specifically designed for the removal of unwanted data from Git repository history. check my source

    ReplyDelete
  14. Process to Fitzgibbons, and others. The audio collection gives the ranking a specific audio and semi-country experience, even when the songs itself owes more to New You are able to than Chattanooga. To my ear, a more consistent country/bluegrass design would better assistance the tale, milieu and artists. important link

    ReplyDelete
  15. Thanks to an wide variety of mobile cellphone programs and thinking processing, organizations can often find out out most of the facts they need, and handle their day, without ever discussing with the house organization office. how can i get rid of cellulite on my bum and thighs

    ReplyDelete
  16. Boulder Electric Vehicle and a Precision customer, invented the electric service truck, and convinced Robichaud to test-drive the truck that he felt would be a perfect fit for the service industry because of the short routes service technicians drive daily. old school new body f4x workout

    ReplyDelete
  17. As almost everyone’s expenses knowledgeable, many organizations had to cut back on their display participation and the wide variety of individuals they sent to be existing at activities was considerably reduced. Consequently, some reveals stopped to are available or became much smaller versions of their halcyon periods. address

    ReplyDelete
  18. Power efficient ms windows and gates can significantly decrease air leak, which indicates that air conditioning systems don't need to work as hard to keep the air temperature at a comfortable level within. important link

    ReplyDelete
  19. The reason why git-svn translates Subversion tags to Git branches is that although Subversion tags are semantically equivalent to Git tags, they are effectively equivalent to Git branches. leptin resistance

    ReplyDelete
  20. Why do I get 'Permission Denied'?
    luka$ git filter-branch --force --tree-filter 'rm -rf `/Volumes/RamDisk/FF/large_files.txt | cut -d " " -f 2` ' --prune-empty master
    Rewrite 072caf825338a50130903528862caa12cebd1c87 (1/3214)/Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-filter-branch: line 318: /Volumes/RamDisk/FF/large_files.txt: Permission denied
    Rewrite 2e7c35b1dd73b2b2ceb010a3cd98ad6906b0716e (2/3214)/Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-filter-branch: line 318: /Volumes/RamDisk/FF/large_files.txt: Permission denied...

    I made sure .git-rewrite is not in the repo (deleted) and force flag does not help?

    ReplyDelete
  21. Air leak is calculated between 0.1 and 0.3 and the lower the variety, the less air that can pass through breaks and other weaknesses in the window's development. Saran Wrap Weight Loss

    ReplyDelete
  22. Discover out more information about the fast selling procedure to see if this will continue to perform for you. We Buy Any House

    ReplyDelete
  23. Private Artist Western Hand Seaside helps in choosing outside home shades may well be extremely problematic. It often takes years of expertise to need in what materials and shades can look enjoyable along. click reference link

    ReplyDelete
  24. Use toys and games that are power, or become hot only with adult guidance to prevent burns and power surprise. Age appropriate suggestions should especially be honored for they. her explanation

    ReplyDelete