Sunday, January 12, 2014

Removing files from git

TL;DR - "git rm --cached <file>" means "Yo git, as far as you know, this file is gone"
In a git repo, there are three places for files to be:
1) In your working directory
2) In the staging area
3) In the most recent commit.

Getting rid of a file means moving it out of all 3 places.
1) by deleting it
   rm <file>
2) by removing it from the staging area:
   git rm <filename>
3) by committing that removal.
   git commit -m "die stupid file die"

When you want the file to remain on your filesystem but NOT in the repo, then tell git to ignore it. But that isn't enough! You also have to get it out of the repo.

1) tell git to ignore it: add the file or directory name to .gitignore[1]
2) get it out of the staging area BUT NOT the working directory:
  git rm --cached <filename>
3) If the file has been committed before, commit that removal, along with your .gitignore changes:
  git add .gitignore
  git commit -m "hide stupid file hide"

Git etiquette: package the .gitignore updates along with the removal of the newly-ignored files in one commit.
Warning: if you ever check out a commit that doesn't have that file in .gitignore, whatever's in the commit will overwrite your current one. No warnings. I hope this was some sort of build output that you can regenerate.
Sometimes the file has never been committed, but it was accidentally added to the staging area, and now you want git to leave you alone and ignore that file already!

Delete the file and remove it from the staging area in one easy step:
1) git rm -f pee

Or keep it, and tell git to leave it the freak alone:
1) tell git to ignore it: add the file or directory name to .gitignore
2) get it out of the staging area BUT NOT the working directory:
  git rm --cached <filename>

Terminology: the "staging area" is also called the "index" and the "cache," for historical reasons.
If this seems complicated... yeah, I agree. If you know that "git rm --cached <file>" means "Yo git, take this file out of you," that'll get you through most of the frustration.

---------------
[1] For more ways to ignore files, and when to use each: http://jessitron.github.io/git-happens/ignore.html



4 comments:

  1. You can also ignore a file with `git update-index --assume-unchanged`. However, if you use `git --rm cached`, it will override that, and you'll have to ignore it again.

    ReplyDelete
  2. Thanks for that excellent breakdown of the state of a recently committed file, and how to change it - I wanted to remark though that your opening sentence ("In a git repo, there are three places for files to be:") for completeness should add a 4th option: "In one or more *older* commits".

    Removing unwanted data, committed from _far back_ in Git repo history, is a very different task to the one you're describing, but I can't resist the compulsion to check that you know about The BFG: http://rtyley.github.io/bfg-repo-cleaner/

    The BFG is a Scala/JGit-based tool that, ok, I wrote, but also is specifically designed for the removal of unwanted data from Git repository history. Typically in the past git-filter-branch has been used for this, which worked, but managed to be an order of magnitude more complicated even than the command you're describing above... it also worked quite slowly, in a single-threaded manner with a degree of flexibility that severely constrained performance. Cleaning a medium-biggish repo could be an overnight task. The BFG, unlike git-filter-branch, does not give you the opportunity to handle a file differently based on where or when it was committed within your history (you don't care _where_ the bad data is, you just want it _gone_), and this constraint gives a dramatic performance benefit. In addition, The BFG can easily make use of all the cores on your machine using Scala parallelism - the overall speed benefit is around 10-50x even on small repos, and on *large* repos, it's more like 500x.

    Finally- it's much simpler to use:

    Delete all files named 'id_rsa' or 'id_dsa' :
    $ bfg --delete-files id_{dsa,rsa} my-repo.git

    Remove all blobs bigger than 1 megabyte :
    $ bfg --strip-blobs-bigger-than 1M my-repo.git

    Replace all passwords listed in a file with ***REMOVED*** wherever they occur in your repository :
    $ bfg --replace-text passwords.txt my-repo.git

    Ah... please forgive my rant - it's just a subject very close to my heart, even if it wasn't exactly the one you were addressing!

    ReplyDelete
    Replies
    1. Roberto, thanks for this comment! And thanks for writing such a useful tool.
      I have another post about removing big files - in my case, I needed to find the large files, evaluate each for whether it was OK to remove, and then delete some of them.

      Would you mind adding your comment there too?
      http://blog.jessitron.com/2013/08/finding-and-removing-large-files-in-git.html

      Delete
    2. Sure, my pleasure! Have added comment here:

      http://blog.jessitron.com/2013/08/finding-and-removing-large-files-in-git.html?showComment=1389653326759#c6977471023524810917

      Delete