Removing sensitive data from a Github repository

Jul 31, 22

Sometimes, things make their way into a Github repository that cannot be stored in that context. The common examples include credentials, secrets, and private keys.

Github provides guidance on how to fix these mistakes in “Removing sensitive data from a repository”.

In my experience however, the documentation doesn’t sufficiently describe the necessary organizational and procedural work needed in a rewrite.

https://twitter.com/QuinnyPig/status/1364766166783123456

I’ve put together two different runbooks to follow - one if the data is in a Pull Request, and another once it is merged into the main branch. Where necessary, git filter-repo is used.

Pull Request

  1. Identify all affected commits and pull requests: Search the repo for other instances of the same data, in case it was introduced elsewhere as well. Ensure it has not made its way into master - if it has, you’ll need to follow the other runbook. Make a note of the hash(es) for the affected commits. Confirm the commit did not appear in any other branches: git branch -a --contains [HASH]. Confirm the commit did not appear in any tags: git tag --contains [HASH]. Check everything: git for-each-ref --contains [HASH]
  2. Make a fresh clone of the repo, dedicated for the rewrite
  3. Force-push a new commit without the data to the branch
    1. The easiest way to do this is to checkout the branch the PR is based on, check the git log to find the commit on which the branch is based, and git reset --hard the branch to the commit
    2. If completely wiping the branch is sub-optimal (generally due to significant changes that can’t be easily reproduced) git filter-repo can be used to rewrite the branch’s history
      1. git filter-repo --refs [branch] --path [file to wipe] --invert-paths
    3. git push -f the branch to origin, effectively wiping out all changes made on the branch
    4. Check that, in Github, the PR no longer shows the PHI
  4. Submit a support ticket to GitHub to request deletion of the PR and a cache clear (which will also remove the commit).
  5. Confirm deletion of both the PR and commit from both the web and from a full clone of the repository
    1. git clone ...
    2. git fetch --all
    3. git checkout [HASH]- should fail
    4. Should not exist - github.com/{organization}/{repo}/commit/{hash}

In the main branch

Buckle up.

  1. Document all locations of data. Optional: Create cleaned replacements for all files to be removed.
  2. Set a code freeze - and notify that all changes should be committed to GitHub at that time. Advise users to:
    1. Run git branch -vv | grep -v origin and push any branches you need that are not currently tracked remotely to Github. (Make sure these branches do not include sensitive data)
    2. Run git log --branches --not --remotes and push any commits found to Github
  3. Conduct the rewrite, by:
    1. git clone --mirror (and then make a backup of the mirror, in case recovery to pre-filter state is later needed)
    2. git filter-repo --path [file to wipe] --invert-paths
      1. Optional: If Protected Branches is in use for main, it must be temporarily disabled (at least for Admins)
    3. git remote add origin git@github.com:{organization}/{repo}.git; git push --force --all origin
      1. Optional: Where a large repository is in use, such as an active monorepo, these commands will fail on an Internal Server Error. As a result, a few hacky scripts are required
        1. Generate lists of refs that still need pushing: git branch --contains [HASH] | grep "remotes" | cut -d" " -f2 | cut -d$'\t' -f2 > /tmp/badbranches.txt; git tag --contains [HASH] > /tmp/badtags.txt
        2. Push branches:
          while read p; do  
           echo "Force-pushing ${p}"                         
           command git push -f https://github.com/cedar-team/cedar.git "${p}"  
          done \</tmp/badbranches.txt
          
        3. Push tags:
          while read p; do
           echo "Force-pushing :refs/tags/${p}"
           command git push -f https://github.com/cedar-team/cedar.git ":refs/tags/${p}"
          done \</tmp/badtags.txt
          
        4. Use git pull; git fetch origin; git for-each- ref --contains [HASH] to find any lingering changes that need pushing
        5. Optional: Directly commit and push cleaned replacement file(s) from 1 to main
        6. Re-enable protected branches
      2. Other possibilities, if a monorepo isn’t in place:
        1. git push --force --all origin
        2. git push origin -f link
        3. git push origin --force 'refs/heads/*'
        4. git push origin --force 'refs/tags/*'
        5. git push origin --force 'refs/replace/*'
  4. Advise everyone to rebase if necessary

Longer Term

  1. You can use a tool like gitleaks as a pre-commit hook to detect sensitive data before it can be committed to the remote.
  2. You can expand this runbook to add violating commit hashes to the pre-commit hook, and prevent their reintroduction