Removing sensitive data from a Github repository
Jul 31, 22Sometimes, things make their way into a Github repository that cannot be stored in that context. The common examples include credentials, secrets, and private keys.
Github provides guidance on how to fix these mistakes in “Removing sensitive data from a repository”.
In my experience however, the documentation doesn’t sufficiently describe the necessary organizational and procedural work needed in a rewrite.
I’ve put together two different runbooks to follow - one if the data is in a Pull Request, and another once it is merged into the main
branch. Where necessary, git filter-repo
is used.
Pull Request
- Identify all affected commits and pull requests: Search the repo for other instances of the same data, in case it was introduced elsewhere as well. Ensure it has not made its way into master - if it has, you’ll need to follow the other runbook. Make a note of the hash(es) for the affected commits. Confirm the commit did not appear in any other branches:
git branch -a --contains [HASH]
. Confirm the commit did not appear in any tags:git tag --contains [HASH]
. Check everything:git for-each-ref --contains [HASH]
- Make a fresh clone of the repo, dedicated for the rewrite
- Force-push a new commit without the data to the branch
- The easiest way to do this is to checkout the branch the PR is based on, check the git log to find the commit on which the branch is based, and
git reset --hard
the branch to the commit - If completely wiping the branch is sub-optimal (generally due to significant changes that can’t be easily reproduced)
git filter-repo
can be used to rewrite the branch’s historygit filter-repo --refs [branch] --path [file to wipe] --invert-paths
git push -f
the branch to origin, effectively wiping out all changes made on the branch- Check that, in Github, the PR no longer shows the PHI
- The easiest way to do this is to checkout the branch the PR is based on, check the git log to find the commit on which the branch is based, and
- Submit a support ticket to GitHub to request deletion of the PR and a cache clear (which will also remove the commit).
- Confirm deletion of both the PR and commit from both the web and from a full clone of the repository
git clone ...
git fetch --all
git checkout [HASH]
- should fail- Should not exist -
github.com/{organization}/{repo}/commit/{hash}
In the main
branch
Buckle up.
- Document all locations of data. Optional: Create cleaned replacements for all files to be removed.
- Set a code freeze - and notify that all changes should be committed to GitHub at that time. Advise users to:
- Run
git branch -vv | grep -v origin
and push any branches you need that are not currently tracked remotely to Github. (Make sure these branches do not include sensitive data) - Run
git log --branches --not --remotes
and push any commits found to Github
- Run
- Conduct the rewrite, by:
git clone --mirror
(and then make a backup of the mirror, in case recovery to pre-filter state is later needed)git filter-repo --path [file to wipe] --invert-paths
1. Optional: If Protected Branches is in use formain
, it must be temporarily disabled (at least for Admins)git remote add origin git@github.com:{organization}/{repo}.git; git push --force --all origin
- Optional: Where a large repository is in use, such as an active monorepo, these commands will fail on an
Internal Server Error
. As a result, a few hacky scripts are required- Generate lists of refs that still need pushing:
git branch --contains [HASH] | grep "remotes" | cut -d" " -f2 | cut -d$'\t' -f2 > /tmp/badbranches.txt; git tag --contains [HASH] > /tmp/badtags.txt
- Push branches:
while read p; do echo "Force-pushing ${p}" command git push -f https://github.com/cedar-team/cedar.git "${p}" done \</tmp/badbranches.txt
- Push tags:
while read p; do echo "Force-pushing :refs/tags/${p}" command git push -f https://github.com/cedar-team/cedar.git ":refs/tags/${p}" done \</tmp/badtags.txt
- Use
git pull; git fetch origin; git for-each- ref --contains [HASH]
to find any lingering changes that need pushing - Optional: Directly commit and push cleaned replacement file(s) from 1 to
main
- Re-enable protected branches
- Generate lists of refs that still need pushing:
- Other possibilities, if a monorepo isn’t in place:
git push --force --all origin
git push origin -f
linkgit push origin --force 'refs/heads/*'
git push origin --force 'refs/tags/*'
git push origin --force 'refs/replace/*'
- Optional: Where a large repository is in use, such as an active monorepo, these commands will fail on an
- Advise everyone to rebase if necessary
Longer Term
- You can use a tool like gitleaks as a pre-commit hook to detect sensitive data before it can be committed to the remote.
- You can expand this runbook to add violating commit hashes to the pre-commit hook, and prevent their reintroduction