Git Version Control with Jupyter Notebooks
Version Control is a vital part data science workflows. Between multiple experiments it is essential to know what changed and which updates were made by which team member. Unfortunately the default jupyter setup is severely lacking in this regard.
Jupyter notebooks are stored in Json
Format, as a result this makes it difficult to see diffs on notebooks and do code reviews on a team level. To solve this issue we will be storing notebooks in both .ipynb
and .Rmd
(r-markdown format). This will enable us to do diffs as well as merges, since .Rmd
is just a text file.
An Intro to R-Markdown and advantages over plain markdown
R-Markdown is the same as markdown format with added advantage that you can make publishing worthy pdf/word-doc (and other formats) from it. As such if your notebook is stored as a .Rmd
format you can not only verson control it but also convert it into a paper publishing format.
More info about R-Markdown can be found here. Also if you intend to write a book using R-Markdown then check the R bookdown package. It allows you write a book using markdown+code+outputs.
Ways to save in multiple formats
- Jupyter Hooks and git hooks
- Jupytext and saving as
.Rmd
Why we choose Jupytext despite 1st method being easier to setup?
Well 1st method only saves it as code while Jupytext saves it as readme, so it will render well in github and keep your documentation intact.
Install and setup Jupytext
You need to perform this on all systems who will use the git repo with notebooks. That means all your teammates must have jupytext configured.
We use Jupytext library (https://github.com/mwouts/jupytext)
pip install jupytext --upgrade
- Next generate a Jupyter config, if you don’t have one yet, with
jupyter notebook --generate-config
- edit
.jupyter/jupyter_notebook_config.py
and append the following:
c.NotebookApp.contents_manager_class="jupytext.TextFileContentsManager"
c.ContentsManager.default_jupytext_formats = ".ipynb,.Rmd"
- and restart Jupyter, i.e. run
jupyter notebook
Note: .jupyter
is mostly present in your home directory.
- Open an existing notebook or create a new one.
- Disable Jupyter’s autosave to do round-trip editing, just add the following in the top cell and execute.
%autosave 0
- You can edit the
.Rmd
file in Jupyter as well as text editors, and this can be used to check version control changes.
Possible Git Workflows (2 Ways)
1. Saving only the Rmd file
We will remove .ipynb
files and make a small change to .Rmd
file.
ipynb
files have all the output in their json source, as such these when stored in source control add huge changes in diff even when the actual change is very small. Also they have all the images/plots encoded as strings so it is heavy on source control. As such you can just check-in the Rmd file into source control.
To do that in your .gitignore
file add the below line in a new-line.
*.ipynb
Note: All your teammates also need to do this so that they don’t commit ipynb
files into git.
In case you have checked in ipynb
files already then remove them from source control after checking in the .Rmd
files. To remove files from gitrepo but not from local directory (Refer here)
git rm --cached file1.ipynb
git commit -m "remove file1.ipynb"
Next we make a small change to .Rmd
file.
- Open the
.Rmd
file invi
- On the line that begins with
jupytext_formats: ...
, change it to:
jupytext_formats: ipynb,Rmd:rmarkdown
- Save file and exit
vi
Note: the change to .Rmd
file is needed only once when you create it and push it to git remote for 1st time.
Cloning the repo and create notebook
Once you removed the ipynb
notebooks, when you clone the repo you want to create notebooks. Lets see how.
- Open the
.Rmd
file in jupyter from its file browser. - You can use the
.Rmd
file directly but it will not persist output between sessions, so we are gonna create a jupyter notebook. - Click
File->Save
(Cmd/Ctrl+S). - Close the
.Rmd
file (File->Close and Halt
) - Now open the
ipynb
in Jupyter. - Start editing and saving. Your
.Rmd
file will keep updating itself.
2. Saving both Rmd and Ipynb file
Don’t add .ipynb
to your .gitignore
.
Make a small change to .Rmd
file.
- Open the
.Rmd
file invi
- On the line that begins with
jupytext_formats: ...
, change it to:
jupytext_formats: ipynb,Rmd:rmarkdown
- Save file and exit
vi
Note: the change to .Rmd
file is needed only once when you create it and push it to git remote for 1st time.
In this workflow since you save both so you don’t need to do anything extra while cloning, the .ipynb
will already be available so just start using it after cloning the repo.
Remember to put %autosave 0
in the 1st cell of your notebook and run it always. And since you have disabled autosave so remember saving your notebook frequently.
Updates
- You can use
.Rmd
format only as well. Basically instead of using.ipynb
use.Rmd
. Check-in.Rmd
into git for version control. The disadvantage of this is that your plots/results etc will be removed when you check-in to git since.Rmd
is a text only format. Anyone who downloads your notebook has to run the code to see plots/results.