Git Version Control with Jupyter Notebooks

Faizan Ahemad
Towards Data Science
4 min readOct 8, 2018

--

Don’t loose your version control history.

Version Control is a vital part data science workflows. Between multiple experiments it is essential to know what changed and which updates were made by which team member. Unfortunately the default jupyter setup is severely lacking in this regard.

Jupyter notebooks are stored in Json Format, as a result this makes it difficult to see diffs on notebooks and do code reviews on a team level. To solve this issue we will be storing notebooks in both .ipynb and .Rmd (r-markdown format). This will enable us to do diffs as well as merges, since .Rmd is just a text file.

An Intro to R-Markdown and advantages over plain markdown

R-Markdown is the same as markdown format with added advantage that you can make publishing worthy pdf/word-doc (and other formats) from it. As such if your notebook is stored as a .Rmd format you can not only verson control it but also convert it into a paper publishing format.

More info about R-Markdown can be found here. Also if you intend to write a book using R-Markdown then check the R bookdown package. It allows you write a book using markdown+code+outputs.

Ways to save in multiple formats

Why we choose Jupytext despite 1st method being easier to setup?

Well 1st method only saves it as code while Jupytext saves it as readme, so it will render well in github and keep your documentation intact.

Install and setup Jupytext

You need to perform this on all systems who will use the git repo with notebooks. That means all your teammates must have jupytext configured.

We use Jupytext library (https://github.com/mwouts/jupytext)

pip install jupytext --upgrade
  • Next generate a Jupyter config, if you don’t have one yet, with jupyter notebook --generate-config
  • edit .jupyter/jupyter_notebook_config.py and append the following:
c.NotebookApp.contents_manager_class="jupytext.TextFileContentsManager"
c.ContentsManager.default_jupytext_formats = ".ipynb,.Rmd"
  • and restart Jupyter, i.e. run
jupyter notebook

Note: .jupyter is mostly present in your home directory.

  • Open an existing notebook or create a new one.
  • Disable Jupyter’s autosave to do round-trip editing, just add the following in the top cell and execute.
%autosave 0
  • You can edit the .Rmd file in Jupyter as well as text editors, and this can be used to check version control changes.

Possible Git Workflows (2 Ways)

1. Saving only the Rmd file

We will remove .ipynb files and make a small change to .Rmd file.

ipynb files have all the output in their json source, as such these when stored in source control add huge changes in diff even when the actual change is very small. Also they have all the images/plots encoded as strings so it is heavy on source control. As such you can just check-in the Rmd file into source control.

To do that in your .gitignore file add the below line in a new-line.

*.ipynb

Note: All your teammates also need to do this so that they don’t commit ipynb files into git.

In case you have checked in ipynb files already then remove them from source control after checking in the .Rmd files. To remove files from gitrepo but not from local directory (Refer here)

git rm --cached file1.ipynb
git commit -m "remove file1.ipynb"

Next we make a small change to .Rmd file.

  • Open the .Rmd file in vi
  • On the line that begins with jupytext_formats: ... , change it to:
jupytext_formats: ipynb,Rmd:rmarkdown
  • Save file and exit vi

Note: the change to .Rmd file is needed only once when you create it and push it to git remote for 1st time.

Cloning the repo and create notebook

Once you removed the ipynb notebooks, when you clone the repo you want to create notebooks. Lets see how.

  • Open the .Rmd file in jupyter from its file browser.
  • You can use the .Rmd file directly but it will not persist output between sessions, so we are gonna create a jupyter notebook.
  • Click File->Save (Cmd/Ctrl+S).
  • Close the .Rmd file (File->Close and Halt)
  • Now open the ipynb in Jupyter.
  • Start editing and saving. Your .Rmd file will keep updating itself.

2. Saving both Rmd and Ipynb file

Don’t add .ipynb to your .gitignore.

Make a small change to .Rmd file.

  • Open the .Rmd file in vi
  • On the line that begins with jupytext_formats: ... , change it to:
jupytext_formats: ipynb,Rmd:rmarkdown
  • Save file and exit vi

Note: the change to .Rmd file is needed only once when you create it and push it to git remote for 1st time.

In this workflow since you save both so you don’t need to do anything extra while cloning, the .ipynb will already be available so just start using it after cloning the repo.

Remember to put %autosave 0 in the 1st cell of your notebook and run it always. And since you have disabled autosave so remember saving your notebook frequently.

Updates

  • You can use .Rmd format only as well. Basically instead of using .ipynb use .Rmd . Check-in .Rmd into git for version control. The disadvantage of this is that your plots/results etc will be removed when you check-in to git since .Rmd is a text only format. Anyone who downloads your notebook has to run the code to see plots/results.

References

--

--