Fix merge conflicts in jupyter notebooks

When working with jupyter notebooks (which are json files behind the scenes) and GitHub, it is very common that a merge conflict (that will add new lines in the notebook source file) will break some notebooks you are working on. This module defines the function fix_conflicts to fix those notebooks for you, and attempt to automatically merge standard conflicts. The remaining ones will be delimited by markdown cells like this:

Fixed notebook
A notebook fixed after a merged conflict. The file couldn't be opened before the command was run, but after it the conflict is higlighted by markdown cells.

Walk cells

This is an example of broken notebook we defined in tst_nb. The json format is broken by the lines automatically added by git. Such a file can't be opened again n jupyter notebook, leaving the user with no other choice than to fix the text file manually.

print(tst_nb)
{
 "cells": [
  {
   "cell_type": "code",
<<<<<<< HEAD
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "z=3
",
    "z"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
=======
   "execution_count": 5,
>>>>>>> a7ec1b0bfb8e23b05fd0a2e6cafcb41cd0fb1c35
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
<<<<<<< HEAD
     "execution_count": 7,
=======
     "execution_count": 5,
>>>>>>> a7ec1b0bfb8e23b05fd0a2e6cafcb41cd0fb1c35
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x=3
",
    "y=3
",
    "x+y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Note than in this example, the second conflict is easily solved: it just concerns the execution count of the second cell and can be solved by choosing either option without really impacting your notebook. This is the kind of conflicts fix_conflicts will (by default) fix automatically. The first conflict is more complicated as it spans across two cells and there is a cell present in one version, not the other. Such a conflict (and generally the ones where the inputs of the cells change form one version to the other) aren't automatically fixed, but fix_conflicts will return a proper json file where the annotations introduced by git will be placed in markdown cells.

The first step to do this is to walk the raw text file to extract the cells. We can't read it as a JSON since it's broken, so we have to parse the text.

extract_cells[source]

extract_cells(raw_txt)

Manually extract cells in potential broken json raw_txt

This function returns the beginning of the text (before the cells are defined), the list of cells and the end of the text (after the cells are defined).

start,cells,end = extract_cells(tst_nb)
test_eq(len(cells), 3)
test_eq(cells[0], """  {
   "cell_type": "code",
<<<<<<< HEAD
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "z=3\n",
    "z"
   ]
  },""")

When walking the borken cells, we will add conflicts marker before and after the cells with conflicts as markdown cells. To do that we use this function.

get_md_cell[source]

get_md_cell(txt)

A markdown cell with txt

tst = '''  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A bit of markdown"
   ]
  },'''
assert get_md_cell("A bit of markdown") == tst

same_inputs[source]

same_inputs(t1, t2)

Test if the cells described in t1 and t2 have the same inputs

ts = ['''  {
   "cell_type": "code",
   "source": [
    "'''+code+'''"
   ]
  },''' for code in ["a=1", "b=1",  "a=1"]]
assert same_inputs(ts[0],ts[2])
assert not same_inputs(ts[0], ts[1])

analyze_cell[source]

analyze_cell(cell, cf, names, prev=None, added=False, fast=True, trust_us=True)

Analyze and solve conflicts in cell

THis is the main function used in the walk through the cells of a notebooks. cell is the cell we're at, cf the conflict state: 0 if we.re not in any conflict, 1 if we are inside the first part of a conflict (between <<<<<<< and =======) and 2 for the second part of a conflicts. names contains the names of the branches (they start at [None,None] and get updated as we pass along conflicts). prev contains a copy of what should be included at the start of the second version (if cf=1 or cf=2). added starts at False and keeps track of whether we added any markdown cells (this flag allows us to know if a fast merge didn't leave any conflicts at the end). fast and trust_us are passed along by fix_conflicts: if fast is True, we don't point out conflict between cells if the inputs in the two versions are the same. Instead we merge using the local or remote branch, depending on trust_us.

The function then returns the updated text (with one or several cells, depending on the conflicts to solve), the updated cf, names, prev and added.

tst = '\n'.join(['a', f'{conflicts[0]} HEAD', 'b', conflicts[1], 'c'])
c,cf,names,prev,added = analyze_cell(tst, 0, [None,None], None, False,fast=False)
test_eq(c, get_md_cell('`<<<<<<< HEAD`')+'\na\nb')
test_eq(cf, 2)
test_eq(names, ['HEAD', None])
test_eq(prev, ['a\nc'])
test_eq(added, True)

Here in this example, we were entering cell tst with no conflict state. At the end of the cells, we are still in the second part of the conflict, hence cf=2. The result returns a marker for the branch head, then the whole cell in version 1 (a + b). We save a (prior to the conflict hence common to the two versions) and c (only in version 2) for the next cell in prev (that should contain the resolution of this conflict).

Main function

fix_conflicts[source]

fix_conflicts(fname, fast=True, trust_us=True)

Fix broken notebook in fname

The function will begin by backing the notebook fname to fname.bak in case something goes wrong. Then it parses the broken json, solving conflicts in cells. If fast=True, every conflict that only involves metadata or outputs of cells will be solved automatically by using the local (trust_us=True) or the remote (trust_us=False) branch. Otherwise, or for conflicts involving the inputs of cells, the json will be repaired by including the two version of the conflicted cell(s) with markdown cells indicating the conflicts. You will be able to open the notebook again and search for the conflicts (look for <<<<<<<) then fix them as you wish.

If fast=True, the function will print a message indicating whether the notebook was fully merged or if conflicts remain.