Git: Improve diff generation with diff.algorithm=histogram

Histogrammical.

This post is an adapted extract from my book Boost Your Git DX, available now.

Contrary to common belief, Git doesn’t store diffs. It actually stores snapshots of whole files, heavily compressed to reduce redundancy. Then when displaying a diff is required, git diff generates it on the fly.

Git has several built-in algorithms for generating diffs. These work by aligning matching lines in the two versions of a file to detect removals and additions.

The default algorithm, “myers”, is decent and fast. But for some changes, it can produce hard-to-read output. The “histogram” algorithm tries to correct these flaws by preferring to align the rarest lines first. This change can also lead to more accurate code change metrics and use less CPU time.

The “histogram” algorithm was added to Git in version 1.7.7 (2011) as a noted improvement. But it has not been changed to be the default, due to a lack of hard data. However, at least one academic paper recommends using “histogram”, so Git may make it the default at some point.

Make “histogram” your default algorithm by setting the diff.algorithm option:

$ git config --global diff.algorithm histogram

Let’s look at an example where “histogram” improves on “myers”. Below are diffs from each algorithm for changes to a C function called getDinosaur().

With “myers”:

 Dinosaur* getDinosaur(char* name)
 {
-  char* dataURL = getResource("dinosaurs", name);
-
-  if (dataURL != NULL)
+  if (name == NULL)
   {
-    return createDinosaur(dataURL);
+      log.error("Dinosaur name is null!");
+      return NULL;
   }
-  else
+
+  char* dataURL = getResource("dinosaurs", name);
+
+  if (dataURL == NULL)
   {
     fprintf(stderr, "Couldn't find data: %s", name);
+    return NULL;
   }
-  return NULL;
+  else
+    return createDinosaur(dataURL);
 }

With “histogram”:

 Dinosaur* getDinosaur(char* name)
 {
+  if (name == NULL)
+  {
+      log.error("Dinosaur name is null!");
+      return NULL;
+  }
+
   char* dataURL = getResource("dinosaurs", name);

-  if (dataURL != NULL)
-  {
-    return createDinosaur(dataURL);
-  }
-  else
+  if (dataURL == NULL)
   {
     fprintf(stderr, "Couldn't find data: %s", name);
+    return NULL;
   }
-  return NULL;
+  else
+    return createDinosaur(dataURL);
 }

Note these improvements in the “histogram” version:

  1. The new block of code checking name against NULL is grouped together as one bunch of new lines.
  2. The line calling getResource() is not marked as removed and added.
  3. The changes to the if-return block are also separated from the other changes, making them a bit easier to read.

These improvements occur because “histogram” detects that lines containing just “{” and “}” occur often, so it deprioritizes aligning them.

Fin

May you understand diffs with ease,

—Adam


Read my book Boost Your Git DX for many more Git lessons.


Subscribe via RSS, Twitter, Mastodon, or email:

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: