Making helm-projectile-find-file Fast In Large Projects

I am an avid Emacs user and helm and projectile are big parts of my workflow, which I combine using helm-projectile. This enables helm file completion using projectile as the backend.[^1] When I'm in a project and want to open a new file I type SPC p f to invoke helm-projectile-find-file. This opens a fuzzy matching helm dialog that navigates the files in my current project.

A few months ago I started a new job where I'm working in a very large git repository (containing well over 100k checked in files). For a project this size helm-projectile-find-file with fuzzy matching is very slow. I'd hit SPC p f to find a file, and it would take 3--5 seconds to update the helm dialog after every key stroke. If I was typing a file name like logging.h it would take helm several seconds to refresh after I typed the l, several more to refresh after typing o, and so on. I would end up typing most of the file name blindly. Worse, there seems to be a race condition in helm in this situation, where sometimes the helm dialog will get out of sync and stop updating or responding to keystrokes if too many keypresses are typed blindly this way.

A Red Herring: Projectile Caching

In my initial investigations to fix this the only thing I was able to find was the projectile-enable-caching variable, which can be set to t to enable projectile caching:

;; XXX: Don't actually do this!
(setq projectile-enable-caching t)

The idea is that normally projectile generates the candidate list of files for helm-projectile-find-file by invoking git ls-files every time it is invoked. This is done using git instead of just searching the filesystem to make helm-projectile-find-file only get a list of files that are actually checked in, so git ignored files won't appear in the helm query. Setting projectile-enable-caching will cause the git ls-files list to only be generated once and then cached, avoiding the need to invoke git every time.

Unfortunately this didn't speed anything up. Helm felt just as slow as before. I confirmed this by running git ls-files a few times and found it only took 0.40s wall time, so it only accounted for an insignificant fraction of the time I was spending waiting. This was also confirmed by looking at top while I would search for a file: the Emacs process was spinning at 100% CPU (in a single thread, of course), indicating that the time was spent by some CPU-bound operation in Emacs rather than waiting for the results of the git command.

One last note: if you turn on projectile-enable-caching the projectile file cache will get out of sync as the repository changes. This means new files won't appear in your queries, and old or moved files will appear even though they're not actually present. If you use this option you'll have to periodically manually refresh the projectile cache when you notice things are out of sync. I would recommend only using this option as a last resort, if you've confirmed that your VCS is extremely slow.

The Solution: Exact Matching Helm Queries

Recently a fellow helm and projectile user at work found a solution to this problem! It turns out that there's a simple way to disable helm fuzzy matching: you simply precede the helm query with a space. For example, let's say I know the file I want to open is named logging.h (but I don't necessary know or want to type out the directory it's in). Instead of literally just typing logging.h into the helm dialog, I would enter a leading space character before typing out the filename. I call this mode "exact matching".

The difference between exact matching and the default fuzzy matching is that exact matching will only match file names that contain the exact string you type as a substring, whereas fuzzy matching will match a broader selection. For example, consider the query foo. In fuzzy matching the string foo is implicitly converted to a regex like /f.*o.*o/. This will match any filename that has an f character followed by an o anywhere else in the string followed again by another o anywhere else. In exact matching mode the same query is converted to the fixed regex /foo/, i.e. a regex that literally matches any string containing the substring foo.

As a concrete example: foo will match a file named src/files/blog.socket in fuzzy matching mode, but not in exact matching mode. Both modes will match a file named src/files/foo.socket. In general, fuzzy matching results are always a superset of exact matching results.

Using exact matching might sound less convenient than fuzzing matching, but I found often it works just as well (and sometimes better). It's very common to know an exact substring of a filename, but not to know what directory it's in, or some leading or trailing component of the filename. Exact matching works great in this situation. In the example I gave earlier, in a huge project I might know there is a header named logging.h somewhere in the project, I'm just not sure the exact subdirectory it's in, and exact matching works perfectly for this. In fact, since I learned this trick I use it quite often on smaller projects just because it gives better matching results in some cases, particularly when typing a very short string.

[^1]: I actually use Spacemacs, which is a wonderful Emacs distribution that has helm and projectile set and integrated out of the box.