Playing Dumb

Write cleaner, more reusable code by forgetting what you know.

I wrote the first draft of this post while I was still at Sequin, and it uses examples from their domain. Publishing here with permission.

When implementing a new feature in a codebase, we usually have the whole problem loaded into our brains, and we’re intimately familiar with how all of the relevant pieces work and how they fit together. This is good, as it helps us get things done quickly.

But if we’re not careful, we can use this deep knowledge to accidentally embed implicit assumptions into the code we’re writing. When we come back to it later, we find it harder to understand, more difficult to reuse in a slightly different context, and more susceptible to needing to be refactored the next time we’re in the area.

One of the ways I try to avoid this mess is to play dumb. That is, I try to take off my omniscient creator of all the things hat put myself in the shoes of the code I’m writing. I’ll ask some questions from that perspective:

What is my job?
What should I know?
What do I have no business knowing?

This technique applies at many levels: subsystems, individual services, modules/classes, and functions/methods.

Sandi Metz says it very well in her article, What Does OO Afford?

“OO asks you to blithely trust others to correctly do their bit. It wants you to strive for ignorance to protect yourself from the consequences of distant changes to other objects.”

Let’s look at some examples. The first two are both in Elixir which is a functional language, so I’ll be talking about modules and functions below. But the same principles apply in an object-oriented language where we work with classes, objects, and methods.

Module Level

In the Sequin codebase, a Job contains the state needed to replicate a collection of objects from an upstream API to a customer’s database (e.g. HubSpot contacts). Every time a job runs, it fetches a page of objects and writes them to the database. We then update the job with information needed to fetch the next page as well some other metadata.

We have two general kinds of jobs for each collection: backfill and delta. Backfill jobs are responsible for fetching historical data, and delta jobs are responsible for keeping up with the latest changes made upstream. When a customer starts replicating a new collection, we start with a backfill job. When it is done, we then kick off a delta job.

We keep the latest version of each job in memory for speed, but we also need to persist the jobs so that we don’t have to start over after a system restart. For this, we have a JobStore.

We were recently writing some new code related to job execution and we wrote something like this function:

defp note_job_completed(%State{} = state, table_id, row_count) do
  job =
    state
    |> get_job!(table_id)
    |> Jobs.note_job_completed!(state.resource_id, row_count, nil)

  if job.kind == :backfill do
    JobStore.replace_job(state.resource_id, job, [Map.put(job, :kind, :delta)])
  else
    JobStore.add(state.resource_id, job)
  end
end

defp get_job!(%State{} = state, table_id) do
  # Omitted; uses `JobStore.load()` to load all of the jobs for
  # `state.resource_id` and then finds the one for the collection
  # represented by table_id.
end

This code makes some critical assumptions about how the JobStore is implemented. If we ever chose to change our implementation, this code would break in interesting and confusing ways.

This seemed perfectly reasonable when we first wrote it. But because you don’t know the implementation of the JobStore, you probably noticed some problems right away. Our intimate knowledge of the implementation of the JobStore kept us from seeing these problems at first.

Here are some of the assumptions this code makes:

The JobStore derives a key for each job and stores the job under that key.
Calling Job.note_job_completed!/3 doesn’t change the job’s key, so when we call JobStore.replace_job with the updated job, the JobStore can still find the job in order to replace it.
Calling JobStore.add/2 will actually replace any existing job that has the same key rather than adding a second copy of the job.

Once we recognized that we were making these assumptions, we immediately refactored the code. To do that, we put ourselves in the shoes of this code and answered the questions from above:

What is my job? To update a job after it completes and persist the updated version back into the JobStore in place of the previous version.
What should I know? There is a JobStore, and it exposes some public functions for interacting with it.
What do I have no business knowing? How and where the JobStore stores jobs; what mechanisms it uses to find stored jobs; what parts of the Job are relevant to how jobs are stored.

Given those answers, we were able to temporarily forget what we know about the internals of the JobStore (aka play dumb) and make this code much cleaner and safer:

defp note_job_completed(%State{} = state, job, row_count) do
  next_job =
    job
    |> SyncJobs.note_job_completed!(state.resource_id, row_count,  nil)
    |> Map.put(:kind, :delta)

  JobStore.replace_job(state.resource_id, job, [next_job])
end

# get_job! as before

Here are the relevant changes:

We now save the next version of the job into a new variable, next_job.
We no longer call add_job at all; we always call replace_job.
We call replace_job with the job we originally loaded from the store (job) and replace it with the updated job, next_job.

This will continue to work even if we make significant changes to the underlying implementation of the JobStore.

Function Level

When the shape of the data changes upstream (for example, a new custom property is added to a HubSpot contact), Sequin has to react to those changes by updating the structure of the database we’re replicating to. We call these changes “migrations”, as do many ORM libraries.

Sequin recently released an improvement that allows a replication to continue running while attempting to perform the migration. If the migration fails, it is scheduled to be retried later.

The new code that runs when a replication (sync)process starts up looked something like this:

defp initialize_sync(state) do
  {state, jobs} =
    case run_pending_migrations(state) do
      {:ok, state} ->
        # Do migration follow-up work
        # Create any new jobs required by the migrations
        {state, jobs}
      _error ->
        Process.send_after(self(), :restart_sync, @migration_retry_interval)
        jobs = load_jobs!(state)
        {state, jobs}
    end

  # Start running `jobs` in order continue replicating data
end

def handle_info(:restart_sync, state) do
  stop_pipeline(state, :migration_failed)
end

Again, there were some implicit assumptions built into this code because we knew too much when writing it:

We know that we’re going to retry running migrations by stopping the replication pipeline and allowing it to restart, where it will try running migrations again.
We know that this is the only place we’re sending the :restart_sync message, so we assume that the reason for stopping the pipeline is :migration_failed.

These assumptions may not always be true. We might move migrations completely away from the replication process and run them out-of-band. We might find other reasons to stop and restart the sync pipeline. Any of those changes would cause us to have to do more refactoring than if we had played dumb while writing this code.

Once we realized that we were making these assumptions, we again asked ourselves the questions, this time from two perspectives:

From the perspective of initialize_sync:

What is my job? Schedule a retry of failed migrations.
What should I know? I can send a message to myself after the retry interval to trigger the retry.
What do I have no business knowing? How the retry is performed, in particular that it will be performed by restarting the entire replication process.

From the perspective of the handle_info callback:

What is my job? Retry running migrations after a failure.
What should I know? That we currently re-run migrations by stopping and restarting the replication pipeline.
What do I have no business knowing? Who is sending me the message to retry migrations and why they’re sending it.

Given this set of answers, we might refactor to something like this:

defp initialize_sync(state) do
  {state, jobs} =
    case run_pending_migrations(state) do
      {:ok, state} ->
        # Do migration follow-up work
        # Create any new jobs required by the migrations
        {state, jobs}
      _error ->
        Process.send_after(self(), :migration_failed, @migration_retry_interval)
        jobs = load_jobs!(state)
        {state, jobs}
    end

  # Start running `jobs` in order continue replicating data
end

def handle_info(:migration_failed, state) do
  stop_pipeline(state, :migration_failed)
end

All we did is change the the message being sent from :restart_sync to :migration_failed, but with this change, we’re cleanly separating “what happened” (migrations failed) from “what do we do about it” (restart the pipeline to retry).

System Level

I don’t have permission to use the code for this example, but several years ago I was reviewing a pull request where a permission check was removed from part of the code. The authors’ comment was, “We removed the permission check because any user can change the topology, so there’s no need for the permission check.”

The original authors were removing what they saw as dead code. If permission checking was rare in the codebase, that would likely have been a good decision.

However, given that the application consistently checked permissions on various operations, they were inappropriately using their knowledge that this operation isn’t currently restricted, making it harder to change that decision in the future.

By asking the questions, they might have come to a different decision:

What is my job? Change the topology What should I know? We often restrict who can perform potentially dangerous operations; we have a standard mechanism for checking these permissions. What do I have no business knowing? Whether or not this particular operation is restricted.

Keeping the permission check in place made the code more resilient to future changes to permissions.

Wrapping Up

I’ve found the technique of playing dumb to be extremely valuable in writing code that is clean and easy to work with. It helps reduce inappropriate coupling and allows code to be more easily reused without first refactoring it.

Note that I’m not advocating doing extra work. Rather, I’m advocating doing the same amount of work in a better way. We’re still writing the same code, but we’re writing it in different places or using better names.

It’s hard to stop and remind myself that I know too much when I’m writing code. I try to make it a habit to put myself in the position of the code I’m writing and ask those questions:

What’s my job?
What should I know?
What do I have no business knowing?

Module Level

Function Level

System Level

Wrapping Up

Comments