Transform Your Data Object-Oriented Style with Formatters

April 17, 2022 Permalink

I admit it. I’m a design pattern nerd. I’ve been a massive fan of object-oriented design patterns ever since I read Patterns of Enterprise Application Architecture in the mid-2000s. One of the aspects which first drew me to Ruby, via Rails, was the in-depth adoption of so many of these patterns and how easy it is to write elegant Ruby code using OOP (Object-Oriented Programming) design patterns.

Today, we’ll talk about a variant of the Template Method design pattern I’ve used in a number of scenarios. I don’t know of a formal name for this pattern, so I like to call it the Formatter pattern. By executing a sequence of formatters, you achieve a process which I think of as a data pipeline. The data pipeline itself can also be represented as an object.

(At first glance, you might think this pattern is better described as a Chain of Responsibility, but that pattern is better suited to cases where each link in the chain should determine whether it can handle all processing and stop, or pass some of the processing directly on to the next link in the chain.)

Perfect for Refactors

The Formatter pattern is so useful that I recently refactored a huge “spaghetti function” in a TypeScript (yes, that TypeScript) codebase to use this OOP pattern instead. The number of bugs and “gotchas” working with that part of the codebase plummeted to nearly zero—and we were able to write unit tests for each formatter in the pipeline to verify the data transformations therein.

It’s safe to say that you wouldn’t necessarily reach for the Formatter pattern the moment you need to author a sequence of steps to transform some data. Rather, you’d start by writing your code in a simple linear fashion, and then as you slowly begin to realize your process is turning into a big ball of mud, it’s time to refactor. (Remember DOEY!)

NOTE: fans of functional programming, you’ll probably want to leave now. This pattern encompasses everything you don’t like about object-oriented programming: abstract classes, inheritance, encapsulation, mutation, and polymorphism. So don’t at me—you’ve been warned!

The Format of Formatters

Formatters in a particular data pipeline all inherit from a base class. Let’s create a new ExampleFormatters Ruby module to house our Base class and future subclasses:

module ExampleFormatters
  class Base
    attr_reader :data

    def initialize(data)
      @data = data
    end

    def format
      raise "Method should be implemented in subclass"
    end
  end
end

You can name the format method however you wish—perhaps in your case convert would be more appropriate. At any rate, the idea here is you pass a data object of some type into your initializer, and then by calling the format method, a transformation would occur for that data. Let’s implement a subclass to see this in action (and let’s assume the data object is a Hash):

module ExampleFormatters
  class TitleFormatter < Base
    def format
      # ensure the title string is in Title Case
      data[:title] = data[:title].split.map do |str|
        str.capitalize
      end.join(" ")

      self
    end
  end
end

Now let’s call this formatter with some data:

hsh = { title: "hello   world!" }

p ExampleFormatters::TitleFormatter.new(hsh).format.data

# => {:title=>"Hello World!"}

Note that we’re not returning a copy of the data object. We’re mutating the data in-place. This is by design (generally better performance and lower memory usage). So in the above example, both hsh and data’s return value are equivalent after running the formatter. If you need to preserve the original state of the data, you should ensure you make a duplicate object before passing that along to the formatters—perhaps in the data pipeline (described further below).

This is the most basic example of how to write a formatter. However, you might want to add a bit of smarts, such that you can’t run format more than once on the same dataset. We can change our base class then to this:

module ExampleFormatters
  class Base
    attr_reader :data

    def initialize(data)
      self.data = data
    end

    def data=(new_data)
      @already_formatted = false
      @data = new_data
    end

    def format
      if @already_formatted
        raise "Cannot run formatter a second time on the same dataset."
      end
  
      process

      @already_formatted = true

      self
    end

    protected

    def process
      raise "Method should be implemented in subclass"
    end
  end
end

Now refactor your subclasses to override process rather than format:

module ExampleFormatters
  class TitleFormatter < Base
    protected # keep `process` hidden from the public API:

    def process
      # ensure the title string is in Title Case
      data[:title] = data[:title].split.map do |str|
        str.capitalize
      end.join(" ")

      self
    end
  end
end

This will ensure you can only run format once on the same data. In order to run it again, you’ll need to use formatter.data = some_new_data to reset the process.

One reason it’s so nice to have a base class from which to inherit is you can add to the base API and your formatter subclasses will have full access to those methods. For instance, if this “titleize” formatting is something you’ll need to do several times across different formatters, you can abstract the implementation:

module ExampleFormatters
  class Base
    # …
    
    # convert a "lower   case" string to "Title Case"
    def titleize(str)
     str.to_s.split.map do { |str| str.capitalize }.join(" ")
    end  
  end
end

module ExampleFormatters
  class TitleFormatter < Base
    protected # keep `process` hidden from the public API:

    def process
      data[:title] = titleize(data[:title])

      self
    end
  end
end

Not only does this DRY up your code, but it fixes a subtle bug as well. Before, if data[:title] was nil, your code would error and crash. Now, by using a separate method which is written defensively (i.e., don’t naively assume the incoming object will always be a String), the worst you’ll get is a value of "". If a blank string itself will cause a problem down the road, make sure you write that code defensively as well!

Constructing the Data Pipeline

Now that we have some formatters working and transforming our data in bite-sized pieces, we need a way to kick off the whole process. Let’s write our data pipeline object. You’ll notice it contains some similar shapes to the formatter base class.

class ExampleDataPipeline
  attr_reader :data

  def initialize(data)
    self.data = data
  end
  
  # Construct the sequence of formatters for the pipeline.
  # Modify or add to this list to update the pipeline:
  def formatters
    [
      TitleFormatter,
      DatesFormatter,
      LineItemsFormatter,
      # future formatters here
    ]
  end

  def data=(new_data)
    @already_formatted = false
    @data = new_data.dup # make a shallow clone
  end

  def run
    if @already_formatted
      raise "Cannot run pipeline a second time on the same dataset."
    end

    formatters.each do |formatter_class|
      formatter_class.new(data).format
    end

    # further processing here?

    @already_formatted = true

    data
  end
end

For the final step, you can run the pipeline with a single statement!

transformed_data = ExampleDataPipeline.new(input_data).run

Boom.

And thanks to the power of object-oriented programming, you can easily experiment with your pipeline—for example by changing the order of formatters or swapping out one formatter for another.

class FasterDataPipeline < ExampleDataPipeline
  def formatters
    [
      TitleFormatter,
      DatesFormatter,
      FasterLineItemsFormatter
    ]
  end
end

# Try benchmarking these two pipelines…

ExampleDataPipeline.new(input_data).run
# vs.
FasterDataPipeline.new(input_data).run

It goes without saying that unit testing formatters in isolation will ensure your formatters don’t become too dependent on each other (aka it would be bad if a formatter only worked if executed after another particular formatter).

Audit Trails? External APIs? Dynamic Converters?

You can build upon this pattern in increasingly sophisticated ways as your business logic needs grow. One idea might be to log the before/after of each formatter as it runs in the pipeline in order to performing an audit of data integrity after the fact. Another idea might be to encapsulate expensive or complex external API calls within formatters so you can work with your overall pipeline steps at a higher level without messy API calls getting in your way. (And then swapping those formatter(s) out with stubs during unit tests becomes quite straightforward.)

In Bridgetown, the subsystem of template converter plugins (which itself is based on Jekyll’s converter subsystem) is architected in a similar fashion to this Formatter pattern. In the Bridgetown case, the order of converters (aka formatters) is determined by a priority designation. High priority converters run first, then medium priorities, then lower priorities. In Bridgetown, I actually did implement an “audit trail” for the conversions, so if you’re curious how a particular resource changed throughout all the conversions, you can inspect that trail and see the state of the template before/after a particular conversion. The precise nature of the converter pipeline is constructed based on extension matches or other metadata (aka .liquid will be converted differently from .erb). Your data pipeline might similarly need to dynamically construct its list of formatters based on the nature of the incoming data.

In summary, I truly adore this design pattern. Once you know it, you start to see its usefulness across a wide variety of scenarios, codebases, and even programming languages! small red gem symbolizing the Ruby language

Topic: Object Orientation