Yield Multiple Rows with a Block Transform in Kiba

Kiba recently introduced the ability to yield multiple rows in a transform. This is great when you need to explode rows. For example, I get a lot of Excel files where the values are comma separated values. Aye!

# Given an incoming row like:
{
  url: 'www.example.org',
  zip_codes: '55802, 90210, 10108'
}

# I want to process three separate rows:
[
  {
    url: 'www.example.org',
    zip_code: '55802'
  },  {
    url: 'www.example.org',
    zip_code: '90210'
  },  {
    url: 'www.example.org',
    zip_code: '10108'
  }
]

A limitation is that it must be a class-based transformation, and not a block-based transformation. Sometimes this isn’t practical. I’ll show you how to snap this technical limitation into pieces with a small helper transformation!

Here’s the file we’ll work with as input:

// example_input.json
[
  {
    url: 'www.example.org',
    zip_codes: '55802, 90210, 10108'
  }, {
    url: 'www.timtilberg.org',
    zip_codes: '55720'
  }  
]

Before Yielding Transforms:

Before the yielding transform was available, I would structure my destination class to account for potential arrays as input using recursion:

class CSVDestination
  # ...
  def write(row)
    # I usually expect my incoming rows to be hashes.
    #   If it's an array, break it down and try again.
    return row.each{|r| write(r)} if row.is_a? Array

    csv << @headers = row.keys unless @headers
    csv << row
  end

end

This single line lets you send arrays of rows to the destination without interfering with the usual processing. Cool!

source JSONSource, filename: 'example_input.json'

transform do |row|
  parsed_zip_codes = row[:zip_codes].split(/, */)
  # => ['55802', '90210', '10108']

  # For each zip code, create a new output row.
  # Our destination is now equipped to handle Arrays,
  #   so each of these records will still get processed individually
  parsed_zip_codes.map do |zip_code|
    {
      url: row[:url],
      zip_code: zip_code
    }
  end
  # => [
  #      {url: 'www.example.org', 'zip_code: '55802'},
  #      {url: 'www.example.org', 'zip_code: '90210'},
  #      {url: 'www.example.org', 'zip_code: '10108'}
  #    ]
end

destination CSVDestination, filename: 'output.csv'

This is a pretty low-cost way to get this to work, so I didn’t mind having this in my everyday CSVDestination. However, the downside is that all following components must also expect to convert an array. Because of this, it’s best to use these explosions in the very end of the chain, right before the destination.

Enter The Yielding Transform

In version 2, Kiba introduced the Yielding Transform. This brought support for exploding records in the pipeline without needing to be aware of Arrayness. Each row gets yielded one at a time, just like in the source class.

From the wiki:

Since Kiba v3 (or Kiba v2 with StreamingRunner enabled), you can also yield as many rows as you want for a given input row, using the yield keyword.

For technical reasons, this will only work in class transforms, not in block transforms.

Emphasis on technical limitations mine. More on this later – It’s the point of this post!

Here’s a handy exploder class leveraging the new feature. This looks useful!

# A record exploder that splits a value and yields multiple rows for each value.
# Similar utilities would be a `PatternScanner`, or a `JsonArray...Exploder`.
# This concept can easily get further extracted into new moves.
#
class Splitter
  # The keys for input and output.
  # If no output_key is given, the input_key gets replaced.
  attr_reader :input_key, :output_key

  # The pattern to split on. You can either use a string, or regex.
  # The default value splits on `,` including any spaces after.
  attr_reader :split

  def initialize(input_key: , output_key: nil, split: /, */)
    @input_key = input_key
    @output_key = output_key || input_key

    @split = split
  end

  def process(row)
    # Extract each value from the input key
    row[input_key].split(split).each do |val|
      # yield a new row, adding the single value to the output_key
      yield row.merge(output_key => val)
    end

    # We already yielded our relevant rows, don't return any more rows:
    nil
  end
end

Our pipeline now looks like:

source JSONSource, filename: 'example_input.json'

transform Splitter, input_key: :zip_codes, output_key: :zip_code

# Previously at this point we were piping an Array of rows.
#   Anything after this transformation needed to account for this.
#   Now, we are back to piping individual rows. 

# Oh, hey, that's handy. We already need to add to this.
#
# The original example did not include the original :zip_codes list.
transform do |row|
  row.delete :zip_codes

  row
end

destination CSVDestination, filename: 'output.csv'

This is great! A lot of useful tools can be made leveraging moves like #split, #scan, and array values. But honestly, I use block based transforms a lot. They are great for those quick 1-line-of-code moves that clients often need for integration. Rename a column. Format a value. Explode… some… records…

About that limitation…

For technical reasons, this will only work in class transforms, not in block transforms.

I lament the need to create a class just to explode some rows in a pipeline. Don’t get me wrong, class transforms bring a ton of value. They are reusable and testable. In a large project, they provide substantial glue. But block transforms hit a sweet spot for certain client-specific requests.

A recent example is “For each price record, can you create a duplicate row that has 500 subtracted from our price region id?” Sure!

# For each record, create an additional row with -500 from the region code.
# This uses the old style of sending Arrays down the pipe:
transform do |row|
  region = row[:region].to_i
  new_row = row.merge(region: region - 500)
  [new_row, row]
end

This is the kind of move I might not write a dedicated class for. Practically speaking, this is a quirky request that will not see any re-use. If I created a class for each of these transformations, it would be harder to follow the pipeline overall because it would be filled with FixThatThingForThatOneGuyASpecialWay classes that I have to look up for details. Small block transformations are perfect for these – they keep the context close at hand and allow you to review complex pipelines without jumping around to 20 files. They can always be upgraded to a class later if they prove useful or an abstraction arrives.

Okay. So, I can’t yield rows from a block. But I can from a class!

Behold, the adapter to glue the old and new together, enabling the usage of Yielding Transforms in a block transformation:

class ArrayExploder
  def process(row)
    if row.is_a? Array  # Perhaps Array === row is faster, I'm not sure.
      row.each{|r| yield r}
      return nil
    end
    row
  end
end

We can leverage the best of both worlds. Simple, quirky block transformations now have an easy tool to get that same normalization.

source JSONSource, filename: 'example_input.json'

transform do |row|
  row[:zip_codes].split(/, */).map do |zip_code|
    {
      url: row[:url],
      zip_code: zip_code
    }
  end
end

# Our last transformation gave us an array of rows -- let's normalize those:
transform ArrayExploder

# Back to those sweet, sweet individual rows.
transform do |row|
  row[:for] = 'great justice'
  row
end

destination CSVDestination, filename: 'output.csv'

> That background, tho...

Before Yielding Transforms:

Enter The Yielding Transform

Emphasis on technical limitations mine. More on this later – It’s the point of this post!

About that limitation…