Let’s think about what the current drafts of summing and grouping tell us about our system. Then code (anyway).

I have, over on the other iPad, some further code on the summing / grouping implementation. But I got to thinking about what I have so far. Here is all the relevant code from yesterday:

function XSet:statsNoGroup(controlTable)
    local summed = {}
    for record,s in self:elements() do
        for operation,fields in pairs(controlTable) do
            if operation=="sum" then
                for i,field in ipairs(fields) do
                    local current = summed[field] or 0
                    local value = record:elementProject(field) or 0
                    summed[field] = current + value
                end
            end
        end
    end
    local result = XSet()
    for field,sum in pairs(summed) do
        result:addAt(sum.."",field)
    end
    return XSet():addAt(result,NULL)
end

function XSet:stats(controlTable)
    if not controlTable.groups then
        return self:statsNoGroup(controlTable)
    end
    local result = {}
    local groupControl = controlTable.groups
    local sumControl = controlTable.sum
    for record,s in self:elements() do
        local gs = record:getGroupString(groupControl)
        local statsForGroup = result[gs] or {}
        for i,field in ipairs(sumControl) do
            local current = statsForGroup[field] or 0
            local value = record:elementProject(field) or 0
            statsForGroup[field] = current + value
        end
        results[gs] = statsForGroup
    end
end

function XSet:getGroupString(groupControl)
    key = ""
    local groupBy = groupControl or {}
    for i,scope in ipairs(groupBy) do
        local value = self:elementProject(scope)
        key = key.."."..(value or "MISSING"..scope)
    end
    return key
end

I think this code is trying to tell us something. There seems to be some muffled mumbling going on here, but I think I’m getting some messages:

This doesn’t look much like set theory here, does it, MathMan?

What in the [DELETED] is that concatenated scope string about? I don’t recall Cantor mentioning those in his work, nor anything in Bourbaki’s Éléments de mathématique.

Even seen as vanilla code, I can’t clearly see what this method is trying to build. Isn’t code supposed to express ideas?

OK, setting aside my desire to stuff a tennis ball in the mouth of that sarcastic code, I think it has a point. Taking the points in order:

  1. I admit that I’ve devised a sort of Lua table structure that is easy to build up, and then I translate it into XSets–in the “no group” case. The grouped one isn’t there yet. This probably tells us that we need some additional methods for creating and accessing sets.
  2. Yes, I admit it. When I think of simple summing over one scope, it makes sense to use the scope name as the scope of the set of sum sets. (The deep nesting here is part of the issue: I’m not at all sure we can get past that.)
  3. Here again, guilty as charged. I tried to write that code “by intention” but instead found myself writing it out longhand. That’s never a good sign: it tells me that I don’t have a solid idea of what I’m doing. If I had a three-step plan clearly in mind, I could at least code s1();s2();s3() for a bit of expressiveness credit.

I can only imagine what you, the reader, thought about yesterday’s code. I suspect that even more than usual it was “what is happening here????”

Let’s see if we can be more clear about what we want to accomplish with summing and grouping.

Summing and Grouping

We should keep in mind that these two are very different things. We might not want to imagine a combined solution, but one that is somehow separated into two (or more) different chunks. Let’s start with summing.

When we say something like “From set People, sum pay and bonus, we’re surely envisioning a result set like this:

{ 37000000pay, 420000bonus }

I think that’s clear, because in the People set, the individual values for pay and bonus show up just like that.

Ha. This gives me a simplifying idea already. Suppose we wanted our summing to be by state. Why not just include state in the individual summing records:

{ ppay, bbonus, MIstate }

Currently, I’ve been creating a group set for a single field like this:

{
{ ppay, bbonus }MI,
{ ppay, bbonus }OH,
}

And that led me to the need for some kind of multiple superscript, and, afraid to create a Tuple or something up there, I decided on the concatenated string trick. Now, in my defense, I did say yesterday that we were surely going to change all this, but even so it seems like we’re not on a good path with that idea.

So what if, instead, we just included the group-by values directly in the output set, exactly like they occur anyway? Makes sense. We’d go from a set of records to a smaller set of records, where we used to have many record for each state and county, but now we just have one record for each combination. It might look like this:

{
{ ppay, bbonus, MIstate },
{ ppay, bbonus, OHstate},
}

With that structure, if we group by state, and then want the record for Michigan, we just have, roughly,

local sums = people, summed and grouped by state
local mi = XSet():record{state="MI"}
local miSums = sums:restrict(mi)

That seems rather good to me. Much simpler than creating and unwinding a nasty curly structure. I wonder why I didn’t think of it right off the bat. I think the answer is: I was thinking too much about how to do it, and not enough about what was needed. It happens. You saw it here.

We do have a performance concern here, and should face it right up front. We would prefer to do this operation in one pass over the input, because in principle, the input is large. All the people in the world or something.

One common algorithm for this problem is to sort the input on the grouping values, because after the sort you can just go through the records in order, creating a single sum record, which you reload when the sort key changes. That algorithm passes over the input more like 2 point something times, one and a fraction for the sort, and one for the summing, and that assumes you have a really good sort.

We don’t have a sort at all. Bit of a drawback.

Another approach is like the one we’ve taken so far. Take the records as they show up, and create an output “table” of the summary records. With reasonably fast access to that table, your speed will be pretty close to the time to read the set once. “Reasonably fast” probably means some keyed indexing structure, not a scan of the growing result set every time you want to find the Michigan record.

That said … what prevents us from defining the operation as if it requires a scan of the output set collection to find the desired summary record, and then replacing that scan with something faster? Nothing prevents that. We could do it.

Let’s try to define grouping in a sort of set-theoretic way. I’m not going to try to do math here, but to speak in the terms we’re familiar with, including some set operation names but still mostly using words, not squiggly marks.

Grouping and Summing

I’ve reversed the section headings, and I did it for a reason that I just got a glimmer of. We’ll see if it comes out in what follows.

We are given a list of “field names” (scopes) to group by. There is no hierarchy in this list, it’s just a breakout. (We may need a hierarchy later. We’re at a disadvantage here, because we have no real problem at hand. We’re inventing an operation we think we need.)

We’re going to create a set of output records, one for each combination of the grouping scopes. And there’s a function F that we want to apply to each record of the input, and to the corresponding output record, that accumulates values into that record. (The only function we have is sum, but we can apply it to multiple fields in the input.)

Let’s sketch an algorithm:

For each input record
  if there's no output record with the input's key
    create the output record with initial field values
  otherwise just use the existing one
  apply F(input,output)

Is that all there is? That’s so stunningly simple that I almost can’t believe it. It is simple for two reasons:

First, I changed the story by defining a simpler output set. And, second, I expressed the story in terms of set operations rather than table creation and accessing. We’ll come back to that second claim. For now, I want to change the tests and redo yesterday’s code.

Where Are We?

We have a couple of tests on summing, one with a grouping and one not. I did the non-grouping one first. This time let’s do it differently, using the first test to perform the application of F on the input set. And let’s go a step further, leaving initialization to the “summing” function, since the grouping function doesn’t naturally know what is needed besides the keys.

So the summing function doesn’t know the keys, and the grouping function doesn’t know what the summing function will do.

Here’s all I had for tests. Pretty thin:

        _:test("Summing", function()
            local control = {sum={"pay","bonus"}}
            local result = People:stats(control)
            _:expect(result:card(),"result").is(1)
            local foundRec = result:elementProject(NULL)
            _:expect(foundRec:at("pay")).is("3300")
            _:expect(foundRec:at("bonus")).is("200")
        end)
        
        _:ignore("Grouping", function()
            local control = {sum={"pay"}, group={"state"} }
            local result = People:stats(control)
            _:expect(result:card(),"result").is(2)
        end)

The plan here is that the first test should concern itself only with the actual set it’s summing into, not the set containing it. So we can recast this test.

I think we want to test a “blank” result set, and then apply the summing function to it, and then get the totals. Let me try to write that:

        _:test("Summing", function()
            local control = {sum={"pay","bonus"}}
            local summingSet = XSet()
            -- do something for each record
            _:expect(summingSet:at("pay")).is("3300")
            _:expect(summingSet:at("bonus")).is("200")
        end)

That comment line there should be named “programming by vague intention”. I think that I’d like to provide the actual function that does the job for us right here, so that the grouping method “just” applies it.

This is definitely a spike. Setting my mind in this mode frees me to expect to try multiple times and to revert freely. I’m on a clean repo (with an ignored test).

Here’s my test. It has one big issue, the putAt.

        _:test("Summing", function()
            local control = {sum={"pay","bonus"}}
            local summingSet = XSet()
            local sum = function(input,accumulator, control)
                local fields = control.sum
                for i,field in ipairs(fields) do
                    local accum = accumulator:at(field) or 0
                    accum = accum + input:at(field) or 0
                    accumulator:putAt(accum,field)
                end
            end
            sum(p1,summingSet,control)
            _:expect(summingSet:at("pay")).is("3300")
            _:expect(summingSet:at("bonus")).is("200")
        end)

The values in the test are wrong. I want to see it fail until it’s right. I’m just applying one record at this point, so the sum should be the values in p1.

But what’s the deal with putAt? I want to put away the new value of the accumulator, and I’m thinking I want to put it back into the existing record. (We could create a new output record each time, but that seems bad, and we’d still have issues.)

I can’t use addAt. That would happily add another value in the set, building up a batch of partial sums. But can I even implement putAt? It’s a set-theoretic nightmare to modify an existing set, but of course we do it with addAt all the time. They are kind of secret operations known only to the elves inside the machine.

Run the test. I expect it to fail on putAt. I am prepared to be surprised.

4: Summing -- TestSum:44: attempt to index a nil value (local 'input')

Yeah, right. The set p1 is local to the before. I’ll bring them out.

local People
local p1,p2,p3,p4

function testSummingEtc()
    
    _:describe("Summing etc", function()
        
        _:before(function()
            p1 = XSet():record{name="p1",state="MI",county="Livingston",pay=1000,bonus=50,age=50}
            p2 = XSet():record{name="p2",state="OH",county="Butler",pay="100", bonus="50", age="30"}
            p3 = XSet():record{name="p3",state="MI",county="Wayne",pay="2000",bonus="50",age="60"}
            p4 = XSet():record{name="p4",state="OH",county="Adams",pay="200",bonus="50",age="40"}
            People = Tuple():add(p1):add(p2):add(p3):add(p4)
        end)
4: Summing -- TestSum:46: attempt to call a nil value (method 'putAt')

There we go. Now let’s compare with addAt and hasAt and see what we can do. That’s in XGeneric:

function XGeneric:addAt(element, scope)
    assert(scope~=nil,"nil scope "..tostring(element))
    if self:hasAt(element,scope) then return end
    if not self.contents[scope] then
        self.contents[scope] = {}
    end
    self.contents[scope][element] = true
    return self
end

function XGeneric:hasAt(element, scope)
    local tab = self.contents[scope]
    if not tab then return false end
    if type(element) ~= "table" then
        return tab[element] == true
    else
        for e,_true in pairs(tab) do
            if type(e) == "table" and e:equals(element) then
                return true
            end
        end
        return false
    end
end

I’m prepared to bend the law enough here to allow me not to worry about someone calling putAt on a set of rank bigger than one. Someday we’ll implement rank and maybe check it.

I’m going to try this one by intention.

function XGeneric:putAt(element,scope)
    assert(scope~=nil,"nil scope "..tostring(element))
    self.contents[scope] = {} -- remove any existing contents at this scope
    self.contents[scope][element] = true
    return self
end

Well, sort of by intention. These are such elementary steps it seems senseless to give them names. That’s probably a mistake. Anyway, what fails now?

Oh, we need to add it on XSet as a forwarder:

function XSet:putAt(element,scope)
    self.data:putAt(element,scope)
    return self
end

Test. Check this out:

4: Summing  -- Actual: 1000, Expected: 3300
4: Summing  -- Actual: 50, Expected: 200

Those are in fact the values in p1. Extend the test:

...
            sum(p1,summingSet,control)
            _:expect(summingSet:at("pay")).is("1000")
            _:expect(summingSet:at("bonus")).is("50")
            sum(p2,summingSet,control)
            _:expect(summingSet:at("pay")).is("1100")
            _:expect(summingSet:at("bonus")).is("100")

I am hopeful that this works. And it does … except that I’m returning numbers not strings. My bad. Fix the summing function? Or should we make this an official restriction? For now, leave it at the higher level:

        _:test("Summing", function()
            local control = {sum={"pay","bonus"}}
            local summingSet = XSet()
            local sum = function(input,accumulator, control)
                local fields = control.sum
                for i,field in ipairs(fields) do
                    local accum = accumulator:at(field) or 0
                    accum = accum + input:at(field) or 0
                    accumulator:putAt(tostring(accum),field)
                end
            end
            sum(p1,summingSet,control)
            _:expect(summingSet:at("pay")).is("1000")
            _:expect(summingSet:at("bonus")).is("50")
            sum(p2,summingSet,control)
            _:expect(summingSet:at("pay")).is("1100")
            _:expect(summingSet:at("bonus")).is("100")
        end)

I just added the tostring to the putAt. Tests are green.

I’ll do the other two records as a special treat.

        _:test("Summing", function()
            local control = {sum={"pay","bonus"}}
            local summingSet = XSet()
            local sum = function(input,accumulator, control)
                local fields = control.sum
                for i,field in ipairs(fields) do
                    local accum = accumulator:at(field) or 0
                    accum = accum + input:at(field) or 0
                    accumulator:putAt(tostring(accum),field)
                end
            end
            sum(p1,summingSet,control)
            _:expect(summingSet:at("pay")).is("1000")
            _:expect(summingSet:at("bonus")).is("50")
            sum(p2,summingSet,control)
            _:expect(summingSet:at("pay")).is("1100")
            _:expect(summingSet:at("bonus")).is("100")
            sum(p3,summingSet,control)
            sum(p4,summingSet,control)
            _:expect(summingSet:at("pay")).is("3300")
            _:expect(summingSet:at("bonus")).is("200")
        end)

Tests still green. Commit: New summing logic and record layout, works at summing level. Grouping still ignored.

Let’s look into histry back.

Retro

Well, putting the best face on it that I can, yesterday I did the best that I possibly could, with the brain, energy, and ideas that I had. And, yay me, I stayed open to the very real possibility that the best I have on any given day isn’t the best I can imagine, or even the best I can do on some future day. So when the code leveled those heinous accusations at me, I remained calm, and non-defensive, and accepted the criticism as useful, even if the code was really rather cruel and sarcastic about it all. I can see its point, it was pretty tortured, and I can see how it might hold a grudge.

Now we’re working toward a simpler output structure, a set containing any grouping fields, just as they’d appear in the input record, and containing statistics fields, just the summed fields for now, with the accumulated values.

We are planning to use set operations to find these summary records in a growing output set. We’ll worry about the performance later. Since each group-by field combination will appear in just one accumulating record, a simple restrict will find the right one.

I like the notion of applying a provided function at each summing step. That should make the final method into an enumeration of the input set, find corresponding accumulator, call function. Seems like it’s going to be sweet.

We’ll wait to find that out on another day. Maybe not Christmas Eve or Christmas, but we’ll see. I tend to get up before anything else is going on.

See you then, unless you are sensibly spending your time with family and friends. Merry Christmas, Season’s Greetings, Happy Holidays, and the best of whatever days you’re celebrating or just experiencing.