Saturday 2 March 2019 — This is six years old. Be careful.

Mutation testing is an old idea that I haven’t yet seen work out, but it’s fascinating. The idea is that your test suite should catch any bugs in your code, so what if we artificially insert bugs into the code, and see if the test suite catches them?

Mutation testers modify (mutate) your project code in small ways, then run your test suite. If the tests all pass, then that mutation is considered a problem: a bug that your tests didn’t catch. The theory is that a mutation will change the behavior of your program, so if your test suite is testing closely enough, some test should fail for each mutation. If a mutation doesn’t produce a test failure, then you need to add to your tests.

There are a few problems with this plan. The first is that it is time-consuming. Most people feel like it takes too long to run their entire test suite just once. Mutation testers run the whole suite once for each mutation, and there can be thousands of mutations.

But my larger concern is false positives: not all mutations are bugs, and if the mutation tester reports too many non-bugs as bugs, then its usefulness is diminished or even negated. I wanted to examine this idea more closely.

There are a few mutation testers out there for Python. I thought I would give them a try, starting with mutmut. [Mutmut’s author Anders Hovmöller helped by commenting on a draft of this post. I’ve included some of his commentary.]

I needed a test suite to use, so I created a slightly artificial project. The templite module in coverage.py is almost standalone, and is well-tested. And it’s small enough that its test suite runs in less than a second. I extracted templite, wrote some project scaffolding, and gave it its own repository.

Now I had a project that tested well:

$ coverage run -m pytest
============================= test session starts ==============================
platform darwin -- Python 3.7.1, pytest-4.3.0, py-1.8.0, pluggy-0.9.0
rootdir: /Users/ned/lab/templite, inifile:
collected 26 items

test_templite.py ..........................                              [100%]

========================== 26 passed in 0.09 seconds ===========================

$ coverage report -m
Name              Stmts   Miss Branch BrPart  Cover   Missing
-------------------------------------------------------------
src/templite.py     144      1     60      1    99%   137, 136->137

(The one line missing coverage is a conditional for Python 2 vs Python 3.)

Running mutmut was easy:

$ pip install mutmut
Collecting mutmut
...
Installing collected packages: mutmut
Successfully installed mutmut-1.3.1

$ mutmut run

- Mutation testing starting -

These are the steps:
1. A full test suite run will be made to make sure we
   can run the tests successfully and we know how long
   it takes (to detect infinite loops for example)
2. Mutants will be generated and checked

Mutants are written to the cache in the .mutmut-cache
directory. Print found mutants with `mutmut results`.

Legend for output:
🎉 Killed mutants. The goal is for everything to end up in this bucket.
⏰ Timeout. Test suite took 10 times as long as the baseline so were killed.
🤔 Suspicious. Tests took a long time, but not long enough to be fatal.
🙁 Survived. This means your tests needs to be expanded.

mutmut cache is out of date, clearing it...
1. Running tests without mutations
⠇ Running... Done

2. Checking mutants
⠧ 154/154  🎉 146  ⏰ 0  🤔 0  🙁 8

This ran 154 different mutations, which took about a minute for my half-second-ish test suite. 146 of them resulted in test suite failures, as they should. But 8 passed the test suite, so they have to be examined as potential test gaps.

One nice touch: if you interrupt mutmut, when you run it again, it picks up where it left off, which is great for a long-running process like this.

I’m not sure how mutmut decides where to find the code to mutate. In this case it found it implicitly. Other projects I tried, I had to add some configuration to setup.cfg, even though I thought the projects were laid out similarly.

[Anders says it looks for “src”, “lib”, or a directory with the same name as the current directory. My other project has a quirk: edx-lint/edx_lint has the code, so the punctuation difference threw it off.]

To look at the mutants, use the results command:

$ mutmut results
To apply a mutant on disk:
    mutmut apply <id>

To show a mutant:
    mutmut show <id>


Survived 🙁 (8)

---- src/templite.py (8) ----

10, 29, 37, 45, 46, 58, 108, 152

This gives me the ids of the mutants that survived, that is, the mutations that didn’t cause a failure in the test suite.

We can see the actual code mutation with the show command:

$ mutmut show 10
--- src/templite.py
+++ src/templite.py
@@ -48,7 +48,7 @@
         self.code.append(section)
         return section

-    INDENT_STEP = 4      # PEP8 says so!
+    INDENT_STEP = 5      # PEP8 says so!

     def indent(self):
         """Increase the current indent for following lines."""

The mutation is shown as a diff. The old line is prefixed with minus, and the new line with plus. Here the INDENT_STEP constant was changed from 4 to 5.

Right off the bat, we have a philosophical decision to make. A bit about how templite works: it converts template files into Python code. Rendering a template is done by executing the generated Python code. This INDENT_STEP constant is the indentation amount used in the generated code.

I have no tests that examine the generated code. That code is an implementation detail. The important thing is that the templates render properly, so that is what’s tested. When mutmut changed the indent level to 5, the generated code was different, but only in white space, so it ran the same, and still produced the right output.

Does this mutation point to a problem in the test suite? I don’t think I should test that the indentation level in the generated code is 4 spaces. Mutmut provides a way to mark the line to exempt it from mutation, but I’m not sure I want to start adding those pragmas. This is one of the things I wanted to understand: what kind of false positives would appear, and how would I deal with them?

Let’s see how the next mutant looks:

$ mutmut show 29
--- src/templite.py
+++ src/templite.py
@@ -134,7 +134,7 @@
         code.add_line("append_result = result.append")
         code.add_line("extend_result = result.extend")
         if sys.version_info.major == 2:
-            code.add_line("to_str = unicode")
+            code.add_line("XXto_str = unicodeXX")
         else:
             code.add_line("to_str = str")

The second mutant has found the one line of code that is not covered by the test suite, because it’s for Python 2, and we are only running under Python 3. Mutmut has a --use-coverage flag, which uses coverage data to skip mutations on lines that are not covered by the test suite. If I had used it to begin with, this mutant wouldn’t have appeared. Nice.

 mutmut show 37
--- src/templite.py
+++ src/templite.py
@@ -144,7 +144,7 @@
             """Force `buffered` to the code builder."""
             if len(buffered) == 1:
                 code.add_line("append_result(%s)" % buffered[0])
-            elif len(buffered) > 1:
+            elif len(buffered) >= 1:
                 code.add_line("extend_result([%s])" % ", ".join(buffered))
             del buffered[:]

This is a classic false positive. The condition has been changed from greater to greater-or-equal, but it doesn’t change the behavior of the code. This mutation is in an “elif” clause and the equal case was already handled by the previous if clause, so greater-or-equal is the same as greater.

On this point, Anders commented:

Mutmut here does point out that your code is overly complex. Just “elif buffered” can’t be mutated but has the same functionality. I’ve found this to be a weird little side effect to using mutation testing. If I follow this the code gets better and more “just so”. This specific case isn’t a super strong argument, but I’ve had many similar things that build on top of each other in small increments.

I can see Anders’ point here, though I’m not sure I want to change the code that way.

Mutant 45 gives us our first true success:

$ mutmut show 45
--- src/templite.py
+++ src/templite.py
@@ -153,7 +153,7 @@
         # Split the text to form a list of tokens.
         tokens = re.split(r"(?s)({{.*?}}|{%.*?%}|{#.*?#})", text)

-        squash = False
+        squash = True

         for token in tokens:
             if token.startswith('{'):

Templite can squash white space around tokens, and here we are changing the initial value of the “should I squash white space?” flag. How can it not cause a test failure? Because we never tested a template that started with white space! Adding this simple test kills the mutant:

self.try_render("  hello  ", {}, "  hello  ")

I thought that mutmut run again would clear the mutant from the results, but the only way I could find to clear it was to delete the mutmut cache and run all the mutations again. [Anders wrote an issue about this.]

Mutant 46 is another false positive:

$ mutmut show 46
--- src/templite.py
+++ src/templite.py
@@ -153,7 +153,7 @@
         # Split the text to form a list of tokens.
         tokens = re.split(r"(?s)({{.*?}}|{%.*?%}|{#.*?#})", text)

-        squash = False
+        squash = None

         for token in tokens:
             if token.startswith('{'):

Here squash is the same boolean flag we saw in mutant 45. I only ever check it with if squash:, so of course False and None produce the same results. Notice here if I wanted to prevent this mutant by adding a pragma to the line, I would also have prevented the first success we had. Adding that pragma would be counter-productive.

$ mutmut show 58
--- src/templite.py
+++ src/templite.py
@@ -160,7 +160,7 @@
                 start, end = 2, -2
                 squash = (token[-3] == '-')
                 if squash:
-                    end = -3
+                    end = -4

                 if token.startswith('{#'):
                     # Comment: ignore it and move on.

This is another useful result. Turns out in my tests, I always wrote space-squashing tags with a space, like {{a -}}. This mutated code adjusted the trimming of punctuation to account for the dash. Because I always had a space before the dash, the change to -4 went unnoticed. I killed this mutant by changing some tags in my tests to have no space: {{a-}}, and also added some with many spaces for good measure.

Mutant 108 sure looks like it’s real:

$ mutmut show 108
--- src/templite.py
+++ src/templite.py
@@ -211,7 +211,7 @@
             else:
                 # Literal content.  If it isn't empty, output it.
                 if squash:
-                    token = token.lstrip()
+                    token = None
                 if token:
                     buffered.append(repr(token))

Seems like we have no tests of non-white-space literal content after a squashing tag. Add that test, and that mutant is killed.

Our last mutant is another interesting case:

$ mutmut show 152
--- src/templite.py
+++ src/templite.py
@@ -283,7 +283,7 @@
                     value = value[dot]
                 except (TypeError, KeyError):
                     raise TempliteValueError(
-                        "Couldn't evaluate %r.%s" % (value, dot)
+                        "XXCouldn't evaluate %r.%sXX" % (value, dot)
                     )
             if callable(value):
                 value = value()

Here the error message has been mutated by adding chaff to the beginning and end. We do have a test for this error, including its message:

def test_exception_during_evaluation(self):
    msg = "Couldn't evaluate None.bar"
    with self.assertRaisesRegex(TempliteValueError, msg):
        self.try_render(
            "Hey {{foo.bar.baz}} there", {'foo': None}, "Hey ??? there"
        )

The test still passes because it’s finding the expected error message somewhere in the actual error message. If mutmut had added chaff in the middle of the string as well, it would have failed the test. Is this clever of mutmut? Hard to say!

When I change the test, the mutant is killed:

regex = "^Couldn't evaluate None.bar$"
with self.assertRaisesRegex(TempliteValueError, regex):

BTW, the first time I ran mutmut, it created another nonsensical mutant:

--- src/__init__.py
+++ src/__init__.py
@@ -1,2 +1,2 @@
-from .templite import *
+from .templite import /

This mutant survived because this file was never executed. That in itself was a useful clue to the fact that I had made a useless file. Delete the file, and the mutant is killed. [mutmut has changed so that it won’t create this mutation any more.]

So after all this, how did mutmut do? It gave me seven mutations, four of which resulted in improving the tests. That’s not a bad outcome. But I don’t know how I would use this regularly. I don’t have a good way to silence the three false positives, so if I run mutmut again in the future, I will have to consider them again.

As another data point about the cost of mutation testing, I tried mutmut on another project with a 10-second test suite. It took mutmut 43 minutes to run 513 mutants, of which 165 survived. I haven’t looked through them yet to see what they mean.

All in all, I am pleased with the results. As an occasional (but expensive) way to assess your test suite, mutmut works well.

Comments

Ionel Cristian Mărieș 12:12 PM on 3 Mar 2019

What I would like to see is a nice way to do negative testing. Something that can tell you: "this test fails if there's a regression"

This is especially useful for timing sensitive integration tests, way too often you get a passing test when there is a regression.

One way to do it is to offer some formalized way to patch/disable code that implement the feature/fix, and then run test and see if it fails. If it doesn't then you got a problem. Is there something like that already implemented?

Ionel Cristian Mărieș 12:20 PM on 3 Mar 2019

Uh, so what I wanted to say: a _nicer_ and faster way to do negative testing.

Skip Montanaro 10:32 AM on 5 Mar 2019

Interesting tool. I wonder if coverage could be used to identify the code touched by each test case. The mutmut probes could then limit test case execution to those tests which actually exercise the modified code. Might speed things up.

mwchase 4:06 PM on 7 Mar 2019

Looking over some of the false positives, I wonder if there's some way to express constraints like "we expect this to work as long as the value is a positive integer".

toonarmycaptain 5:28 PM on 12 Sep 2020

This indeed seems useful, however you have to set up your tests very specifically in some cases. If you have UI code/error messages, and what you really want is to check that a message is returned, and that it has 'balloon' and not 'aeroplane' in it, leaving you not to care if the message is modified in some way. You could set it up as an accessible variable and import that and test against it, but then if mutmut mutates your error message...your test will still pass, as what you expect and receive is going to be the mutated value, so maybe you keep a copy in your test suite...which is sounding like a greater than desirable maintainence burden, particularly if you're not running mutmut as part of CI.

So I guess usability depends on how you want to set up your suite, and what your tests are designed to catch, and if you want to go to the effort of setting them in such a way that mutmut will be helpful without false flags everywhere.

Definitely useful to see what it picks out though, maybe more occasionally than frequently. Adding flags for CI-exceptions in coverage/mypy/pylint plus mutmut starts to hurt readability and/or constrain code structure too much at some point.

Mutmut

Comments

Add a comment: