I’m available for freelance work. Let’s talk »

You should include your tests in coverage

Tuesday 11 August 2020

This seems to be a recurring debate: should you measure the coverage of your tests? In my opinion, definitely yes.

Just to clarify: I’m not talking about using coverage measurement with your test suite to see what parts of your product are covered. I’ll assume we’re all doing that. The question here is, do you measure how much of your tests themselves are executed? You should.

The reasons all boil down to one idea: tests are real code. Coverage measurement can tell you useful things about that code:

  • You might have tests you aren’t running. It’s easy to copy and paste a test to create a new test, but forget to change the name. Since test names are arbitrary and never used except in the definition, this is a very easy mistake to make. Coverage can tell you where those mistakes are.
  • In any large enough project, the tests directory has code that is not a test itself, but is a helper for the tests. This code can become obsolete, or can have mistakes. Helpers might have logic meant for a test to use, but somehow is not being used. Coverage can point you to these problems.

Let’s flip the question around: why not measure coverage for your tests? What’s the harm?

  • “It skews my results”: This is the main complaint. A project has a goal for coverage measurement: coverage has to be above 80%, or some other number. Measuring the tests feels like cheating, because for the most part, tests are straight-line code executed by the test runner, so it will all be close to 100%.

    Simple: change your goal. 80% was just a number your team picked out of the air anyway. If your tests are 100% covered, and you include them, your total will go up. So use (say) 90% as a goal. There is no magic number that is the “right” level of coverage.

  • “It clutters the output”: Coverage.py has a --skip-covered option that will leave all the 100% files out of the report, so that you can focus on the files that need work.
  • “I don’t intend to run all the tests”: Some people run only their unit tests in CI, saving integration or system tests for another time. This will require some care, but you can configure coverage.py to measure only the part of the test suite you mean to run.

Whenever I discuss this idea with people, I usually get one of two responses:

  • “There are people who don’t measure their tests!?”
  • “Interesting, I had a problem this could have found for me.”

If you haven’t been measuring your tests, give it a try. I bet you will learn something interesting. There’s no downside to measuring the coverage of your tests, only benefits. Do it.

Comments

[gravatar]
I'd argue against this for a _different_ reason (although closely related, and one could probably argue that it's part of 'skewing the results' - but the rebuttal you give here does not work for this!):

If you're measuring coverage of the tests themselves, you're measuring the wrong thing.

Consider the following two scenarios, both with 100 (total) code paths in the application and 200 (total) code paths in the tests:

1. 99/100 code paths in the application are executed, and 101/200 code paths in the tests are executed.
2. 1/100 code paths in the application are executed, and 199/200 code paths in the tests are executed.

1 is obviously better than 2 (at least to me). At the end of the day, the tests are not an end in and of themselves - they are a tool for making the application better.

But if you count test code coverage the same as application code coverage... these two are indistinguishable. And hence you can't just say "shift the target" - because _there is no target_ that distinguishes a) and b).

Hence, the metric is flawed.

You can argue that it's an extreme example - but the same effect applies even in more subtle cases. It's just more obvious when taken to extremes.

Note that the "standard" approach of "measure the application codepath coverage when running tests" _does_ distinguish these two cases. Hence, "There’s no downside to measuring the coverage of your tests" is incorrect.

*****

The other component to this that you don't really get into is that oftentimes test code is very generic, and is deliberately set up for future extensibility. Which very often ends up with codepaths that aren't _currently_ executed.

This is one of the issues with heavily coverage-metric-based applications _in general_ - it encourages people to write very specialized and often-difficult-to-extend code because writing the framework for extensibility either a) hurts test coverage or b) requires writing a bunch of tests for code that isn't (currently) exercised in the application in general.

If you've got the choice between either a) writing something with the hooks now for future extensibility, with good testing of the core functionality you're using now, knowing that the 'fringe' isn't tested yet (but you'll go back and test the fringe when you get there), or b) writing something that can't easily be extended later because a design that can be extended would take too much time in the short term once you include the testing you'd need, or c) missing the short-to-mid-term goals, or d) having your release have a bunch of bugs because you spent a bunch of your effort that could have been spent testing the functionality exposed in your current release instead spent testing a bunch of code paths that your application isn't even using currently... I know which one _I'd_ prefer.

*****

In general, you seem to be ignoring that time _is a cost_. It's not a question of "just A, or A and B". It's a case of "more A, or less A and some B". The time that would be spent looking at codepath coverage of tests did not magically appear out of thin air. It's time that could be spent e.g. improving testing of the application itself.

If you have infinite time? Sure, test everything up front. If you're e.g. writing something for aerospace? Sure, test every component. Other than that? It's not as clearcut as you seem to make it out to be.
[gravatar]
@TLW, you give us two scenarios that produce the same total coverage, and point out that one is better than the other, but they give the same result. That is true, but it's not a reason to skip coverage measurement of tests. It's a reason to not rely solely on a single number to understand your coverage.

You then argue that people write generalized code with un-executed paths. That's a bad idea. YAGNI says you shouldn't write code just because you think you might need it sometime in the future. You correctly point out that time is a cost. Save your time and don't write code you aren't using. Of course, this is a trade-off. If you have code you aren't using yet, and want to keep it, but don't want to be distracted by it in the coverage reports, you can omit that code from measurement. This applies to product code as well as testing code, so it's not a reason to skip measuring your tests.

My overarching point about coverage (which I perhaps did not state clearly here) is that a simple numeric result is not a full picture of your coverage. Coverage reports are full information, and you should use them to understand the complicated world of your code coverage.
[gravatar]

I’d like to add something that might feel unobvious — measuring coverage (as in which lines executed and which didn’t) and displaying/counting it can be separate processes.

Some people measure coverage in different envs, then merge it together, showing the common report (just like measuring+showing the report right away!) And this may create an impression that the resulting overall number is the only thing to rely on when linting / deciding whether to fail the CI. Especially, when just following the examples Hynek gives with a common --fail-under=100 metric.

But it shouldn’t be this way!

Really, Coverage.py allows showing the total coverage number for a portion of code, for a folder or something. This essentially means that we can extract as many distinct coverage metrics as we need and have separate asserts. For example, it’s an absolutely great idea to have 100% coverage in tests, now matter what; while also having a separate metric like 80% for the non-test code:

python -Im coverage report --omit=src/<your-library>/** --fail-under=100
python -Im coverage report --include=src/<your-library>/** --fail-under=80

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.