Home > Uncategorized > Having all the source code in one file

Having all the source code in one file

An early, and supposedly influential, analysis of the Coronavirus outbreak was based on results from a model whose 15,000 line C implementation was contained in a single file. There has been lots of tut-tutting from the peanut gallery, about the code all being in one file rather than distributed over many files. The source on Github has been heavily reworked.

Why do programmers work with all the code in one file, rather than split across multiple files? What are the costs and benefits of having the 15K of source in one file, compared to distributing it across multiple files?

There are two kinds of people who work with code all in one file, novices and really capable developers. Richard Stallman is an example of a very capable developer who worked using files containing huge amounts of code, as anybody who looked at the early sources of gcc will be all to familiar.

The benefit of having all the code in one file is that it is easy to find stuff and make global changes. If the source is scattered over multiple files, then working on the code entails knowing which file to look in to find whatever; there is a learning curve (these days screens have lots of pixels, and editors support multiple windows with a different file in each window; I’m sure lots of readers work like this).

Many years ago, when 64K was a lot of memory, I sometimes had to do developer support: people would come to me complaining that the computer was preventing them writing a larger program. What had happened was they had hit the capacity limit of the editor. The source now had to be spread over multiple files to get over this ‘limitation’. In practice people experienced the benefits of using multiple files, e.g., editor loading files faster (because they were a lot smaller) and reduced program build time (because only the code that changed needed to be recompiled).

These days, 15K of source can be loaded or compiled in a blink of an eye (unless a really cheap laptop is being used). Computing power has significantly reduced these benefits that used to exist.

What costs might be associated with keeping all the source in one file?

Monolithic code makes sharing difficult. I don’t know anything about the development environment within which these researched worked. If there were lots of different programs using the same algorithms, or reading/writing the same file formats, then code reuse often provides a benefit that makes it worthwhile splitting off the common functionality. But then the researchers has to learn how to build a program from multiple source files, which a surprising number are unwilling to do (at least it has always been surprising to me).

Within a research group, sharing across researchers might be a possible (assuming they are making some use of the same algorithms and file formats). Involving multiple people in the ongoing evolution of software creates a need for some coordination. At the individual level it may be more cost-efficient for people to have their own private copies of the source, with savings only occurring at the group level. With software development having a low status in academia, I don’t see any of the senior researchers willingly take on a management role, for this code. Perhaps one of the people working on the code is much better than the others (it often happens), but are they going to volunteer themselves as chief dogs body for the code?

In the world of Open Source, where source code is available, cut-and-paste is rampant (along with wholesale copying of files). Working with a copy of somebody else’s source removes a dependency, and if their code works well enough, then go for it.

A cost often claimed by the peanut gallery is that having all the code in a single file is a signal of buggy code. Given that most of the programmers who do this are novices, rather than really capable developers, such code is likely to contain many mistakes. But splitting the code up into multiple files will not reduce the number of mistakes it contains, just distribute them among the files. Correlation is not causation.

For an individual developer, the main benefit of splitting code across multiple files is that it makes developers think about the structure of their code.

For multi-person projects there are the added potential benefits of reusing code, and reducing the time spent reading other people’s code (it’s no fun having to deal with 10K lines when only a few functions are of interest).

I’m not saying that the original code is good, bad, or indifferent. What I am saying is that the having all the source in one file may, or may not, be the most effective way of working. It’s complicated, and I have no problem going with the flow (and limiting the size of the source files I write), but let’s not criticise others for doing what works for them.

  1. John Carter
    May 11, 2020 04:27 | #1

    Let me throw some peanuts….

    Here’s a list of why I tend to find Big Files a code smell….

    * Scope: Every file scoped symbol from the point of declaration to the end of the file is in scope. ie. There is no hint when you are creating a totally tangled cyclic architecture. (Ye Olde Pascal at least required you to define everything before you used it, and provided lexical scoping rules.)

    * State: Most Big Files I have worked on have had an enormous amount of global state. At least if the global variables are file scoped and you have lots of file scopes, that state is roughly encapsulated. In a single file implementation with all globals declared at the top of the file, causal analysis is extraordinarily difficult.

    * Reuse: You have already note reuse, but I will add one important fact. Test is the first reuse.

    * Testing: It’s very hard to do high branch / state space coverage testing of a Big File.

    * Change: Any change in a functions clients is the second reuse. If there is connascent coupling between the function client and the function being called, often there exists undocumented and uncheck preconditions. Thus changing the client is effectively reuse and often results in subtle breakage.

    * Configuration: Most Big Files I have met have become a #ifdef spaghetti of uncompiled and untested and broken variants of configurations.

  2. nick j
    May 11, 2020 07:28 | #2

    sqlite distribute their source as a single file, but develop it in multiple files.

    claimed advantages are simplicity for the end user (true imo) and improved optimisation (don’t know, don’t care personally)

  3. May 11, 2020 13:06 | #3

    @John Carter
    The scope issue is a bit of a red herring, because when people are forced to split up large files, without knowing why it might be a good idea, they simply put all the variables in a header, which gets included everywhere. Because software engineering often has low status, researchers are not willing to spend time learning more effective ways of doing things.

    Testing is impacted by the size of the functions, not the size of the files. Now for novices the two are sometimes highly correlated. I have seen ‘smaller’ files each containing one large function.

    The #ifdef point may be a good one. A misplaced #ifdef will cause less havoc in a smaller file, and should be easier to track down.

  4. Nemo
    May 12, 2020 23:02 | #4

    This is an interesting post. I have never seen large files in any of the companies I worked in. No longer having access to their repositories, I cannot provide any numbers but I recall no file larger than 100 lines. (All were in embedded systems or cellphones.)

    Ease of understanding, maintenance, sub-system replacement, testing, and review come to mind. But everywhere I worked, the philosophy of small files was already in extant.

  5. Magnum
    May 24, 2020 07:40 | #5

    I had a look at the code at the time this story broke and it seemed straightforward. The function and variable names were clear, so you could find what you were looking pretty easily. The code was layed out logically and consistently (if quite a bit repetitively) in a way that gave assurance it would work, and that it would be pretty easy to debug.

    I read the comments in three threads where the code was posted. It wouldn’t take long for box-checking code bureaucrats to start moaning about the lack of unit tests and non-reproducibility of exact results (due to threading and hardware fp) as well as everything else. The thing is it was pretty clear they had an agenda, and sure enough it was all taken as proof that global warming models are also all wrong. So the people getting excited about this were global-warming deniers as well as being anti-lockdown.

    Sorry for being a couple of weeks late but I saw your comment Derek on the ‘unherd’ site and it was the first one which made any kind of sense.

  6. May 24, 2020 13:20 | #6

    @Magnum
    Always happy to receive comments.

    A great deal of online commentary about code has a non-technical agenda. Similar comments appeared when the ‘Climategate’ code appeared.

    The cleaned-up Imperial code looked better than a lot of code I have seen that had been worked on by many students over many years. The original version may be a lot worse, but that is the nature of research software.

  1. No trackbacks yet.