Fantastic Models and how to Train Them

Experimenting with software development pipelines in machine learning projects.

Christian Melchiorre
Towards Data Science

--

In my long years as a software engineer I’ve been involved in software projects of varying sizes and application domains. So far, I was lucky enough to be given the opportunity to experiment with different technology stacks, keeping my professional skills always up to date and in line with the continuously accelerating rythm in which today’s technology landscape is evolving. Only relatively recently, though, I’ve started experimenting and building up skills in machine learning related technologies, in part in the context of my day-to-day job and in part for personal interests (I’m sort of a geek, I confess). So, after resuming my math theory and learning the basics of one or two of the principal software libraries, I set to work to try and reproduce some of the most interesting examples found on the web.

At first, when you start playing with a completely new technology stack, it might be tempting to just jot down some random script implementing some web tutorial, or try some variations of it in no particular order, just to taste the feeling of it…

I soon realized, though, that in order to be able to exploit my new skills in a more professional setting, I needed to proceed in a somewhat more structured way. So I considered integrating these new techniques and tools with what I’ve learnt so far through designing and developing “classical” enterprise level applications, by adapting the usual software development pipeline and tools to the specific case where machine learning models are involved. I plan to document here — this blog post and the following ones — my efforts and experiments, and offer them as a guide to those like me that are starting to move in this fascinatinig technological field.

For a start, I began considering the whole software development and distribution process from a bird’s eye view.

Best practices and patterns provide a useful perspective on the issue: Just to make an example, some of these practices¹ recommend to strictly separate the three stages of build, release and run in the software development and distribution lifecycle. The build stage deals with the actual writing of the source code for our application or service in our favourite programming language, dealing with aspects as source code versioning and usually resulting in a binary executable bundle, known as a “build”. The following release (also know as deploy) stage takes this binary bundle and combines it with specific configuration data stricly dependent on the target deployment environment (development, testing and quality assurance, staging, production and so on). Such configuration data may include connection and credential for external databases or other analogous services (the so called “backing services”) deployment-specific settings such as the canonical host name for the deploy etc. What is produced by this stage is hence something which is ready for immediate execution. As a last stage, the run phase executes the application or service’s main process(es) in the target environment.

This strict separation of stages implies for instance that it is forbidden to make changes to the application code directly in the runtime environment, but a new release cycle must be initiated to change source code, produce a new bundle and execute this on a target runtime as a new version. This guarantees many nice properties such as traceability, reproducibility of the process, and so on and so forth.

All this consequently leads to set up separated (at least conceptually) runtime environments, where different stages are supported by different specific environments, each of which is equipped with the resource level, software infrastructure tools best fitted to support the specific activities related to the stage at issue.

A development environment will need some IDE (like Eclipse, Visual Studio or PyCharm) and language runtime that allows developers to produce their code, a Continuous Integration (CI) environment will contain tools to share, version and manage both source code and compiled component binaries (eg. GIT, Maven, Jenkins, etc.), and finally the target runtime environment should be equipped with some a solid runtime infrastructure that allows the execution of the software comoponents while managing issues like distribution, load balancing, fault tolerance, and so on (lately I’m getting passionate with Kubernetes).

Now, one of the first aspects that come to mind when considering machine learning projects is that the build phase gets conceptually slightly more complicated. The developers don’t just write a program source code that get built and then executed, but rather this first source code is a program which is built and then executed somewhere to produce another “program” (namely, the ML model) that will finally be executed somewhere else to serve clients’ requests.

To be clear, actual “physical” separation between the CI and train environment is basically imposed by the fundamentally different needs that arise for the different tasks of building application binaries from the source code as compared to training a ML model. The latter supposedly requires much higher computational power and resources than the former, maybe with support of GPUs, TPUs, whatever… I realize that all this might likely sound trivial for experienced ML practictioneers, but I confess that it took some time to me at first to correctly focus on this issue, leading to some initial confusion, where I viewed model training just a further stage in the building process of a final executable artifact building.

Similarly, requirements of the training environment and the target execution environment might differ greatly, where the final target might even be just some mobile device running a trained model as part of some “AI-powered” mobile application.

What I plan to do in following posts is examine any issues that might arise when setting up and operating any of the pieces of the puzzle mentioned above with the general focus on relevant solutions in enterprise-level contexts, where teams of more than a single developer might be involved, where large scale models are designed or (most importantly) large dataset are used, and where considerations like run-time performance, high scalability or resiliency are important. All this from the point of view of (a little more than) a beginner as I am as far as software based on Machine Learning is concerned, … a sort of “diary” of this work of exploration I’m carrying out. At the same time, my intention is not to provide some in-depth step-by-step guide on how to make things, but rather to take a high-level view on the problems and issues that might arise, the choices that you faces or the available options (best practices, software tools, services etc.) that might help to find the desired solutions.

I hope this is a subject that could interest a few people, and also hope to receive good feedback and useful suggestions on how to improve this series of posts.

References and notes

[1] See for example the so called “12 factors” methodology, a set of patterns and recommendations that is getting quite popular lately, initially devised by engineers at Heroku (the well know PaaS platform). These are meant to provide guidelines to build portable, highly scalable and resilient applications, with particular focus to web-based (software-as-a-service) or cloud-based services, as most of the modern software applications are supposed to be. https://12factor.net/

--

--