Blog AI/ML Empower ModelOps and HPC workloads with GPU-enabled runners integrated with CI/CD
July 6, 2023
9 min read

Empower ModelOps and HPC workloads with GPU-enabled runners integrated with CI/CD

Learn how to leverage our GitLab-hosted GPU-enabled runners for ModelOps and high-performance computing workloads.

gitlab-data-science-icon.png

This blog post is the latest in an ongoing series about GitLab's journey to build and integrate AI/ML into our DevSecOps platform. Start with the first blog post: What the ML is up with DevSecOps and AI?. Throughout the series, we'll feature blogs from our product, engineering, and UX teams to showcase how we're infusing AI/ML into GitLab.

In today's fast-paced world, organizations are constantly looking to improve their ModelOps and high-performance computing (HPC) capabilities. Leveraging powerful graphical processing units (GPUs) has become a game-changer for accelerating machine learning workflows and compute-intensive tasks. To help meet these evolving needs, we recently released our first GPU-enabled runners on GitLab.com.

Securely hosting a GitLab Runner environment for ModelOps and HPC is non-trivial and requires a lot of knowledge and time to set up and maintain. In this blog post, we'll look at some real-world examples of how you can harness the potential of GPU computing for ModelOps or HPC workloads while taking full advantage of a SaaS solution.

What are GPU-enabled runners?

GPU-enabled runners are dedicated computing resources for the AI-powered DevSecOps platform. They provide accelerated processing power for ModelOps and HPC such as the training or deployment of large language models (LLMs) as part of ModelOps workloads. In the first iteration of releasing GPU-enabled runners, GitLab.com SaaS offers the GCP n1-standard-4 machine type (4 vCPU, 15 GB memory) with 1 NVIDIA T4 (16 GB memory) attached. The runner behaves like a GitLab Runner on Linux, using the docker+machine executor.

Using GPU-enabled runners

To take advantage of GitLab GPU-enabled runners, follow these steps:

1. Have a project on GitLab.com

All projects on GitLab.com SaaS with a Premium or Ultimate subscription have the GPU-enabled runners enabled by default - no additional configuration is required.

2. Create a job running on GPU-enabled runners

Create a job in your .gitlab-ci.yml configuration file, and set the runner tag to the saas-linux-medium-amd64-gpu-standard value.

gpu-job:
  stage: build
  tags:
    - saas-linux-medium-amd64-gpu-standard

3. Select a Docker image with the Nvidia CUDA driver

The CI/CD job runs in an isolated virtual machine (VM) with a bring-your-own-image policy as with GitLab SaaS runners on Linux. GitLab mounts the GPU from the host VM into your isolated environment. You must use a Docker image with the GPU driver installed to use the GPU. For Nvidia GPUs, you can use the CUDA Toolkit directly, or third-party images with Nvidia drivers installed, such as the TensorFlow GPU image.

The CI/CD job configuration for the Nvidia CUDA base Ubuntu image looks like this:

  image: nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04

4. Verify that the GPU is working

To verify that the GPU drivers are working correctly, you can execute the nvidia-smi command in the CI/CD job script section.

  script:
    - nvidia-smi

Basic usage examples

Let's explore some basic scenarios where GPU-enabled runners can supercharge your ModelOps and HPC workloads:

Example 1: ModelOps with Python

In this example, we train a model on our GPU-enabled runner defined in the train.py file using the Nvidia CUDA base Ubuntu image mentioned earlier.

.gitlab-ci.yml file:

model-training:
  stage: build
  tags:
    - saas-linux-medium-amd64-gpu-standard
  image: nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04
  script:
    - apt update
    - apt install -y --no-install-recommends python3 python3-pip 
    - pip3 install -r requirements.txt
    - python3 --version
    - python3 train.py

Example 2: Scientific simulations and HPC

Complex scientific simulations require significant computing resources. GPU-enabled runners can accelerate these simulations, allowing you to get results in less time.

.gitlab-ci.yml file:

simulation-run:
  stage: build
  tags:
    - saas-linux-medium-amd64-gpu-standard
  image: nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04
  script:
    - ./run_simulation --input input_file.txt

Advanced usage examples

Let's go through some real-world scenarios of how we use GPU-enabled runners at GitLab.

Example 3: Python model training with a custom Docker image

For our third example, we will use this handwritten digit recognition model. We are using this project as a demo to showcase or try out new ModelOps features.

Open the project and fork it into your preferred namespace. You can follow the next steps using the Web IDE in the browser, or clone the project locally to create and edit the files. Some of the next steps require you to override existing configuration in the Dockerfile and .gitlab-ci.yml.

As we need more pre-installed components and want to save installation time when training the model, we decided to create a custom Docker image with all dependencies pre-installed. This also gives us full control over the build environment we use and allows us to reuse it locally without relying on the `.gitlab-ci.yml' implementation.

In addition, we are using a more complete pipeline configuration with the following stages:

stages:
  - build
  - test
  - train
  - publish

GPU pipeline overview

Building a custom Docker image

The first step is to define a Dockerfile. In this example, we start with the Nvidia CUDA base Ubuntu image and then install Python3.10. Using pip install, we then add all the required libraries specified in a requirements.txt file.

FROM nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04

1. Update and install required packages
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3.10-dev \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

2. Set Python 3.10 as the default Python version
RUN ln -s /usr/bin/python3.10 /usr/bin/python

3. Copy the requirements.txt file
COPY requirements.txt /tmp/requirements.txt

4. Install Python dependencies
RUN pip3 install --no-cache-dir -r /tmp/requirements.txt

In the .gitlab-ci.yml file we use Kaniko to build the Docker image and push it into the GitLab Container Registry.

variables:
  IMAGE_PATH: "${CI_REGISTRY_IMAGE}:latest"
  GIT_STRATEGY: fetch

docker-build:
  stage: build
  tags:
    - saas-linux-medium-amd64
  image:
    name: gcr.io/kaniko-project/executor:v1.9.0-debug
    entrypoint: [""]
  script:
    - /kaniko/executor
      --context "${CI_PROJECT_DIR}"
      --dockerfile "${CI_PROJECT_DIR}/Dockerfile"
      --destination "${IMAGE_PATH}"
      --destination "${CI_REGISTRY_IMAGE}:${CI_COMMIT_TAG}"
  rules:
    - if: $CI_COMMIT_TAG

In rules we define to only trigger the Docker image build for a new git tag. The reason is simple - we don't want to run the image build process for every time we train the model.

To start the image build job create a new Git tag. You can either do this by using git tag -a v0.0.1 command or via UI. Navigate into Code > Tags and click on New Tag. As Tag name type v0.0.1 to create a new Git tag and trigger the job.

Navigate to Build > Pipelines to verify the docker-build job status, and then locate the tagged image following Deploy > Container Registry.

Docker image

Testing the Docker image

To test the image, we will use the following test-image job and run nvidia-smi and check that the GPU drivers are working correctly.

The job configuration in .gitlab-ci.yml file looks as follows:

test-image:
  stage: test
  tags:
    - saas-linux-medium-amd64-gpu-standard
  image: $IMAGE_PATH
  script:
    - nvidia-smi
  rules:
    - if: $CI_COMMIT_TAG

We also include container scanning and more security scanning templates in the .gitlab-ci.yml file.

include:
  - template: Security/Secret-Detection.gitlab-ci.yml
  - template: Security/Container-Scanning.gitlab-ci.yml
  - template: Jobs/Dependency-Scanning.gitlab-ci.yml
  - template: Security/SAST.gitlab-ci.yml

Training the model with our custom Docker image

Now that we have built our Custom docker image, we can train the model without installing any more dependencies in the job.

The train job in our .gitlab-ci.yml looks like this:

train:
  stage: train
  tags:
    - saas-linux-medium-amd64-gpu-standard
  image: $IMAGE_PATH
  script:
    - python train_digit_recognizer.py
  artifacts:
    paths:
      - mnist.h5
    expose_as: 'trained model'

Navigate to Build > Pipelines to see the job logs.

Train job logs

From here, you can also inspect the train job artifacts.

Publishing the model

In the last step of our .gitlab-ci.yml file, we are going to publish the trained model.

publish:
  stage: publish
  when: manual
  dependencies:
    - train
  image: curlimages/curl:latest
  script:
    - 'curl --header "JOB-TOKEN: $CI_JOB_TOKEN" --upload-file mnist.h5 "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/generic/MNIST-Model/${CI_COMMIT_TAG}/mnist.h5"'

Navigate to Build > Pipelines and trigger the publish job manually. After that, navigate into Deploy > Package Registry to verify the uploaded trained model.

Package Registry

Example 4: Jupyter notebook model training for ML-powered GitLab Issue triage

In the last example, we are using our GPU-enabled runner to train the internal GitLab model to triage issues. We use this model at GitLab to determine and assign issues to the right team from the context of the issue description.

Different from the previous examples, we now use the tensorflow-gpu container image and install the requirements in the job itself.

.gitlab-ci.yml configuration:

train:
  tags:
    - saas-linux-medium-amd64-gpu-standard
  image: tensorflow/tensorflow:2.4.1-gpu
  script:
    - nvidia-smi
    - cd notebooks
    - pip install -r requirements.tensorflow-gpu.txt
    - jupyter nbconvert --to script classify_groups.ipynb
    - apt-get install -y p7zip-full
    - cd ../data
    - 7z x -p${DATA_PASSWORD} gitlab-issues.7z
    - cd ../notebooks
    - python3 classify_groups.py
  artifacts:
    paths:
      - models/
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event" || $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH  
      when: manual
      allow_failure: true

TensorFlow train

If you are interested in another Jupyter notebook example, check out our recently published video on Training ML Models using GPU-enabled runner.

Results

The integration of GPU-enabled runners on GitLab.com SaaS opens up a new realm of possibilities for ModelOps and HPC workloads. By harnessing the power of GPU-enabled runners, you can accelerate your machine learning workflows, enable faster data processing, and improve scientific simulations, all while taking full advantage of a SaaS solution and avoiding the hurdles of hosting and maintaining your own build hardware.

When you try the GPU-enabled runners, please share your experience in our feedback issue.

Compute-heavy workloads can take a long time. A known problem is timeouts after three hours because of the current configuration of GitLab SaaS runners. We plan to release more powerful compute for future iterations to handle heavier workloads faster. You can follow updates about GPU-enabled runners in the GPU-enabled runners epic and learn more in the GPU-enabled runners documentation.

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum. Share your feedback

Ready to get started?

See what your team could do with a unified DevSecOps Platform.

Get free trial

New to GitLab and not sure where to start?

Get started guide

Learn about what GitLab can do for your team

Talk to an expert