Using a mini-HPC cluster to test batch jobs

Testing applications intended to run on high-performance computing clusters can be a painful process, especially when you do not have access to the resources you need. Sometimes you just want to check if you have the right compile flag specified, if the script you wrote actually works, or if your application can successfully use multiple cores. One way to test your HPC software is by emulation; instead of wasting precious compute time on your bare-metal cluster, you use a mini-HPC cluster that provides the same functionality as the bare-metal cluster on a smaller scale.

Using cleantest, you can build yourself a mini-HPC cluster anywhere you need it, whether it be on your laptop or a continuous integration pipeline runner. In this tutorial, you will learn how to build a mini-HPC cluster and submit a test batch jobs to the cluster's resource manager/workload scheduler. Below is a diagram outlining the architecture of the mini-HPC cluster that we are going to build with cleantest.

flowchart LR
    subgraph identity [Identity]
        ldap(OpenLDAP)
    end
    subgraph filesystem [Shared Filesystem]
        nfs(NFS)
    end
    subgraph cluster [Resource Management & Compute Service]
        controller(slurmctld)
        compute0(slurmd)
        compute1(slurmd)
        compute2(slurmd)
    end

    identity --> filesystem
    identity --> cluster
    filesystem --> cluster
    cluster --> filesystem
    controller --> compute0
    compute0 --> controller
    controller --> compute1
    compute1 --> controller
    controller --> compute2
    compute2 --> controller

Setting up the cleantest environment

Test dependencies

This tutorial will be using the LXD test environment provider to provide the test environment instances that will compose the mini-HPC cluster. If you do not have LXD installed on your system, please visit the Installation guide for instructions on how to set up LXD.

This tutorial will also being using the Jinja templating engine for rendering specific configuration files for the services used inside the mini-HPC cluster. You can use the pip package manager to install Jinja on your system:

python3 -m pip install Jinja2

We will also be using pytest to run our "clean tests". pytest can be installed using pip as well:

python3 -m pip install pytest

Required template files

The following Jinja template files are needed for rendering service configuration files. Please create a templates directory in your current working directory and copy the templates to the newly created directory.

sssd.conf.tmpl

We will be using sssd (System Security Services Daemon) to connect clients to the mini-HPC cluster's identity service. This Jinja template will be used to render the sssd.conf file that will be used by the sssd service to locate the identity service.

[sssd]
config_file_version = 2
domains = mini-hpc.org

[domain/mini-hpc.org]
id_provider = ldap
auth_provider = ldap
ldap_uri = ldap://{{ ldap_server_address }}
cache_credentials = True
ldap_search_base = dc=mini-hpc,dc=org

slurm.conf.tmpl

We will be using the SLURM workload manager to provide resource management, workload scheduling, and compute service in the mini-HPC cluster. This Jinja template will be used to configure SLURM after the controller and compute nodes have been created.

SlurmctldHost={{ slurmctld_name }}({{ slurmctld_address }})
ClusterName=mini-hpc

AuthType=auth/munge
FirstJobId=65536
InactiveLimit=120
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/jobcomp
ProctrackType=proctrack/linuxproc
KillWait=30
MaxJobCount=10000
MinJobAge=3600
ReturnToService=0
SchedulerType=sched/backfill
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmctldPort=7002
SlurmdPort=7003
SlurmdSpoolDir=/var/spool/slurmd.spool
StateSaveLocation=/var/spool/slurm.state
SwitchType=switch/none
TmpFS=/tmp
WaitTime=30

# Node Configurations
NodeName={{ slurmd_0_name }} NodeAddr={{ slurmd_0_address }} CPUs=1 RealMemory=1000 TmpDisk=10000
NodeName={{ slurmd_1_name }} NodeAddr={{ slurmd_1_address }} CPUs=1 RealMemory=1000 TmpDisk=10000
NodeName={{ slurmd_2_name }} NodeAddr={{ slurmd_2_address }} CPUs=1 RealMemory=1000 TmpDisk=10000

# Partition Configurations
PartitionName=all Nodes={{ slurmd_0_name }},{{ slurmd_1_name }},{{ slurmd_2_name }} MaxTime=30 MaxNodes=3 State=UP

Create the test file

Create the file test_mini_hpc.py in your current working directory; this file is where we will write our test. Once you have created test_mini_hpc.py, add the following lines to the top of the file:

#!/usr/bin/env python3

"""Test batch job using mini-HPC cluster created with cleantest."""

import json
import os
import pathlib
from io import StringIO

from jinja2 import Environment, FileSystemLoader

from cleantest.control.hooks import StopEnvHook
from cleantest.control.lxd import InstanceConfig
from cleantest.data import File
from cleantest.provider import lxd, LXDArchon

root = pathlib.Path(os.path.dirname(os.path.realpath(__file__)))
templates = Environment(loader=FileSystemLoader(root / "templates"))

# Where the testlet will be added.
...


def test_mini_hpc() -> None:
    """Test batch job inside mini-hpc cluster."""
    archon = LXDArchon()
    archon.config.register_hook(
        StopEnvHook(name="get_result", download=[File("/tmp/result", root / "result")])
    )
    placeholder = archon.config.get_instance_config("ubuntu-jammy-amd64").dict()
    placeholder["name"] = "mini-hpc-sm"
    archon.config.add_instance_config(
        InstanceConfig(
            config={
                "limits.cpu": "1",
                "limits.memory": "8GB",
                "security.privileged": "true",
                "raw.apparmor": "mount fstype=nfs*, mount fstype=rpc_pipefs,",
            },
            **placeholder,
        )
    )

    # Where most of the code snippets will be appended 
    # for creating the mini-HPC cluster.
    ...

These imports/variable declarations at the beginning of the test file will be used throughout the code snippets in the rest of this tutorial.

Inside our test function, we instantiate an instance of LXDArchon that will be used to control the LXD test environment provider. We also define a stop environment hook that will be used to retrieve the result of our test batch job. Lastly, we create a custom test environment instance configuration. This configuration will have the necessary privileges set so that we do not eat all of your workstation's resources as well as enabling support for NFS mounts inside the test environment instances.

Configuring the identity service

To start, we will be configuring an identity service for our mini-HPC cluster. We cannot let user root be the primary user of our mini-HPC cluster; that would create a disaster. To prevent the mini-HPC cluster from going up into flames, we are going to use slapd (Stand-alone LDAP Daemon) to provide the identity service for our cluster.

Provision script for `ldap-0`

cleantest has a built-in provisioning mechanism for test environment instances. You can write custom Python scripts that use utilities provided by cleantest to set up the test environment instance how you like. Below is the provision script that we will use to provision the ldap-0 instance in our mini-HPC cluster. Save the provision script to the file ldap_provision_script.py in your current working directory.

#!/usr/bin/env python3
# Copyright 2023 Jason C. Nucciarone
# See LICENSE file for licensing details.

"""Provision LDAP server nodes."""

import pathlib
import tempfile
import textwrap

from cleantest.utils import apt, systemd, run

# Define resources needed to set up LDAP.
slapd_preseed = textwrap.dedent(
    """
    slapd slapd/no_configuration boolean false
    slapd slapd/domain string mini-hpc.org
    slapd shared/organization string mini-hpc
    slapd slapd/password1 password test
    slapd slapd/password2 password test
    slapd slapd/purge_database boolean true
    slapd slapd/move_old_database boolean true
    """
).strip("\n")
default_ldif = textwrap.dedent(
    """
    dn: ou=People,dc=mini-hpc,dc=org
    objectClass: organizationalUnit
    ou: People

    dn: ou=Groups,dc=mini-hpc,dc=org
    objectClass: organizationalUnit
    ou: Groups

    dn: uid=nucci,ou=People,dc=mini-hpc,dc=org
    uid: nucci
    objectClass: inetOrgPerson
    objectClass: posixAccount
    cn: nucci
    sn: nucci
    givenName: nucci
    mail: nucci@example.com
    userPassword: test
    uidNumber: 10000
    gidNumber: 10000
    loginShell: /bin/bash
    homeDirectory: /home/nucci

    dn: cn=nucci,ou=Groups,dc=mini-hpc,dc=org
    cn: nucci
    objectClass: posixGroup
    gidNumber: 10000
    memberUid: nucci

    dn: cn=research,ou=Groups,dc=mini-hpc,dc=org
    cn: research
    objectClass: posixGroup
    gidNumber: 10100
    memberUid: nucci
    """
).strip("\n")

# Set up slapd service.
apt.update()
apt.install("slapd", "ldap-utils", "debconf-utils")
with tempfile.NamedTemporaryFile() as preseed, tempfile.NamedTemporaryFile() as ldif:
    pathlib.Path(preseed.name).write_text(slapd_preseed)
    pathlib.Path(ldif.name).write_text(default_ldif)
    results = run(
        f"debconf-set-selections < {preseed.name}",
        "dpkg-reconfigure -f noninteractive slapd",
        (
            "ldapadd -x -D cn=admin,dc=mini-hpc,dc=org -w "
            f"test -f {ldif.name} -H ldap:///"
        ),
    )
    for result in results:
        assert result.exit_code == 0

systemd.restart("slapd")

The provision script works through the following steps:

Import the apt, systemd, and run utilities for interfacing with the APT package manager and systemd on the instance, and executing shell commands respectively.
Define a preseed file that will be used by debconf-set-selections to configure the slapd service on the instance.
Define a ldif (LDAP Data Interchange Format) file that will be used to create our test user and group.
Update APT cache and install slapd, ldap-utils, and debconf-utils.
Configure slapd service using debconf-set-selections and add test user and group to the LDAP server.
Restart slapd so that new configurations will be set.

Use cleantest archon to create `ldap-0`

Now that we have our provision script ldap_provision_script.py written, use the following block of code to add the instance ldap-0 to our LXD test environment provider:

archon.add(
    "ldap-0",
    image="mini-hpc-sm",
    provision_script=root / "ldap_provision_script.py",
)

The provision script will be injected into ldap-0 and executed after the instance becomes active.

Setting up the shared filesystem

We will be using NFS (Network File System) to provide the shared filesystem in the mini-HPC cluster. Many compute instances are generally operating on the same set of data across the cluster, so the compute nodes all need access to the same data. Using NFS, we can ensure that each compute instance has access to the same file and/or directories.

Provision script for `nfs-0`

With the ldap-0 instance created, create the provision script nfs_provision_script.py in your current working directory and add the following code block to the file:

#!/usr/bin/env python3
# Copyright 2023 Jason C. Nucciarone
# See LICENSE file for licensing details.

"""Provision NFS server nodes."""

import pathlib
import textwrap

from cleantest.utils import apt, systemd, run

# Define resources needed to set up nfs-kernel-server.
default_exports = textwrap.dedent(
    """
    /srv     *(ro,sync,subtree_check)
    /home    *(rw,sync,no_subtree_check)
    /data    *(rw,sync,no_subtree_check,no_root_squash)
    /opt     *(rw,sync,no_subtree_check,no_root_squash)
    """
).strip("\n")

# Set up SSSD service.
apt.update()
apt.install("nfs-kernel-server", "sssd-ldap")
for result in run(
    "mv /root/.init/sssd.conf /etc/sssd/sssd.conf",
    "chmod 0600 /etc/sssd/sssd.conf",
    "pam-auth-update --enable mkhomedir",
):
    assert result.exit_code == 0

systemd.restart("sssd")

# Set up NFS kernel server.
for result in run(
    "mkdir -p /data/nucci",
    "mkdir -p /home/nucci",
    "chown -R nucci:nucci /data/nucci",
    "chown -R nucci:nucci /home/nucci",
    "chmod 0755 /data",
    "chmod -R 0750 /data/nucci",
    "chmod -R 0740 /home/nucci",
    "ln -s /data/nucci /home/nucci/data",
):
    assert result.exit_code == 0

pathlib.Path("/etc/exports").write_text(default_exports)
for result in run("exportfs -a"):
    assert result.exit_code == 0

systemd.restart("nfs-kernel-server")

Connecting to the LDAP server on ldap-0 using sssd

Notice how in nfs_provision_script.py when are configuring the sssd service, we are first moving a sssd.conf file from /root/.init/sssd.conf to /etc/sssd/sssd.conf? You may be wondering "how did that file get there?" Worry not! In the next section, we will be generating that sssd.conf file using a Jinja template and uploading to nfs-0 as a "provisioning resource."

This script will work through the following the steps to configure the NFS server that will provide the shared filesystem for the mini-HPC cluster:

Import apt, systemd, and run utilities for interfacing with the APT package manager and systemd on the instance, and executing shell commands respectively.
Define the exports file that will be used to tell the NFS kernel server which directories to export.
Update the APT cache and install nfs-kernel-server and sssd-ldap for running the NFS server and connecting to the LDAP server respectively.
Connect to the LDAP server on ldap-0 using sssd.
Set up the test user's home and data directories.
Share the /data and /home directories across the network.
Restart the nfs-kernel-service so the new /etc/exports file takes effect.

Use cleantest archon to create `nfs-0`

Now with our nfs_provision_script.py file created, use the following block of code to add the nfs-0 instance to our LXD test environment provider.

sssd_conf = StringIO(
    templates.get_template("sssd.conf.tmpl").render(
        ldap_server_address=archon.get_public_address("ldap-0")
    )
)
archon.add(
    "nfs-0",
    image="mini-hpc-sm",
    provision_script=root / "nfs_provision_script.py",
    resources=[File(sssd_conf, "/root/.init/sssd.conf")],
)

The provision script and generated sssd.conf file will be injected into nfs-0 after the instance becomes active.

Starting the SLURM cluster

SLURM is one of the most popular open-source workload managers out there for HPC. SLURM scales very well, so we can use it inside the mini-HPC cluster even if SLURM is meant to used with multi-thousand node clusters.

Provision script for `slurmctld-0`

Now that we have both the ldap-0 and nfs-0 instances created, it is time to provision to controller server slurmctld-0 for the mini-HPC cluster. Create the file slurmctld_provision_script.py in your current working directory and copy the following code block to it:

#!/usr/bin/env python3
# Copyright 2023 Jason C. Nucciarone
# See LICENSE file for licensing details.

"""Provision slurmctld nodes."""

import pathlib
import json
from io import StringIO

from cleantest.utils import apt, systemd, run

# Set up SSSD service.
apt.update()
apt.install("slurmctld", "nfs-common", "sssd-ldap")
for result in run(
    "mv /root/.init/sssd.conf /etc/sssd/sssd.conf",
    "chmod 0600 /etc/sssd/sssd.conf",
    "pam-auth-update --enable mkhomedir",
):
    assert result.exit_code == 0

systemd.restart("sssd")

# Set up NFS mount.
nfs_ip = json.load(StringIO(pathlib.Path("/root/.init/nfs-0").read_text()))
for result in run(
    f"mount {nfs_ip['nfs-0']}:/home /home",
    "mkdir -p /data",
    f"mount {nfs_ip['nfs-0']}:/data /data",
):
    assert result.exit_code == 0

Mounting directories shared by nfs-0 and connecting to LDAP server on ldap-0

Notice how like when we provisioned the nfs-0 instance, we used a "provisioning resource" to connect the sssd service to the LDAP server? In the above provisioning script for slurmctld-0, we use the same mechanism for mounting the shared directories on nfs-0. We will create this /root/.init/nfs-0 resource in the section about creating the slurmctld-0 instance.

This script will work through the following steps to configure the controller service for the mini-HPC cluster. We will be starting the control server manually rather than having the provision script start it for us:

Import apt, systemd, and run utilities for interfacing with the APT package manager and systemd on the instance, and executing shell commands respectively.
Update the APT cache and install slurmctld, nfs-common, sssd-ldap for running the controller server, mounting the shared directories, and connecting to the LDAP server respectively.
Connect to the LDAP server running on ldap-0 using sssd.
Mount the shared directories exported by the NFS server running on nfs-0.

Provision script for `slurmd-{0,1,2}`

We can also provision the compute nodes alongside the controller server slurmctld-0. Create the file slurmd_provision_script.py in your current working directory and copy the following code block to it:

#!/usr/bin/env python3
# Copyright 2023 Jason C. Nucciarone
# See LICENSE file for licensing details.

"""Provision slurmd nodes."""

import json
import pathlib
from io import StringIO

from cleantest.utils import apt, systemd, run

# Set up SSSD service.
apt.update()
apt.install("slurmd", "nfs-common", "sssd-ldap")
for result in run(
    "mv /root/.init/sssd.conf /etc/sssd/sssd.conf",
    "chmod 0600 /etc/sssd/sssd.conf",
    "pam-auth-update --enable mkhomedir",
):
    assert result.exit_code == 0

systemd.restart("sssd")

# Set up NFS mount.
nfs_ip = json.load(StringIO(pathlib.Path("/root/.init/nfs-0").read_text()))
for result in run(
    f"mount {nfs_ip['nfs-0']}:/home /home",
    "mkdir -p /data",
    f"mount {nfs_ip['nfs-0']}:/data /data",
):
    assert result.exit_code == 0

# Set up munge key.
for result in run(
    "mv /root/.init/munge.key /etc/munge/munge.key",
    "chown munge:munge /etc/munge/munge.key",
    "chmod 0600 /etc/munge/munge.key",
):
    assert result.exit_code == 0

systemd.restart("munge")

Setting up MUNGE authentication service

Similar to how nfs-0 and slurmctld-0 are provisioned, the slurmd instances also use "provisioning resources." However, unlike those instances, the slurmd instances require a MUNGE key from the slurmctld-0; MUNGE is the authentication service that the SLURM workload manager uses to verify nodes in SLURM cluster. The section on creating the slurmd instances will show you how to pull resources from other instances.

This script will work through the following steps to configure the compute service for the mini-HPC cluster. We will be starting the compute servers manually rather than having the provision script start it for us:

Import apt, systemd, and run utilities for interfacing with the APT package manager and systemd on the instance, and executing shell commands respectively.
Update the APT cache and install slurmd, nfs-common, sssd-ldap for running the compute server, mounting the shared directories, and connecting to the LDAP server respectively.
Connect to the LDAP server running on ldap-0 using sssd.
Mount the shared directories exported by the NFS server running on nfs-0.
Set up MUNGE key pulled from slurmctld-0 instance.

Use cleantest archon to create `slurmctld-0`

Now with our slurmctld_provision_script.py file created, use the following block of code to add the slurmctld-0 instance to our LXD test environment provider.

nfs_ip = json.dumps({"nfs-0": str(archon.get_public_address("nfs-0"))})
archon.add(
    "slurmctld-0",
    image="mini-hpc-sm",
    provision_script=root / "slurmctld_provision_script.py",
    resources=[
        File(sssd_conf, "/root/.init/sssd.conf"),
        File(StringIO(nfs_ip), "/root/.init/nfs-0"),
    ],
)

The provision script, generated sssd.conf and generated nfs-0 files will be injected into slurmctld-0 after the instance becomes active.

Pull MUNGE key from `slurmctld-0` and create `slurmd-{0,1,2}`

Using our other provisioning script slurmd_provision_script.py, use the following code block to pull the munge.key file from slurmctld-0 and create the instance slurmd-0, slurmd-1, and slurmd-2.

archon.pull(
    "slurmctld-0", data_obj=[File("/etc/munge/munge.key", root / "munge.key")]
)
archon.add(
    ["slurmd-0", "slurmd-1", "slurmd-2"],
    image="mini-hpc-sm",
    provision_script=root / "slurmd_provision_script.py",
    resources=[
        File(sssd_conf, "/root/.init/sssd.conf"),
        File(StringIO(nfs_ip), "/root/.init/nfs-0"),
        File(root / "munge.key", "/root/.init/munge.key"),
    ],
)

The provision script, generated sssd.conf file, generated nfs-0 file, and pulled munge.key file will be injected into slurmd-0, slurmd-1, and slurmd-2 after the instance becomes active.

Sync slurm.conf across `slurmctld-0` and `slurmd-{0,1,2}` and start SLURM services

Now that slurmctld-0, slurmd-0, slurmd-1, and slurmd-2 have been created, we can now use the following code block to start the SLURM cluster:

slurm_node_info = {
    "slurmctld_name": "slurmctld-0",
    "slurmctld_address": archon.get_public_address("slurmctld-0"),
    "slurmd_0_name": "slurmd-0",
    "slurmd_0_address": archon.get_public_address("slurmd-0"),
    "slurmd_1_name": "slurmd-1",
    "slurmd_1_address": archon.get_public_address("slurmd-1"),
    "slurmd_2_name": "slurmd-2",
    "slurmd_2_address": archon.get_public_address("slurmd-2"),
}
slurm_conf = StringIO(
    templates.get_template("slurm.conf.tmpl").render(**slurm_node_info)
)
for node in ["slurmctld-0", "slurmd-0", "slurmd-1", "slurmd-2"]:
    archon.push(node, data_obj=[File(slurm_conf, "/etc/slurm/slurm.conf")])
archon.execute("slurmctld-0", command="systemctl start slurmctld")
archon.execute(
    ["slurmd-0", "slurmd-1", "slurmd-2"], command="systemctl start slurmd"
)

This code block will pull the IPv4 addresses of slurmctld-0, slurmd-0, slurmd-1, and slurmd-2, generate the slurm.conf from a Jinja template, push slurm.conf into all the slurm instances, and then start each slurm service.

Write a testlet to submit test batch job

Congratulations, we finally have all the code that we need to create the mini-HPC cluster when we go to run our test. Now it is time to write the testlet. In the code block below, @lxd.target("...") is used to target a specific test environment instance rather than creating a unique instance for the testlet:

@lxd.target("slurmctld-0")
def run_job():
    import os
    import pathlib
    import shutil
    import textwrap
    from time import sleep

    from cleantest.utils import run

    tmp_dir = pathlib.Path("/tmp")
    (tmp_dir / "research.submit").write_text(
        textwrap.dedent(
            """
            #!/bin/bash
            #SBATCH --job-name=research
            #SBATCH --partition=all
            #SBATCH --nodes=1
            #SBATCH --ntasks-per-node=1
            #SBATCH --cpus-per-task=1
            #SBATCH --mem=500mb
            #SBATCH --time=00:00:30
            #SBATCH --error=research.err
            #SBATCH --output=research.out

            echo "I love doing research!"
            """
        ).strip("\n")
    )

    # Set user to test cluster user nucci.
    os.setuid(10000)
    os.chdir("/home/nucci")
    for result in run(
        f"cp {(tmp_dir / 'research.submit')} .",
        "sbatch research.submit",
    ):
        assert result.exit_code == 0
    sleep(60)
    shutil.copy("research.out", (tmp_dir / "result"))

A warning about using cleantest with the SLURM workload manager

One thing to note about the SLURM workload manager is that it does not let user root submit jobs to the scheduler (for obvious reasons). Unfortunately, cleantest currently only allows for testlets to be run as user root. To get around this limitation, we can use Python's os module to change to our test user, but we cannot change back to user root after the switch. You should have your "power-user" operations taken care of before switching to the test user.

Bringing it all together

Your completed test file test_mini_hpc.py should look like the following:

#!/usr/bin/env python3

"""Test batch job using mini-HPC cluster created with cleantest."""

import json
import os
import pathlib
from io import StringIO

from jinja2 import Environment, FileSystemLoader

from cleantest.control.hooks import StopEnvHook
from cleantest.control.lxd import InstanceConfig
from cleantest.data import File
from cleantest.provider import lxd, LXDArchon

root = pathlib.Path(os.path.dirname(os.path.realpath(__file__)))
templates = Environment(loader=FileSystemLoader(root / "templates"))


@lxd.target("slurmctld-0")
def run_job():
    import os
    import pathlib
    import shutil
    import textwrap
    from time import sleep

    from cleantest.utils import run

    tmp_dir = pathlib.Path("/tmp")
    (tmp_dir / "research.submit").write_text(
        textwrap.dedent(
            """
            #!/bin/bash
            #SBATCH --job-name=research
            #SBATCH --partition=all
            #SBATCH --nodes=1
            #SBATCH --ntasks-per-node=1
            #SBATCH --cpus-per-task=1
            #SBATCH --mem=500mb
            #SBATCH --time=00:00:30
            #SBATCH --error=research.err
            #SBATCH --output=research.out

            echo "I love doing research!"
            """
        ).strip("\n")
    )

    # Set user to test cluster user nucci.
    os.setuid(10000)
    os.chdir("/home/nucci")
    for result in run(
        f"cp {(tmp_dir / 'research.submit')} .",
        "sbatch research.submit",
    ):
        assert result.exit_code == 0
    sleep(60)
    shutil.copy("research.out", (tmp_dir / "result"))


def test_lxd_archon_local() -> None:
    """Test LXDArchon against local LXD cluster."""
    archon = LXDArchon()
    archon.config.register_hook(
        StopEnvHook(name="get_result", download=[File("/tmp/result", root / "result")])
    )
    placeholder = archon.config.get_instance_config("ubuntu-jammy-amd64").dict()
    placeholder["name"] = "mini-hpc-sm"
    archon.config.add_instance_config(
        InstanceConfig(
            config={
                "limits.cpu": "1",
                "limits.memory": "8GB",
                "security.privileged": "true",
                "raw.apparmor": "mount fstype=nfs*, mount fstype=rpc_pipefs,",
            },
            **placeholder,
        )
    )
    archon.add(
        "ldap-0",
        image="mini-hpc-sm",
        provision_script=root / "ldap_provision_script.py",
    )
    sssd_conf = StringIO(
        templates.get_template("sssd.conf.tmpl").render(
            ldap_server_address=archon.get_public_address("ldap-0")
        )
    )
    archon.add(
        "nfs-0",
        image="mini-hpc-sm",
        provision_script=root / "nfs_provision_script.py",
        resources=[File(sssd_conf, "/root/.init/sssd.conf")],
    )
    nfs_ip = json.dumps({"nfs-0": str(archon.get_public_address("nfs-0"))})
    archon.add(
        "slurmctld-0",
        image="mini-hpc-sm",
        provision_script=root / "slurmctld_provision_script.py",
        resources=[
            File(sssd_conf, "/root/.init/sssd.conf"),
            File(StringIO(nfs_ip), "/root/.init/nfs-0"),
        ],
    )
    archon.pull(
        "slurmctld-0", data_obj=[File("/etc/munge/munge.key", root / "munge.key")]
    )
    archon.add(
        ["slurmd-0", "slurmd-1", "slurmd-2"],
        image="mini-hpc-sm",
        provision_script=root / "slurmd_provision_script.py",
        resources=[
            File(sssd_conf, "/root/.init/sssd.conf"),
            File(StringIO(nfs_ip), "/root/.init/nfs-0"),
            File(root / "munge.key", "/root/.init/munge.key"),
        ],
    )
    slurm_node_info = {
        "slurmctld_name": "slurmctld-0",
        "slurmctld_address": archon.get_public_address("slurmctld-0"),
        "slurmd_0_name": "slurmd-0",
        "slurmd_0_address": archon.get_public_address("slurmd-0"),
        "slurmd_1_name": "slurmd-1",
        "slurmd_1_address": archon.get_public_address("slurmd-1"),
        "slurmd_2_name": "slurmd-2",
        "slurmd_2_address": archon.get_public_address("slurmd-2"),
    }
    slurm_conf = StringIO(
        templates.get_template("slurm.conf.tmpl").render(**slurm_node_info)
    )
    for node in ["slurmctld-0", "slurmd-0", "slurmd-1", "slurmd-2"]:
        archon.push(node, data_obj=[File(slurm_conf, "/etc/slurm/slurm.conf")])
    archon.execute("slurmctld-0", command="systemctl start slurmctld")
    archon.execute(
        ["slurmd-0", "slurmd-1", "slurmd-2"], command="systemctl start slurmd"
    )
    for name, result in run_job():
        assert "I love doing research!" in pathlib.Path(root / "result").read_text()
    (root / "munge.key").unlink(missing_ok=True)
    (root / "result").unlink(missing_ok=True)
    archon.execute(
        ["slurmctld-0", "slurmd-0", "slurmd-1", "slurmd-2"],
        command=f"umount /home /data",
    )
    archon.destroy()

Do not forget to add the bottom part here that evaluates the result of the testlet! After the evaluation, some basic cleanup is also performed so that the mini-HPC cluster does not continue to use your workstation's resources after the test has completed. Now use pytest to run your "clean test"!

pytest test_mini_hpc.py

Now sit back, relax, and wait for your test to complete! You can watch the following video to get an idea of what cleantest is doing "under the hood" on your system to run the test:

Where to go from here

This was quite a long tutorial, so make sure to grant yourself a well-earned coffee break!

If you are interested in learning more about the underpinnings of a mini-HPC cluster, check out my intro to open-source HPC workshop that I gave at the 2022 Ubuntu Summit in Prague. My workshop will take your through building your own mini-HPC cluster manually, and it takes you further than just setting up the infrastructure such as building containers and deploying your own software stack.

If you want to sink your teeth further into cleantest, I suggest reading more of the tutorials or heading over onto the User Guide page to learn about what cleantest is capable of!

Using a mini-HPC cluster to test batch jobs

Setting up the cleantest environment

Test dependencies

Required template files

Create the test file

Configuring the identity service

Provision script for ldap-0

Use cleantest archon to create ldap-0

Setting up the shared filesystem

Provision script for nfs-0

Use cleantest archon to create nfs-0