Machine Learning Guide

If you follow this guide, you learn how to create and reproduce a comprehensive machine learning experiment, that leverages the full potential of the CC framework. Before you continue, make sure that you have understood the contents of the RED Beginner’s Guide. A basic understanding of machine learning methods is recommended.

This guide contains an experiment, that trains a Convolutional Neural Network (CNN) on the PCAM dataset to classify tumor tissue in pathological image slides.

Teaching Goals

The main teaching goals of this guide are:

  • how to use a read-only SSHFS directory to mount a large training dataset located on a remote server
  • how to use a writable SSHFS directory for live logging of the training process
  • how to use batch processing for hyperparameter optimization of machine learning methods
  • how to use Nvidia GPUs to accellerate the processing
  • how to send experiments to the CC-Agency execution engine

Prerequisites

Using an Nvidia GPU for training acceleration is recommended but not required.

The dataset used in this guide is located on the avocado01.f4.htw-berlin.de storage server, that is not available to the public. You can still follow the guide using your own SSH server.

Download Dataset to Storage Server

You can skip this section, if you have SSH access to avocado01.f4.htw-berlin.de.

Login to your SSH server, create a PCAM folder in your home directory, download the PCAM dataset using curl and extract the files using gunzip.

SSH_USERNAME=christoph
SSH_HOST=avocado01.f4.htw-berlin.de
ssh ${SSH_USERNAME}@${SSH_HOST}
mkdir PCAM
cd PCAM
curl -fO https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_train_x.h5.gz
curl -fO https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_train_y.h5.gz
curl -fO https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_valid_x.h5.gz
curl -fO https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_valid_y.h5.gz
curl -fO https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_test_x.h5.gz
curl -fO https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_test_y.h5.gz
gunzip *.h5.gz

When following the tutorial, you have to replace all occurrences of /data/ldap/histopathologic/original_read_only/PCAM_extracted with PCAM and all occurrences of avocado01.f4.htw-berlin.de with your own SSH server.

Training Experiment

This part of the guide describes the setup of the experiment.

Training Code

The following code uses a pre-release of tensorflow 2. A standard CNN architecture NASNetMobile is used, to learn the binary classification of tumor tiles. Store this code in a file using nano cnn-training.py, or another editor of your choice, and make the file executable with chmod u+x cnn-training.py.

#!/usr/bin/env python3

import os
import sys
import argparse
import random

import h5py
import numpy as np
import tensorflow as tf
from tensorflow.keras.applications.nasnet import NASNetMobile
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import AUC


# Constants
WEIGHTS_FILE = 'weights.h5'
TRAIN_X_FILE = 'camelyonpatch_level_2_split_train_x.h5'
TRAIN_Y_FILE = 'camelyonpatch_level_2_split_train_y.h5'
VALID_X_FILE = 'camelyonpatch_level_2_split_valid_x.h5'
VALID_Y_FILE = 'camelyonpatch_level_2_split_valid_y.h5'
INPUT_SHAPE = (96, 96, 3)


# Arguments
parser = argparse.ArgumentParser()
parser.add_argument(
    dest='data_dir', type=str,
    help='Data: Path to read-only directory containing PCAM *.h5 files.'
)
parser.add_argument(
    '--learning-rate', type=float, default=0.0001,
    help='Training: Learning rate. Default: 0.0001'
)
parser.add_argument(
    '--batch-size', type=int, default=64,
    help='Training: Batch size. Default: 64'
)
parser.add_argument(
    '--num-epochs', type=int, default=5,
    help='Training: Number of epochs. Default: 5'
)
parser.add_argument(
    '--steps-per-epoch', type=int, default=None,
    help='Training: Steps per epoch. Default: data_size / batch_size'
)
parser.add_argument(
    '--log-dir', type=str, default=None,
    help='Debug: Path to writable directory for a log file to be created. Default: log to stdout / stderr'
)
parser.add_argument(
    '--log-file-name', type=str, default='training.log',
    help='Debug: Name of the log file, generated when --log-dir is set. Default: training.log'
)
args = parser.parse_args()


# Redirect output streams for logging
if args.log_dir:
    log_file = open(os.path.join(os.path.expanduser(args.log_dir), args.log_file_name), 'w')
    sys.stdout = log_file
    sys.stderr = log_file

print('GPU available:', tf.test.is_gpu_available())


# Model
model = NASNetMobile(weights=None, input_shape=INPUT_SHAPE, classes=2)
model.compile(
    loss='categorical_crossentropy',
    optimizer=Adam(args.learning_rate),
    metrics=['accuracy', AUC()]
)
model.summary()


# Input
def data_generator(x, y, batch_size=None):
    index = range(len(x))
    labels = np.array([[1, 0], [0, 1]])

    while True:
        index_sample = index
        if batch_size is not None:
            index_sample = sorted(random.sample(index, batch_size))

        x_data = x[index_sample] / 255.0
        y_data = y[index_sample]
        y_data = labels[y_data[:, 0, 0, 0]]
        yield x_data, y_data


data_dir = os.path.expanduser(args.data_dir)

train_x = h5py.File(os.path.join(data_dir, TRAIN_X_FILE), 'r', libver='latest', swmr=True)['x']
train_y = h5py.File(os.path.join(data_dir, TRAIN_Y_FILE), 'r', libver='latest', swmr=True)['y']
valid_x = h5py.File(os.path.join(data_dir, VALID_X_FILE), 'r', libver='latest', swmr=True)['x']
valid_y = h5py.File(os.path.join(data_dir, VALID_Y_FILE), 'r', libver='latest', swmr=True)['y']


# Training
data_size = len(train_x)
steps_per_epoch = data_size // args.batch_size

if args.steps_per_epoch:
    steps_per_epoch = args.steps_per_epoch

model.fit_generator(
    data_generator(train_x, train_y, batch_size=args.batch_size),
    steps_per_epoch=steps_per_epoch,
    epochs=args.num_epochs,
    validation_data=data_generator(valid_x, valid_y, batch_size=args.batch_size),
    validation_steps=1
)

# Output
model.save_weights(WEIGHTS_FILE)

if args.log_dir:
    sys.stdout.close()

The output of the script is a weights file named weights.h5, that is defined as a constant WEIGHTS_FILE and will be created in your current working directory. These weights represent what the model has learned.

The training and validation data is contained in the PCAM directory. Since this directory will be mounted in the container by CC, the local path inside the container is not known beforehand. Therefore the path cannot be hard coded and is provided as a mandatory CLI argument at the first argument position. You could, for example, run this program locally on your computer as ./cnn-training /path/to/PCAM --batch-size 32. This requires the PCAM data directory to be available locally as /path/to/PCAM. For the sake of simplicity, the expected file names inside the directory are hard coded.

Additional training parameters, like --steps-per-epoch and --learning-rate are optional arguments. Of course, in a real world application, many more parameters should be exposed via the CLI.

A special feature is the optional --log-dir argument. If a path to an existing directory is specified, the stdout and stderr streams are redirected into a file that is created in this log directory. The standard name for this file is training.log. If multiple trainings are executed in parallel, this name can be changed to avoid conflicts using --log-file-name. We will use this feature to mount a writable SSHFS network filesystem for logging, such that the training progress can be viewed live outside of the container.

CWL

Now create a CLI description of the cnn-training.py program as nano cnn-training.cwl.yml.

cwlVersion: "v1.0"
class: "CommandLineTool"
baseCommand: "cnn-training.py"
doc: "Train a CNN on PCAM data in HDF5 format."

inputs:
  data_dir:
    type: "Directory"
    inputBinding:
      position: 0
    doc: "Data: Path to read-only directory containing PCAM *.h5 files."
  learning_rate:
    type: "float?"
    inputBinding:
      prefix: "--learning-rate"
    doc: "Training: Learning rate. Default: 0.0001"
  batch_size:
    type: "int?"
    inputBinding:
      prefix: "--batch-size"
    doc: "Training: Batch size. Default: 64"
  num_epochs:
    type: "int?"
    inputBinding:
      prefix: "--num-epochs"
    doc: "Training: Number of epochs. Default: 5"
  steps_per_epoch:
    type: "int?"
    inputBinding:
      prefix: "--steps-per-epoch"
    doc: "Training: Steps per epoch. Default: data_size / batch_size"
  log_dir:
    type: "Directory?"
    inputBinding:
      prefix: "--log-dir"
    doc: "Debug: Path to writable directory for a log file to be created. Default: log to stdout / stderr"
  log_file_name:
    type: "string?"
    inputBinding:
      prefix: "--log-file-name"
    doc: "Debug: Name of the log file, generated when --log-dir is set. Default: training.log"

outputs:
  weights_file:
    type: "File"
    outputBinding:
      glob: "weights.h5"
    doc: "CNN model weights in HDF5 format."

Please note, that data_dir and log_dir are defined as directories. This allows us to use RED connectors to download or mount these directories. The ? in type descriptions like int? denote optional arguments. The output is defined as a File named weights.h5. Since the file is mandatory, CC will report an error if the training code does not create it.

Docker Image

Create the following Dockerfile using nano Dockerfile.

FROM docker.io/nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04

RUN apt-get update \
&& apt-get install -y python3-venv python3-pip sshfs \
&& useradd -ms /bin/bash cc

# switch user
USER cc

ENV PATH /home/cc/.local/bin:${PATH}

RUN mkdir -p /home/cc/.local/bin

# install connectors
RUN python3 -m venv /home/cc/.local/red \
&& . /home/cc/.local/red/bin/activate \
&& pip install wheel \
&& pip install red-connector-ssh==1.2 \
&& ln -s /home/cc/.local/red/bin/red-connector-* /home/cc/.local/bin

# install app
RUN pip3 install --user --upgrade pip six \
&& pip install --user numpy h5py tensorflow-gpu==2.*

ADD --chown=cc:cc cnn-training.py /home/cc/.local/bin/cnn-training.py

This image installs the red-connector-ssh. We plan to use the directory mounting feature of this connector. Therefore sshfs must be installed as well.

The Python library dependencies for cnn-training.py are installed in the cc user’s home directory via pip3.

You can build a new image and push it to the DockerHub registry using the following commands.

IMAGE=docker.io/curiouscontainers/cnn-training
docker build --tag ${IMAGE} .
docker run --rm ${IMAGE} cnn-training.py --help
docker run --rm ${IMAGE} red-connector-ssh --version
docker run --rm ${IMAGE} sshfs --version
docker push ${IMAGE}

Of course, you won’t be allowed to push the image to the curiouscontainers organization, but you can choose another image name for your own organization. To follow this guide, you do not have to publish the image yourself, because it is already available under the given image URL.

RED

The batches keyword in a RED file can be used to provide inputs / outputs pairs as a list. This batch processing feature can be used to run multiple containers in different configurations. Since the inputs / outputs configurations are all very similar, we will use a Python script to generate a RED file containing two batches. This makes the process of creating a large RED file much easier.

Create the python script using nano cnn-training.red.py and make it executable chmod u+x cnn-training.red.py.

#!/usr/bin/env python3

import json

SSH_SERVER = 'avocado01.f4.htw-berlin.de'
SSH_AUTH = {'username': '{{ssh_username}}', 'password': '{{ssh_password}}'}
DATA_DIR = '/data/ldap/histopathologic/original_read_only/PCAM_extracted'
AGENCY_URL = 'https://agency.f4.htw-berlin.de/cc'
LEARNING_RATES = [0.0001, 0.0005]
STEPS_PER_EPOCH = 10


batches = []

for i, learning_rate in enumerate(LEARNING_RATES):
    batch = {
        'inputs': {
            'data_dir': {
                'class': 'Directory',
                'connector': {
                    'command': 'red-connector-ssh',
                    'mount': True,
                    'access': {
                        'host': SSH_SERVER,
                        'auth': SSH_AUTH,
                        'dirPath': DATA_DIR
                    }
                }
            },
            'learning_rate': learning_rate,
            'steps_per_epoch': STEPS_PER_EPOCH,
            'log_dir': {
                'class': 'Directory',
                'connector': {
                    'command': 'red-connector-ssh',
                    'mount': True,
                    'access': {
                        'host': SSH_SERVER,
                        'auth': SSH_AUTH,
                        'dirPath': 'cnn-training/log',
                        'writable': True
                    }
                }
            },
            'log_file_name': 'training_{}.log'.format(i)
        },
        'outputs': {
            'weights_file': {
                'class': 'File',
                'connector': {
                    'command': 'red-connector-ssh',
                    'access': {
                        'host': SSH_SERVER,
                        'auth': SSH_AUTH,
                        'filePath': 'weights_{}.h5'.format(i),
                    }
                }
            }
        }
    }
    batches.append(batch)

with open('cnn-training.cwl.json') as f:
    cli = json.load(f)

red = {
    'redVersion': '9',
    'cli': cli,
    'batches': batches,
    'container': {
        'engine': 'docker',
        'settings': {
            'image': {
                'url': 'docker.io/curiouscontainers/cnn',
            },
            'ram': 32000,
            'gpus': {
                'vendor': 'nvidia',
                'count': 1
            }
        }
    },
    'execution': {
        'engine': 'ccagency',
        'settings': {
            'access': {
              'url': AGENCY_URL,
              'auth': {
                  'username': '{{agency_username}}',
                  'password': '{{agency_password}}'
              }
            }
        }
    }
}

with open('cnn-training.red.json', 'w') as f:
    json.dump(red, f, indent=4)

This script uses the json module from the Python standard library to load a CWL file named cnn-training.cwl.json, because Python does not provide a built-in YAML module. You may have noticed, that we created a CWL in YAML format earlier and we have to convert the file for further processing using the faice convert format tool.

faice convert format --format=json cnn-training.cwl.yml > cnn-training.cwl.json

You can read the cnn-training.red.py script to see, that the CWL information gets embedded in the red file under die cli keyword. Furthermore the script loops over two learning rates, to create two distinct inputs / outputs pairs. Both, data_dir and log_dir will use SSHFS, because mount: True is specified in the connector section, but only log_dir has the writable: True flag set. Remember to change the SSH_SERVER constant, if you are not using avocado01.f4.htw-berlin.de.

Under the execution keyword CC-Agency is specified as engine. The URL is set to https://agency.f4.htw-berlin.de/cc. If you do not have an account for this instance of CC-Agency, you can change the URL to point to a self-hosted agency.

As an alternative, you can switch to CC-FAICE and execute the experiment locally using the following code snippet.

red = {
    'execution': {
        'engine': 'ccfaice',
        'settings': {}
    }
}

If you want to use an Nvidia GPU in your system you must have the docker-ce version of Docker and Nvidia Container Toolkit installed, which is only possible on Linux. If you do not have access to GPUs or your system does not fulfill the aforementioned requirements remove the gpus section from the RED file to run the program on a CPU. Using the CPU may slow down the processing.

You can now run cnn-training.red.py to create the cnn-training.red.json file.

./cnn-training.red.py

If you are using engine: 'ccagency', use faice exec to submit the batches to CC-Agency.

faice exec cnn-training.red.json

If your are using engine: 'ccfaice' instead, you have to add the --insecure flag to set SYS_ADMIN capabilities in Docker. These capabilities are required to use FUSE file systems like SSHFS in a Docker container.

faice exec --insecure cnn-training.red.json

You can watch the progress using the live log files.

SSH_USERNAME=christoph
SSH_HOST=avocado01.f4.htw-berlin.de

ssh ${SSH_USERNAME}@${SSH_HOST}
tail -f cnn-training/log/training_0.log
# tail -f cnn-training/log/training_1.log

In addition, you can check the status of your experiment’s batches using the CC-Agency API. For example, the following Python snippet uses the external packages requests and keyring to get information about the last two registered batches. The keyring part only works if you have used faice exec or the keyring CLI tool to store the variable values.

#!/usr/bin/env python3

from pprint import pprint

import requests
import keyring


url = 'https://agency.f4.htw-berlin.de/cc'
auth = (
    keyring.get_password('red', 'agency_username'),
    keyring.get_password('red', 'agency_password')
)

r = requests.get(
    f'{url}/batches?limit=2',
    auth=auth
)
r.raise_for_status()
batches = r.json()
pprint(batches)

Updated: