MPI Operator

The upcoming example shows how to use MPI in Horovod.

Horovod

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. Its goal is to make distributed Deep Learning fast and easy to use via ring-allreduce and requires only a few lines of modification to user code.

MPI

The MPI operator plugin within Flyte uses the Kubeflow MPI Operator, which makes it easy to run an all reduce-style distributed training on Kubernetes. It provides an extremely simplified interface for executing distributed training using MPI.

MPI and Horovod together can be leveraged to simplify the process of distributed training. The MPI Operator provides a convenient wrapper to run the Horovod scripts.

Installation

To use the Flytekit MPI Operator plugin, run the following command:

pip install flytekitplugins-kfmpi

Example of an MPI-enabled Flyte Task

In this code block, you can see the three parameters that an MPIJob can accept.

@task(
    task_config=MPIJob(
        # number of worker to be spawned in the cluster for this job
        num_workers=2,
        # number of launcher replicas to be spawned in the cluster for this job
        # the launcher pod invokes mpirun and communicates with worker pods through MPI
        num_launcher_replicas=1,
        # number of slots per worker used in the hostfile
        # the available slots (GPUs) in each pod
        slots=1,
    ),
    requests=Resources(cpu='1', mem="3000Mi"),
    limits=Resources(cpu='2', mem="6000Mi"),
    retries=3,
    cache=True,
    cache_version="0.5",
)
def mpi_task(...):
    # do some work
    pass

Dockerfile for MPI on K8s

The Dockerfile has to have the installation commands for MPI and Horovod, amongst others.

 1FROM ubuntu:focal
 2LABEL org.opencontainers.image.source https://github.com/flyteorg/flytesnacks
 3
 4WORKDIR /root
 5ENV VENV /opt/venv
 6ENV LANG C.UTF-8
 7ENV LC_ALL C.UTF-8
 8ENV PYTHONPATH /root
 9ENV DEBIAN_FRONTEND=noninteractive
10
11# Install Python3 and other basics
12RUN apt-get update \
13    && apt-get install -y software-properties-common \
14    && add-apt-repository ppa:ubuntu-toolchain-r/test \
15    && apt-get install -y \
16    build-essential \
17    cmake \
18    g++-7 \
19    curl \
20    git \
21    wget \
22    python3.8 \
23    python3.8-venv \
24    python3.8-dev \
25    make \
26    libssl-dev \
27    python3-pip \
28    python3-wheel \
29    libuv1
30
31ENV VENV /opt/venv
32# Virtual environment
33RUN python3.8 -m venv ${VENV}
34ENV PATH="${VENV}/bin:$PATH"
35
36# Install AWS CLI to run on AWS (for GCS install GSutil). This will be removed
37# in future versions to make it completely portable
38RUN pip3 install awscli
39
40# Install wheel after venv is activated
41RUN pip3 install wheel
42
43# MPI
44# Install Open MPI
45RUN mkdir /tmp/openmpi && \
46    cd /tmp/openmpi && \
47    wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
48    tar zxf openmpi-4.0.0.tar.gz && \
49    cd openmpi-4.0.0 && \
50    ./configure --enable-orterun-prefix-by-default && \
51    make -j $(nproc) all && \
52    make install && \
53    ldconfig && \
54    rm -rf /tmp/openmpi
55
56# Install OpenSSH for MPI to communicate between containers
57RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
58    mkdir -p /var/run/sshd
59
60# Allow OpenSSH to talk to containers without asking for confirmation
61# by disabling StrictHostKeyChecking.
62# mpi-operator mounts the .ssh folder from a Secret. For that to work, we need
63# to disable UserKnownHostsFile to avoid write permissions.
64# Disabling StrictModes avoids directory and files read permission checks.
65RUN sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \
66    echo "    UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
67    sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config
68
69# Install Python dependencies
70COPY kfmpi/requirements.txt /root
71
72RUN pip install -r /root/requirements.txt
73
74# Enable GPU
75# ENV HOROVOD_GPU_OPERATIONS NCCL
76RUN HOROVOD_WITH_MPI=1 HOROVOD_WITH_TENSORFLOW=1 pip install --no-cache-dir horovod[tensorflow]==0.22.1
77
78# Copy the makefile targets to expose on the container. This makes it easier to register.
79COPY in_container.mk /root/Makefile
80COPY kfmpi/sandbox.config /root
81
82# Copy the actual code
83COPY kfmpi/ /root/kfmpi/
84
85# This tag is supplied by the build script and will be used to determine the version
86# when registering tasks, workflows, and launch plans
87ARG tag
88ENV FLYTE_INTERNAL_IMAGE $tag

Backend installation documentation coming soon!

Gallery generated by Sphinx-Gallery