AWS Sagemaker Pytorch#

Tags: Integration, MachineLearning, AWS, Advanced

This plugin shows an example of using Sagemaker custom training, with Pytorch distributed training.

Installation#

To use the Flytekit AWS Sagemaker plugin, simply run the following:

pip install flytekitplugins-awssagemaker

Creating a Dockerfile for Sagemaker Custom Training [Required]#

The dockerfile for Sagemaker custom training is similar to any regular dockerfile, except for the difference in using the Nvidia cuda base to use GPU’s.

Note

If using CPU for training, then the special dockerfile is NOT REQUIRED. If GPU or TPUs are required, the dockerfile differs only in the driver setup. The following dockerfile is enabled for GPU accelerated training using CUDA. The checked in version of docker file uses python:3.8-slim-buster for faster CI, but you can use the Dockerfile pasted below which uses cuda base. Additionally, the requirements.in uses the cpu version of pytorch. Remove the + cpu for torch and torchvision in requirements.in and make all requirements as shown below:

make -C integrations/aws/sagemaker_pytorch requirements

FROM pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
LABEL org.opencontainers.image.source https://github.com/flyteorg/flytesnacks

WORKDIR /root
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONPATH /root

# Install the AWS cli separately to prevent issues with boto being written over
RUN pip install awscli

ENV VENV /opt/venv
# Virtual environment
RUN python3 -m venv ${VENV}
ENV PATH="${VENV}/bin:$PATH"

# Install Python dependencies
COPY sagemaker_pytorch/requirements.txt /root/.
RUN pip install -r /root/requirements.txt

# Setup Sagemaker entrypoints
ENV SAGEMAKER_PROGRAM /opt/venv/bin/flytekit_sagemaker_runner.py

# Copy the makefile targets to expose on the container. This makes it easier to register.
COPY in_container.mk /root/Makefile
COPY sagemaker_pytorch/sandbox.config /root

# Copy the actual code
COPY sagemaker_pytorch/ /root/sagemaker_pytorch

# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows, and launch plans
ARG tag
ENV FLYTE_INTERNAL_IMAGE $tag

Distributed Pytorch on Sagemaker