The workflow demonstrates how to train an XGBoost model. The workflow is designed for the Pima Indian Diabetes dataset.
An example dataset is available here.
Why a Workflow?#
One common question when you read through the example would be whether it is really required to split the training of XGBoost into multiple steps. The answer is complicated, but let us try and understand the pros and cons of doing so.
Each task/step is standalone and can be used for other pipelines
Each step can be unit tested
Data splitting, cleaning and processing can be done using a more scalable system like Spark
State is always saved between steps, so it is cheap to recover from failures, especially if
Performance for small datasets is a concern because the intermediate data is durably stored and the state is recorded, and each step is essentially a checkpoint
Steps of the Pipeline#
Gather data and split it into training and validation sets
Fit the actual model
Run a set of predictions on the validation set. The function is designed to be more generic, it can be used to simply predict given a set of observations (dataset)
Calculate the accuracy score for the predictions
Usage of FlyteSchema Type. Schema type allows passing a type safe vector from one task to task. The vector is directly loaded into a pandas dataframe. We could use an unstructured Schema (By simply omitting the column types). This will allow any data to be accepted by the training algorithm.
We pass the file (that is auto-loaded) as a CSV input.
Run workflows in this directory with the custom-built base image:
pyflyte run --remote diabetes.py:diabetes_xgboost_model --image ghcr.io/flyteorg/flytecookbook:pima_diabetes-latest