EDA, Feature Engineering, and Modeling With Papermill#
Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses and check assumptions with the help of summary statistics and graphical representations.
EDA cannot be solely implemented within Flyte as it requires visual analysis of the data. In such scenarios, we are inclined towards using a Jupyter notebook as it helps visualize and feature engineer the data.
Now the question is, how do we leverage the power of Jupyter Notebook within Flyte to perform EDA on the data?
Papermill is a tool for parameterizing and executing Jupyter Notebooks. Papermill lets you:
We have a pre-packaged version of Papermill with Flyte that lets you leverage the power of Jupyter Notebook within Flyte pipelines.
To install the plugin, run the following command:
pip install flytekitplugins-papermill
There are three code examples that you can refer to in this tutorial:
Run the whole pipeline (EDA + Feature Engineering + Modeling) in one notebook
Run EDA and feature engineering in one notebook, fetch the result (EDA’ed and feature engineered-dataset), and model the data as a Flyte task by sending the dataset as an argument
Run EDA and feature engineering in one notebook, fetch the result (EDA’ed and feature engineered-dataset), and model the data in another notebook by sending the dataset as an argument
If you want to send inputs and receive outputs, your Jupyter notebook has to have
outputstags, respectively. To set up tags in a notebook, follow this guide.
parameterscell must only have the input variables.
outputscell looks like the following:
from flytekitplugins.papermill import record_outputs record_outputs(variable_name=variable_name)
Of course, you can have any number of variables!
outputsvariable names in the
NotebookTaskmust match the variable names in the notebook.
You will see three outputs on running the Python code files, although a single output is returned. One output is the executed notebook, and the other is the rendered HTML of the notebook.