Note
Click here to download the full example code
EDA and Feature Engineering in One Jupyter Notebook and Modeling in the Other#
In this example, we will implement a simple pipeline that takes hyperparameters, does EDA, feature engineering (step 1: EDA and feature engineering in notebook), and measures the Gradient Boosting model’s performace using mean absolute error (MAE) (step 2: Modeling in notebook).
First, let’s import the libraries we will use in this example.
We define a NotebookTask
to run the Jupyter notebook (EDA).
This notebook returns dummified_data
and dataset
as the outputs.
Note
dataset
is used in this example, and dummified_data
is used in the previous example.
dataset
lets us send the DataFrame as a JSON string to the subsequent notebook because DataFrame input cannot be sent
directly to the notebook as per Papermill.
nb_1 = NotebookTask(
name="eda-featureeng-nb",
notebook_path=os.path.join(
pathlib.Path(__file__).parent.absolute(), "supermarket_regression_1.ipynb"
),
outputs=kwtypes(dummified_data=pd.DataFrame, dataset=str),
requests=Resources(mem="500Mi"),
)
We define a NotebookTask
to run the Jupyter notebook
(Modeling).
This notebook returns mae_score
as the output.
nb_2 = NotebookTask(
name="regression-nb",
notebook_path=os.path.join(
pathlib.Path(__file__).parent.absolute(),
"supermarket_regression_2.ipynb",
),
inputs=kwtypes(
dataset=str,
n_estimators=int,
max_depth=int,
max_features=str,
min_samples_split=int,
random_state=int,
),
outputs=kwtypes(mae_score=float),
requests=Resources(mem="500Mi"),
)
We define a Workflow
to run the notebook tasks.
@workflow
def notebook_wf(
n_estimators: int = 150,
max_depth: int = 3,
max_features: str = "sqrt",
min_samples_split: int = 4,
random_state: int = 2,
) -> float:
eda_output = nb_1()
regression_output = nb_2(
dataset=eda_output.dataset,
n_estimators=n_estimators,
max_depth=max_depth,
max_features=max_features,
min_samples_split=min_samples_split,
random_state=random_state,
)
return regression_output.mae_score
We can now run the two notebooks locally.
if __name__ == "__main__":
print(notebook_wf())
Total running time of the script: ( 0 minutes 0.000 seconds)