Analytics#

Flyte is ideal for data cleaning, statistical summarization, and plotting because with flytekit you can leverage the rich Python ecosystem of data processing and visualization tools.

Cleaning Data#

In this example, weโ€™re going to analyze some covid vaccination data:

import pandas as pd
import plotly
import plotly.graph_objects as go
from flytekit import Deck, task, workflow, Resources


@task(requests=Resources(mem="1Gi"))
def clean_data() -> pd.DataFrame:
    """Clean the dataset."""
    df = pd.read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv")
    filled_df = (
        df.sort_values(["people_vaccinated"], ascending=False)
        .groupby("location")
        .first()
        .reset_index()
    )[["location", "people_vaccinated", "population", "date"]]
    return filled_df

As you can see, weโ€™re using pandas for data processing, and in the task below we use plotly to create a choropleth map of the percent of a countryโ€™s population that has received at least one COVID-19 vaccination.

Rendering Plots#

We can use Flyte Decks for rendering a static HTML report of the map. In this case, we normalize the people_vaccinated by the population count of each country:

@task(disable_deck=False)
def plot(df: pd.DataFrame):
    """Render a Choropleth map."""
    df["text"] = df["location"] + "<br>" + "Last updated on: " + df["date"]
    fig = go.Figure(
        data=go.Choropleth(
            locations=df["location"],
            z=df["people_vaccinated"].astype(float) / df["population"].astype(float),
            text=df["text"],
            locationmode="country names",
            colorscale="Blues",
            autocolorscale=False,
            reversescale=False,
            marker_line_color="darkgray",
            marker_line_width=0.5,
            zmax=1,
            zmin=0,
        )
    )

    fig.update_layout(
        title_text=(
          "Percent population with at least one dose of COVID-19 vaccine"
        ),
        geo_scope="world",
        geo=dict(
            showframe=False, showcoastlines=False, projection_type="equirectangular"
        ),
    )
    Deck("Choropleth Map", plotly.io.to_html(fig))


@workflow
def analytics_workflow():
    """Prepare a data analytics workflow."""
    plot(df=clean_data())
/home/docs/checkouts/readthedocs.org/user_builds/flytecookbook/envs/latest/lib/python3.11/site-packages/flytekit/core/base_task.py:450: FutureWarning: disable_deck was deprecated in 1.10.0, please use enable_deck instead
  warnings.warn("disable_deck was deprecated in 1.10.0, please use enable_deck instead", FutureWarning)

Running this workflow, we get an interative plot, courtesy of plotly:

analytics_workflow()
User Content
Name Wall Time(s) Process Time(s)
Translate literal to python value 0.000015 0.000013
Execute user level code 3.759147 2.420174
Translate the output to literals 0.349885 0.125454
Translate literal to python value 0.076039 0.013439
Execute user level code 0.145021 0.054118
Translate the output to literals 0.000007 0.000007

Note:

  1. if the time duration is too small(< 1ms), it may be difficult to see on the time line graph.
  2. For accurate execution time measurements, users should refer to wall time and process time.
location people_vaccinated population date text
0 Afghanistan 1.889700e+07 4.112877e+07 2023-11-26 Afghanistan<br>Last updated on: 2023-11-26
1 Africa 5.549983e+08 1.426737e+09 2023-11-19 Africa<br>Last updated on: 2023-11-19
2 Albania 1.349255e+06 2.842318e+06 2023-09-10 Albania<br>Last updated on: 2023-09-10
3 Algeria 7.840131e+06 4.490323e+07 2022-04-24 Algeria<br>Last updated on: 2022-04-24
4 American Samoa NaN 4.429500e+04 2020-01-05 American Samoa<br>Last updated on: 2020-01-05
... ... ... ... ... ...
250 Western Sahara NaN 5.760050e+05 2022-04-20 Western Sahara<br>Last updated on: 2022-04-20
251 World 5.630409e+09 7.975105e+09 2024-03-14 World<br>Last updated on: 2024-03-14
252 Yemen 1.050112e+06 3.369661e+07 2023-11-26 Yemen<br>Last updated on: 2023-11-26
253 Zambia 1.171156e+07 2.001767e+07 2023-06-25 Zambia<br>Last updated on: 2023-06-25
254 Zimbabwe 6.437808e+06 1.632054e+07 2022-10-09 Zimbabwe<br>Last updated on: 2022-10-09

Custom Flyte Deck Renderers#

You can also create your own custom Flyte Deck renderers to visualize data with any plotting/visualization library of your choice, as long as you can render HTML for the objects of interest.

Important

Prefer other data processing frameworks? Flyte ships with Polars, Dask, Modin, Spark, Vaex, and DBT integrations.

If you need to connect to a database, Flyte provides first-party support for AWS Athena, Google Bigquery, Snowflake, SQLAlchemy, and SQLite3.