GCP (GKE) Setup

Flyte Deployment - Manual GCE/GKE Deployment

This guide helps you set up Flyte from scratch, on GCE, without using an automated approach. It details step-by-step instructions to go from a bare GCE account to a fully functioning Flyte deployment that members of your company can use.

Prerequisites

  • Access to GCE console

  • A domain name for the Flyte installation like flyte.example.org that allows you to set a DNS A record.

Before you begin, please ensure that you have the following tools installed:

Initialize Gcloud

Authorize Gcloud sdk to access GCP using your credentials, setup the config for the existing project, and optionally set the default compute zone. Init

gcloud init

Create an Organization (Optional)

This step is optional if you already have an organization linked to a billing account. Use the following docs to understand the organization creation process in Google cloud: Organization Management.

Get the organization ID to be used for creating the project. Billing should be linked with the organization so that all projects under the org use the same billing account.

gcloud organizations list

Sample output .. code-block:

DISPLAY_NAME                 ID    DIRECTORY_CUSTOMER_ID
example-org        123456789999                C02ewszsz

<id> = ID column value for the organization

export ORG_ID=<id>

Create a GCE Project

export PROJECT_ID=<my-project>
gcloud projects create $PROJECT_ID --organization $ORG_ID

Of course you can also use an existing project if your account has appropriate permissions to create the required resources.

Set project <my-project> as the default in gcloud or use gcloud init to set this default:

gcloud config set project ${PROJECT_ID}

We assume that <my-project> has been set as the default for all gcloud commands below.

Permissions

Configure workload identity for Flyte namespace service accounts. This creates the GSA’s which would be mapped to the KSA (kubernetes service account) through annotations and used for authorizing the pods access to the Google cloud services.

  • Create a GSA for flyteadmin

gcloud iam service-accounts create gsa-flyteadmin
  • Create a GSA for flytescheduler

gcloud iam service-accounts create gsa-flytescheduler
  • Create a GSA for datacatalog

gcloud iam service-accounts create gsa-datacatalog
  • Create a GSA for flytepropeller

gcloud iam service-accounts create gsa-flytepropeller
  • Create a GSA for cluster resource manager

Production

gcloud iam service-accounts create gsa-production

Staging

gcloud iam service-accounts create gsa-staging

Development

gcloud iam service-accounts create gsa-development
  • Create a new role DataCatalogRole with following permissions
    • storage.buckets.get

    • storage.objects.create

    • storage.objects.delete

    • storage.objects.update

    • storage.objects.get

  • Create a new role FlyteAdminRole with following permissions
    • storage.buckets.get

    • storage.objects.create

    • storage.objects.delete

    • storage.objects.get

    • storage.objects.getIamPolicy

    • storage.objects.update

  • Create a new role FlyteSchedulerRole with following permissions
    • storage.buckets.get

    • storage.objects.create

    • storage.objects.delete

    • storage.objects.get

    • storage.objects.getIamPolicy

    • storage.objects.update

  • Create a new role FlytePropellerRole with following permissions
    • storage.buckets.get

    • storage.objects.create

    • storage.objects.delete

    • storage.objects.get

    • storage.objects.getIamPolicy

    • storage.objects.update

  • Create a new role FlyteWorkflowRole with following permissions
    • storage.buckets.get

    • storage.objects.create

    • storage.objects.delete

    • storage.objects.get

    • storage.objects.list

    • storage.objects.update

Refer the following role page for more details.

  • Add IAM policy binding for flyteadmin GSA using FlyteAdminRole.

gcloud projects add-iam-policy-binding ${PROJECT_ID}  --member "serviceAccount:gsa-flyteadmin@${PROJECT_ID}.iam.gserviceaccount.com"    --role "projects/${PROJECT_ID}/roles/FlyteAdminRole"
  • Add IAM policy binding for flytescheduler GSA using FlyteSchedulerRole.

gcloud projects add-iam-policy-binding ${PROJECT_ID}  --member "serviceAccount:gsa-flytescheduler@${PROJECT_ID}.iam.gserviceaccount.com"    --role "projects/${PROJECT_ID}/roles/FlyteSchedulerRole"
  • Add IAM policy binding for datacatalog GSA using DataCatalogRole.

gcloud projects add-iam-policy-binding ${PROJECT_ID}  --member "serviceAccount:gsa-datacatalog@${PROJECT_ID}.iam.gserviceaccount.com"    --role "projects/${PROJECT_ID}/roles/DataCatalogRole"
  • Add IAM policy binding for flytepropeller GSA using FlytePropellerRole.

gcloud projects add-iam-policy-binding ${PROJECT_ID}  --member "serviceAccount:gsa-flytepropeller@${PROJECT_ID}.iam.gserviceaccount.com"    --role "projects/${PROJECT_ID}/roles/FlytePropellerRole"
  • Add IAM policy binding for cluster resource manager GSA using FlyteWorkflowRole.

Production

gcloud projects add-iam-policy-binding ${PROJECT_ID}  --member "serviceAccount:gsa-production@${PROJECT_ID}.iam.gserviceaccount.com"    --role "projects/${PROJECT_ID}/roles/FlyteWorkflowRole"

Staging

gcloud projects add-iam-policy-binding ${PROJECT_ID}  --member "serviceAccount:gsa-staging@${PROJECT_ID}.iam.gserviceaccount.com"    --role "projects/${PROJECT_ID}/roles/FlyteWorkflowRole"

Development

gcloud projects add-iam-policy-binding ${PROJECT_ID}  --member "serviceAccount:gsa-development@${PROJECT_ID}.iam.gserviceaccount.com"    --role "projects/${PROJECT_ID}/roles/FlyteWorkflowRole"
  • Allow the Kubernetes service account to impersonate the Google service account by creating an IAM policy binding the two. This binding allows the Kubernetes Service account to act as the Google service account.

flyteadmin

gcloud iam service-accounts add-iam-policy-binding --role "roles/iam.workloadIdentityUser" --member "serviceAccount:${PROJECT_ID}.svc.id.goog[flyte/flyteadmin]" gsa-flyteadmin@${PROJECT_ID}.iam.gserviceaccount.com

flytepropeller

gcloud iam service-accounts add-iam-policy-binding --role "roles/iam.workloadIdentityUser" --member "serviceAccount:${PROJECT_ID}.svc.id.goog[flyte/flytepropeller]" gsa-flytepropeller@${PROJECT_ID}.iam.gserviceaccount.com

datacatalog

gcloud iam service-accounts add-iam-policy-binding --role "roles/iam.workloadIdentityUser" --member "serviceAccount:${PROJECT_ID}.svc.id.goog[flyte/datacatalog]" gsa-datacatalog@${PROJECT_ID}.iam.gserviceaccount.com

Cluster Resource Manager

We create binding for production, staging and development domains for the Flyte workflows to use.

Production

gcloud iam service-accounts add-iam-policy-binding --role "roles/iam.workloadIdentityUser" --member "serviceAccount:${PROJECT_ID}.svc.id.goog[production/default]" gsa-production@${PROJECT_ID}.iam.gserviceaccount.com

Staging

gcloud iam service-accounts add-iam-policy-binding --role "roles/iam.workloadIdentityUser" --member "serviceAccount:${PROJECT_ID}.svc.id.goog[staging/default]" gsa-staging@${PROJECT_ID}.iam.gserviceaccount.com

Development

gcloud iam service-accounts add-iam-policy-binding --role "roles/iam.workloadIdentityUser" --member "serviceAccount:${PROJECT_ID}.svc.id.goog[development/default]" gsa-development@${PROJECT_ID}.iam.gserviceaccount.com

Create a GKE Cluster

Create a GKE cluster with VPC-native networking and workload identity enabled. Navigate to the gcloud console and Kubernetes Engine tab to start creating the k8s cluster.

Ensure that VPC native traffic routing is enabled under Security enable Workload identity and use project default pool which would be ${PROJECT_ID}.svc.id.goog.

It is recommended to create it from the console. This is to make sure the options of VPC-native networking and Workload identity are enabled correctly. There are multiple commands needed to achieve this. If you create it through the console, it will take care of creating and configuring the right resources:

gcloud container clusters create <my-flyte-cluster> \
  --workload-pool=${PROJECT_ID}.svc.id.goog
  --region us-west1 \
  --num-nodes 6

Create the GKE Context

Initialize your kubecontext to point to GKE cluster using the following command:

gcloud container clusters get-credentials <my-flyte-cluster>

Verify by creating a test namespace:

kubectl create ns test

Create a Cloud SQL Database

Next, create a relational Cloud SQL for PostgreSQL database. This database will be used by both the primary control plane service (Flyte Admin) and the Flyte memoization service (Data Catalog). Follow this link to create the cloud SQL instance.

  • Select PostgreSQL

  • Provide an Instance ID

  • Provide a password for the instance <DB_INSTANCE_PASSWD>

  • Use PostgresSQL13 or higher

  • Select the Zone based on your availability requirements.

  • Select customize your instance and enable Private IP in the Connections tab. This is required for private communication between the GKE apps and cloud SQL instance. Follow the steps to create the private connection (default).

  • Create the SQL instance.

  • After creation of the instance get the private IP of the database <CLOUD-SQL-IP>.

  • Create flyteadmin database and flyteadmin user accounts on that instance with <DBPASSWORD>.

  • Verify connectivity to the DB from the GKE cluster.
    • Create a testdb namespace:

    kubectl create ns test
    
    • Verify the connectivity using a postgres client:

    kubectl run pgsql-postgresql-client --rm --tty -i --restart='Never' --namespace testdb --image docker.io/bitnami/postgresql:11.7.0-debian-10-r9 --env="PGPASSWORD=<DBPASSWORD>" --command -- psql testdb --host <CLOUD-SQL-IP> -U flyteadmin -d flyteadmin -p 5432
    

It is recommended to create it from the console. This is to make sure the private IP connectivity works correctly with the cloud SQL instance. There are multiple commands needed to achieve this. If you create it through the console, it will take care of creating and configuring the right resources.

gcloud sql instances create <my-flyte-db> \
  --database-version=POSTGRES_13 \
  --cpu=1 \
  --memory=3840MB \
  --region=us-west1

SSL Certificate

In order to use SSL (which we need to use gRPC clients), we will need to create an SSL certificate. We use Google-managed SSL certificates.

Save the following certificate resource definition as flyte-certificate.yaml:

apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata:
  name: flyte-certificate
spec:
  domains:
    - flyte.example.org

Then apply it to your cluster:

kubectl apply -f flyte-certificate.yaml

An alternative is to use the certificate manager:

  • Install the cert manager

helm install cert-manager --namespace flyte --version v0.12.0 jetstack/cert-manager
  • Create cert issuer

apiVersion: cert-manager.io/v1alpha2
kind: Issuer
metadata:
  name: letsencrypt-production
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: issue-email-id
    privateKeySecretRef:
      name: letsencrypt-production
    solvers:
    - selector: {}
      http01:
        ingress:
          class: nginx

Ingress

  • Add the ingress repo

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
  • Install the nginx-ingress

helm install nginx-ingress ingress-nginx/ingress-nginx

Create the GCS Bucket

Create <BUCKETNAME> with uniform access:

gsutil mb -b on -l us-west1 gs://<BUCKETNAME>/

Add access permission for the following principals:

Time for Helm

Installing Flyte

  1. Add the Flyte helm repo

helm repo add flyteorg https://flyteorg.github.io/flyte
  1. Download the GCP values file

curl https://raw.githubusercontent.com/flyteorg/flyte/master/charts/flyte-core/values-gcp.yaml >values-gcp.yaml
  1. Update values

<RELEASE-NAME> to be used as prefix for ssl certificate secretName
<PROJECT_ID> of your GCP project
<CLOUD-SQL-IP> private IP of cloud sql instance
<DBPASSWORD> of the flyteadmin user created for the cloud sql instance
<BUCKETNAME> of the GCS bucket created
<HOSTNAME> to the flyte FQDN (e.g. flyte.example.org)
  1. (Optional) Configure Flyte project and domain

To restrict projects, update Helm values. By default, Flyte creates three projects: Flytesnacks, Flytetester, and Flyteexample.

# you can define the number of projects as per your need
flyteadmin:
 initialProjects:
    - flytesnacks
    - flytetester
    - flyteexamples

To restrict domains, update Helm values again. By default, Flyte creates three domains per project: development, staging and production.

# -- Domain configuration for Flyte project. This enables the specified number of domains across all projects in Flyte.
configmap
  domain:
    domains:
      - id: development
        name: development
      - id: staging
        name: staging
      - id: production
        name: production

# Update Cluster resource manager only if you are using Flyte resource manager. It will create the required resource in the project-domain namespace.
cluster_resource_manager:
  enabled: true
  config:
    cluster_resources:
       customData:
         - development:
             - projectQuotaCpu:
               value: "5"
             - projectQuotaMemory:
               value: "4000Mi"
             - defaultIamRole:
               value: "gsa-development@{{ .Values.userSettings.googleProjectId }}.iam.gserviceaccount.com"
         - staging:
             - projectQuotaCpu:
               value: "2"
             - projectQuotaMemory:
               value: "3000Mi"
             - defaultIamRole:
               value: "gsa-staging@{{ .Values.userSettings.googleProjectId }}.iam.gserviceaccount.com"
         - production:
             - projectQuotaCpu:
               value: "2"
             - projectQuotaMemory:
               value: "3000Mi"
             - defaultIamRole:
               value: "gsa-production@{{ .Values.userSettings.googleProjectId }}.iam.gserviceaccount.com"
  1. Update helm dependencies

helm dep update
  1. Install Flyte

helm install -n flyte -f values-gcp.yaml --create-namespace flyte flyteorg/flyte-core
  1. Verify that all pods have come up correctly

kubectl get pods -n flyte
  1. Get the ingress IP to update the zone and fetch name server records for DNS

kubectl get ingress -n flyte

Uninstalling Flyte

helm uninstall -n flyte flyte

Upgrading Flyte

helm upgrade -n flyte -f values-gcp.yaml --create-namespace flyte flyteorg/flyte-core

Connecting to Flyte

Flyte can be accessed using the UI console or your terminal.

  • First, find the Flyte endpoint created by the GKE ingress controller.

$ kubectl -n flyte get ingress

Sample O/P

NAME         CLASS    HOSTS              ADDRESS     PORTS   AGE
flyte        <none>   <FLYTE-ENDPOINT>   34.136.165.92   80, 443   18m
flyte-grpc   <none>   <FLYTE-ENDPOINT>   34.136.165.92   80, 443   18m
  • Connecting to flytectl CLI

Add :<FLYTE-ENDPOINT> to ~/.flyte/config.yaml eg ;

admin:
 # For GRPC endpoints you might want to use dns:///flyte.myexample.com
 endpoint: dns:///<FLYTE-ENDPOINT>
 insecure: false
logger:
 show-source: true
 level: 0
storage:
  type: stow
  stow:
    kind: google
    config:
      json: ""
      project_id: myproject # GCP Project ID
      scopes: https://www.googleapis.com/auth/devstorage.read_write
  container: mybucket # GCS Bucket Flyte is configured to use

Accessing Flyte Console (web UI)

  • Use the https://<FLYTE-ENDPOINT>/console to get access to flyteconsole UI

  • Ignore the certificate error if using a self-signed cert

Running Workflows

  • Docker file changes

Make sure the Dockerfile contains gcloud-sdk installation steps which are needed by Flyte to upload the results.

# Install gcloud for GCP
RUN apt-get install curl --assume-yes

RUN curl -sSL https://sdk.cloud.google.com | bash
ENV PATH $PATH:/root/google-cloud-sdk/bin
  • Serializing workflows

For running flytecookbook examples on GCP, make sure you have the right registry during serialization. The following example shows if you are using GCP container registry and US-central zone with project name flyte-gcp and repo name flyterep:

REGISTRY=us-central1-docker.pkg.dev/flyte-gcp/flyterepo make serialize
  • Uploading the image to registry

The following example shows uploading cookbook core examples to GCP container registry. This step must be performed before performing registration of the workflows in Flyte:

docker push us-central1-docker.pkg.dev/flyte-gcp/flyterepo/flytecookbook:core-2bd81805629e41faeaa25039a6e6abe847446356
  • Registering workflows

Register workflows by pointing to the output folder for the serialization and providing the version to use for the workflow through flytectl:

flytectl register file  /Users/<user-name>/flytesnacks/cookbook/core/_pb_output/*   -d development  -p flytesnacks --version v1
  • Generating exec spec file for workflow

The following example generates the exec spec file for the latest version of core.flyte_basics.lp.go_greet workflow part of flytecookbook examples:

flytectl  get launchplan -p flytesnacks -d development core.flyte_basics.lp.go_greet --latest --execFile lp.yaml
  • Modifying exec spec files of the workflow for inputs

Modify the exec spec file lp.yaml and modify the inputs for the workflow:

iamRoleARN: ""
inputs:
    am: true
    day_of_week: "Sunday"
    number: 5
kubeServiceAcct: ""
targetDomain: ""
targetProject: ""
version: v1
workflow: core.flyte_basics.lp.go_greet
  • Creating execution using the exec spec file

flytectl create execution -p flytesnacks -d development --execFile lp.yaml

Sample O/P

execution identifier project:"flytesnacks" domain:"development" name:"f12c787de18304f4cbe7"
  • Getting the execution details

flytectl get executions  -p flytesnacks -d development f12c787de18304f4cbe7

Troubleshooting

  • If any pod is not coming up, then describe the pod and check which container or init-containers had an error.

kubectl describe pod/<pod-instance> -n flyte

Then check the logs for the container which failed. E.g.: to check for <init-container> init container type this:

kubectl logs -f <pod-instance> <init-container> -n flyte
  • Increasing log level for flytectl

Change your logger config to this:

logger:
show-source: true
level: 6
  • If you have a new ingress IP for your Flyte deployment, you would need to flush DNS cache using this

  • If you need to get access logs for your buckets then follow this GCP guide

  • If you get the following error:

ERROR: Policy modification failed. For a binding with condition, run "gcloud alpha iam policies lint-condition" to identify issues in condition.
ERROR: (gcloud.iam.service-accounts.add-iam-policy-binding) INVALID_ARGUMENT: Identity Pool does not exist

this means that you haven’t enabled workload identity on the cluster. Use the following docs.