AWS (EKS) Manual Setup#
Flyte Deployment - Manual AWS/EKS Deployment#
This guide helps you set up Flyte from scratch, on AWS, without using an automated approach. It details step-by-step how to go from a bare AWS account, to a fully functioning Flyte deployment that members of your company can use.
Prerequisites#
Before you begin, please ensure that you have the following tools installed.
AWS CLI
eksctl
Access to AWS console
Helm
kubectl
Openssl
AWS Permissioning#
Create a series of roles. These roles control Flyte’s access to the AWS account. Since this is a setup guide, you can use the default policies that AWS IAM comes with. If this is too broad, you can consult with your infrastructure team.
EKS Cluster Role#
Create a role for the EKS cluster. This is the role that the Kubernetes platform will use to monitor, scale, and create ASGs, run the etcd store, and the K8s API server, etc.
Navigate to your AWS console and choose the IAM Roles page.
Under the EKS service, select
EKS-Cluster
.Ensure that the
AmazonEKSClusterPolicy
is selected.Create this role without any permission boundary. Advanced users can try to restrict the permissions for their usecases.
Choose any tags that would help in you tracking this role based on your devops rules.
Choose a name for your cluster role which is easier to search, for example: <ClusterName-EKS-Cluster-Role>.
Refer to the AWS docs for details.
EKS Node IAM Role#
Create a role that your compute nodes use. This is the role given to the nodes that actually run the user pods (including Flyte pods).
Navigate to your AWS console and choose IAM role service.
Choose EC2 as service while choosing the use case.
Choose the following policies as mentioned in this AWS doc:
AmazonEKSWorkerNodePolicy
that allows EKS nodes to connect to EKS clusters.AmazonEC2ContainerRegistryReadOnly
if using Amazon ECR for container registry.AmazonEKS_CNI_Policy
that allows pod and node networking within your VPC. This is required even though it is marked as optional in the AWS guide.
Create this role without any permission boundary. Advanced users can try to restrict the permissions for their use cases.
Choose any tags that would help you in tracking this role based on your DevOps rules.
Choose a name for your node role which is easier to search, for example: <ClusterName-EKS-Node-Role>.
Flyte System Role#
Create a role for the Flyte platform. When pods run, they shouldn’t run with the node role created above; they should assume a separate role with permissions suitable for that pod’s containers. This role will be used for Flyte’s own API servers and associated agents.
Create a role iam-role-flyte
from the IAM console. Select “AWS service” for the type, and EC2 for the use case.
Add the AmazonS3FullAccess
policy. S3 access can be tweaked later to narrow down the scope.
Flyte User Role#
Lastly, create a role for Flyte users. This is the role that user pods will assume when Flyte kicks them off.
Create a role flyte-user-role
from the IAM console. Select “AWS service” for the type, and EC2 for the use case. Add the AmazonS3FullAccess
policy .
Create an EKS Cluster#
Create an EKS cluster from the AWS console:
Pick a name for your cluster, for example: <Name-EKS-Cluster>
Pick Kubernetes version >= 1.19
Choose the EKS cluster role <ClusterName-EKS-Cluster-Role>, created in previous steps.
Keep the secrets encryption off.
Use the same VPC that you intend to deploy your RDS instance. Keep the default VPC if none were created and choose RDS to use the default as well.
Use the subnets for all supported AZ’s in that VPC.
Choose the security group to be used for this cluster and the RDS instance (use default if you use default VPC).
Provide public access to your cluster, or based on your DevOps settings.
Choose default version of the network add-ons.
You can choose to enable the control plane logging to CloudWatch.
Connect to an EKS Cluster#
Use your AWS account access keys to run the following command to update your kubectl config and switch to the new EKS cluster context:
export AWS_ACCESS_KEY_ID=<YOUR-AWS-ACCOUNT-ACCESS-KEY-ID> export AWS_SECRET_ACCESS_KEY=<YOUR-AWS-SECRET-ACCESS-KEY> export AWS_SESSION_TOKEN=<YOUR-AWS-SESSION-TOKEN>
Switch to the EKS cluster context <Name-EKS-Cluster>:
aws eks update-kubeconfig --name <Name-EKS-Cluster> --region <region>
Verify the context is switched:
$ kubectl config current-context
arn:aws:eks:<region>:<AWS_ACCOUNT_ID>:cluster/<Name-EKS-Cluster>
Test it with
kubectl
. It should tell you there aren’t any resources:
$ kubectl get pods
No resources found in default namespace.
OIDC Provider for the EKS Cluster#
Create the OIDC provider to be used for the EKS cluster and associate a trust relationship with the EKS cluster role <ClusterName-EKS-Cluster-Role>:
EKS cluster created should have a URL created and hence the following command would return the provider:
aws eks describe-cluster --region <region> --name <Name-EKS-Cluster> --query "cluster.identity.oidc.issuer" --output text
Example output:
https://oidc.eks.<REGION>.amazonaws.com/id/<UUID-OIDC>
The following command creates the OIDC provider using the address provided by the cluster:
eksctl utils associate-iam-oidc-provider --cluster <Name-EKS-Cluster> --approve
Follow this AWS documentation for your reference.
Verify that the OIDC provider is created by navigating to https://console.aws.amazon.com/iamv2/home?#/identity_providers and confirming that a new provider entry has been created with the same <UUID-OIDC> issuer as the cluster’s.
- Next we need to add a trust relationship between this OIDC provider and the two Flyte roles:
Navigate to the newly created OIDC Providers with <UUID-OIDC> and copy the ARN.
Navigate to IAM Roles and select the
iam-role-flyte
role.Under the Trust relationships tab, hit the Edit button.
Replace the
Principal:Federated
value in the policy JSON below with the copied ARN.Replace the
<UUID-OIDC>
placeholder in theCondition:StringEquals
with the last part of the copied ARN. It’ll look something like8DCF90D22E386AA3975FC4DCD2ECD23BC
and should match the tail end of the issuer ID from the first step. Ensure you don’t accidentally remove the:aud
suffix. You need that.Repeat these steps for the
flyte-user-role
.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "eks.amazonaws.com"
},
"Action": "sts:AssumeRole"
},
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::<AWS_ACCOUNT_ID>:oidc-provider/oidc.eks.<REGION>.amazonaws.com/id/<UUID-OIDC>"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.<REGION>.amazonaws.com/id/<UUID-OIDC>:aud": "sts.amazonaws.com"
}
}
}
]
}
Create an EKS Node Group#
The initial EKS cluster will not have any instances configured to operate the cluster. Create a node group which provides resources for the kubernetes cluster:
Navigate to your EKS cluster, that is, Configuration -> Compute tab.
Provide a suitable name <Name-EKS-Node-Group>.
Use the EKS node IAM role <ClusterName-EKS-Node-Role> created in the above steps.
Use without any launch template, kuebernetes labels,taints or tags.
Choose the default Amazon EC2 AMI (AL2_x86_64).
Capacity type on demand, Instance type and size can be chosen based on your DevOps requirements. Keep the default if in doubt.
Create a node group with 5/10/5 instance min, max, desired.
Use the default subnets selected which would be chosen based on your EKS cluster accessible subnets.
Disallow remote access to the nodes (If needed provide the ssh access key pair to use from your account).
Create an RDS Database#
Next, create a relational database. This database will be used by both the primary control plane service (FlyteAdmin) and the Flyte memoization service (Data Catalog).
Navigate to RDS and create an Aurora engine with Postgres compatibility database.
Leave the Template as Production.
Change the default cluster identifier to
flyteadmin
.Set the master username to
flyteadmin
.Choose a master password which you can later use in your Helm template.
Leave Public access off.
Choose the same VPC that your EKS cluster is in.
In a separate tab, navigate to the EKS cluster page and make note of the security group attached to your cluster.
Go back to the RDS page and in the security group section, add the EKS cluster’s security group (feel free to leave the default as well). This will ensure you don’t have to play around with security group rules in order for pods running in the cluster to access the RDS instance.
Under the top level Additional configuration (there’s a sub menu by the same name) under “Initial database name” enter
flyteadmin
as well.
Leave all the other settings as is and hit Create
.
Check Connectivity to the RDS Database From the EKS Cluster#
Get the <RDS-HOST-NAME> by navigating to the database cluster and copying the writer instance endpoint.
We will use pgsql-postgres-client to verify DB connectivity:
Create a testdb namespace for trial.
kubectl create ns testdb
Run the following command with the username and password you used, and the host returned by AWS.
kubectl run pgsql-postgresql-client --rm --tty -i --restart='Never' --namespace testdb --image docker.io/bitnami/postgresql:11.7.0-debian-10-r9 --env="PGPASSWORD=<Password>" --command -- psql testdb --host <RDS-HOST-NAME> -U <Username> -d flyteadmin -p 5432
If things are working fine then you should drop into a psql command prompt. Type
\q
to quit. If you make a mistake in the above command you may need to delete the pod created withkubectl -n testdb delete pod pgsql-postgresql-client
In case there are connectivity issues then you would see the following error. Please check the security groups on the Database and the EKS cluster.
psql: warning: extra command-line argument "testdb" ignored
psql: could not translate host name "database-2-instance-1.ce40o2y3b4os.us-east-2.rds.amazonaws.co" to address: Name or service not known
pod "pgsql-postgresql-client" deleted
pod flyte/pgsql-postgresql-client terminated (Error)
Install an Amazon Loadbalancer Ingress Controller#
The cluster doesn’t come with any ingress controllers so we have to install one separately. This one will create an AWS load balancer for K8s Ingress objects.
Before we begin, make sure all the subnets are tagged correctly for subnet discovery. The controller uses this for creating the ALB’s.
Go to your default VPC subnets. There would be 3 subnets for the 3 AZ’s.
Add 2 tags on all the three subnets Key kubernetes.io/role/elb Value 1 Key kubernetes.io/cluster/<Name-EKS-Cluster> Value shared
Refer to this document for additional details.
Download the IAM policy for the AWS Load Balancer Controller:
curl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.2.0/docs/install/iam_policy.json
Create an IAM policy called AWSLoadBalancerControllerIAMPolicy (delete it if it already exists from IAM service):
aws iam create-policy \ --policy-name AWSLoadBalancerControllerIAMPolicy \ --policy-document file://iam-policy.json
Create an IAM role and ServiceAccount for the AWS Load Balancer controller, using the ARN from the step above:
eksctl create iamserviceaccount \ --cluster=<cluster-name> \ --region=<region> \ --namespace=kube-system \ --name=aws-load-balancer-controller \ --attach-policy-arn=arn:aws:iam::<AWS_ACCOUNT_ID>:policy/AWSLoadBalancerControllerIAMPolicy \ --override-existing-serviceaccounts \ --approve
Add the EKS chart repo to helm:
helm repo add eks https://aws.github.io/eks-charts
Install the TargetGroupBinding CRDs:
kubectl apply -k "github.com/aws/eks-charts/stable/aws-load-balancer-controller//crds?ref=master"
Install the load balancer controller using helm:
helm install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system --set clusterName=<Name-EKS-Cluster> --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller
Verify load balancer webhook service is running in kube-system ns:
kubectl get service -n kube-system
Sample o/p
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
aws-load-balancer-webhook-service ClusterIP 10.100.255.5 <none> 443/TCP 95s
kube-dns ClusterIP 10.100.0.10 <none> 53/UDP,53/TCP 75m
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
aws-load-balancer-controller-674869f987-brfkj 1/1 Running 0 11s
aws-load-balancer-controller-674869f987-tpwvn 1/1 Running 0 11s
Use this document for any additional installation instructions.
SSL Certificate#
To use SSL (needed to use gRPC clients), you need to create an SSL certificate. To acquire a legitimate certificate, you will need to work with your infrastructure team. These are not secure and will show up as a security warning to any users, so it is recommended that you deploy a legitimate certificate.
Self-Signed Method (Insecure)#
Generate a self signed cert using open ssl and get the <KEY> and <CRT> file.
Define req.conf file with the following contents.
[req] distinguished_name = req_distinguished_name x509_extensions = v3_req prompt = no [req_distinguished_name] C = US ST = WA L = Seattle O = Flyte OU = IT CN = flyte.example.org emailAddress = dummyuser@flyte.org [v3_req] keyUsage = keyEncipherment, dataEncipherment extendedKeyUsage = serverAuth subjectAltName = @alt_names [alt_names] DNS.1 = flyte.example.org
Use openssl to generate the KEY and CRT files.
openssl req -x509 -nodes -days 3649 -newkey rsa:2048 -keyout key.out -out crt.out -config req.conf -extensions 'v3_req'
Create ARN for the cert.
aws acm import-certificate --certificate fileb://crt.out --private-key fileb://key.out --region <REGION>
Production#
Generate a cert from the CA used by your org and get the <KEY> and <CRT>. Flyte doesn’t manage the lifecycle of certificates so this will need to be managed by your security or infrastructure team.
Refer to AWS docs to import the cert. You can also request a public cert issued by ACM Private CA
Note the generated ARN. Let’s calls it <CERT-ARN> in this doc which we will use to replace in our values-eks.yaml
Use the AWS Certificate manager for generating the SSL certificate to host your hosted Flyte installation.
Create an S3 Bucket#
Create an S3 bucket without public access.
Choose a name for it, for example: <ClusterName-Bucket>
Use the same region as the EKS cluster.
Create a Log Group#
Navigate to the AWS Cloudwatch page and create a Log Group.
Give it a name like flyteplatform
.
Note
Pushing logs to CloudWatch logs is native to K8s. You need to use a K8s agent to push logs.
Time for Helm#
Installing Flyte#
Add the Flyte chart repo to Helm
helm repo add flyteorg https://flyteorg.github.io/flyte
Download EKS values for Helm
Download EKS Helm values (it enables Flyte native scheduler by default)
curl -sL https://raw.githubusercontent.com/flyteorg/flyte/master/charts/flyte-core/values-eks.yaml
Download EKS helm values for AWS Scheduler
curl -sL https://raw.githubusercontent.com/flyteorg/flyte/master/charts/flyte-core/values-eks.yaml curl -sL https://raw.githubusercontent.com/flyteorg/flyte/master/charts/flyte-core/values-eks-override.yaml
Update values in the YAML file
Search and replace the following:
Placeholder |
Description |
Sample Value |
---|---|---|
|
The AWS Account ID within quotation marks |
|
|
The region your EKS cluster is in |
|
|
DNS entry for your Aurora instance |
|
|
Bucket used by Flyte |
|
|
The password in plaintext for your RDS instance |
awesomesauce |
|
CloudWatch Log Group |
|
|
ARN of the self-signed (or official) certificate |
|
(Optional) Configure Flyte project and domain
To restrict projects, update Helm values. By default, Flyte creates three projects: Flytesnacks, Flytetester, and Flyteexample.
# you can define projects as per your need
flyteadmin:
initialProjects:
- flytesnacks
- flytetester
- flyteexamples
To restrict domains, update the Helm values again. By default, Flyte creates three domains per project: development, staging and production.
# -- Domain configuration for Flyte project. This enables the specified number of domains across all projects in Flyte.
configmap
domain:
domains:
- id: development
name: development
- id: staging
name: staging
- id: production
name: production
# Update Cluster resource manager only if you are using Flyte resource manager. It will create the required resource in the project-domain namespace.
cluster_resource_manager:
enabled: true
config:
cluster_resources:
customData:
- development:
- projectQuotaCpu:
value: "5"
- projectQuotaMemory:
value: "4000Mi"
- defaultIamRole:
value: "arn:aws:iam::{{ .Values.userSettings.accountNumber }}:role/flyte-user-role"
- staging:
- projectQuotaCpu:
value: "2"
- projectQuotaMemory:
value: "3000Mi"
- defaultIamRole:
value: "arn:aws:iam::{{ .Values.userSettings.accountNumber }}:role/flyte-user-role"
- production:
- projectQuotaCpu:
value: "2"
- projectQuotaMemory:
value: "3000Mi"
- defaultIamRole:
value: "arn:aws:iam::{{ .Values.userSettings.accountNumber }}:role/flyte-user-role"
Install Flyte
Install Flyte with Flyte native scheduler
helm install -n flyte -f values-eks.yaml --create-namespace flyte flyteorg/flyte-core
Install Flyte with Flyte AWS Scheduler
helm install -n flyte -f values-eks.yaml -f values-eks-override.yaml --create-namespace flyte flyteorg/flyte-core
Verify if all of the pods have come up correctly
kubectl get pods -n flyte
Uninstalling Flyte#
helm uninstall -n flyte flyte
Upgrading Flyte#
Install Flyte with flyte native scheduler:
helm upgrade -n flyte -f values-eks.yaml --create-namespace flyte flyteorg/flyte-core
Install Flyte with flyte aws scheduler
helm upgrade -n flyte -f values-eks.yaml -f values-eks-override.yaml --create-namespace flyte flyteorg/flyte-core
Connecting to Flyte#
Flyte can be accessed using the UI console or your terminal.
First, find the Flyte endpoint created by the ALB ingress controller.
$ kubectl -n flyte get ingress
NAME CLASS HOSTS ADDRESS PORTS AGE
flyte <none> * k8s-flyte-8699360f2e-1590325550.us-east-2.elb.amazonaws.com 80 3m50s
flyte-grpc <none> * k8s-flyte-8699360f2e-1590325550.us-east-2.elb.amazonaws.com 80 3m49s
<FLYTE-ENDPOINT> = Value in ADDRESS column and both will be the same as the same port is used for both GRPC and HTTP.
Connect to flytectl CLI.
Add :<FLYTE-ENDPOINT> to ~/.flyte/config.yaml eg ;
admin:
# For GRPC endpoints you might want to use dns:///flyte.myexample.com
endpoint: dns:///<FLYTE-ENDPOINT>
insecureSkipVerify: true # only required if using a self-signed cert. Caution: not to be used in production
insecure: false # only set to true when using insecure ingress. Secure ingress may cause an unavailable desc error to true option, self-signed cert can be seen as secure ingress but should not be used in production
logger:
show-source: true
level: 0
storage:
type: s3
connection:
auth_type: iam
region: <REGION> # Example: us-east-2
container: <ClusterName-Bucket> # Example my-bucket. Flyte k8s cluster / service account for execution should have access to this bucket
Accessing Flyte Console (Web UI)#
Use the https://<FLYTE-ENDPOINT>/console to get access to flyteconsole UI
Ignore the certificate error if using a self signed cert
Troubleshooting#
If a flyteadmin pod is not coming up, then describe the pod and check which of the container or init-containers had an error.
kubectl describe pod/<flyteadmin-pod-instance> -n flyte
Then check the logs for the container which failed.
Eg: to check for run-migrations init container do this:
kubectl logs -f <flyteadmin-pod-instance> run-migrations -n flyte
If the ADDRESS column is empty after getting the ingress, describe ingress to find out if there are error messages.
kubectl describe ingress -n flyte
If you see connectivity issues, then check your security group rules on the DB and eks cluster.
For authentication issues, check that you have used the same password in helm and RDS DB creation.
(Note : When using Cloud formation templates, make sure the passwords are not double/single quoted.)
Increasing log level for flytectl Change your logger config to this .. code-block:
logger: show-source: true level: 6