EMR on EKS is a feature to run Spark on EKS. While normal EMR also manages Hadoop clusters, EMR on EKS is only responsible for starting containers.
Launch an EMR cluster with AWS CLI and run Spark applications - sambaiz-net
By running on Kubernetes, you can use tools and functions for Kubernetes to manage and monitor, and utilize resources left over if you have an existing cluster. It also enables to use knowledge about Kubernetes learned from other applications for aggregation and analysis, and also may be vice versa. However, some knowledge about Kubernetes and EKS is required. You can start using normal EMR without knowledge of YARN, but I think EMR on EKS will be difficult to even operate with no knowledge.
By the way, Docker is supported from EMR 6.x (Hadoop 3).
Pricing is based on requested the amount of vCPU and memory.
Launch a cluster
Launch an EKS cluster with CDK.
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as eks from 'aws-cdk-lib/aws-eks';
const vpc = new ec2.Vpc(this, 'vpc', {
cidr: '10.0.0.0/16',
maxAzs: 2,
subnetConfiguration: [
{
cidrMask: 18,
name: 'public',
subnetType: ec2.SubnetType.PUBLIC,
},
{
cidrMask: 18,
name: 'private',
subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
},
]
})
cdk.Tags.of(vpc).add("Name", this.stackName)
const mastersRole = new iam.Role(this, 'masters-role', {
assumedBy: new iam.CompositePrincipal(
new iam.ServicePrincipal('eks.amazonaws.com'),
new iam.AccountRootPrincipal(),
),
managedPolicies: [
iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKSClusterPolicy'),
],
inlinePolicies: {
'eksctl-iamidentitymapping': new iam.PolicyDocument({
statements: [
new iam.PolicyStatement({
effect: iam.Effect.ALLOW,
actions: ['eks:DescribeCluster'],
resources: ['*'],
})
]
})
}
})
return new eks.Cluster(this, 'cluster', {
vpc,
mastersRole,
clusterName: 'emr-on-eks-test',
version: eks.KubernetesVersion.V1_23,
defaultCapacity: 3,
defaultCapacityInstance: new ec2.InstanceType('m5.large')
})
“eksctl create iamidentitymapping” to be executed later requires eks:DescribeCluster in addition to Kubernetes permissions, so I attached the policy to the master role and set the profile to assume the role. Alternatively, you can add user roles to aws-auth to give Kubernetes permissions to it.
$ brew tap weaveworks/tap
$ brew install weaveworks/tap/eksctl
$ eksctl version
0.124.0
$ cat ~/.aws/config
...
[profile eks-master]
region = ap-northeast-1
role_arn = arn:aws:iam::xxxxx:role/EmrOnEksTestStack-clusterMastersRoleEABFBB9C-10Q5X4BE7QN51
source_profile = <user_profile_name>
Grant the cluster access
Bind a Kubernetes role to a user linked to service-linked role for EMR on EKS.
RBACが有効なGKEでHelmを使う - sambaiz-net
You can do that manually, but the following command automatically adds resources and settings.
$ CLUSTER_NAME=emr-on-eks-test
$ NAMESPACE=default
$ eksctl create iamidentitymapping \
--cluster $CLUSTER_NAME \
--namespace $NAMESPACE \
--service-name emr-containers \
--profile eks-master
2022-12-31 17:40:34 [ℹ] created "default:Role.rbac.authorization.k8s.io/emr-containers"
2022-12-31 17:40:34 [ℹ] created "default:RoleBinding.rbac.authorization.k8s.io/emr-containers"
2022-12-31 17:40:34 [ℹ] adding identity "arn:aws:iam::xxxxx:role/AWSServiceRoleForAmazonEMRContainers" to auth ConfigMap
It can be confirmed that a role is bound.
$ kubectl describe -n kube-system configmap aws-auth
...
mapRoles:
----
...
- rolearn: arn:aws:iam::xxxxx:role/AWSServiceRoleForAmazonEMRContainers
username: emr-containers
...
$ kubectl get rolebinding -n $NAMESPACE emr-containers
NAME ROLE AGE
emr-containers Role/emr-containers 14m
Register the cluster
Register the cluster to EMR.
new emrcontainers.CfnVirtualCluster(this, 'virtual-cluster', {
name: 'test_virtual_cluster',
containerProvider: {
id: cluster.clusterName,
type: 'EKS',
info: {
eksInfo: {
namespace: 'default'
}
}
}
})
Create an execution role
Create a Role with Policy required for Job execution.
new iam.Role(this, 'job-exectuion-role', {
roleName: "emr-on-eks-test-job-execution-role",
assumedBy: new iam.AccountRootPrincipal(), // update later
inlinePolicies: {
'job-execution-policy': new iam.PolicyDocument({
statements: [
new iam.PolicyStatement({
effect: iam.Effect.ALLOW,
actions: [
's3:PutObject',
's3:GetObject',
's3:ListBucket',
],
resources: ['arn:aws:s3:::*'],
}),
new iam.PolicyStatement({
effect: iam.Effect.ALLOW,
actions: [
'logs:PutLogEvents',
'logs:CreateLogStream',
'logs:DescribeLogGroups',
'logs:DescribeLogStreams'
],
resources: ['arn:aws:logs:*:*:*'],
})
]
})
}
})
The following command adds a trust policy for the EMR on EKS service account that is created when a job is submitted.
$ EXECUTION_ROLE_NAME=emr-on-eks-test-job-execution-role
$ aws emr-containers update-role-trust-policy \
--cluster-name $CLUSTER_NAME \
--namespace $NAMESPACE \
--role-name $EXECUTION_ROLE_NAME
Successfully updated trust policy of role emr-on-eks-test-job-execution-role
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::xxxxx:oidc-provider/oidc.eks.ap-northeast-1.amazonaws.com/id/aaaaa"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringLike": {
"oidc.eks.ap-northeast-1.amazonaws.com/id/aaaaa:sub": "system:serviceaccount:default:emr-containers-sa-*-*-xxxxx-<BASE36_ENCODED_ROLE_NAME>"
}
}
}
The issuer url of the cluster can be obtained with the following command.
$ aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text
https://oidc.eks.ap-northeast-1.amazonaws.com/id/aaaaa
Register it to IAM OIDC providers.
Create a role that can assume with OIDC from GitHub Actions with CDK - sambaiz-net
$ eksctl utils associate-iam-oidc-provider --cluster $CLUSTER_NAME --approve
2023-01-02 13:39:28 [ℹ] will create IAM Open ID Connect provider for cluster "emr-on-eks-test" in "ap-northeast-1"
2023-01-02 13:39:29 [✔] created IAM Open ID Connect provider for cluster "emr-on-eks-test" in "ap-northeast-1"
$ aws iam list-open-id-connect-providers
{
"OpenIDConnectProviderList": [
{
"Arn": "arn:aws:iam::xxxxx:oidc-provider/oidc.eks.ap-northeast-1.amazonaws.com/id/aaaaa"
},
....
]
}
Run Spark jobs
start-job-run launches pods. If the resource is insufficient, it will be pending. If the driver fails with S3 AccessDenied, execution-role may be failed to be assumed, so check if trust policy and IAM OIDC provider are registered.
$ VIRTUAL_CLUSTER_ID=xxxxx
$ EXECUTION_ROLE_ARN=arn:aws:iam::xxxxx:role/emr-on-eks-test-job-execution-role
$ aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name pi-2 \
--execution-role-arn $EXECUTION_ROLE_ARN \
--release-label emr-6.7.0-latest \
--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
"sparkSubmitParameters": "--conf spark.executor.instances=1 --conf spark.executor.memory=2G --conf spark.executor.cores=1 --conf spark.driver.cores=1"
}
}'
{
"id": "000000031bdlkdo0q1k",
"name": "pi-2",
"arn": "arn:aws:emr-containers:ap-northeast-1:xxxxx:/virtualclusters/xxxxx/jobruns/000000031bdlkdo0q1k",
"virtualClusterId": "xxxxx"
}
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
000000031bdlkdo0q1k-sn72c 2/2 Running 0 21s
pythonpi-c4ae29856dad6d33-exec-1 0/1 ContainerCreating 0 1s
spark-000000031bdlkdo0q1k-driver 2/2 Running 0 12s
$ kubectl describe pod spark-000000031bdlkdo0q1k-driver
...
spark-kubernetes-driver:
Image: 059004520145.dkr.ecr.ap-northeast-1.amazonaws.com/spark/emr-6.7.0:latest
Args:
driver
--properties-file
/usr/lib/spark/conf/spark.properties
--class
org.apache.spark.deploy.PythonRunner
local:///usr/lib/spark/examples/src/main/python/pi.py
...