Karpenter is an OSS that scales a Kubernetes cluster developed by AWS. Compared to Cluster Autoscaler, it can perform fast and flexible provisioning not through Auto Scaling Group. It supports only EKS so far, but it seems to have possibility to support other cloud providers.
(PS: 2023-12-06) now available on AKS
In this article, I install Karpenter on an EKS cluster with CDK, and confirm the cluster to be auto-scaled when the number of replicas is changed. The entire code is on GitHub.
Create AWS resources used by Karpenter
Create AWS resources used by Karpenter, such as Roles and SQS to receive Events of Spot Instance Interruption etc., with provided CFn template, which is included as a NestedStack.
class KapenterResourcesStack extends cdk.NestedStack {
constructor(scope: Construct, id: string, props?: cdk.StackProps & {karpenterVersion: string, clusterName: string}) {
super(scope, id, props);
new cfninc.CfnInclude(this, 'KarpenterResources', {
templateFile: `karpenter_${props?.karpenterVersion}.yaml`,
parameters: {
"ClusterName": props?.clusterName
}
});
}
}
Mapping NodeRole
Update aws-auth ConfigMap to map NodeRole, created by the template, and nodes.
awsAuth.addRoleMapping(nodeRole, {
groups: ["system:bootstrappers", "system:nodes"],
username: "system:node:{{EC2PrivateDNSName}}"
})
If you create multiple mappings for the same Role, they seem to be overwritten, so be careful. This can lead to insufficient permissions, which may cause issues such as the kubernetes.io/kubelet-serving CertificateSigningRequest remaining in a Pending state, resulting in the Kubelet becoming non-functional.
$ journalctl -u kubelet
handshake error from x.x.x.x:x no serving certificate available for the kubelet
Create ControllerRole
Create a ControllerRole that karpenter ServiceAccount, created by Helm Chart later, can assume with OIDC auth. It corresponds to the iam part of ClusterConfig passed on eksctl create cluster.
const controllerRole = new iam.Role(this, 'KarpenterControllerRole', {
roleName: `${cluster.clusterName}-karpenter`,
path: "/",
assumedBy: new iam.WebIdentityPrincipal(
cluster.openIdConnectProvider.openIdConnectProviderArn,
{
// delay resolution to deployment-time to use tokens in object keys
"StringEquals": new cdk.CfnJson(this, 'KarpenterControllerRoleStringEquals', { value: {
[`${cluster.clusterOpenIdConnectIssuer}:aud`]: "sts.amazonaws.com",
[`${cluster.clusterOpenIdConnectIssuer}:sub`]: "system:serviceaccount:karpenter:karpenter"
}})
}
),
managedPolicies: [controllerPolicy]
})
Install Karpenter
Install Karpenter with Helm Chart. If ControllerRole’s Principal is wrong, and it can’t be performed AssumeRole, the update fails and is roll-backed, so you might be better to see the behavior with “wait: false” at first.
cluster.addHelmChart('karpenter', {
chart: 'karpenter',
repository: 'oci://public.ecr.aws/karpenter/karpenter',
version: karpenterVersion,
namespace: 'karpenter',
createNamespace: true,
values: {
serviceAccount: {
name: 'karpenter', // added
annotations: {
"eks.amazonaws.com/role-arn": controllerRole.roleArn,
}
},
settings: {
clusterName: cluster.clusterName,
interruptionQueue: interruptionQueueName
},
controller: {
resources: {
requests: {
cpu: 1,
memory: "1Gi"
},
limits: {
cpu: 1,
memory: "1Gi"
}
}
}
},
wait: true
})
Settings for Provisioner and AWSNodeTemplate
Settings for instances to be provisioned.
(PS: 2023-12-06) v1beta1 api was added in v0.32.0, and the old Provisioner and AWSNodeTemplate were deprecated. Besides, tags attached to AWS resources were changed, so conditions of policies were also changed.
cluster.addManifest('DefaultProvisionerAndNodeTemplate', {
"apiVersion": "karpenter.k8s.aws/v1beta1",
"kind": "EC2NodeClass",
"metadata": {
"name": "default"
},
"spec": {
"amiFamily": "AL2",
"role": `KarpenterNodeRole-${cluster.clusterName}`,
"subnetSelectorTerms": [{
"tags": {
"Name": cluster.vpc.publicSubnets.join(",")
}
}],
"securityGroupSelectorTerms": [{
"id": cluster.clusterSecurityGroup.securityGroupId
}],
// optional
"blockDeviceMappings": [{
"deviceName": "/dev/xvda",
"ebs": {
"volumeSize": "100Gi",
"volumeType": "gp3",
"deleteOnTermination": true,
}
}]
}
}, {
"apiVersion": "karpenter.sh/v1beta1",
"kind": "NodePool",
"metadata": {
"name": "default"
},
"spec": {
"template": {
"spec": {
"nodeClassRef": {
"name": "default"
},
"requirements": [{
"key": "karpenter.sh/capacity-type",
"operator": "In",
"values": ["spot"]
}],
}
},
"disruption": {
"consolidationPolicy": "WhenUnderutilized"
},
"limits": {
"cpu": "1000"
}
}
})
Confirm the behavior
When the number of deployments was increased, a spot instance started up according to NodePool’s requirements, and when I deleted it, it shut down.
$ kubectl scale deployment inflate --replicas 5
$ kubectl logs -f -n karpenter -l app.kubernetes.io/name=karpenter -c controller
...
...,"message":"found provisionable pod(s)",...
...,"message":"computed new nodeclaim(s) to fit pod(s)",...
...,"message":"created nodeclaim",...
...,"message":"created launch template",...
...,"message":"launched nodeclaim",...
...,"message":"discovered subnets",...
...,"message":"registered nodeclaim",...
...,"message":"initialized nodeclaim",...
Some labels are added to the nodes, and by specifying them with nodeSelector, pods can be scheduled to the cheapest node that meets the requirements.
karpenter.k8s.aws/instance-category=r
karpenter.k8s.aws/instance-cpu=16
karpenter.k8s.aws/instance-encryption-in-transit-supported=false
karpenter.k8s.aws/instance-family=r6gd
karpenter.k8s.aws/instance-generation=6
...
Pod disruption
karpenter.sh/do-not-disrupt: “true” annotation prevents running pods from being disrupted and moved to another node due to Consolidation. You can see Consolidation events with Karpenter logs or kubectl get event.
However, this setting may extend the lifespan of the instance and increase the chance of encountering a spot instance interruption. The 2 minute prior interruption notice that is retrieved from the interruptionQueue on Reconcile is handled. Disruption in such a case is unavoidable and has a time limit. If you are concerned about capacity, I think it is better to set minAvailable/maxUnavailable in PodDisruptionBudget rather than do-not-disrupt to ensure that the minimum required resources are guaranteed even on disruption.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: testapp-pdb
spec:
minAvailable: 90%
unhealthyPodEvictionPolicy: AlwaysAllow
selector:
matchLabels:
app: testapp
The default for unhealthyPodEvictionPolicy is IfHealthyBudget, and unhealthy pods are evicted only if they meet the PDB’s desiredHealthy. Setting this to AlwaysAllow will allow eviction to occur unconditionally, you can drain with ignoring pods that never become healthy.