Helm 2 to 3 Migration Review

My first project at Calm was migrating Helm2 to Helm3. In this post, I’m going to share a few resources I used as guidance and more detailed steps and minor problems I ran into during the migration process.

Recommended Read

Migration Scenario at Calm

At Calm, we have multiple k8s clusters throughout the environment. To minimize the risk, start the migration from the least impactful cluster.
On top of Helm, we also utilize Helmfile to organize the Helm charts and values.
Most of the in-house applications are deployed with Jenkins using a generic Helm chart.
A handful of charts from the opensource community are used throughout the environments.

Step 1: Manually convert and re-deploy the least impactful environment.

To get familiar with the process, pick a cluster that is the least impactful for the development process, convert each release into Helm 3, and re-deploy with Helm 3 so you can be sure that there won’t be problems when you deploy in the future. At Calm, we have a QA cluster that had some kube-system components and a sample app so that was a great starting point.

Get the list of the releases with helm ls -a command.
Migrate a release with helm3 2to3 convert $RELEASE_NAME.
You could run with --dry-run if you want to be extra careful but it wasn’t really helpful for me.
Now try to deploy the same chart with Helm 3. Either using helm3 upgrade or helmfile sync -b helm3 .
The -b for helmfile select the Helm binary to be used. During the process, it’s best to have both Helm 2 and Helm 3 binaries.

In my experience, the conversion step never caused any problems, but deploying the same charts with Helm 3 even without any changes could throw some errors. Most of the community charts showed the changes in label from tiller to helm which wasn’t problematic at all. However some charts needed some extra work, so here are some common problems I encountered.

Kubernetes API validation failure.

One of the changes introduced in Helm 3 is API validation. Before the Helm templates get rendered into k8s manifests and applied to the cluster, Helm 3 validates the k8s manifests against the API definitions. The error message looked like the following.

Error: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: cannot patch “admin-www” with kind Deployment: Deployment.apps “admin-www” is invalid: spec.template.spec.containers[0].env[10].valueFrom: Invalid value: “”: may not be specified when value is not empty: cannot patch “admin-www” with kind Deployment: Deployment.apps “admin-www” is invalid: spec.template.spec.containers[0].env[10].valueFrom: Invalid value: “”: may not be specified when value is not empty

For this specific example, it was due to envVar requiring the value to be empty. In our Helm chart, we did not have value: section at all when we use valueFrom. The fix was simply to add value: ""

This process can be cumbersome as it depends on your Kubernetes version as well. One way to bypass this is to disable the API validation with --disable-openapi-validation keyword but I’d rather fix these errors now than encountering different problems down the line.

Immutable fields

This was encountered only for external-dns in my case. In short, some of the fields in k8s resources are immutable. If the changes from Helm 2 to Helm 3 happen to be in one of these immutable fields, the deployment will fail even if the conversion was successful. Luckily the external-dns is not a critical component, so my fix was simply to delete the deployment resource and run the upgrade command again.

Chart repository changes

Some of the chart repositories turned out to be deprecated. If you are lucky, this might be just a matter of updating the repository URL. In my case, some charts were missing the matching version in the new repositories. If that’s the case, you can either try to upgrade the chart or find the old chart from source code somewhere and host it in your own chart repo. As upgrading and testing the components would add more time for me, I hosted the older version of charts myself.

I got pretty comfortable after converting and re-deploying all releases in the QA cluster manually.

The next step was to convert the dev cluster that’s actively being deployed by developers with Jenkins

Step 2: Preparing automated deployment in CI with Helm3

At Calm, we are mainly using Jenkins for CI/CD at the moment, but this step wouldn’t be too different for other CI tools. Unless you are migrating all environments at the same time, you will likely have a period where Helm 2 and Helm 3 are used throughout the environments. To minimize the impact, I kept helm2 binary as helm and introduced a new helm3 binary.

We have a dedicated Docker image to be used as Jenkins builder, so the following lines are added to the Dockerfile.

Docker file changes for helm3

ARG helm3_ver=3.4.1
RUN wget https://get.helm.sh/helm-v${helm3_ver}-linux-amd64.tar.gz \
    && mkdir /tmp/helm3 \
    && tar -xvf helm-v${helm3_ver}-linux-amd64.tar.gz -C /tmp/helm3 \
    && mv /tmp/helm3/linux-amd64/helm /usr/local/bin/helm3
RUN helm3 plugin install https://github.com/helm/helm-2to3.git \
    && yes | helm3 2to3 move config

These changes add a new argument for Helm 3 version to be installed so we can easily come back to bump it in the future, prepare helm3 binary, install the 2to3 plugin, then move the existing Helm 2 configurations to Helm 3.

Once the new image is built, make sure to update the Jenkins builder image to use this new image, and we are ready to update the deployment script to use helm3

Calm uses declarative style Jenkins pipeline and we have a function that takes the following arguments to pass to helmfile sync command.

helmfile_deploy arguments

def call(Map parameters) {
    service = parameters.repo_name
    region = parameters.region
    env = parameters.env
    namespace = parameters.namespace
    short_hash = parameters.short_hash
    helm_binary = parameters.helm_binary ?: "helm"

Here we are introducing a new argument, helm_binary, and set it to be default as helm which would be Helm 2. This way the existing function calls will continue to behave the same way as before. Inside of this function we call helmfile sync command, add -b $helm_binary as part of the helmfile command to selectively use helm3.

Now with this change, we can selectively convert an environment to use Helm 3 by setting helm_binary: helm3 where the deployment is happening. Once this helm3 binary change is tested, go ahead and convert all releases in the given environment.

From Migrating to Helm 3: What You Need to Know by Jack Morris

# List Helm 2 Releases
# omit --tls flag if you're not using TLS
RELEASES=$(helm list --tls -aq)
# Loop through releases and, for each one, test conversion
while IFS= read -r release; do
helm3 2to3 convert $release --dry-run
done <<< "$RELEASES"

Run it without --dry-run once everything is ready. Time to chill and monitor the deployment Slack channel.

At Calm we have 5 dev environments. I ended up converting one of these environments first. A few deployments failed due to unrelated issues but it’s better to move slow here and make sure all moving parts are working correctly.

Once this part is done, we pretty much repeat the same steps up to the production cluster.

Step 3: Cleanup

Once the releases are converted and CI tools are switched to Helm 3, it’s time to clean up the Helm 2 information.

If you’d like to backup the Helm 2 release configmap just in case:

$ kubectl get configmaps \
    --namespace "kube-system" \
    --selector "OWNER=TILLER" \
    --output "yaml" > helm2-backup-cm.yaml
# From https://geeksocket.in/posts/helm-2-3-migration/

After that, you can run helm3 2to3 cleanup to remove the Helm 2 release configmaps and Tiller deployment.

Too many connections warning message in EKS during cleanup process

While running the cleanup command from my local machine, I was seeing thousands of messages like

W0303 09:43:12.111352 8329 exec.go:201] constructing many client instances from the same exec auth config can cause performance problems during cert rotation and can exhaust available network connections; 1961 clients constructed calling “aws”

This might be just a wrongly worded warning message but I didn’t want to risk a network problem in the production cluster. I could bypass this warning message by running the cleanup command inside of a pod in the target k8s cluster with admin permissions.

Create the resources with following manifest:

---
apiVersion: v1
kind: ServiceAccount
metadata:
name: helm-cleanup
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: helm-cleanup
subjects:
- kind: ServiceAccount
name: helm-cleanup
namespace: default
roleRef:
kind: ClusterRole
name: admin
apiGroup: ""

Create a new pod and attach:

kubectl run --serviceaccount=helm-cleanup --generator=run-pod/v1 -i --tty helm-cleanup --image=DOCKER_IMAGE_WITH_HELM_3 -- sh

Then executes helm3 2to3 cleanup inside of the container to avoid unnecessary client connections for auth.

Conclusion

It has been about 2 weeks since I converted the production cluster and removed Tiller from the cluster.

There have been no issues at all so far and the conversion process was overall pretty smooth.

Now we are ready for all the latest and greatest charts that the community has to offer.