You Don’t Have to Optimize Your Docker Images; Just Bake ‘em

8 min readOct 16, 2022

Container Images have become the de facto mechanism for packaging dependencies and code. A few years ago, when the task of building these Images was under the purview of operations specialists, there existed the luxury of optimization — with beautiful multi-stage builds, clean-up after every step, etc. But today, any developer working on a modern stack gets to build their own images and pass them along to a container orchestrator like Kubernetes or Docker Swarm to deploy their application.

This decoupling between operations and application development has been a boon. A tiny team of Ops engineers — whatever you call them: Ops, DevOps, Platform, Infra — can serve a wide range of products at virtually any scale. At no point does an operations specialist need to get involved in this process of Image building once a simple build pipeline has been set up.

On the flip side, this decoupling now means that the Images built by app developers are barely optimized, contain a lot of unused packages and dead weight, and little thought is given to breaking an application into multiple pieces. Which is… completely fair. What we lose with the efficiency of these Images, we gain exponentially with the increase in developer velocity. There’s nothing worse than having Ops be a bottleneck to development. So, I’ll gladly take a few extra Gigabytes any day.

But, that said, here’s one of the biggest problems I face as a platform engineer: Large Startup Times. It’s pretty much the villain in my book, the silent killer of efficiency, and has implications up and down the stack.

In the case of Kubernetes with AWS EKS, here’s what happens when a new pod is launched but no node currently exists to serve it.

A new EC2 instance is launched.
The instance joins the cluster and becomes a K8s ‘node’.
The Docker image necessary for the pod to run is pulled to the node.
The container is launched and the application is started on it.

Each step takes time, but there’s nuance to this:

Steps 1 and 2 are inevitable, and only EKS and K8s devs can contribute to any improvements here.
Step 4 is dependent on the app developer. If they use a potato language like Python (I say this with love) with a lot of module imports, the app launch is going to be quite drawn out.

So, steps 1, 2, and 4 for a particular application take a fixed amount of time that’s out of the hands of an Ops engineer. But step 3, the Image pull time, is interesting, because it introduces the biggest opportunity for guaranteed improvement in this process.

Let’s say a particular task takes four minutes to run, and the 4 GB Image needed for the task takes two minutes to be pulled. This means that 33% of the computing time is taken up by the Image pull step, which is significant. And this only gets worse when the Image gets even larger and autoscaling gets aggressive. You could end up spending thousands of minutes of computing time a day on nothing but Image pulls.

How does this affect the larger picture? Say the app team wants a task to start in 20 seconds flat from a cold start, this is pretty much impossible unless you always have a pool of warmed up nodes that have the Container Images pulled and waiting for new tasks to be scheduled. This “headroom” is incredibly wasteful, and since many workloads are rather hard to predict, you’re paying round the clock for compute capacity that is guaranteed never to be used. (Disclaimer: I refuse to take responsibility if your finance team comes across this.)

The Solution

Every major cloud has the concept of a machine image from which a virtual machine can be launched. If the Container Image is ‘baked’ into a Machine Image, then when the machine is booted, the Container Image is already present and ready to go, thus completely eliminating the Container Image pull time.

The following steps assume the use of AWS EKS since it’s the most popular way to deploy K8s today, but the underlying concepts are incredibly straightforward and can be used for any cloud/platform.

Overview

Steps

Trigger Container Image build.
Store Image in a remote Container Image Registry.
Trigger Machine Image build with Packer.
Packer starts an EC2 instance from an EKS AMI.
It then runs a script on the instance that pulls the newly built Container Image from ECR.
Once done, it stops the EC2 instance and triggers the creation of an AMI from the stopped instance.
Once the AMI is created, Packer cleans up everything else.
We now make use of this AMI in our Kubernetes cluster manager — Karpenter, Cluster Autoscaler, Spotinst Ocean, etc.

Packer

Packer is an automation tool that lets you build, among a lot of other things, EC2 AMIs, which are AWS’ version of Machine Images. We’ll be using it to bake our Container Images into EKS AMIs so that we can use these new AMIs to launch the EC2 instances that will back the nodes of our K8s cluster.

Packer uses its own Domain Specific Language, but it’s well thought out and thus easy to grasp. Glance through the following script, and we’ll break it down in the section that follows it. I do have to note at this point that this is a bare-bones introduction to Packer to serve our current needs. For a comprehensive overview, I suggest following the official tutorials, they’re structured in an easy-to-consume manner.

This is executed using the following command:

packer build \
--var IMAGE=xyz.dkr.ecr.us-east-1.amazonaws.com/app:potato-0.2 \
--var AMI_NAME=baked-potato-ami-0.2 \
--var SOURCE_AMI=amazon-eks-gpu-node-1.23-* \
prepull.pkr.hcl

Breakdown

Our goal is to:

Launch an instance from a base EKS AMI. This is the default AMI that is used when new nodes are launched.
Execute the nerdctl-script.sh file on the launched instance.
Create a new AMI.

Let’s forget about the nerdctl-script.sh file for now, because its contents have no bearing on the packer build process, we only care that it modifies the instance in some beneficial way.

In the packer build command above, you’ll notice that we specify three var arguments. These are variables that configure the specific build. Each var is a key=value pair, and the corresponding keys will be defined as variable “key” in the pkr.hcl script.

We’ve used this mechanism to configure:

IMAGE: This is the Container Image that will be baked into the AMI.
AMI_NAME: The name of the AMI that will be created.
SOURCE_AMI: The base AMI on top of which the customization will take place.

The source “amazon-ebs” “eks-ami” section captures information about the EC2 instance that will be launched, including the instance type, the volume size, IAM role attached to it, and the source AMI that’s used. You’ll spot the use of SOURCE_AMI through ${var.SOURCE_AMI} in the source_ami_filter section here. More details here and here.

The build section makes use of the source section and adds the shell provisioner on top of it. As the name suggests, the shell provisioner does nothing but execute a shell command or script on the source. In this case, the source is an EC2 instance and nerdctl-script.sh is the script that gets executed on it.

The Script

Kubernetes is an incredible container orchestrator, but for all its magic, it still requires a container runtime to run the Images it’s given. Up to K8s version 1.22, at least on EKS, Docker was the default runtime, but starting with version 1.23, containerd is used instead. For our purposes, this just means that for versions 1.22 and lower, we use Docker to pull the images from the remote Image registry. For version 1.23+, we use nerdctl, which is a Docker compatible CLI for containerd.

The “Docker compatible” part is pretty important for us because it lets us trivially authenticate nerdctl with AWS ECR to retrieve our Images in the same way we’d do it with Docker.

The following script, meant for K8s nodes with containerd instead of Docker, installs and uses nerdctl to pull Images to the machine. The comments in the script provide an explanation for each step.

The next script is meant for K8s nodes using Docker. Quite straightforward as opposed to the above.

As discussed earlier, the shell provisioner in the packer script just mentions a script that should be executed on the machine. So your folder structure is now the following, and you’d run the packer build command above from the packer-build folder.

packer-build/
  prepull.pkr.hcl
  nerdctl-script.sh

Karpenter

Once you have your new AMI from the previous steps, you just need to update your cluster manager. In this case, it’s Karpenter, which is an incredible K8s native cluster manager that currently only supports EKS. If you haven’t come across it, I urge you to read more about it here; it’s a true game-changer in my opinion.

Karpenter has the concept of a provisioner which is used to configure the instances it launches. There can be multiple provisioners, each with different characteristics for different purposes. For example, I have a provisioner each for CPU, GPU, and High Availability workloads. We can also link an AWSNodeTemplate to each provisioner to precisely configure the EC2 instances that are launched. Here’s a detailed example of a provisioner and the AWSNodeTemplate it uses, just for reference.

TL;DR: There is an aws-ids field in the AWSNodeTemplate that can be used to select the AMI. There are also other ways of specifying the AMI.

With Karpenter + AMI baking, I’ve seen cold starts of 40–50 seconds even with 4–5 GB Docker Images. (This feels like a poor man’s version of discussing quarter mile times in drag races.)

Conclusion

Optimizing large Container Images is a long-term goal that should be addressed through thorough training sessions, code reviews, and more. In the meantime, Ops teams can’t ignore the nightmare of torturous Image pull times and the effect it has on their platforms. So as much as I hate the bodging that this article promotes, we really do need it until a solution comes along to fix this issue for good (maybe it’s image streaming?). I can only hope it’s sooner rather than later.