Cost-Effective Github Actions
This is a short article that covers:
- The cost of use for Github Actions (GA).
- Setting up autoscaling, self-hosted GA runners on K8s.
- A brief discussion of how it works.
The Economics 💸
The convenience of Github Actions (GA) is unparalleled, allowing for a seamless CICD/workflow-automation experience colocated with your code. Combined with the generous 2,000 minutes/month provided for free for private repos, this makes it a natural choice for many small private projects, and a no-brainer for any open source project which will have unlimited minutes/month.
But what happens when we try to implement a comprehensive test suite with GA? Suddenly 2000 minutes (~34 hours) starts to look like small change. Any test suite serving a dozen developers could chew through it in a couple of days or less.
And that’s when the costs from Github kick in. For every minute over 2000, we pay $0.008, which seems like a minuscule amount, but the equivalent hourly rate comes to $0.48. And for this price, you get 2 CPU cores and 7 GB; you can barely toast bread with that!
But setting aside the issue of the woefully inadequate compute, is the cost at least on par with what we’d get if we paid a cloud provider for comparable compute? Not even close.
4.8x the cost of equivalent on demand instances.
23x the cost of equivalent spot instances.
Thankfully, Github Actions has excellent, free support for self-hosted runners. So we pay Github $0 for any workflows that are executed on our own compute.
BYOC (Bring Your Own Compute) 💻
Github provides an option to manually create our own runners.
What’s the point of Kubernetes if you have to start machines *shudders* by hand?
Instead, we make use of this great project: Actions Runner Controller, that lets us bring all the autoscaling glory of Kubernetes to the table. It’s an understatement to say that this changes the whole equation when considering the use of Github Actions.
Prerequisites
- A Kubernetes Cluster
Github Actions Concepts and Terminology
This document should serve as a nice primer to the concepts underlying Github Actions.
Actions Runner Controller (ARC)
Benefits
- If the cluster on which you set up ARC is optimized well, you’ll be running 100% spot instances, so that’s a 23x lower bill when compared to using Github-hosted compute.
- It auto-scales! So the runners are only up if they absolutely need to be, which makes it very cost-effective compared to statically provisioned CI/CD systems that need runners up even if they’re idle.
- We can have machines of any size or type serving our workflows, allowing us to execute tests on environments that closely mirror the real deployment scenario. This is doubly important for Machine Learning code where a lot of its production use is on GPU-enabled machines, which then means that for the tests of that code to be comprehensive, GPUs are necessary.
High-Level Breakdown
We care about three components:
- The Actions Runner Controller extends the Kubernetes API to help the cluster manage custom resources specific to the ARC.
- The Runner Deployment is responsible for creating the runner pods that execute the jobs defined in a GA workflow.
- The Horizontal Runner Autoscaler is responsible for scaling the deployment up and down based on a metric provided by Github.
Technical Breakdown
- A GA Workflow is an automated process that is defined by one or more jobs.
- A Job is a set of steps that will be executed sequentially on the same server. A step can be a command, script execution, or an “action”.
- An action is just a set of predefined steps that perform a common task, for example: git checking out repositories, Docker image builds, etc.
- Each job runs in an isolated environment, which means that the steps within each job have direct access to the outputs of the previous steps. The same, however, doesn’t apply between steps in different jobs.
- A runner in Github Actions is a server that executes a single job at a time.
- In the ARC implementation, each runner is a pod that waits for and executes a job and is immediately killed off to provide a new pod with a fresh environment for the next job that comes in. Zero maintenance FTW!
- ARC doesn’t create its own runner implementation, but rather makes use of Github’s self-hosted runner. This runner software works by long-polling Github for 50 seconds to receive job assignments. If nothing was received, the poll closes and a new one is started.
- Ultimately, the role of ARC is not to care about the contents of the jobs themselves, but rather to understand that there are in fact jobs that need execution and provide enough runners that will directly poll for the details of those jobs from Github. This means that from the perspective of the runner, there’s really no concept of the ARC system that created it. This perfectly adheres to K8s/Docker principles wherein the system that created an application is of no relevance to the application as long as its requirements are met.
Usage
Let’s take a look at a sample implementation of the Runner Autoscaler and Deployment before looking at how it works for us. Like most things in Kubernetes, the definition is rather straightforward and intuitive.
The specifications for the RunnerDeployment
and the HorizontalRunnerAutoscaler
are contained in the single YAML file below. I’ll use the termsDeployment
to mean RunnerDeployment
and Autoscaler
to mean HorizontalRunnerAutoscaler
.
Points of Note:
- There is no
replicas
field specified in the Deployment because it will be taken care of by the Autoscaler which hasminReplicas
andmaxReplicas
configured. - By default, Github applies the
self-hosted
label to all self-hosted runners, so using the following parameter on your Job definitions will cause them to run on a self-hosted runner. No other configuration is required.
jobs:
build_image:
runs-on: self-hosted
- But what if you have multiple Deployments in your cluster, and jobs that need a runner from a specific Deployment to function? In the Deployment specification above, you’ll note the labels for the deployment specifying
cpu-medium
. This lets us refer to it from a workflow job definition as follows:
jobs:
build_image:
runs-on: cpu-medium
- The
repository
field under the Deployment spec tells the deployment the exact repository on Github it’ll be serving. In this case, it’sprivate_repository
. - The
scaleTargetRef
variable defined in the Autoscaler contains the name of the Deployment that it will be managing, in this case, it’sgeneral-runner
. - The
metrics
field can contain any of the metrics described here. The Autoscaler is now pull-based, meaning that it will poll Github every minute and make scaling decisions based on the returned metric. This can only be as responsive as the poll time, so in the worst-case scenario, we have a minute’s delay between the workflow run being requested and the Autoscaler triggering a scale-up. This can be adjusted using the sync-period setting, but note that a really small sync-period will cause issues with Github rate limiting. - The poll-based method is more than good enough because there are usually not a lot of real-time uses for Github Actions. But if that doesn’t work for you, there’s the option of configuring the Autoscaler to be driven by a webhook server. So a Github event can actively trigger a scale-up, allowing for real-time scale-up. More details can be found here.
Management
Authentication
The ARC can communicate with Github on your behalf using either:
- A Personal Access Token (PAT) that works for personal use or for small projects.
- A Github App’s authentication key. This also doesn’t run into the potential rate-limiting issues of the PAT method and is more suitable for use by an organization.
Installation
The Actions Runner Controller can be deployed as a Helm chart, though you need to ensure that cert-manager is installed before you do so.
Detailed instructions here.
Managing Configuration with ArgoCD
Since the primary mode of configuring ARC is through custom K8s resources, you can commit these YAML files to a repository and have ArgoCD take over for the actual creation and management of these resources on your cluster.
Conclusion
A must-have tool that brings out the true potential of Github Actions. The fact that we get a modern, comprehensive CICD/workflow automation system at no additional cost to our compute is incredible.
In part 2, we’ll delve into building custom Docker Images for the runners, using GPU nodes, and using GA for ML use-cases.