If you’ve tried to run K8s Job
s from within your CI/CD system, you know it’s tricky. You spray several kubectl
commands and expect that should do the job. But getting all the details right is challenging. I’m going to show you how to do it in an elegant and robust way.
Triggering Jobs and handling them correctly requires us to address several concerns:
- Reliability (triggering a Job even if the previous one has failed),
- Tracking Job’s progress & logs, streaming to STDOUT,
- Properly cleaning up after each Job,
- Properly handling Job’s exit code.
Let me stress out the importance of the last one (handling exit codes). Within CI/CD environment, you’ll find that you want to run workflow steps conditionally, depending on the previous operation’s status. This enables deeper integration into your existing CI/CD flows.
Job and kubectl
handling
First, let’s create a Job resource (job.yaml
):
apiVersion: batch/v1
kind: Job
metadata:
name: myjob
spec:
backoffLimit: 0
template:
spec:
containers:
- name: myjob
image: bash:latest
command: ["/bin/sh", "-c"]
args:
- echo "Starting job..";
sleep 1;
echo "Working (1/3)..";
sleep 1;
echo "Working (2/3)..";
sleep 1;
echo "Working (3/3)..";
sleep 1;
echo "Done!";
restartPolicy: Never
Note the backoffLimit: 0
. This instructs the Job to be executed only once. If you increase this value (and it’s non-zero by default), K8s will try to retry the Job’s process several times until it succeeds. You may opt into retrying your Job depending on your use case.
Now, add the kubectl
handling (run-job.sh
).
#!/usr/bin/env bash
NS="mynamespace"
JOB="myjob"
# Delete the Job if it exists (runs could fail without cleanup)
kubectl delete job $JOB -n $NS || true
# Create the Job
kubectl apply -f job.yaml -n $NS
# Wait for the Job container creation
kubectl wait --for=condition=ready -n $NS \
$(kubectl get pod -l job-name=$JOB -n $NS -o name)
# Stream logs to STDOUT (with -f follow flag)
kubectl logs -f job/$JOB -n $NS
# Handling status (complete|failed)
# Wait for complete condition – push to bg and save PID
kubectl wait --for=condition=complete \
job/$JOB -n $NS > /dev/null 2>&1 &
completion_pid=$!
# Wait for failed condition – push to bg and save PID
kubectl wait --for=condition=failed \
job/$JOB -n $NS > /dev/null 2>&1 && exit 1 &
failure_pid=$!
# Wait until any of the waits complete
wait -n $completion_pid $failure_pid
exit_code=$?
# Display a friendly Job status message
if (( $exit_code == 0 )); then
echo "Job completed"
else
echo "Job failed with exit code ${exit_code}, exiting..."
fi
# Clean up the job afterwards
kubectl delete job $JOB -n $NS
# Exit with the Job's exit code
exit $exit_code
To test how this code handles errors, simply inject one (in job.yaml
):
echo "Working (2/3)..";
sleep 1;
echo "ERROR!";
exit 1;
echo "Working (3/3)..";
sleep 1;
Caveat: compute minutes
Notice that using this method, you’re wasting compute. A CI/CD runner process triggers a K8s Job, and then waits until its completion. For longer jobs, you will be blocking the runner for the whole Job duration, even though its compute load is close to zero. With self-hosted runners, this might a non-issue, but if you pay for CI/CD minutes, this can quickly ramp up your bill.
Why bother with CI/CD
If it’s tricky to setup, and might cost extra, why bother running Jobs this way? There is a number of valid reasons to do so:
- A K8s Job will have access to cluster’s private network,
- You can run Jobs in the context of existing K8s namespaces,
- You can
exec
into running Pods using the same method, - You can use the same
Job
definition on CI and running locally.
CI/CD workflows run in a somewhat unique environment. They have triggers that can’t be reproduced otherwise:
- New deployments (and automated; continuous delivery),
- Pull requests (and related events, like comments),
- New repository commits.
Finally, the most popular CI/CD systems have a robust UI that makes managing jobs and workflows a breeze. You can view workflow runs, their logs, retry failed jobs, and do a lot more.
Many systems support manual workflow triggers. With a simple click of a button, a non-technical staff member can trigger powerful automation, that manipulates K8s resources in a safe way. This could greatly simplify many complex RBAC & kubectl
access patterns.
Example workflows:
- Clear cache after successful deployment,
- Restore or rollback an in-cluster database,
- Run DB migrations.