Notifications when Kubernetes (Cron)Jobs fail?

Published in

FAUN — Developer Community 🐾

4 min readNov 7, 2019

What do you do when you have CronJobs running in your Kubernetes cluster and want to know when a job fails? Do you manually check the execution status? Painful. Or do you perhaps rely on roundabout Prometheus queries, adding unnecessary overhead? Not ideal… But worry not! Instead, let me suggest a way to immediately receive notifications when jobs fail to execute, using two nifty tools:

cmaster11/Overseer — an open-source monitoring tool.
Notify17 — a notification app that lets you receive notifications on Android/iOS and web.

Note: cmaster11/Overseer is a heavily modified fork of the amazing skx/Overseer tool, e.g. with added support for Kubernetes eventing. All original credits for this tool go to skx!

Brief tech excursion: Kubernetes events

The underlying trick we will use is watching the stream of Kubernetes events. (A list of basic events can be found in the Kubernetes source code.)

Try running the following command in your cluster:

kubectl get events --all-namespaces

Most likely, you will see some interesting events happening. In my stream, I see a job that failed to create a pod. Womp womp.

50s    Normal    Pulling                 Pod    pulling image "alpine"
23s    Normal    Pulled                  Pod    Successfully pulled image "alpine"
23s    Normal    Created                 Pod    Created container
23s    Normal    Started                 Pod    Started container
2m39s  Normal    SuccessfulCreate        Job    Created pod: test-74rz4
22s    Warning   BackoffLimitExceeded    Job    Job has reached the specified backoff limit

You might notice that one of the events is BackoffLimitExceeded. This event is generated whenever a Job fails and there are no more retries available. This is the event we're going to watch with Overseer.

Overseer

Overseer can easily be run in Kubernetes using the provided example. More specifically, we will use the following files:

000-namespace.yaml: the Overseer Kubernetes Namespace resource.
redis.yaml: the database where the alerts/found events will be stored.
001-service-account-k8s-event-watcher.yaml: a service account that lets Overseer watch Kubernetes events.
overseer-k8s-event-watcher.yaml: the Overseer worker that will watch for new Kubernetes events.
overseer-bridge-webhook-n17.yaml: the notification system to inform us about found events.

To start, we’ll set up the core of Overseer with the following commands:

kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/000-namespace.yaml
kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/redis.yaml
kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/001-service-account-k8s-event-watcher.yaml
kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/overseer-k8s-event-watcher.yaml

You can monitor the process (in Linux) with:

watch kubectl -n overseer get pod

When all pods are up and running, let’s proceed with the notifier!

Notify17

To set up the notifier:

Create a Notify17 account, it’s free!
Next, create a notification template from the dashboard by pressing the import button and pasting the following configuration:

Once you’ve imported the template, save it by clicking the Save button.

The last step is to set up Overseer’s webhook bridge.

Copy the file https://github.com/cmaster11/overseer/blob/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/overseer-bridge-webhook-n17.yaml to a local directory and replace REPLACE_TEMPLATE_API_KEY with your notification template API key. Then apply the file with kubectl apply -f FILE_PATH.

And we’re done!

Test

To test the whole system, you can try to apply the failing job example file:

kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/master/example-kubernetes/example-failing-job/job-fail.yaml

The job will fail and in a few seconds Overseer should generate an alert and send it through Notify17!

P.S. If something doesn’t work, remember that kubectl get pod and kubectl logs POD_NAME are your friends.

Cleanup

To clean up Overseer, just delete its namespace with:

kubectl delete ns overseer

Written with StackEdit.

Follow us on Twitter 🐦 and Facebook 👥 and join our Facebook Group 💬.

To join our community Slack 🗣️ and read our weekly Faun topics 🗞️, click here⬇