Notifications when Kubernetes (Cron)Jobs fail?

--

What do you do when you have CronJobs running in your Kubernetes cluster and want to know when a job fails? Do you manually check the execution status? Painful. Or do you perhaps rely on roundabout Prometheus queries, adding unnecessary overhead? Not ideal… But worry not! Instead, let me suggest a way to immediately receive notifications when jobs fail to execute, using two nifty tools:

  • cmaster11/Overseer — an open-source monitoring tool.
  • Notify17 — a notification app that lets you receive notifications on Android/iOS and web.

Note: cmaster11/Overseer is a heavily modified fork of the amazing skx/Overseer tool, e.g. with added support for Kubernetes eventing. All original credits for this tool go to skx!

Brief tech excursion: Kubernetes events

The underlying trick we will use is watching the stream of Kubernetes events. (A list of basic events can be found in the Kubernetes source code.)

Try running the following command in your cluster:

kubectl get events --all-namespaces

Most likely, you will see some interesting events happening. In my stream, I see a job that failed to create a pod. Womp womp.

50s    Normal    Pulling                 Pod    pulling image "alpine"
23s Normal Pulled Pod Successfully pulled image "alpine"
23s Normal Created Pod Created container
23s Normal Started Pod Started container
2m39s Normal SuccessfulCreate Job Created pod: test-74rz4
22s Warning BackoffLimitExceeded Job Job has reached the specified backoff limit

You might notice that one of the events is BackoffLimitExceeded. This event is generated whenever a Job fails and there are no more retries available. This is the event we're going to watch with Overseer.

Overseer

Overseer can easily be run in Kubernetes using the provided example. More specifically, we will use the following files:

To start, we’ll set up the core of Overseer with the following commands:

kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/000-namespace.yaml
kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/redis.yaml
kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/001-service-account-k8s-event-watcher.yaml
kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/overseer-k8s-event-watcher.yaml

You can monitor the process (in Linux) with:

watch kubectl -n overseer get pod

When all pods are up and running, let’s proceed with the notifier!

Notify17

To set up the notifier:

Import button

Once you’ve imported the template, save it by clicking the Save button.

Save button

The last step is to set up Overseer’s webhook bridge.

Copy the file https://github.com/cmaster11/overseer/blob/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/overseer-bridge-webhook-n17.yaml to a local directory and replace REPLACE_TEMPLATE_API_KEY with your notification template API key. Then apply the file with kubectl apply -f FILE_PATH.

And we’re done!

Test

To test the whole system, you can try to apply the failing job example file:

kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/master/example-kubernetes/example-failing-job/job-fail.yaml

The job will fail and in a few seconds Overseer should generate an alert and send it through Notify17!

A new alert!

P.S. If something doesn’t work, remember that kubectl get pod and kubectl logs POD_NAME are your friends.

Cleanup

To clean up Overseer, just delete its namespace with:

kubectl delete ns overseer

Written with StackEdit.

Follow us on Twitter 🐦 and Facebook 👥 and join our Facebook Group 💬.

To join our community Slack 🗣️ and read our weekly Faun topics 🗞️, click here⬇

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author! ⬇

--

--