How to stop fearing and start using Kubernetes

The KISS principle (keep it simply stupid) is important for modern software development, and even more so in the Data Engineering, where due to big data and big costs every additional system or layer without clear benefits can quickly generate waste and money loss.

Many data engineers are therefore wary when it goes about implementing and rolling out Kubernetes into their operational infrastructure. After all, 99,999% of the organizations out there are not Google, Meta, Netflix or OpenAI and for their tiny gigabytes of data and two or three data science-related microservices running as prototypes internally on a single hardware node, just bare Docker (or at most, docker-compose) is more than adequate.

So, why Kubernetes?

Before answering this question, let me show you how flat the learning curve of the modern Kubernetes starts.

First of all, we don’t need the original k8s, we can use a simple and reasonable k3s instead. To install a fully functional cluster, just login to a Linux host and execute the following:

curl -sfL https://get.k3s.io | K3S_KUBECONFIG_MODE="644" sh -s - --secrets-encryption

You can then execute

kubectl get node

to check if the cluster is running.

Now, if you have a Docker image with a web service inside (for example implemented with Python and Flask) listening on port 5000, you only need to create the following YAML file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-microservice
  labels:
    app: my-microservice
spec:
  selector:
    matchLabels:
      app: my-microservice
  replicas: 1
  template:
    metadata:
      labels:
        app: my-microservice
    spec:
      containers:
       -name: my-microservice
          image: my-artifactory.repo/path-to-docker-image
          ports:
           -containerPort: 5000

---

kind: Service 
apiVersion: v1 
metadata:
  name: my-microservice
spec:
  type: LoadBalancer
  selector:
    app: my-microservice
  ports:
    - port: 5000
      targetPort: 5000

Conceptually, Kubernetes manages the computing resources of the nodes belonging to the cluster to run Pods. Pod is something akin a Docker container. Usually, you don’t create Pods manually. Instead, you create a Deployment object and it will then take care to start defined number of Pods, watch their health and re-start them if necessary. So, in the first object defined above, with the kind of Deployment, we define a template, which will be used whenever a Deployment needs to run yet another Pod. As you can see, inside the template you are specifying the path to the Docker image to run. There, you can also specify everything else necessary for Docker to run it: environment variables, volumes, command line, etc.

A Kubernetes cluster assigns IP addresses from its very own IP network to the nodes and Pods running there, and because usually your company network doesn’t know how to route to this network, the microservices are not accessible by default. You make them accessible by creating another Kubernetes object of kind Service. There are different types of the Services, but for now everything you need to know is that if you set it to be LoadBalancer, the k3s will expose your microservice to the rest of your corporate network by leasing a corporate network IP address and hosting a proxy service on it (Traefik) that will forward the communication to the corresponding Pod.

Now, when we have our YAML file, we can roll out our tiny happy microservice to our Kubernetes cluster with

kubectl apply -f my-microservice.yaml

We can see if it is running, watch its logs or get a shell access to the running docker container with

kubectl get pod
kubectl logs -f pod/my-pod-name-here
kubectl exec -it pod/my-pod-name-here bash

And if we don’t need our service any more, we just delete it with

kubectl delete -f my-microservice.yaml

Why Kubernetes?

So far, we didn’t see any advantages compared to Docker, did we?

Well, yes, we did:

We’ve got a watch-dog that monitors Pods and can (re)start them for example after server reboot or if they crash for any reason.
If we have two hardware nodes, we can deploy our Pods with “replicas: 2” and because we already have a load balancer in front of them, we can get high availability almost for free
If the microservice supports scalability by running several instances in parallel, we already get a built-in industrial grade loadbalancer for scaling out.

Besides, hosting your services in Kubernetes has the following advantages:

If at some point you will need to pass your internal prototypes for professional operations to a separate devop team, they will hug you to death when they learn your service is already kubernetized
If you need to move your services from on-premises into the cloud, the efforts to migrate, for example, to Amazon ECS is much much higher than the changes you need to do to go from k3s to Amazon EKS.
You can execute batched workflows scheduled by time with a CronJob object, without the need to access the /etc/crontab on the hardware nodes.
You can define DAG (directed acyclic graphs) for complicated workflows and pipelines using Airflow, Prefect, Flyte, Kubeflow or other Python frameworks that will deploy and host your workflow steps on Kubernetes for you
You can deploy Hashicorp Vault or other secret manager to Kubernetes and manage your secrets in a professional, safer way.
If your microservices need some standard, off-the-shelf software like Apache Kafka, RabbitMQ, Postgres, MongoDB, Redis, ClickHouse, etc, they all can be installed into Kubernetes with one command, and deploying additional cluster nodes will be just a matter of changing the number of replicas in the YAML file.

Summary

If you only need to host a couple of prototypes and microservices, Kubernetes will immediately improve their availability, and, more importantly, will be a future-proof, secure, scalable and standartized foundation for coming operational challenges.

Now when you’ve seen how easy the entry into the world of Kubernetes is, you don’t have the “steep learning curve” as an excuse for not using Kubernetes already today.

Why Kubernetes?

Summary

Leave a comment