Investigating the On-prem Kubernetes Network Stack

This introduction to the components of Kubernetes cluster networking is the fifth post of a multi-part series by Bikash Choudhury and Emily Watkins, where we discuss how to configure a Kubernetes-based AI Data Hub for data scientists. See the previous posts in the series, Visualizing Prometheus Data with Grafana Dashboards for FlashBlade™, Providing Data Science […]


This introduction to the components of Kubernetes cluster networking is the fifth post of a multi-part series by Bikash Choudhury and Emily Watkins, where we discuss how to configure a Kubernetes-based AI Data Hub for data scientists. See the previous posts in the series, Visualizing Prometheus Data with Grafana Dashboards for FlashBlade™, Providing Data Science Environments with Kubernetes and FlashBlade, Storing a Private Docker Registry on FlashBlade S3, and Scraping FlashBlade Metrics Using a Prometheus Exporter.

Networking is a critical component in any Kubernetes cluster setup. A typical Kubernetes cluster consists of pods with one or more containers running on a single or multiple virtual machines (VMs) or bare metal nodes. Containers communicate amongst themselves inside a pod, between pods on the same host, and across hosts to services.

By default, in a private datacenter deployment, Kubernetes provides a flat network without NAT (Network Address Translation). CNI plugins like Calico and Flannel provide secured network policies and network connectivity options for unified networking solutions. Without network-based policies provided by these plugins, multi-user environments–like those in AI/ML workloads–are not scalable due to the limited number of network connections available by default.

Container Network Interfaces (CNI) is a standard that is created to interface between container runtime and the rest of the network implementation on the host to communicate between pods and across hosts in the Kubernetes cluster. CNI plugins allow attaching and detaching containers to the network without restarting the network.

There are many CNI plugins supported by Docker–like Flannel, Calico, Weave, and Canal. Canal is a popular choice in a Kubernetes cluster for on-prem workloads.

Canal is a combination of Calico and Flannel. Both of those parent plugins use an overlay network across all the nodes in the Kubernetes cluster. Flannel provides a simple overlay Layer 2 (L2) network that encapsulates data and tunnels over Layer 3 (L3) as shown in the above diagram. This reduces network complexities without much additional configuration. Calico enforces network rule evaluations on top of the networking layer using IP tables, routing tables etc. for more security and control.

In this deep-dive blog post, we walk through the components of an example configuration. The goal of this post is to orient readers to the key components they will configure as they create their own customized cluster network.

This post uses some real examples to illustrate the network flow in three different sections:

  1. Network communication path for single container
  2. Pod communication with a CNI plugin
  3. Network configuration of a Kubernetes Service

Because network requirements are highly variable, organizations are likely to have varied implementations and there are many networking solutions available. This post attempts to decode the network stack of a Kubernetes cluster using Docker containers on a flat network using a CNI plugin called Canal.

Example Network Communication Path for Single Container

Note: in preparation for Kubernetes installation, some basic networking rules should be followed to minimize cluster complexity.

  1. Utilize a flat, public IP range to configure all the nodes in the cluster. Avoid any network hops between the cluster nodes.
  2. Utilize a flat, private IP range for internal communication for the Kubernetes pods.
  3. Reserve an IP range for Docker containers and the Docker bridge.

a. Manually set up a gateway IP for the docker0 bridge.

In our /etc/docker/daemon.json file on each node, we have set up an IP network to the default Docker bridge (docker0) that works as a gateway for the containers.

The following diagram illustrates a single docker container network flow. The eth0 inside the container is paired with a virtual ethernet (veth) that talks to the default Docker bridge, docker 0. Then docker0 uses the –net=host setting to forward the packets to the external world.

Docker Network Stack

Let us now analyze the network flow for a docker container named “busybox” as an example.

The container is a process running on a host (VM or bare metal) and has its own Process ID (PID).

The following commands check the internal IP address assigned to the busybox container and then use it to examine the routing table, which describes how the container communicates with the bridge docker0.

With the network configured with the recommendations listed above, the veth can establish a connection with docker0.

Use “docker inspect” to identify the PID for the busybox container. (Note: using “ps” to find the PID will only show the external PID for the parent docker process.)

The following steps show how the busybox container uses the veth3f80d9c to reach the bridge docker0.

The following sequence of commands identifies how the busybox PID is mapped to the veth3f80d9c and connects to docker0.

We can also trace the broadcast path. From within the busybox container, packet broadcast is ending at the “02:42:2d:46:c5:ed” MAC address.

And we can see that the MAC address corresponds to the Docker bridge, docker0.

Once the busybox container is able to reach the bridge docker0, the –net=host setting allows the container to communicate to the external world using its host’s interface. In the following example; the Flashblade IP address 10.21.236.201 is pingable from within a busybox container.

The initial docker container busybox setup on a single node is a quick test to check if the core network is functioning properly.

Pod Communication with a CNI Plugin

Now that we’ve illustrated the network communication path for single container, we’ll walk through the network communication path between pods. Pods are the basic Kubernetes objects that can be created, managed, and destroyed during the lifetime of an application.

There are two parts to the namespace in Kubernetes networking – pod namespace (podns) and host or root namespace (rootns). The pod eth0 is part of the podns. The veth, the bridge docker0, routing tables, and the host port are part of rootns as shown in the diagram below. The nsenter command (short for “namespace enter”) is used to map the container PID to the veth in the podns and the rootns.

Example CNI usage:

An example pod, called “pure-exporter”, is used to scrape metrics from FlashBlade to Prometheus. (For more information on how the custom Pure exporter scrapes Flashblade metrics for Grafana dashboards, see this blog post.)

The pure-exporter pod’s veth is named cali65bb5da2c3b.

We can locate the pod’s internal pod IP address via:

As expected, the routing table shows the following:

  • For the example pod, the veth cali65bb5da2c3b is paired with its IP address 10.42.6.25.
  • The bridge docker0 that we configured earlier is paired with its gateway address 172.17.0.5.

We can also see additional routing components:

  • The Flannel component of the Canal CNI uses flannel.1 as the overlay (tunnel) to communicate to the six other pods in the network.
  • The eno3 physical interface is part of the bond0.2236 VLAN and is responsible for physical packet transfer over the network.

Network configuration of Services

The Pure exporter is one of the services we have running in our cluster:

The pure-exporter service uses a Nodeport service to map internal port 9491 to the host port 31175 to communicate for external communication.

A container in a pod is ephemeral when the pod is destroyed. However, applications are able to be persistent by using an abstraction called a “service” that logically groups a set of pods that are accessible over a network. While pods are the back end that spin up and down, services are the front-end process that is always alive and provides persistence during the life of the application.

Load Balancer

In addition to the Nodeport, a load balancer provides an external IP address to every service in the Kubernetes cluster for clients and end-users outside the cluster to access the application using HTTP (or HTTPS) as shown in the diagram below.

Load balancer for Kubernetes cluster for clients and end-users outside the cluster

Metallb is one of the most popular load balancers used in on-premise Kubernetes clusters. Applications in Machine Learning pipelines with multi-user access benefit from using a load balancer to distribute the workload across nodes. In this example, we’ll use Metallb to balance traffic for our Jupyterhub-as-a-Service.

The Metallb configmap should include a predefined IP address pool that automatically assigns public IPs to the services and advertises them in the network.

The proxy-public service picks a public IP address with Service type “LoadBalancer” after Jupyterhub after installed in the Kubernetes cluster.

Networking for our hosted Jupyter-as-a-Service is complete. For further steps on deploying the service, stay tuned for our upcoming Jupyterhub blog post.

Conclusion

Networking is core to any Kuberentes cluster and there are many different ways of building the network stack for applications running in a Kubernetes pod. It is recommended to get the basics right by using a flat network without introducing any layer of complexity. Load-distributing layers for TCP and HTTP requests respectively on Kubernetes can be possible with a strong and stable network layout.