# Set Up kubernetes on bare metal

<!-- vim-markdown-toc GFM -->

* [Apt repo](#apt-repo)
  * [Docker](#docker)
  * [Kubernetes](#kubernetes)
* [containerd](#containerd)
* [General Concepts](#general-concepts)
  * [Enable cri](#enable-cri)
  * [Enable systemd cgroups](#enable-systemd-cgroups)
    * [Kubernetes - Docker compatibility](#kubernetes---docker-compatibility)
* [kubeadm-config.yaml](#kubeadm-configyaml)
* [kubeadm init](#kubeadm-init)
  * [Old, calico-based setup](#old-calico-based-setup)
  * [New](#new)
    * [Set up kubeconfig](#set-up-kubeconfig)
    * [Install Cilium (kube-proxy replacement mode)](#install-cilium-kube-proxy-replacement-mode)
    * [Final validation](#final-validation)
    * [Other issues](#other-issues)
    * [Pod not allowed to run on node because of default control-plane taint](#pod-not-allowed-to-run-on-node-because-of-default-control-plane-taint)
    * [Cilium nodeport issue](#cilium-nodeport-issue)
    * [Cilium LoadBalancer](#cilium-loadbalancer)
    * [Cilium Ingress](#cilium-ingress)
    * [Next Steps](#next-steps)
    * [High-impact next steps](#high-impact-next-steps)

<!-- vim-markdown-toc -->


Build a modern kubeadm + Cilium (no kube-proxy) cluster from scratch, including:
  - swap-enabled kubelet tuning
  - nftables awareness
  - DNS debugging (CoreDNS + musl edge case)
  - eBPF datapath validation


## Apt repo

install containerd.io
install kubadm kubectl kubelet

### Docker

- needed for `containerd.io`

/etc/apt/sources.list.d/docker.sources
```deb822sources
X-Repolib-Name: Docker
# sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
Types: deb
URIs: https://download.docker.com/linux/debian
Suites: trixie
Components: stable
Architectures: amd64
Signed-By: /etc/apt/keyrings/docker.key
Enabled: yes
```

### Kubernetes

- install kubeadm kubectl kubelet kubernetes-cni

/etc/apt/preferences.d/99kubectl
```deb822sources
#  500 https://pkgs.k8s.io/core:/stable:/v1.31/deb  Packages
#      release o=obs://build.opensuse.org/isv:kubernetes:core:stable:v1.31/deb,n=deb,l=isv:kubernetes:core:stable:v1.31,c=
#      origin pkgs.k8s.io
Package: *
Pin: release l=isv:kubernetes:core:stable:v1.36
Pin-Priority: 550
```

/etc/apt/sources.list.d/kubernetes.sources
```deb822sources
X-Repolib-Name: Kubernetes
# https://v1-34.docs.kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/
# sudo wget -O /etc/apt/keyrings/kubernetes-apt-keyring.asc 'https://keyserver.ubuntu.com/pks/lookup?op=get&search=0xDE15B14486CD377B9E876E1A234654DA9A296436'
Types: deb
URIs: https://pkgs.k8s.io/core:/stable:/v1.36/deb/
Suites: /
#Components:
Signed-By: /etc/apt/keyrings/kubernetes-apt-keyring.asc
Enabled: yes

X-Repolib-Name: cri-o
# sudo wget -O /etc/apt/keyrings/cri-o-apt-keyring.asc 'https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x85B67D5C50100B1AC8CEFE49CBA9C85640A2B579'
Types: deb
URIs: https://download.opensuse.org/repositories/isv:/cri-o:/stable:/v1.35/deb/
Suites: /
Signed-By: /etc/apt/keyrings/cri-o-apt-keyring.asc
Enabled: yes
```

## containerd

- install either `docker-ce` or just `containerd.io`
- Docker runs file with Kubernetes
- Install `containerd.io` from Docker repo
- <https://download.docker.com/linux/debian>


## General Concepts

Containerd containers are usually application containers. They’re part of the modern container stack used by Docker/Kubernetes and are optimized for running single apps or services with image-based workflows.

systemd-nspawn containers are more like lightweight system containers. They’re commonly used to boot and manage a fuller Linux userspace, often with a writable root filesystem, and integrate tightly with systemd.


### Enable cri

`sudo crictl info`

/etc/containerd/config.toml

default has:
```toml
disabled_plugins = ["cri"]
```
change to:
```toml
disabled_plugins = []
```

### Enable systemd cgroups

`crictl info | grep -i cgroup`

`containerd config default | sudo tee /etc/containerd/config.toml`
```toml
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options]
SystemdCgroup = true
```

#### Kubernetes - Docker compatibility

Prefer aligning everything to systemd cgroups:

/etc/docker/daemon.json
```json
{
  exec-opts: [native.cgroupdriver=systemd]
}
```
systemctl restart docker


## kubeadm-config.yaml
```yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: 192.168.12.35
  bindPort: 6444
nodeRegistration:
  criSocket: "unix:///run/containerd/containerd.sock"
  kubeletExtraArgs:
    feature-gates: "NodeSwap=true"

---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: stable
networking:
  podSubnet: 10.90.0.0/16

---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
failSwapOn: false
featureGates:
  NodeSwap: true
memorySwap:
  swapBehavior: LimitedSwap
cgroupDriver: systemd
```

## kubeadm init


### Old, calico-based setup
```bash
# Before running kubeadm init, load:
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe nf_conntrack
```

### New

Modern, optimized cluster:
  - kubeadm control plane (custom IP + port)
  - swap-aware kubelet (non-default, advanced)
  - containerd with systemd cgroups
  - no kube-proxy
  - eBPF networking via Cilium
  - bridge-based networking (br0)
  - nftables-compatible stack (via kernel path, not legacy iptables)


Keep in mind

You enabled:
  - swap (NodeSwap)
  - custom eviction tuning

That’s powerful—but:
  - keep an eye on memory pressure
  - monitor with:
`kubectl top nodes`

```bash
# Initialize WITHOUT kube-proxy:
kubeadm init --config kubeadm-config.yaml --skip-phases=addon/kube-proxy
```
Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:
```bash
  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config
```
Alternatively, if you are the root user, you can run:
```bash
  export KUBECONFIG=/etc/kubernetes/admin.conf
```
You should now deploy a pod network to the cluster.
Run `kubectl apply -f [podnetwork].yaml` with one of the options listed at:
  <https://kubernetes.io/docs/concepts/cluster-administration/addons/>

Then you can join any number of worker nodes by running the following on each as root:
```bash
kubeadm join 192.168.12.35:6444 --token r46hc6.yl399kkf8w0cq3r4 \
  --discovery-token-ca-cert-hash sha256:6a04ba0e15a6231254b40688fac15f408c2cf99eaecf81dd8ddd49f26fa0df81
```


#### Set up kubeconfig
```bash
mkdir -p $HOME/.kube
sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# verify
kubectl get nodes
kubectl get nodes -o wide

kubectl cluster-info


kubectl get componentstatuses
kubectl get pods -n kube-system

kubectl -n kube-system logs cilium-mdr96
# if multiple containers exist, use:
kubectl -n kube-system logs cilium-mdr96 -c cilium-agent
```

#### Install Cilium (kube-proxy replacement mode)
```bash
curl -L --remote-name https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz
tar xzvf cilium-linux-amd64.tar.gz
sudo mv cilium /usr/local/bin/


# Make sure bpffs is mounted
mount | grep bpf
# If not mounted:
sudo mount -t bpf bpf /sys/fs/bpf
# persist
echo "bpffs /sys/fs/bpf bpf defaults 0 0" | sudo tee -a /etc/fstab


# Install Cilium with eBPF + kube-proxy replacement
cilium install \
  --version 1.15.6 \
  --pod-cidr 10.90.0.0/16 \
  --kube-proxy-replacement strict \
  --set ipam.mode=kubernetes \
  --set routingMode=tunnel \
  --set tunnelProtocol=vxlan \
  --set autoDirectNodeRoutes=false


# remove broken install
sudo cilium uninstall --kubeconfig /etc/kubernetes/admin.conf


# Verify
sudo cilium status
sudo cilium status --kubeconfig /etc/kubernetes/admin.conf
kubectl get nodes
kubectl get pods -n kube-system

# 5.10+ minimum, 5.15+ recommended
uname -r

# Enable
mount bpffs -t bpf /sys/fs/bpf

# Persist
echo "bpffs /sys/fs/bpf bpf defaults 0 0" >> /etc/fstab

# Enable native routing (if L2 network is simple)
#  Removes encapsulation → best performance.
--set routingMode=native
--set autoDirectNodeRoutes=true
```

#### Final validation
```bash
# CoreDNS
kubectl -n kube-system get pods -l k8s-app=kube-dns

# End-to-end test pod
kubectl run test --image=busybox -it --rm -- sh
# test networking in the shell:
ping -c 3 8.8.8.8
# BusyBox uses a very minimal DNS client, and short names depend on search domains + ndots behavior
#nslookup kubernetes.default  # will fail to resolve
# Instead:
nslookup kubernetes.default.svc.cluster.local

kubectl delete pod test


# Better test pod:
kubectl run test --image=debian:stable-slim -it --rm -- bash
# inside pod run:
apt update && apt install -y dnsutils
nslookup kubernetes.default


# Alpine test pod:
kubectl run test --image=alpine -it --rm -- sh
# inside pod run:
apk add bind-tools
nslookup kubernetes.default


# DNS search domain issue workaround (NOT NEEDED):
kubectl run test \
  --image=alpine \
  -it --rm \
  --overrides='
{
  "spec": {
    "dnsConfig": {
      "options": [
        {"name": "ndots", "value": "2"}
      ]
    }
  }
}' \
  -- sh
```

#### Other issues

ndots:5 + Alpine (musl) resolver behavior
With ndots:5, short FQDNs like:
  dl-cdn.alpinelinux.org
get expanded through multiple search domains first, and musl sometimes fails to recover correctly from those NXDOMAIN chains.

/etc/systemd/network/10-br0.network
```systemd
# comment out:
Domains = dunes.ixlo.io
```

Recommended approach (practical + production-friendly)

Add this to workloads that use Alpine (or anything musl-based):
```yaml
dnsConfig:
  options:
    - name: ndots
      value: "2"
```
```bash
# NOT NEEDED
kubectl -n kube-system edit configmap coredns
# replace:
#   forward . /etc/resolv.conf {
#       max_concurrent 1000
#   }
# with:
#forward . 1.1.1.1 1.0.0.1 {
#    max_concurrent 1000
#}
# or with:
# forward . 1.1.1.1 1.0.0.1
# Apply change:
kubectl -n kube-system rollout restart deployment coredns


# Inspect CoreDNS inside the pod
kubectl -n kube-system exec -it deployment/coredns -- cat /etc/resolv.conf

# Check CoreDNS status
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Verify the Kubernetes service exists
# since kubernetes service is always in the default namespace don't use -A
kubectl get svc kubernetes [-A]

# Check CoreDNS config
kubectl -n kube-system get configmap coredns -o yaml
# look for:
# kubernetes cluster.local in-addr.arpa ip6.arpa

# List all services
kubectl get svc -A


# CoreDNS containers are extremely minimal and don’t include tools like cat
#  so this will fail:
kubectl -n kube-system exec -it deployment/coredns -- cat /etc/resolv.conf
# attach temporary container with tools into pod:
kubectl -n kube-system debug -it deployment/coredns --image=busybox -- sh
# or exec using a shell:
# some CoreDNS images don’t even include sh, so this may fail:
kubectl -n kube-system exec -it deployment/coredns -- /bin/sh


# attach debug container to CoreDNS pod:
# get DNS container name:
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl -n kube-system debug -it <container name> --image=alpine --target=coredns -- sh
# inside container:
cat /etc/resolv.conf
apk add --no-cache bind-tools
nslookup google.com 1.1.1.1
nslookup google.com
```

#### Pod not allowed to run on node because of default control-plane taint

By default, Kubernetes protects control-plane nodes with:
`node-role.kubernetes.io/control-plane:NoSchedule`
which prevents workloads from running there.

fix (single-node cluster):
`kubectl taint nodes --all node-role.kubernetes.io/control-plane-`


Instead of removing taint globally, can tolerate per pod:
```yaml
tolerations:
- key: "node-role.kubernetes.io/control-plane"
  operator: "Exists"
  effect: "NoSchedule"
```

#### Cilium nodeport issue
```bash
sudo cilium upgrade --kubeconfig /etc/kubernetes/admin.conf --set nodePort.enabled=true
kubectl -n kube-system rollout restart ds/cilium
# wait for 'running'
kubectl -n kube-system get pods -l k8s-app=cilium


kubectl get endpoints nginx

# Other cilium commands
curl -v http://127.0.0.1:32049

sudo cilium status --kubeconfig /etc/kubernetes/admin.conf


sudo nft monitor trace
# then try connecting to http://192.168.12.35:32049/

sudo iptables -L KUBE-FIREWALL -n -v
```

#### Cilium LoadBalancer

- NodePort is kind of a “debug-level” exposure method
- better to use LoadBalancer
```bash
kubectl expose deployment nginx --type=LoadBalancer --port=80
```
Then access at: http://192.168.12.35


#### Cilium Ingress

- Best option
- Gives clean URLs like: http://nginx.local


#### Next Steps

If you want, next steps I’d suggest (in order):
  - ingress (Cilium native)
  - TLS (cert-manager)
  - persistent storage
  - multi-node expansion


#### High-impact next steps

1. Enable Hubble (strongly recommended)
```bash
cilium hubble enable --kubeconfig /etc/kubernetes/admin.conf
cilium hubble port-forward --kubeconfig /etc/kubernetes/admin.conf
```
Open:
http://localhost:12000


2. Remove VXLAN (performance boost)

Benefits:
  - lower latency
  - higher throughput
  - simpler dataplane


Since you’re on a single node / same L2 network:
```bash
cilium upgrade \
  --kubeconfig /etc/kubernetes/admin.conf \
  --set routingMode=native \
  --set autoDirectNodeRoutes=true \
  --set devices=br0
```

3. Allow scheduling on control plane (if single-node lab)
```bash
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
```

4. Install a test workload
```bash
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port=80 --type=NodePort
kubectl get svc

kubectl delete svc nginx

kubectl get endpoints nginx
```

