Skip to main content

Configure and measure realtime workers performance on OpenShift

Depending on your workloads, you may want to have workers with realtime kernel in your cluster.
How to configure and how to enroll the nodes into an Openshift cluster is not on the scope of this article, we will assume that you have the worker node up and running with realtime kernel.
Once you have it, how to properly schedule workloads that make use of it?

CPU manager and kubelet static policy - https://docs.openshift.com/container-platform/4.2/scalability_and_performance/using-cpu-manager.html

CPU Manager manages groups of CPUs and constrains workloads to specific CPUs. This is useful for several cases, in our case for low-latency applications. In order to enable CPU manager following steps are needed:
  • Label a node with cpu manager:
    # oc label node perf-node.example.com cpumanager=true
  • Edit the machineconfigpool worker, and add a label to reference a custom kubelet:
    # oc edit machineconfigpool worker
    metadata:
      creationTimestamp: 2019-xx-xxx
      generation: 3
      labels:
        custom-kubelet: cpumanager-enabled
     
  • Create this custom KubeletConfig custom resource, to enable the static policy:
    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: cpumanager-enabled
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: cpumanager-enabled
      kubeletConfig:
         cpuManagerPolicy: static
         cpuManagerReconcilePeriod: 5s
         systemReserved:
          cpu: 1
          memory: 512Mi
         kubeReserved:
          cpu: 2
          memory: 512Mi
    # oc create -f cpumanager-kubeletconfig.yaml
  •  This adds the CPU manager feature to the Kubelet config. Machine config pool will apply that, and the affected nodes will be rebooted after the config change. After node has been rebooted, you can check that the policy has been correctly applied:
    # oc debug node/perf-node.example.com
    sh-4.4# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager
    cpuManagerPolicy: static        
    cpuManagerReconcilePeriod: 5s   
  •  Now, in order to make use of it, a pod with cpu and memory settings for limits and requests need to be created. This will create a pod with guaranteed resources:
    # cat cpumanager-pod.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      generateName: cpumanager-
    spec:
      containers:
      - name: cpumanager
        image: gcr.io/google_containers/pause-amd64:3.0
        resources:
          requests:
            cpu: 1
            memory: "1G"
          limits:
            cpu: 1
            memory: "1G"
      nodeSelector:
        cpumanager: "true"
    # oc create -f cpumanager-pod.yaml
    # oc describe pod cpumanager
    Name:               cpumanager-6cqz7
    Namespace:          default
    Priority:           0
    PriorityClassName:  <none>
    Node:  perf-node.example.com/xxx.xx.xx.xxx
    ...
     Limits:
          cpu:     1
          memory:  1G
        Requests:
          cpu:        1
          memory:     1G
    ...
    QoS Class:       Guaranteed
    Node-Selectors:  cpumanager=true
  •  See that the pod has been executed with Guaranteed class. Other pods that have been scheduled with class Burstable cannot run on the cores allocated for Guaranteed pod.

How to measure realtime

Once the  kubelet has been correctly configured, it is time to measure the realtime performance. This can be achieved with 2 steps:
  1. Put the system under stress, to execute some tests under heavy load
  2. Run cyclictest on a guaranteed pod inside the Openshift cluster, and collect results there

Put the system under stress

This can be achieved by running pods on free cores. Imagine that we have a worker node with 36 cores. We will leave 1 core left for system, 2 cores left for kube. So we have 33 cores left, we will leave 4 cores for cyclictest. So we need to stress the system on the 29 cores left. You need to do your own calculations for your system:


apiVersion: apps/v1
kind: Deployment
metadata:
  name: stress 
spec:
  replicas: 29
  selector:
    matchLabels:
      name: stress 
  template:
    metadata:
      labels:
        name: stress 
    spec:
      containers:
      - name: stress 
        image: "docker.io/cscojianzhan/stress"
        resources:
          limits:
            memory: "200Mi"
            cpu: "1"
          requests:
            memory: "200Mi"
            cpu: "1"
      nodeSelector:
        node-role.kubernetes.io/worker-rt: ""
# oc apply -f stress_pod.yaml 

Measure with cyclictest

This measure needs to be executed on a pod with guaranteed class. So we will run a guaranteed pod with 4 cpus , and we will collect the results of it:

apiVersion: v1 
kind: Pod 
metadata:
  name: cyclictest 
spec:
  restartPolicy: Never 
  containers:
  - name: cyclictest
    image:  docker.io/cscojianzhan/cyclictest
    resources:
      limits:
        cpu: 4
        memory: "400Mi"
      requests:
        cpu: 4
        memory: "400Mi"
    env:
    - name: DURATION
      value: "10m"
    securityContext:
      capabilities:
        add:
          - SYS_NICE
          - SYS_RAWIO
          - IPC_LOCK
    volumeMounts:
    - mountPath: /tmp
      name: results-volume
    - mountPath: /dev/cpu_dma_latency
      name: cstate
  nodeSelector:
    node-role.kubernetes.io/worker-rt: ""
  volumes:
  - name: results-volume
    hostPath:
      path: /tmp
  - name: cstate
    hostPath:
      path: /dev/cpu_dma_latency

 See that we have a pod with limits and requests of 4 cpus, and 400M of memory. This guarantees that the pod is allocated with those results. See also the capabilities needed to schedule the pod.
cyclictest is executed during 10 minutes, according to the DURATION env var. This can be modified, it could be over 24h for consistent results.
The final results are publicated on /tmp directory of the pod (see results-volume). This is mapped with hostPath to a directory inside the worker itself. To see the output of the log, you can enter into the worker node and examine /tmp/cyclictest* files there. There you can check the latencies (max, min, average) and also if there have been some histogram overflows. The long period that you run the test, the most accurate results you will find.
To get more information look at: https://wiki.linuxfoundation.org/realtime/documentation/howto/tools/cyclictest/start?s[]=cyclictest#cyclictest

Comments

Popular posts from this blog

Enable UEFI PXE boot in Supermicro SYS-E200

When provisioning my Supermicro SYS-E200-8D machines (X10 motherboard), i had the need to enable UEFI boot mode, and provision through PXE. This may seem straightforward, but there is a set of BIOS settings that need to be changed in order to enable it. First thing is to enable EFI on LAN , and enable Network Stack. To do that, enter into BIOS > Advanced > PCIe/PCI/PnP configuration and check that your settings match the following: See that PCI-E have EFI firmware loaded. Same for Onboard LAN OPROM and Onboard Video OPROM. And UEFI Network stack is enabled , as well as IPv4 PXE/IPv6 PXE support. Next thing is to modify boot settings. The usual boot order for PXE is to first add hard disk and second PXE network . The PXE tools (for example Ironic) will set a temporary boot order for PXE (one time) to enable the boot from network, but then the reboot will be done from hard disk. So be sure that your boot order matches the following: See that the first order is hard d...

Test API endpoint with netcat

Do you need a simple way to validate that an API endpoint is responsive, but you don't want to use curl? There is a simple way to validate the endpoint with nc, producing an output that can be redirected to a logfile and parsed later: URL=$1 PORT=$2 while true; do     RESULT=$(nc -vz $URL $PORT 2>&1)     DATE=$(date)     echo $DATE $RESULT     sleep 1 done You can all this script with the API URL as first parameter, and API port as the second. netcat will be accessing to that endpoint and will report the results, detecting when the API is down. We also can output the date to have a reference when failures are detected. The produced output will be something like: vie jun 26 08:19:28 UTC 2020 Ncat: Version 7.70 ( https://nmap.org/ncat ) Ncat: Connected to 192.168.111.3:6443. Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds. vie jun 26 08:19:29 UTC 2020 Ncat: Version 7.70 ( https://nmap.org/ncat ) Ncat: Connec...

Create and restore external backups of virtual machines with libvirt

A common need for deployments in production, is to have the possibility of taking backups of your working virtual machines, and export them to some external storage. Although libvirt offers the possibility of taking snapshots and restore them, those snapshots are intended to be managed locally, and are lost when you destroy your virtual machines. There may be the need to just trash all your environment, and re-create the virtual machines from an external backup, so this article offers a procedure to achieve it. First step, create an external snapshot So the first step will be taking an snapshot from your running vm. The best way to take an isolated backup is using blockcopy virsh command. So, how to proceed? 1. First you need to extract all the disks that your vm has. This can be achieved with domblklist command:   DISK_NAME=$(virsh domblklist {{domain}} --details | grep 'disk' | awk '{print $3}') This will extract the name of the device that the vm is using ...