Kubernetes - Pod Security Policies

A fully fleshed out example with exception management

Reddit
LinkedIn

My team is building a general purpose kubernetes cluster at Square. As a part of that build out, we implemented Pod Security Policies (PSPs) to protect our clusters from many container escape risks.

This post is focused on how to do a full deployment of Pod Security Policies with everything locked down and how to grant exceptions. For an introduction to PSPs, please go read the Official Kubernetes PSP documentation. The official documentation does a good job of showing how to use a simple PSP example, however fails in describing how a fully functioning PSP system would work, how it would be applied by default, and how exceptions could be managed. This post was written to help fill that gap. At the end, we will discuss several pitfalls you might run into while implementing PSPs, and discuss some troubleshooting tactics.

Why Bother?

Kubernetes is a very powerful and complicated tool; however, this has led to several security issues within the community. Most kubernetes security failures fall into two broad camps:

  1. Attack a workload, escape containment, and attack the cluster/host.
  2. Attack the kubernetes API from the outside.

Leveraging all the capabilities of PSPs allow you to dramatically increase the difficulty of a container escape, which shrinks the first major attack vector. PSPs are not a silver bullet (nothing is), but they are a powerful tool to protect your clusters and workloads.

NOTE: the above is a simplification of a complex topic. For more information, Microsoft recently released an Attack Matrix for Kubernetes. PSPs allow you to restrict 9 of those 31 challenges.

A Fully Restrictive PSP (up to kube 1.15)

Some notes before diving in:

  • PSPs demonstrated in this post will work with later versions, but you should go through the official docs to ensure all available controls are enumerated.
  • In an effort to simplify exception management, the PSP examples in this post take very few "shortcuts" in their syntax and explicitly enumerate all capabilities/permissions. Example: for requiredDropCapabilities we mention every capability rather than using "all".
  • The PSPs found in this post avoid AppArmor and SELinux management as either topic is a blog post series unto itself. Only the stubs are mentioned.
  • PSPs for Kubernetes on Windows are left as an exercise for the reader.

The fully restrictive PSP:

---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    kubernetes.io/description: 'restricted psp for all standard use-cases'
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default
    seccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default
  name: restricted
spec:
#  allowedCapabilities:
#    - NET_BIND_SERVICE                             # useful if a workload needs a low port. but by default it's commented out
#  allowedHostPaths:
#    - pathPrefix: "/some/path"                     # Example of how to allow a specific host path
#      readOnly: true
  allowPrivilegeEscalation: false                   # Disallow privilege escalation to any special capabilities
  allowedProcMountTypes:                            # Disallow full /proc mounts, only allow the "default" masked /proc
    - Default
  fsGroup:                                          # disallow root fsGroups for volume mounts
    rule: MustRunAs
    ranges:
      - max: 65535
        min: 1
  hostIPC: false                                    # disallow sharing the host IPC namespace
  hostNetwork: false                                # disallow host networking
  hostPID: false                                    # disallow sharing the host process ID namespace
  hostPorts:                                        # disallow low host ports (this seems to only apply to eth0 on EKS)
    - max: 65535
      min: 1025
  privileged: false                                 # disallow privileged pods
  readOnlyRootFilesystem: true                      # change default from 'false' to 'true'
  requiredDropCapabilities:                         # Drop all privileges in the Linux kernel
    - AUDIT_CONTROL
    - AUDIT_READ
    - AUDIT_WRITE
    - BLOCK_SUSPEND
    - CHOWN
    - DAC_OVERRIDE
    - DAC_READ_SEARCH
    - FOWNER
    - FSETID
    - IPC_LOCK
    - IPC_OWNER
    - KILL
    - LEASE
    - LINUX_IMMUTABLE
    - MAC_ADMIN
    - MAC_OVERRIDE
    - MKNOD
    - NET_ADMIN
    - NET_BIND_SERVICE
    - NET_BROADCAST
    - NET_RAW
    - SETGID
    - SETFCAP
    - SETPCAP
    - SETUID
    - SYS_ADMIN
    - SYS_BOOT
    - SYS_CHROOT
    - SYS_MODULE
    - SYS_NICE
    - SYS_PACCT
    - SYS_PTRACE
    - SYS_RAWIO
    - SYS_RESOURCE
    - SYS_TIME
    - SYS_TTY_CONFIG
    - SYSLOG
    - WAKE_ALARM
  runAsGroup:                                       # disallow GID 0 for pods (block root group)
    rule: MustRunAs
    ranges:
      - max: 65535
        min: 1
  runAsUser:                                        # disallow UID 0 for pods
    rule: MustRunAsNonRoot
  seLinux:                                          # Harness for SELinux
    rule: RunAsAny
  supplementalGroups:                               # restrict supplemental GIDs to be non-zero (non-root)
    rule: MustRunAs
    ranges:
    - max: 65535
      min: 1
  volumes:                                          # allow only these volume types
  - configMap
  - downwardAPI
  - emptyDir
  - projected
  - secret
#   - hostPath                                      # Host paths are disallowed by default.
#   - persistentVolumeClaim                         # If you use statefulsets, you'll need this one.

NOTE: Here you can find a list of all kubernetes volume types. You may require a different volume list, just be careful with them.

How to Apply This PSP to All Users

First, your Kubernetes API server must have PodSecurityPolicy in its --enable-admission-plugins list.

Then you must ensure that all users have access to a PSP. To do that sanely, you grant all users access to the most restrictive PSP.

Instead of enumerating all users, which could end up missing users added after PSPs are enabled, we grant this restrictive PSP to the system:authenticated group. The system:authenticated group is the fallback group for all authenticated users if the user lacks more explicit permissions.

NOTE: if you are allowing unauthenticated users to deploy workloads, PSPs might not be sufficient to your problem space. However, I will note that a system:unauthenticated group does exist.

Attaching this PSP access requires a ClusterRole to grant the permissions and a ClusterRoleBinding to the system:authenticated group. A ClusterRole and ClusterRoleBinding are used to enable PSP access as we want this restriction to apply to all authenticated users in the cluster. This ClusterRole usage does not supersede other existing RBAC policies limiting user capabilities like creating pods or deployments.

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    eks.amazonaws.com/component: pod-security-policy
    kubernetes.io/cluster-service: "true"
  name: psp-restricted
rules:
- apiGroups:
  - policy
  resourceNames:
  - restricted
  resources:
  - podsecuritypolicies
  verbs:
  - use

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    kubernetes.io/description: Restrictive PSP bound to system:authenticated to cover all users
  labels:
    eks.amazonaws.com/component: pod-security-policy
    kubernetes.io/cluster-service: "true"
  name: psp-restricted
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: psp-restricted
subjects:
  - apiGroup: rbac.authorization.k8s.io
    kind: Group
    name: system:authenticated

Exceptions

So, you're going to have workloads which require exceptions to the above.

  • Exporting logs? It'll need privileged access to the host logs.
  • Monitoring nodes and workloads? It will probably need some special host mounts, and the dreaded /var/run/docker.sock
  • etc.

So, how do you handle exceptions?

In the end you'll need four things for each exception: a custom PSP, a ServiceAccount, either a Role or ClusterRole, and either a RoleBinding or ClusterRoleBinding.

If your use-case only makes workloads in a single namespace, you can use a Role/RoleBinding. If your exception use-case will reuse a given PSP in a few namespaces, you could use a ClusterRole, and a service account/RoleBinding per namespace. Or you could have one ServiceAccount per NS and a ClusterRole and ClusterRoleBinding which mentions all the ServiceAccounts.

The pattern I've settled on is a PSP, a ClusterRole and a RoleBinding to it. This has the advantage of keeping all the PSP related cluster roles together for auditing purposes. Using a RoleBinding to a ClusterRole also restricts the PSP exceptions to workloads in a single namespace.

The following shows an example of a PSP exception which covers a fluentd DaemonSet (fluentd exports log messages to their final destination) as it needs a significant number of exceptions compared to the restrictive default PSP:

---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    kubernetes.io/description: 'tailored PSP for fluentd->fluentd'
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default
    seccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default
  name: fluentd-fluentd
spec:
#  allowedCapabilities:
#    - NET_BIND_SERVICE                             # useful if a workload needs a low port. but by default it's commented out
  allowedHostPaths:
    - pathPrefix: "/run/fluentd/data"               # EXCEPTION: fluentd needs a host-persistent place to store state
      readOnly: false
    - pathPrefix: "/var/lib/docker/containers"      # EXCEPTION: fluentd needs read access to docker container logs
      readOnly: true
    - pathPrefix: "/var/log"                        # EXCEPTION: fluentd needs read access to host logs
      readOnly: true
  allowPrivilegeEscalation: true                    # EXCEPTION: fluentd needs root to read logs
  allowedProcMountTypes:                            # Disallow full /proc mounts, only allow the "default" masked /proc
    - Default
  fsGroup:                                          # EXCEPTION: fluentd runs as privileged, so requires fsGroup 0
    rule: MustRunAs
    ranges:
      - max: 65535
        min: 0
  hostIPC: false                                    # disallow sharing the host IPC namespace
  hostNetwork: false                                # disallow host networking
  hostPID: false                                    # disallow sharing the host process ID namespace
  hostPorts:                                        # disallow low host ports (this seems to only apply to eth0 on EKS)
    - max: 65535
      min: 1025
  privileged: true                                  # EXCEPTION: fluentd needs root
  readOnlyRootFilesystem: true                      # change default from 'false' to 'true'
  requiredDropCapabilities:                         # Drop all privileges in the Linux kernel
    - AUDIT_CONTROL
    - AUDIT_READ
    - AUDIT_WRITE
    - BLOCK_SUSPEND
    - CHOWN
    - DAC_OVERRIDE
    - DAC_READ_SEARCH
    - FOWNER
    - FSETID
    - IPC_LOCK
    - IPC_OWNER
    - KILL
    - LEASE
    - LINUX_IMMUTABLE
    - MAC_ADMIN
    - MAC_OVERRIDE
    - MKNOD
    - NET_ADMIN
    - NET_BROADCAST
    - NET_RAW
    - SETGID
    - SETFCAP
    - SETPCAP
    - SETUID
    - SYS_ADMIN
    - SYS_BOOT
    - SYS_CHROOT
    - SYS_MODULE
    - SYS_NICE
    - SYS_PACCT
    - SYS_PTRACE
    - SYS_RAWIO
    - SYS_RESOURCE
    - SYS_TIME
    - SYS_TTY_CONFIG
    - SYSLOG
    - WAKE_ALARM
  runAsGroup:                                       # EXCEPTION: fluentd runs as privileged, so requires GID 0
    rule: MustRunAs
    ranges:
      - max: 65535
        min: 0
  runAsUser:
    rule: MustRunAs                                 # EXCEPTION: fluentd runs as privileged, so requires UID 0
    ranges:
    - max: 65535
      min: 0
  seLinux:                                          # Harness for SELinux, if we ever engage with it
    rule: RunAsAny
  supplementalGroups:                               # EXCEPTION: fluentd runs as privileged, so requires GID 0
    rule: MustRunAs
    ranges:
    - max: 65535
      min: 0
  volumes:                                          # allow only these volume types
  - configMap
  - downwardAPI
  - emptyDir
  - hostPath                                        # EXCEPTION: Fluentd needs to mount hostPaths
  - projected
  - secret
#   - persistentVolumeClaim                         # If you use statefulsets, you'll need this one.

NOTE: each exception is called out as a YAML comment with the word "EXCEPTION". This practice lets you track your exceptions easily in the code which manages them, and makes the auditors happy when they come calling.

NOTE2: This is laid out identically to the restrictive PSP for ease of using "diff" style tools to validate.

To permit usage of the above PSP by fluentd, we have this ClusterRole and RoleBinding:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    eks.amazonaws.com/component: pod-security-policy
    kubernetes.io/cluster-service: "true"
  name: psp-fluentd-fluentd                         # named for psp-<namespace>-<serviceaccount>
rules:
- apiGroups:
  - policy
  resourceNames:
  - fluentd-fluentd
  resources:
  - podsecuritypolicies
  verbs:
  - use

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  annotations:
    kubernetes.io/description: 'tailored PSP for fluentd->fluentd'
  labels:
    eks.amazonaws.com/component: pod-security-policy
    kubernetes.io/cluster-service: "true"
  name: psp-fluentd-fluentd
  namespace: fluentd
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: psp-fluentd-fluentd
subjects:
  - kind: ServiceAccount
    namespace: fluentd
    name: fluentd

To use this PSP, the DaemonSet just needs to specify serviceAccountName: fluentd in their pod specification and run in the fluentd namespace to leverage the RoleBinding.

Resolution: Which PSP gets used?

The fluentd workload above will have access to two distinct PSPs... How does that work?

The PSP Admission Controller runs the given pod against all allowed PSPs in alphabetical order. The first PSP which allows the workload is used. People seem to get hung up on the alphabetical thing, but in practice it's not a big deal. Workloads in this framework will have access to one or two PSPs: the fully restrictive one, and perhaps a special one. It doesn't really matter which one gets evaluated first, since PSPs are handled in a "if there is a single policy allowing the work, it goes in" pattern. If there is a fully restrictive PSP, then in effect it works similarly to a "default deny" firewall policy.

If no PSPs allow the workload, you'll see a kubernetes event which shows the union of ALL errors the workload has against EVERY PSP it has access. This makes troubleshooting quite painful.

Troubleshooting and Triage

When a pod passes a PSP, an annotation is added to it:

    annotations:
      kubernetes.io/psp: restricted

The downside is there's almost no useful information given when workloads FAIL all available PSPs.

The only hint you'll get is if you run kubectl describe daemonset <your_daemonset>

You'll see the events for that DaemonSet attempting to launch a pod:

Warning  FailedCreate  1s (x2 over 3s)  daemonset-controller  (combined from similar events): Error creating: pods
"fluentd-4mqhj" is forbidden: unable to validate against any pod security policy: []

A tactic I used to help triage these problems would be to temporarily kubectl delete psp restrictive so that I would only see the errors from the actual PSP the exception was trying to leverage. Obviously this testing would break new workloads for a running cluster (no PSP? No launch), so we recommend a non-production environment for such triage.

PSPs and Mutating Admission Controllers

By far the hardest challenge we had enabling PSPs was around our Mutating Admission Controllers.

Under the covers, kubernetes admission controllers (which include PSPs) have a complex precedence hierarchy that causes PSPs to run twice. Here's the webhook ordering:

  1. Builtin Mutating Webhooks: Pod Security Policies mutating webhook runs here to modify pod defaults according to the PSP determined to best apply.
  2. Custom Mutating Webhooks: Any mutations you need in your environment would happen here.
  3. Custom Validating Webhooks: Any validations you need in your environment would happen here.
  4. Builtin Validating Webhooks: Pod Security Policies validating webhook runs here to verify that pod objects still abide by the PSP requirements.

All in all, the above is a reasonable setup, and probably the only way to enable all these distinct webhook features. However, it's barely documented (unless you stumble onto the right github issues like this one from istio). Worse, the Validating version of the PSP admission controller yields zero error messages, so you're blind troubleshooting it.

In the end, if you have a Mutating Admission Controller, you need to ensure that its mutations are completely safe against the most restrictive PSP in your cluster. And this verification will probably be tedious (it was for us), hence why we wrote this guide.