Kubernetes - Pod Security Policies
A fully fleshed out example with exception management
My team is building a general purpose kubernetes cluster at Square. As a part of that build out, we implemented Pod Security Policies (PSPs) to protect our clusters from many container escape risks.
This post is focused on how to do a full deployment of Pod Security Policies with everything locked down and how to grant exceptions. For an introduction to PSPs, please go read the Official Kubernetes PSP documentation. The official documentation does a good job of showing how to use a simple PSP example, however fails in describing how a fully functioning PSP system would work, how it would be applied by default, and how exceptions could be managed. This post was written to help fill that gap. At the end, we will discuss several pitfalls you might run into while implementing PSPs, and discuss some troubleshooting tactics.
Why Bother?
Kubernetes is a very powerful and complicated tool; however, this has led to several security issues within the community. Most kubernetes security failures fall into two broad camps:
- Attack a workload, escape containment, and attack the cluster/host.
- Attack the kubernetes API from the outside.
Leveraging all the capabilities of PSPs allow you to dramatically increase the difficulty of a container escape, which shrinks the first major attack vector. PSPs are not a silver bullet (nothing is), but they are a powerful tool to protect your clusters and workloads.
NOTE: the above is a simplification of a complex topic. For more information, Microsoft recently released an Attack Matrix for Kubernetes. PSPs allow you to restrict 9 of those 31 challenges.
A Fully Restrictive PSP (up to kube 1.15)
Some notes before diving in:
- PSPs demonstrated in this post will work with later versions, but you should go through the official docs to ensure all available controls are enumerated.
- In an effort to simplify exception management, the PSP examples in this post take very few "shortcuts" in their
syntax and explicitly enumerate all capabilities/permissions. Example: for
requiredDropCapabilities
we mention every capability rather than using "all". - The PSPs found in this post avoid AppArmor and SELinux management as either topic is a blog post series unto itself. Only the stubs are mentioned.
- PSPs for Kubernetes on Windows are left as an exercise for the reader.
The fully restrictive PSP:
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
annotations:
kubernetes.io/description: 'restricted psp for all standard use-cases'
seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default
seccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default
name: restricted
spec:
# allowedCapabilities:
# - NET_BIND_SERVICE # useful if a workload needs a low port. but by default it's commented out
# allowedHostPaths:
# - pathPrefix: "/some/path" # Example of how to allow a specific host path
# readOnly: true
allowPrivilegeEscalation: false # Disallow privilege escalation to any special capabilities
allowedProcMountTypes: # Disallow full /proc mounts, only allow the "default" masked /proc
- Default
fsGroup: # disallow root fsGroups for volume mounts
rule: MustRunAs
ranges:
- max: 65535
min: 1
hostIPC: false # disallow sharing the host IPC namespace
hostNetwork: false # disallow host networking
hostPID: false # disallow sharing the host process ID namespace
hostPorts: # disallow low host ports (this seems to only apply to eth0 on EKS)
- max: 65535
min: 1025
privileged: false # disallow privileged pods
readOnlyRootFilesystem: true # change default from 'false' to 'true'
requiredDropCapabilities: # Drop all privileges in the Linux kernel
- AUDIT_CONTROL
- AUDIT_READ
- AUDIT_WRITE
- BLOCK_SUSPEND
- CHOWN
- DAC_OVERRIDE
- DAC_READ_SEARCH
- FOWNER
- FSETID
- IPC_LOCK
- IPC_OWNER
- KILL
- LEASE
- LINUX_IMMUTABLE
- MAC_ADMIN
- MAC_OVERRIDE
- MKNOD
- NET_ADMIN
- NET_BIND_SERVICE
- NET_BROADCAST
- NET_RAW
- SETGID
- SETFCAP
- SETPCAP
- SETUID
- SYS_ADMIN
- SYS_BOOT
- SYS_CHROOT
- SYS_MODULE
- SYS_NICE
- SYS_PACCT
- SYS_PTRACE
- SYS_RAWIO
- SYS_RESOURCE
- SYS_TIME
- SYS_TTY_CONFIG
- SYSLOG
- WAKE_ALARM
runAsGroup: # disallow GID 0 for pods (block root group)
rule: MustRunAs
ranges:
- max: 65535
min: 1
runAsUser: # disallow UID 0 for pods
rule: MustRunAsNonRoot
seLinux: # Harness for SELinux
rule: RunAsAny
supplementalGroups: # restrict supplemental GIDs to be non-zero (non-root)
rule: MustRunAs
ranges:
- max: 65535
min: 1
volumes: # allow only these volume types
- configMap
- downwardAPI
- emptyDir
- projected
- secret
# - hostPath # Host paths are disallowed by default.
# - persistentVolumeClaim # If you use statefulsets, you'll need this one.
NOTE: Here you can find a list of all kubernetes volume types. You may require a different volume list, just be careful with them.
How to Apply This PSP to All Users
First, your Kubernetes API server must have PodSecurityPolicy in its --enable-admission-plugins
list.
Then you must ensure that all users have access to a PSP. To do that sanely, you grant all users access to the most restrictive PSP.
Instead of enumerating all users, which could end up missing users added after PSPs are enabled, we grant this restrictive PSP to the system:authenticated
group. The system:authenticated
group is the fallback group for all authenticated users if the user lacks more explicit permissions.
NOTE: if you are allowing unauthenticated users to deploy workloads, PSPs might not be sufficient to your problem space. However, I will note that a system:unauthenticated
group does exist.
Attaching this PSP access requires a ClusterRole to grant the permissions and a ClusterRoleBinding to the system:authenticated
group. A ClusterRole and ClusterRoleBinding are used to enable PSP access as we want this restriction to apply to all authenticated users in the cluster. This ClusterRole usage does not supersede other existing RBAC policies limiting user capabilities like creating pods or deployments.
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
eks.amazonaws.com/component: pod-security-policy
kubernetes.io/cluster-service: "true"
name: psp-restricted
rules:
- apiGroups:
- policy
resourceNames:
- restricted
resources:
- podsecuritypolicies
verbs:
- use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
kubernetes.io/description: Restrictive PSP bound to system:authenticated to cover all users
labels:
eks.amazonaws.com/component: pod-security-policy
kubernetes.io/cluster-service: "true"
name: psp-restricted
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: psp-restricted
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:authenticated
Exceptions
So, you're going to have workloads which require exceptions to the above.
- Exporting logs? It'll need privileged access to the host logs.
- Monitoring nodes and workloads? It will probably need some special host mounts, and the dreaded /var/run/docker.sock
- etc.
So, how do you handle exceptions?
In the end you'll need four things for each exception: a custom PSP, a ServiceAccount, either a Role or ClusterRole, and either a RoleBinding or ClusterRoleBinding.
If your use-case only makes workloads in a single namespace, you can use a Role/RoleBinding. If your exception use-case will reuse a given PSP in a few namespaces, you could use a ClusterRole, and a service account/RoleBinding per namespace. Or you could have one ServiceAccount per NS and a ClusterRole and ClusterRoleBinding which mentions all the ServiceAccounts.
The pattern I've settled on is a PSP, a ClusterRole and a RoleBinding to it. This has the advantage of keeping all the PSP related cluster roles together for auditing purposes. Using a RoleBinding to a ClusterRole also restricts the PSP exceptions to workloads in a single namespace.
The following shows an example of a PSP exception which covers a fluentd DaemonSet (fluentd exports log messages to their final destination) as it needs a significant number of exceptions compared to the restrictive default PSP:
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
annotations:
kubernetes.io/description: 'tailored PSP for fluentd->fluentd'
seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default
seccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default
name: fluentd-fluentd
spec:
# allowedCapabilities:
# - NET_BIND_SERVICE # useful if a workload needs a low port. but by default it's commented out
allowedHostPaths:
- pathPrefix: "/run/fluentd/data" # EXCEPTION: fluentd needs a host-persistent place to store state
readOnly: false
- pathPrefix: "/var/lib/docker/containers" # EXCEPTION: fluentd needs read access to docker container logs
readOnly: true
- pathPrefix: "/var/log" # EXCEPTION: fluentd needs read access to host logs
readOnly: true
allowPrivilegeEscalation: true # EXCEPTION: fluentd needs root to read logs
allowedProcMountTypes: # Disallow full /proc mounts, only allow the "default" masked /proc
- Default
fsGroup: # EXCEPTION: fluentd runs as privileged, so requires fsGroup 0
rule: MustRunAs
ranges:
- max: 65535
min: 0
hostIPC: false # disallow sharing the host IPC namespace
hostNetwork: false # disallow host networking
hostPID: false # disallow sharing the host process ID namespace
hostPorts: # disallow low host ports (this seems to only apply to eth0 on EKS)
- max: 65535
min: 1025
privileged: true # EXCEPTION: fluentd needs root
readOnlyRootFilesystem: true # change default from 'false' to 'true'
requiredDropCapabilities: # Drop all privileges in the Linux kernel
- AUDIT_CONTROL
- AUDIT_READ
- AUDIT_WRITE
- BLOCK_SUSPEND
- CHOWN
- DAC_OVERRIDE
- DAC_READ_SEARCH
- FOWNER
- FSETID
- IPC_LOCK
- IPC_OWNER
- KILL
- LEASE
- LINUX_IMMUTABLE
- MAC_ADMIN
- MAC_OVERRIDE
- MKNOD
- NET_ADMIN
- NET_BROADCAST
- NET_RAW
- SETGID
- SETFCAP
- SETPCAP
- SETUID
- SYS_ADMIN
- SYS_BOOT
- SYS_CHROOT
- SYS_MODULE
- SYS_NICE
- SYS_PACCT
- SYS_PTRACE
- SYS_RAWIO
- SYS_RESOURCE
- SYS_TIME
- SYS_TTY_CONFIG
- SYSLOG
- WAKE_ALARM
runAsGroup: # EXCEPTION: fluentd runs as privileged, so requires GID 0
rule: MustRunAs
ranges:
- max: 65535
min: 0
runAsUser:
rule: MustRunAs # EXCEPTION: fluentd runs as privileged, so requires UID 0
ranges:
- max: 65535
min: 0
seLinux: # Harness for SELinux, if we ever engage with it
rule: RunAsAny
supplementalGroups: # EXCEPTION: fluentd runs as privileged, so requires GID 0
rule: MustRunAs
ranges:
- max: 65535
min: 0
volumes: # allow only these volume types
- configMap
- downwardAPI
- emptyDir
- hostPath # EXCEPTION: Fluentd needs to mount hostPaths
- projected
- secret
# - persistentVolumeClaim # If you use statefulsets, you'll need this one.
NOTE: each exception is called out as a YAML comment with the word "EXCEPTION". This practice lets you track your exceptions easily in the code which manages them, and makes the auditors happy when they come calling.
NOTE2: This is laid out identically to the restrictive PSP for ease of using "diff" style tools to validate.
To permit usage of the above PSP by fluentd, we have this ClusterRole and RoleBinding:
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
eks.amazonaws.com/component: pod-security-policy
kubernetes.io/cluster-service: "true"
name: psp-fluentd-fluentd # named for psp-<namespace>-<serviceaccount>
rules:
- apiGroups:
- policy
resourceNames:
- fluentd-fluentd
resources:
- podsecuritypolicies
verbs:
- use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
annotations:
kubernetes.io/description: 'tailored PSP for fluentd->fluentd'
labels:
eks.amazonaws.com/component: pod-security-policy
kubernetes.io/cluster-service: "true"
name: psp-fluentd-fluentd
namespace: fluentd
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: psp-fluentd-fluentd
subjects:
- kind: ServiceAccount
namespace: fluentd
name: fluentd
To use this PSP, the DaemonSet just needs to specify serviceAccountName: fluentd
in their pod specification and run in the fluentd namespace to leverage the RoleBinding.
Resolution: Which PSP gets used?
The fluentd workload above will have access to two distinct PSPs... How does that work?
The PSP Admission Controller runs the given pod against all allowed PSPs in alphabetical order. The first PSP which allows the workload is used. People seem to get hung up on the alphabetical thing, but in practice it's not a big deal. Workloads in this framework will have access to one or two PSPs: the fully restrictive one, and perhaps a special one. It doesn't really matter which one gets evaluated first, since PSPs are handled in a "if there is a single policy allowing the work, it goes in" pattern. If there is a fully restrictive PSP, then in effect it works similarly to a "default deny" firewall policy.
If no PSPs allow the workload, you'll see a kubernetes event which shows the union of ALL errors the workload has against EVERY PSP it has access. This makes troubleshooting quite painful.
Troubleshooting and Triage
When a pod passes a PSP, an annotation is added to it:
annotations:
kubernetes.io/psp: restricted
The downside is there's almost no useful information given when workloads FAIL all available PSPs.
The only hint you'll get is if you run kubectl describe daemonset <your_daemonset>
You'll see the events for that DaemonSet attempting to launch a pod:
Warning FailedCreate 1s (x2 over 3s) daemonset-controller (combined from similar events): Error creating: pods
"fluentd-4mqhj" is forbidden: unable to validate against any pod security policy: []
A tactic I used to help triage these problems would be to temporarily kubectl delete psp restrictive
so that I would only see the errors from the actual PSP the exception was trying to leverage. Obviously this testing would break new workloads for a running cluster (no PSP? No launch), so we recommend a non-production environment for such triage.
PSPs and Mutating Admission Controllers
By far the hardest challenge we had enabling PSPs was around our Mutating Admission Controllers.
Under the covers, kubernetes admission controllers (which include PSPs) have a complex precedence hierarchy that causes PSPs to run twice. Here's the webhook ordering:
- Builtin Mutating Webhooks: Pod Security Policies mutating webhook runs here to modify pod defaults according to the
PSP determined to best apply. 2. Custom Mutating Webhooks: Any mutations you need in your environment would happen here. 3. Custom Validating Webhooks: Any validations you need in your environment would happen here. 4. Builtin Validating Webhooks: Pod Security Policies validating webhook runs here to verify that pod objects still abide by the PSP requirements.
All in all, the above is a reasonable setup, and probably the only way to enable all these distinct webhook features. However, it's barely documented (unless you stumble onto the right github issues like this one from istio). Worse, the Validating version of the PSP admission controller yields zero error messages, so you're blind troubleshooting it.
In the end, if you have a Mutating Admission Controller, you need to ensure that its mutations are completely safe against the most restrictive PSP in your cluster. And this verification will probably be tedious (it was for us), hence why we wrote this guide.