Threat Hunting with Kubernetes Audit Logs - Part 2
Using the MITRE ATT&CK® Framework to hunt for attackers
Welcome back! If you have arrived here, I assume you have a fair understanding of how Kubernetes audit logs can be useful from the previous article. Now let us dive in a bit deeper and see how we can use these audit logs to hunt for attackers using the MITRE ATT&CK® framework.
Using the MITRE ATT&CK® Framework
So this was a good first step, but we can leverage the MITRE ATT&CK Framework to look at performing our hunts in a more organized way. There is a great article by Aquasec which lays out a threat matrix for Kubernetes that we can use. Here’s a snapshot of the Aquasec Threat Matrix:
Threat Matrix for Kubernetes1
MITRE has also come up with an ATT&CK matrix for Containers which can be viewed over here.
For our purpose, we will be using Aquasec’s threat matrix for specific hunts. We will go through each phase, choose a tactic/technique from the phase, create a hypothesis to identify the tactic and look at outputs of example queries that may corroborate our hypothesis.
Now let us pick up from point 5 (annotations.authorization.k8s.io/decision) in the previous article where we found an anomalous user - once we find a particular user or serviceaccount that moved into our top 10, we can then use that information to find out something more about the user. We will start from the "Execution” phase since we’re looking for an anomalous user right now. From the threat matrix, we can see that one of the methods for the "Execution Phase" would be for an attacker to use pod "exec". Let’s take a look at that.
Execution Phase
Hypothesis
Legitimate users of our cluster are implicitly aware of the pods they can “exec” into and repeated “exec” “forbid” audit events might indicate an attacker has obtained user credentials and is attempting to define/expand their access.
The Data
objectRef.resource/objectRef.subresource are two fields which tell us what resources are being targeted by a user or attacker. We can use this along with the annotations.authorization.k8s.io/decision field to see if any attackers tried to attempt pod "exec" and failed.
In the below example, running a query across the last 2 months, we can see that a user tried to exec into a pod for 22 times and failed ("forbid") because of an implicit RBAC policy. This is an anomaly & we can use this information to move further up the matrix.
fields `annotations.authorization.k8s.io/decision`, objectRef.resource, objectRef.subresource, user.username
| filter `annotations.authorization.k8s.io/decision` like /forbid/ and objectRef.subresource = "exec"
| stats count (*) as decision_count by `annotations.authorization.k8s.io/decision`, objectRef.resource, objectRef.subresource, user.username
Decision, resource, subresource, user.username, decision_count
Forbid, pods, exec, shu, 22
Persistence
Hypothesis
Users in our cluster have full access to their own application’s namespace. That being the case, there should ideally be minimal number of forbids for creating "cronjobs". An anomalous number of forbids for cronjob creation could indicate that an attacker is trying to gain a persistent foothold in the cluster.
The Data
If we moved up the matrix and looked at the "Persistence" phase, the attacker would next look to ensure that they maintain their foothold in the cluster. What better way to do this than to run the malicious application as a "cronjob". So the next step would be to check if the suspicious user.username or sourceIPs.0 tried creating multiple "cronjobs" with success or failure. Multiple "forbid" followed by several "allow" is suspicious in itself! Example query:
fields `annotations.authorization.k8s.io/decision`, sourceIPs.0
| filter objectRef.resource = "cronjobs"
| stats count (*) as decision_count by `annotations.authorization.k8s.io/decision`, objectRef.resource, sourceIPs.0
| sort status_code_count desc
| limit 10
Decision, objectRef.resource, sourceIPs.0, decision_count
forbid, cronjobs, 10.168.4.8, 1356
allow, cronjobs, 10.168.3.31, 1207
allow, cronjobs, 10.168.4.149, 531
allow, cronjobs, 10.168.4.127, 1278
allow, cronjobs, 10.168.4.35, 1481
allow, cronjobs, 10.168.4.239, 685
allow, cronjobs, 10.168.4.151, 51
forbid, cronjobs, 10.168.3.234, 25
allow, cronjobs, 10.168.4.12, 45
allow, cronjobs, 10.168.4.8, 36
Privilege Escalation
Hypothesis
Users in our cluster are generally not granted "cluster-admin" access. That being the case, any user being granted "cluster-admin" access could mean that an attacker has found a way to escalate their privileges by compromising the credentials of a higher privileged user and associating the user to an even higher privileged Clusterrole.
The Data
The next stage in the threat matrix is "Privilege Escalation" and an example of a simple but dangerous action of a user being granted "cluster-admin" access should be looked at. If an attacker attains cluster-admin access, the attacker can now have full control over your entire cluster! One very useful query would be to see if a user tried to grant themselves the "cluster-admin" Clusterrole. Cluster-admin can pretty much do anything on the cluster and any user or group that you don’t expect to get this access you should definitely alert on. For this we can use the responseObject.status.reason field:
filter responseObject.status.reason like /ClusterRole "cluster-admin" to group/
| fields @timestamp, responseObject.status.reason
| sort @timestamp desc
| stats count(*) by responseObject.status.reason
responseObject.status.reason:
"RBAC: allowed by ClusterRoleBinding ""cluster-admin"" of ClusterRole ""cluster-admin"" to Group ""deployer-user""",
count (*): 41
Another useful check would be to see if there are any "forbids" for a user trying to get access to the cluster-admin Clusterrole. This should definitely be flagged as an anomaly:
fields `annotations.authorization.k8s.io/decision`, sourceIPs.0, user.username, user.groups.0
| filter `annotations.authorization.k8s.io/decision` like /forbid/ and user.username like /admin/
Another way that an attacker would try to escalate privileges is by trying to associate a serviceaccount (for their app) with a Rolebinding/Clusterrolebinding with higher privileges. For example, an attacker may try to associate their app’s serviceaccount to a Rolebinding which has a Clusterrole which "use"(s) a highly privileged PSP which in turn allows the pod to set the "privileged" flag or even allows the pod to mount "docker.sock" (which can lead to instant root access to the underlying node in the cluster!).
A good way to hunt for these is by:
a) Checking to see if there are an unusually large number of "list" and "get" for Clusterroles, Roles, Rolebindings, Clusterrolebindings by a user.username
b) Checking if there are unusually large number of "forbids" for the user.username from associating a serviceaccount to one of these Rolebindings/Clusterrolebindings
c) Finally checking to see if there were "allows'' for the user.username to a Rolebinding/Clusterrolebinding.
At Square, we have a mapping of sensitive Clusterrolebindings/Rolebindings to serviceaccounts, and at any point in time, if there are changes or additions to these Clusterrolebindings/Rolebindings with non allow-listed serviceaccounts, we alert on these anomalies.
Example - Let’s say you have a PSP called kube-service-psp
associated with a Rolebinding called kube-service-rb
. Let’s say you have a serviceaccount called kube-test-sa
which is associated with the kube-service-rb
Rolebinding which ensures that this serviceaccount can use the kube-service-psp
PSP. Set up alerts in case any modifications are made to this Rolebinding association or if any serviceaccount is added to this Rolebinding. Example query:
fields objectRef.namespace as namespace, responseObject.subjects.0.namespace as responseObject_namespace, responseObject.subjects.0.name as responseObject_name
| filter namespace like /kube-sec-tools/ and responseObject_namespace not like /kube-bench-sa/ and ispresent (responseObject_namespace)
| stats count (*) by namespace, responseObject_namespace, responseObject_name
namespace, responseObject_namespace, responseObject_name, count
kube-service, kube-service, shu-sa, 9
kube-service, kube-service, shu-test-sa, 9
kube-service, john-app, john-sa, 1
Defensive Evasion
Hypothesis
Kubernetes "event" resources are generally never deleted by our users. So if we see an anomalous number of delete events, this could mean that an attacker is trying to cover their tracks.
The Data
A technique generally employed by attackers is to delete kubernetes resource type - "events". This ensures that these events do not show up on the Administrator’s or the Security team’s radar in case the attackers are creating resources within the cluster. Unfortunately for them, they can’t evade the Kubernetes audit logs!
Attackers would generally look to see what kind of "events" are being generated by "list"ing or "watch"ing pods. So, it would be good to have a baseline of the number of users performing a "list", "watch" or "delete" on "events". If this goes beyond your baseline, that’s a good sign to investigate further. Example queries:
fields objectRef.resource
| filter verb = "delete", objectRef.resource = "events"
| stats count (*) as resource by objectRef.resource
| sort resource desc
fields objectRef.resource
| filter objectRef.resource = "events" and user.username not like /system/
| stats count (*) as verb_count by verb, user.username
| sort verb_count desc
verb, user.username, verb_count
create, scheduler, 33628
watch, controller-manager, 1991
list, controller-manager, 34
deletecollection, namespace-control, 4
list, namespace-control, 4
Again, note that the "user.username" would also give us the serviceaccount.
Credential Access
Hypothesis
Serviceaccount tokens are used mainly by applications to call the Kubernetes API from within the cluster. That being the case, if the serviceaccount token is used from an anomalous IP, this could mean that an attacker has compromised the serviceaccount token of a privileged application and is trying to use it to access resources within the cluster.
The Data
A method that attackers use for credential access is to steal serviceaccount tokens and use them outside the cluster. We should run checks to see if serviceaccount credentials were accessed from IP addresses that we do not expect them to be accessed from. For example, this query checks to see if any of our serviceaccounts, at Square, are accessed from IP spaces that we don’t expect the serviceaccounts to be accessed from:
filter ispresent("system:serviceaccount:") and ((sourceIPs.0 not like /10.168.4./) and (sourceIPs.0 not like /10.168.3./)
and (sourceIPs.0 not like /10.168.2./) and (sourceIPs.0 not like /::1/) and (sourceIPs.0 not like /172.16/) and (sourceIPs.0 not like /127.0.0/) and
user.groups.0 = "system:serviceaccounts"
| fields sourceIPs.0, user.username
Once an attacker steals a serviceaccount token/credential, the attacker can now try to use the permissions granted to that token (permissions granted by Kubernetes RBAC to the serviceaccount token) to - perform recon, escalate privileges and/or move laterally.
Discovery
Hypothesis
Users of our cluster seldom look at Networkpolicies and especially Networkpolicies across different namespaces. If there are multiple such instances, it could mean that an attacker has managed to compromise a legitimate user’s credentials and is trying to get network information within the cluster.
The Data
Once attackers manage to get into the environment, one of the first things that they tend to do is query the API server to map out network information. For this, an attacker may use network scanning from within the pod to get a map of the network. However, if the network is controlled with Networkpolicies, like we have at Square, an attacker would try to map out the network connectivity within an environment by running "list", "watch" and "get" on network policies.
An effective check here would be to have a list of top 10 user.usernames performing these actions in a defined period, and alert in case a new user.username moves into the top 10. This doesn’t always mean malicious activity, but should be looked at nonetheless.
Another sign would be to look for users trying to modify Networkpolicies and encountering multiple denies or even someone trying to perform multiple Networkpolicy "delete" actions. Even simple 404 "not found" errors could indicate an attacker trying to get some information that does not exist. Obviously these can lead to false positives so we need to tread carefully. Example queries:
fields objectRef.resource, user.username, verb, responseStatus.code
| filter objectRef.resource like /networkpolicies/ and responseStatus.code like /404/
fields objectRef.resource, verb, user.username
| filter objectRef.resource like /networkpolicies/
| stats count (*) as user_count by user.username, verb
| sort user_count desc
user.username, verb, user_count
system:serviceaccount:policies:network-policies, create, 27493560
system:serviceaccount:policies:network-policies, update, 27493352
system:serviceaccount:policies:network-policies, get, 27493351
shu, list, 4723720
john, watch, 214222
joe, list, 97096
raghav, watch, 58784
shu, watch, 28432
shu, deletecollection, 36
shu, delete, 1
Lateral Movement
Hypothesis
Users are generally not granted access to "secrets" within our cluster, except their application’s own secrets. This being the case, if there are an anomalous number of access to secrets, especially across namespaces, this could be an indication of an attacker, with compromised credentials, trying to get access to secrets to move laterally.
The Data
As described under the "Credential Access" section above, an attacker can compromise serviceaccount tokens and use that to move laterally (for example - by trying to access kubernetes secrets, or secrets accidentally stored in environment variables). This, of course, depends on the RBAC permissions granted to the serviceaccount token.
So if we see anomalous serviceaccount activity, we should check if the serviceaccount is querying pods, deployments or cronojbs for either environment variables or kubernetes secrets for credentials to move laterally.
An attacker may try to get secrets and try to move to another Cloud Account, a DB cluster etc. A method to hunt for an attacker that is trying to move laterally would be to check for a large number of "list", "get" and "watch" secrets within a cluster. Combining this with checks for multiple "forbid" (or "allow", depending on your hunt method) for these actions across namespaces from specific users or serviceacounts would be a good starting point for an investigation. Example query:
fields user.username, objectRef.resource, verb, `annotations.authorization.k8s.io/decision`
| filter objectRef.resource = "secrets" and (verb = "list" or verb = "watch" or verb = "get") and user.username not like /system/ and `annotations.authorization.k8s.io/decision` like /forbid/
| stats count (*) as user_count by user.username, verb
| sort user_count desc
Another hunt method would be to see what are the top 10 serviceaccounts in the cluster that have had the most number of "list", "get" and "watch" actions (verbs) in a certain period. If that number drastically changes, or a certain user.username (because service accounts show up as user.username in audit logs) moves into the top 10 or top 10, that’s an indicator to investigate too.
Impact
Hypothesis
Users in our cluster are granted full access to their own namespaces. That being the case, users can create, patch, delete their deployments, cronjobs, and even their configmaps. However, they are generally not granted access to create, delete and patch resources across namespaces. So, anomalous actions to resources across namespaces could mean that an attacker has compromised user credentials to perform denial of service actions.
The Data
An attacker who has a foothold in an environment can very easily start destroying resources, thereby causing a Denial of Service attack. A classic example would be where an attacker manages to compromise a serviceaccount token and starts to modify or delete pods, delete cronjobs, manipulate configmaps or even delete configmaps. If an attacker manages to modify or delete configmaps the corresponding workload would never get spun up!
An example method to hunt for an attacker at this stage would be to see if a user.username performs multiple "list", "get" and "watch" actions on resources to get information about configmaps. An example query:
fields user.username, objectRef.resource, verb, `annotations.authorization.k8s.io/decision`
| filter objectRef.resource = "configmaps" and (verb = list or verb = "watch" or verb = "get") and user.username not like /system/
| stats count (*) as user_count by user.username, verb
| sort user_count desc
user.username, verb, user_count
shu, get, 252
joe, get, 221
john, get, 202
shu, watch, 188
raghav, get, 177
bob, get, 171
paul, get, 142
raghav, watch, 106
mike, get, 83
jes, get, 54
An attacker can then follow that up by “deleting”, “updating” or “patching” multiple configmaps. An example query for this would be:
fields user.username, objectRef.resource, verb, `annotations.authorization.k8s.io/decision`
| filter objectRef.resource = "configmaps" and (verb = "delete" or verb = "update" or verb = "patch") and user.username not like /system/
| stats count (*) as user_count by user.username, verb
| sort user_count desc
User.username, verb, user_count
shu, patch, 25
john, patch, 5
joe, patch, 1
Conclusion
As you can see, Kubernetes audit logs provide a wealth of information that Security teams can easily leverage to hunt for attackers. The logs can be very verbose and sometimes the fields can be difficult to understand, but they are easy to set up and readily available for consumption in Cloud platforms. It would be best to view these logs in combination with other logs such as - flow logs, container and system logs, ip tables logs etc., to get a comprehensive idea of what an attacker is trying to do. Having said that though, these logs provide a lot of value by themselves as an attacker would find it hard to evade the almighty Eye of the Audit Logs! Happy hunting!
References
- "Key Kubernetes Audit Logs For Monitoring Cluster Security". Key Kubernetes Audit Logs For Monitoring Cluster Security, 2021, https://www.datadoghq.com/blog/key-kubernetes-audit-logs-for-monitoring-cluster-security/.
- "Auditing". Kubernetes, 2021, https://kubernetes.io/docs/tasks/debug-application-cluster/audit/.
- Kuriel, Maor. "Mapping Risks And Threats In Kubernetes To The MITRE ATT&CK Framework". Blog.Aquasec.Com, 2021, https://blog.aquasec.com/mitre-attack-framework-for-containers.
- "The Kubernetes API". Kubernetes, 2021, https://kubernetes.io/docs/concepts/overview/kubernetes-api/.
Notes
"Mapping Risks and Threats in Kubernetes to the MITRE ... - Aqua Blog." 3 Jun. 2021, https://blog.aquasec.com/mitre-attack-framework-for-containers. Accessed 30 Jul. 2021.