Threat Hunting with Kubernetes Audit Logs
Analyzing Kubernetes audit logs to look for potential threats
From the official Kubernetes page - “The core of Kubernetes' control plane is the API server... The Kubernetes API lets you query and manipulate the state of API objects in Kubernetes (for example: Pods, Namespaces, ConfigMaps, and Events)”.
The Kubernetes API is quite powerful and malicious actors can use the API to attack a Kubernetes cluster. The Kubernetes audit logs provide visibility into what an attacker is trying to do by providing a clear audit trail of the API calls made to the Kubernetes API. This two-part article aims to take us through the basics of Kubernetes audit logs and how we can use these audit logs effectively to hunt for attackers in our Kubernetes clusters.
What is threat hunting?
A good way to start this would be to first understand what is Threat Hunting. For us, Threat Hunting starts as a series of steps to understand the environment, have a hypothesis, look for probable anti-patterns which may indicate threat/threat actors in the environment and checking to ensure we have data to corroborate our hypothesis:
- Have an understanding of the environment. For example - "169.254.169.123 is AWS Amazon Time Sync Service and hence is used by EC2 instances for time sync (ntp)"
- Have a criteria for success. For example - "If the above is true, then all EKS instances would be communicating with 169.254.169.123 for time sync"
- Have a clear hypothesis. For example - “If the above statements are true, then EC2 instances communicating with IPs other than 169.254.169.123 for ntp would be anomalous and could mean that an attacker is trying to exfiltrate traffic via ntp”
- What assets are considered / where the data is sourced. For example - "iptables logs from the EKS nodes/servers"
- Use appropriate tools to gather relevant data and create baselines for relevant data. For example - "Collecting all iptables logs from servers and creating a list of all the servers communicating with this IP"
- Test your hypothesis: does the data support or reject your assertion?
- Try to find repeatable patterns to find this data and use the baseline to automate your hunting
- Use the information gathered to appropriately enrich your detection pipeline
As we proceed through this article, there is a section dedicated to the MITRE ATT&CK® Framework under which we will discuss various hypotheses.
What are Kubernetes audit logs?
Kubernetes audit logs are generated to provide insight into the actions taken by users, applications or the Kubernetes control plane. In general, these logs provide details on the client, the session content, the server component handling the request, and the state of the request. More details about kubernetes audit logs can be found in the Kubernetes Official Documentation for Auditing over here - https://kubernetes.io/docs/tasks/debug-application-cluster/audit/.
As mentioned in the official documentation, Kubernetes audit logs tell us:
- What happened?
- When did it happen?
- Who initiated it?
- On what did it happen?
- Where was it observed?
- From where was it initiated?
- To where was it going?
These logs can be quite verbose, but they provide a wealth of information that can be useful for any threat hunter.
Why are audit logs important?
I have noticed that Kubernetes audit logs, in general, are not used as widely as container and system level events and logs to identify Intrusion events. There aren’t very many articles on the Internet about the usage of these logs for Security.
Security visibility into events in the container and the system give great insight into what an attacker is trying to do. This requires us to add third party tools to get visibility into syscalls, file modifications etc., to piece together an attack.
Kubernetes audit logs on the other hand provide an audit trail of all the API calls made to the Kubernetes API. Anything that an attacker does to query the API server to get a foothold into the environment can be tracked by the Audit Logs and most of the detection mechanisms can be built completely just with these logs!
In fact, with Cloud based Kubernetes solutions like Amazon EKS, Google GKE, Kubernetes audit logs are made readily available (with the audit policy pre-set) to be used as a cluster is spun up and these logs are very easy to query, extract and alert on. Security teams can find immediate value with these logs as they are very useful to hunt for attackers. Let’s see how.
What does an audit log look like?
Below is an example of an audit log. I am using AWS Cloudwatch Logs for this, but you can pretty much choose to use any Kubernetes audit log and it would look similar. The log is represented in json format
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "Request",
"auditID": "8a932edd-9c9a-4735-b982-8d2ca08cd6f4",
"stage": "ResponseComplete",
"requestURI": "/api/v1/namespaces/inventory/endpoints/inventory",
"verb": "get",
"user": {
"username": "serviceaccount:inventory:inventory",
"uid": "3e94bfaf-8edc-4562-b2ed-44e9a9e565fb",
"groups": [
"serviceaccounts",
"serviceaccounts:inventory",
]
},
"sourceIPs": [
"10.16.4.17"
],
"userAgent": "server/v0",
"objectRef": {
"resource": "endpoints",
"namespace": "inventory",
"name": "inventory",
"apiVersion": "v1"
},
"responseStatus": {
"metadata": {},
"code": 200
},
"requestReceivedTimestamp": "2021-07-16T14:33:24.429585Z",
"stageTimestamp": "2021-07-16T14:33:24.434300Z",
"annotations": {
"authorization.k8s.io/decision": "allow",
"authorization.k8s.io/reason": "RBAC: allowed by RoleBinding \"inventory\" of Role \"inventory\" to ServiceAccount \"inventory/inventory\""
}
}
As you can see, there are a number of different fields over here. As we move forward, we will go over a few important fields and how we can use them for hunting.
NOTE: These tests were run on EKS with Kubernetes version 1.17.
Let’s Hunt!
Before we start, it’s important to highlight a few important fields in the audit log that are of interest. They are:
- message
- timestamp (ofcourse :) )
- annotations.authorization.k8s.io/decision
- annotations.authorization.k8s.io/reason
- objectReg.apigroup
- objectRef.name
- objectRef.namespace
- objectRef.resource
- objectRef.subresource
- requestObject.subjects.0.kind
- requestObject.subjects.0.name
- requestObject.subjects.0.namespace
- responseStatus.code
- responseObject.metadata.name
- responseObject.metadata.uid
- responseObject.roleRef.kind
- responseObject.roleRef.name
- responseObject.metadata.namespace
- responseObject.subjects.0.kind
- responseObject.subjects.0.name
- responseObject.subjects.0.namespace
- responseObject.status.reason
- responseObject.metadata.annotations.kubernetes.io/psp
- sourceIPs.0
- stage
- user.groups.0
- user.groups.1
- user.username
- userAgent
- verb
Now let’s see how we can use some of these fields to perform simple but effective hunts. Let’s keep it simple and see how we can hunt with only AWS Cloudwatch Logs Insights.
We start off by looking at some interesting fields which can be used in combination with other fields in most queries.
NOTE:
a. I am using Cloudwatch Logs for hunting in this article since I am using EKS for my tests. The hunting methods would be similar for GKE, AKS and other Kubernetes platforms since we are dealing with audit logs, however the querying mechanism would vary for different platforms.
b. Request, Response and general audit log lifecycle related information can be found in the official kubernetes documentation over here - https://kubernetes.io/docs/tasks/debug-application-cluster/audit/
1. userAgent:
userAgent string - The most basic place to start with, and something that every SecOps engineer or threat hunter knows about! It is common knowledge that this field is set by the client and hence easily modifiable, but is still a good starting point for a hunt.
Here’s a screenshot of a simple query listing the top 5 userAgent strings. You could keep a track of, for example, the top 10 userAgent strings within your network for the last one month and alert if you see an userAgent string bubbling up into your top 10. Example Query with output on Cloudwatch Log Insights
NOTE: The subsequent outputs will be in CSV format for easier viewing:
fields userAgent
| filter ispresent (userAgent)
| stats count (*) as UA by userAgent
| sort UA desc
| limit 10
Let’s take this a step further as just userAgent strings may not be enough to find out who or what is the reason for this new userAgent string to bubble up to your monthly top 10.
2. sourceIPs.0
Another useful hunting technique is looking for unrecognized IP ranges trying to access your clusters. Once again, IPs are something set by the client and hence not very reliable, but attackers can be lazy too and they make mistakes, so this is still a good method to hunt. Here’s an example where we looked for the top IPs in our environment last month. If we see IP(s) bubble up to the top 10, that’s an anomaly worth investigating. You could also look for known malicious IPs and see if you find any of those in your audit logs.
fields sourceIPs.0
| filter ispresent (sourceIPs.0) and sourceIPs.0 not like /::1/
| stats count (*) as SRC_IP_Count by sourceIPs.0
| sort SRC_IP desc
| limit 10
sourceIPs.0, SRC_IP_Count
172.16.119.35, 5425615
172.16.188.168, 4596703
10.16.4.151, 2990985
10.16.2.195, 2836075
10.16.2.7, 1414419
10.16.3.71, 811473
10.16.3.68, 797049
10.16.3.231, 669291
10.16.2.23, 570666
172.16.41.74, 546064
If your cluster is private, like ours here at Square, then you should check to see if your serviceaccount tokens are not compromised. (I will be talking more about this in Part 2 of the article under "Credential Access" when we talk about the MITRE ATT&CK Framework).
3. user.username
As the name implies, user.username provides the username of the user or serviceaccount that made the API call to the Kubernetes API. Adding this field to our query above would give us greater insight into who, along with what userAgent string moved into our top 10. This would be a more accurate check for anomalies.
fields userAgent, user.username
| filter ispresent (userAgent) and user.username not like /system/
| stats count (*) as UA by userAgent, user.username
| sort UA desc
| limit 10
userAgent, user.username, UA
kubectl/v1.16, shu, 12227
kubectl/v1.16, john, 11211
kubectl/v1.16, joe, 5865
"Mozilla/5.0", raghav, 4320
kubectl/v1.16, bob, 4056
OpenAPI-Generator/python jason, 4023
kubectl/v1.16, jason, 3673
kubectl/v1.16, paul, 2774
kubectl/v1.16, mike, 2708
kubectl/v1.16, jes, 2437
Similarly, we could add the user.username field to our sourceIP search and get a more accurate picture of who, and with what IP moved up into our top 10. As you can see over here, we are ignoring "system" "users.username"s.
fields sourceIPs.0, user.username
| filter ispresent (sourceIPs.0) and user.username not like /system/
| stats count (*) as SRC_IP by sourceIPs.0, user.username
| sort SRC_IP desc
| limit 10
sourceIPs.0, user.username, SRC_IP
127.0.0.1, shu, 9758
127.0.0.1, john, 7742
127.0.0.1, joe, 3941
127.0.0.1, raghav, 1735
127.0.0.1, bob, 1635
127.0.0.1, paul, 1571
192.168.168.206, mike, 1189
192.168.168.209, carol, 1088
127.0.0.1, jes, 812
192.168.169.204, rodriguez, 776
Great! So we now know we can find which user/serviceaccount moved up to our top 10. From here, we can build on this to get more interesting information.
4. responseStatus.code
Status codes are something that threat hunters use extensively. It gives an indication of anomalous behavior if a particular status code count went beyond a particular baseline value. For example, if the number of 403s (forbidden), go beyond a known good value, that should be cause for an investigation.
fields `annotations.authorization.k8s.io/decision`, responseStatus.code
| filter ispresent(responseStatus.code)
| stats count (*) as status_code_count by responseStatus.code
| sort status_code_count desc
| limit 10
responseStatus.code, status_code_count
200, 7772602
409, 685152
201, 202973
404, 33895
403, 6085
400, 626
422, 417
503, 223
101, 25
500, 10
However, in our tests we see that with users starting to use tools such as Lens, looking only at a surge of 403s increasingly leads to false positives, and so, we need to expand our search to include a combination of - user.username, objectRef.resource, verbs and objectRef.namespace to reduce false positives. For example, if there are a large number of 403s for a user.username for "get", "list" - "secrets", across "namespaces", that could be an indicator for an attacker performing reconnaissance and this is worth investigating further.
5. annotations.authorization.k8s.io/decision
annotations.authorization.k8s.io/decision is one of the most important fields in the audit logs for hunting, in my opinion. This field tells us what happened to a particular request. It would be either an "allow" or a "forbid". This one field can be used in combination with many other fields to hunt effectively.
Let’s say, in point 3 under user.username, you find an anomalous user, we can now use annotations.authorization.k8s.io/decision to find out if the user had multiple "forbid" for their actions. Here’s an example where we can see our query getting a little more complex:
fields `annotations.authorization.k8s.io/decision` , user.username
| filter `annotations.authorization.k8s.io/decision` like /forbid/ and user.username not like /system/
| stats count (*) as decision by `annotations.authorization.k8s.io/decision`, user.username
| sort decision desc
| limit 10
annotations.authorization.k8s.io/decision, user.username, decision
forbid, shu, 3210
forbid, john, 1770
forbid, joe, 880
forbid, raghav, 827
forbid, bob, 557
forbid, paul, 24
forbid, mike, 7
forbid, carol, 5
forbid, jes, 4
forbid, rodriguez, 3
6. verb
Verbs define what action was taken on a particular resource defined by Kubernetes RBAC. There are different verbs in Kubernetes. A list of these are - "get", "list", "watch", "create", "update", "patch", "delete". An example of good hunting technique would be to look at several "forbids" for "list" by a particular user.username or unknown sourceIPs.0 coupled with similar objectRef.resource (which I will describe later). This could be an indicator that an attacker is trying to perform reconnaissance. An example query:
fields `annotations.authorization.k8s.io/decision` as decision, responseStatus.code, responseStatus.status, objectRef.resource as resource
| filter responseStatus.code = 403 and verb = "list" and user.username not like /system/
| stats count (*) as status_code_count by responseStatus.code, decision, verb, user.username , resource
| sort status_code_count desc
| limit 10
responseStatus.code, decision, verb, user.username, resource, status_code_count
403, forbid, list, shu, namespaces, 1378
403, forbid, list, shu, services, 1007
403, forbid, list, john, services, 728
403, forbid, list, joe, namespaces, 554
403, forbid, list, raghav, namespaces, 458
403, forbid, list, bob, namespaces, 373
403, forbid, list, joe, services, 200
403, forbid, list, raghav, services, 6
403, forbid, list, bob, services, 4
403, forbid, list, john, namespaces, 4
A large number of just "list" kubernetes secrets and "get" kubernetes secrets by a user.username across a large number of namespaces would as well (ideally) be anomalous and is worth investigating.
7. objectRef.resource, objectRef.subresource
A couple of other useful fields are - objectRef.resource, objectRef.subresource. These fields define the Kubernetes resource (such as pods, deployments, cronjobs, configmaps etc.) and sub-resource (like status, exec, etc.) for the said Kubernetes resource. These two fields can be used in combination with many other fields to extract valuable information.
For example, if an attacker wanted to "exec" (similar to ssh) into a pod and then break out into the host, we can easily hunt for that information! (I will be talking more about “exec” in Part 2 of the article under “Execution phase” when we talk about the MITRE ATT&CK Framework).
This is the end of the first part of this article. In the second part we will go over how to use these audit log fields (and other new audit log fields) effectively for hunts using the MITRE ATT&CK® Framework. See you soon!