Mac OS lover, Docker fan, Go explorer, Python geek, Trello addicted.
Everyone use Docker and normally when something is so diffused, there’s always someone else that try to figure out how to leverage the diffusion to do bad things (you know what happened in Breaking Bad). Only a few months ago it happened that someone pushed some malicious software - cryptomining - over lot of images: this happened because, despite the fact that everyone use Docker, not so many people are really aware about security over Docker. Fortunately, during the years Docker put in place several tools and features to avoid (or at least drastically reduce) the damages that bad attitude could cause in your containerized systems or applications.
If you are interested in discover more about that, I will say something about:
Let’s think about a potentially real scenario: you have an application running over one or more Docker images - that could or could not be scaled out to many containers, it doesn’t matter for the purpose of the reasoning. Let imagine that this application runs some malicious code and/or code with bugs well known or - even worst - not known yet. This bring your system to be hacked by someone else - i.e. someone else take control of the Docker container running your application by using some exploits and accidentally - even worst - scale privileges to your host system too.
The first question is: what should you do if this happens to your application?
There’s a manual incident response. This are the main things you should keep in mind:
You can always avoid what you already now (already happened to you) and you should act to avoid as much as possible what you don’t know (never happened to you).
Thus, even if the damage is huge, having traces will prevent you (or your replacement if you caused a mess and you got fired 😅) to repeat the same mistakes already did once;
As we said, during the years Docker put in place different tools and features to deal with security of your containers: this start from the engineering of the images and include also the run time.
The Docker commands to work with isolation are at least two:
One of the reasons Docker containers and services are so powerful is that you can connect them together, or connect them to non-Docker workloads. Docker’s networking subsystem is pluggable, using drivers. Several drivers exist by default, and provide core networking functionality.
The Docker network command, as stated by the documentation, let you manage networks. What does it means? It means that if you have to isolate something from something else, the network layer is a good starting point to apply a potentially already sufficient cutoff. Let’s see the commands in details
|Command||What it does|
||connect a container to a network;|
||create a network;|
||disconnect a container from a network;|
||display detailed information on one or more networks;|
||remove all unused networks;|
||remove one or more networks;|
Of course, the most useful programs if you are still admin of your system - and you should, otherwise just RUN FOREST, RUN!!! - is detach the network interface of the container(s): which one? Well, this is something more related to the way you monitor your application but usually you should be able to identify where is the cause if you’re owner (or you know the owner) of the application. What I mean? For instance, if you’re homepage get hacked, you should be able to know at least which part of the application serve the homepage and start from there, etc.
The well known Docker stop terminate your container: if your container is orchestrated by some other tools (ecs, k8s, etc), it could not be enough and you should most probably act over deployment, services, or any other abstraction layer that manage your container for you.
The point is: a stopped container is something you can work on to understand what happened (for instance by looking at the logs with
docker log command) and in general is a good idea to avoid escalation of the problem.
Despite the Docker diff is a really powerful command, it doesn’t seem to be always well known: what it does and why it’s important?
To quote the official doc, the Docker diff inspect changes to files or directories on a container’s filesystem.
So this command will show any differences in the filesystem inside the container, compared to the image it was built from. In our top example, we could see if any new files have been created/modified related to our hacked service. If a hacker had installed any tools or scripts, we would see those too.
In information security, computer science, and other fields, the principle of least privilege requires that in a particular abstraction layer of a computing environment, every module - such as a process, a user, or a program, depending on the subject - must be able to access only the information and resources that are necessary for its legitimate purpose.1 In containers world, this translate directly in:
A (process/container) should only have access to the data and resources essential to perform its function.
Take a moment to reflect over this: this is crucial to understand, because the violation of this principle it’s most probably the main cause of exploits available in most of the Docker images available online.
Ok, know that we saw how to isolate and/or stop a container, let’s see how to mitigate an attack by following some - almost always - simple smart rules.
The main goal of security is to make attacks difficult - ideally, impossible - to be perpetrated by anyone else who don’t have privileged access to your system. How can you do this with Docker?
By default, users are not namespaced in Docker. What does it means namespaced and why is this important to know? Well, because namespaces are a fundamental aspect of containers on Linux (yes, Docker didn’t invent anything, I already wrote about this once). Thus, let’s do a recall of this concept first.
Namespaces are a feature of the Linux kernel2 that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources. The feature works by having the same namespace for these resources in the various sets of processes, but those names referring to distinct resources. Examples of resource names that can exist in multiple spaces, so that the named resources are partitioned, are process IDs, hostnames, user IDs, file names, and some names associated with network access, and interprocess communication.
Thus, back to USER and Docker, this is what you should now:
|UIDs are the same on the host and in the container||BOOOOOOOM;|
|Root in the container is root on the host||BOOOOOOOM^2;|
|Consider the possibility of container breakout||yes, containers are not jails (ready: fight);|
From this three fundamental information (practise examples in a while) you should learn that attackers should be constrained with the container, because you wouldn’t run apps in VMs as root: and thus, you should not do it in a container even.
Since Docker since 1.10, the namespaces where introduced to automatically maps users in container to high-numbered user on host. You can configure the mapping set on the daemon - and not per container - due to some complications with ownership of shared layers. Unfortunately, there are some problems yet with volume permissions and volume plugings, you can’t use –pid=host or –net=host (to share the network interface) and some privileged operations are not allowed (like the mknod to create files for peripheral).
Ok so… how to prevent privilege escalation?
The following statement should be like a mantra:
The best way to prevent privilege-escalation attacks from within a container is to configure your container’s applications to run as unprivileged users.
For containers whose processes must run as the root user within the container, you can re-map this user to a less-privileged user on the Docker host. The mapped user is assigned a range of UIDs which function within the namespace as normal UIDs from 0 to 65536, but have no privileges on the host machine itself. Thus…
Creating a USER in a Docker container is as simple as start by creating the user in the Dockerfile:
Then, you can simply do anything that needs root privileges like installing packages and so on:
Then change to the user:
It could happen that sometimes you need root privileges in start-up script, for example when you have to change ownership of files. In this case you can’t use USER statement, and you should done it in the start-up script.
It’s really simple to run a read-only filesystem: there’s a flag available to enable this feature and it’s called
--read-only. Just try it out by launching the following command:
And this can be combined also with the volume directive:
One option I found useful when you have to deal with filesystem issues - but also for security reason could be the use of temporary in memory - not persistent - volume
that are deleted when the container die. More on them here.
Be careful about deleting data in Dockerfiles because remember that every step is a layer - a tarball shared, preserved and most important available for manipulation - does things like the following don’t work:
because your secret will be stored until the last layer and thus available. This is something simple to avoid because it only requires you to remember that every layer is defined by a statement. Thus something like this
it’s almost ok from - even if maybe not so best-practises compliant (see at the end). Of course, avoid copy and deleting ssh private keys in separated steps. Those kind of things should be - in the worst case - injected as environment variables from outside but even in this case they would be available in the env thus… if you have to use secret, just port them outside of your logic and trust your container somewhere else (more about this later).
To reduce privilege and containing attacks, you can even segregate containers by context. What does this means? Well, containers run on hosts, either virtual on bare-metal machines, and hosts provide an extra layer of isolation. This can be used to segregate containers by security context, and placing low-risk containers on one set of hosts and high-risk ones on another set, in such a way that - in case of escalation - at list critical part will be physically separated.
The point is that unfortunately anyone who can start Docker containers effectively has root access to your system if can mount and read/change any file: this is super easy to verify. Open a shell and try to run the following:
Now try to run:
So how to avoid this? Let’s figure out what actually you can do from inside a container.
There are two special permissions that can be set on executable files: set User ID (
setuid) and set Group ID (
sgid). These permissions allow the file being executed to be executed with the privileges of the owner or the group. For example, if a file was owned by the root user and has the
setuid bit set, no matter who executed the file it would always run with root user privileges. If you have access to docker you could even start a container, create a suid binary and docker cp (docker copy, see here) the binary out of the container. Chances are that your application does not need any elevated privileges, and thus
I found a nice article from Redhat about how to defang your binaries: I ported the key points in the following two paragraphs.
To get a list of binaries with special permissions in a container image, the following command can be used:
over your image. At least now you now what could be dangerous. So, how to remove this privileges?
You can then defang the binaries with the
chmod a-s command to remove the suid bit. For example, you can defang your debian image by running:
|| true? The
|| true allows you to ignore any errors from find. The
setgid binaries run with the privileges of the owner rather than the user. These are normally used to allow users to temporarily run with escalated privileges required to execute a given task, such as setting a password.
It’s more likely that your Dockerfile will rely on a
setuid/setgid binary than your application. Therefore, you can always perform this step near the end, after any such calls and before changing the user (removing setuid binaries is pointless if the application runs with root privileges).
The linux kernel defines various capabilities (CAP_AUDIT_CONTROL CAP_CHOWN CAP_FSETID CAP_LEASE CAP_MKNOD, etc). Directly from the man, for the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero). Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process’s credentials (usually: effective UID, effective GID, and supplementary group list).
Starting with kernel 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled. Capabilities are a per-thread attribute.
Docker containers get a bunch of these capabilities by default, but some of them are not present: try run this command from a shell.
--privileged let you run with almost all of them. The flag
--cap-drop can respectively add and remove them. Thus if run something like
we are preventing from changing ownership using the
chmod command and we will obtain an error!!
Secure computing mode (
seccomp) is a Linux kernel feature. What it does? Shortly, seccomp allows a process to make a one-way transition into a secure state where it cannot make any system calls except exit(), sigreturn(), read() and write() to already-open file descriptors.
This feature is available only if Docker has been built with
seccomp and the kernel is configured with CONFIG_SECCOMP enabled. To check if your kernel supports
The nice thing is that you can use it to restrict the actions available within the container. The
seccomp() system call operates on the
seccomp state of the calling process. You can use this feature to restrict your application’s access by running with this
There’s default seccomp profile (it can be found here) that provides a sane default for running containers with seccomp and disables around 44 system calls out of 300+ (as stated in the Docker doc).
Now the point is: what do you need inside your Docker? Who knows. Good luck
But hey! There’s a tool for everything: docker-slim can help you build your secomp profile.
If you only want to auto-generate a Seccomp profile (along with other interesting image metadata) use the profile command.
Feel free to copy the generated profile. You can use the generated Seccomp profile with your original image or with the minified image. But docker-slim do so much more like minifying your images: if you want to auto-generate a Seccomp profile AND minify your image use the build command (more in the official repo)
Another important thing that can be done to avoid attacks (specialy DoS attacks or similar) is limiting resources. This is something that can be done by docker natively - even outside of orchestration environment and can be useful when you have high and low priority containers.
How can you keep under control - and thus act differently if you notice something weird - CPUs across containers? Well, containers get a default relative weighting of 1024. Share this value is as simple as run:
You can have another image running at default 1024, and this will imply 66% for it and 33% a. With other two flags you can control even how many CPU and time period dimension available for use for each container:
or even the more recent
--cpus 1.0 to specify the amount of cpus you want to use directly. Nice, but working over the memory could be more useful.
By default, a container can use as much memory and swap as it likes. This could potentially let the attacker or some other bugs to cause memory problems and thus stop your application working properly. If you accept to encounter some performance overhead problems, you can control both memory and swap by doing using flags like:
and avoid bad things like - just to say - fork bombs attacks.
There’s no a safe place to store secret. You should avoid copy them inside, thus the first things is to load them inside the magical safe place called ENVIRONMENT or ENV - as stated in the 12factor-app. Everyone did almost ones something like:
At least, I do this every day. It’s super simple and easy to use, but if you do:
The major problem is that it makes the secret too visible from the linked containers and also as output of
env command - then of course you cat your credentials files every day thus everything is in there in the history XD.
Now the point is: if not inside the ENV, where the f**k do I have to place my secrets?!?! Maybe mounting a read-only file could seem an option:
but who likes files?! And this nested_dir_in_host/key_file is really secure? The solution is using a Vault (never did it) or even better port your authentication out - and define roles and authorization outside. The process should be something like I trust my immutable object and I provide to him temporary credentials it can use to do things. This is to decouple your image (your vulnerable part) from your system (locked part).
Unfortunately, even by apply everything you haven’t solved the initial problem: because in the end you cannot lock everything. The general guidance is to - first of all - avoid vulnerable software: in fact, if you have a bug in your application the only real solution is to replace the library because running old and known-vulnerable code will get you hacked. Then prefer minimal images (like alpine ones are often the most checked and not vulnerable, but of course it depends on the images you are using). And finally, since it’s almost an impossible task keeping the images you build up-to-date with CVEs and actually is even harder to know what libraries are in use, you can leverage an image scanner that automatically scan images for known vulnerabilities.
Thank you everybody for reading!