My Love Hate Relationship with Docker and Container Orchestration Systems
Docker was first getting big while I was working for an open source shop in New Zealand. At work we’d joke about containers, mostly because of our misconceptions. “Aren’t they based on LXC containers, which are full of security holes?” a co-worker and I would ask. When CoreOS was released, initially we laughed but also realized people were taking containers seriously. Late one night at a bar, some German developers who ran a name registrar, talked about how amazing Docker was and how they were currently running it in production.
It wouldn’t be until I took a contract in Seattle that I was really exposed to Docker. I worked at a shop that ran a CoreOS instance, and eventually switched to a company wide DCOS/marathon based platform. I learned a lot about containers, embraced many of their advantages, as well as becoming incredibly frustrated with their limitations. Despite the issues, I started to prefer using containers, and even wrote a tool for managing the containers I use to host this website. In this post I intend to cover what I’ve learned about containers, their strengths, their limitations and some good ways to incorporate them into your infrastructure.
Using Official Containers
I knew a Windows Database Administrator who was learning how to manage Postgres on Linux. He had downloaded a tar file and placed it in the root file system. I introduced him to the magic that is Linux package management and explained how to use tools like
yum depending on the Linux distribution. Ideally you’d further automate package management using some type of configuration management system such as
ansible. Even in the Docker world, administrators typically use some type of configuration management system to setup the Docker ecosystem (which we’ll cover later).
Typically in the Linux world, it’s best to use a distribution’s official packages whenever possible. If the service or application isn’t available in the official repository, or if the official package is out of date, the second best option is to see if the application maintainers have an official package repository. Afterwards, administrators may end up using independently maintained repositories, such as Fedora EPEL, PPAs in Ubuntu or the OpenSUSE Build Service.
Docker hub has a series of official containers for many different services, such as mysql, postgres, redis and others. Most of the databases are configured in a similar manner, using the same types of environment variable settings and similar volume mappings. Official software packages for many Linux distributions typically have specifications for ensuring services and libraries are built, configured and installed in a standard manner. On first glance, this would seem to be the same with official Docker containers. However, once you start working with individual containers and looking at their source code, you’ll quickly find this is not the case.
The official Apache httpd container is based on
debian:jessie-backpports and pulls some httpd dependencies using
apt-get from the system, but then pulls the httpd server from source, compiles it and installs it into
/usr/local/. The official Nginx container is based off of
debain:stretch-slim, pulls some dependencies via
apt-get, but then pulls Nginx debian packages from the official Nginx repositories. The official PHP container with Apache httpd support uses a
debain:stretch-slim base, uses the system version of Apache httpd via
apt-get, but then builds the PHP executable from scratch, and has instructions for adding custom extensions by extending the
Dockerfile and uses their custom
It’s important to note that when I use the term official, I am referring to the containers produced by the Docker creators and that are found on Docker Hub under the
_ namespace. They can be pulled without a namespace qualifier with commands like
docker pull nginx (versus
docker pull someDevsRepo/nginx). I will say that looking through all the source code, each official repo does verify all source and binary packages with official PGP keys or SHA hashes. However after that, they’re all fairly different, using different base images, sometimes using the system binary package and sometimes building from source. It’s almost like they are all maintained by entirely different teams with some very loose base guidelines. It’s a mess.
I would recommend only using containers that are published by the actual software maintainers, such as the container for Certbot that’s maintained by the EFF, or the containers for Mastadon and Roundcube. If the service you want to use doesn’t publish to either Docker Hub or their own repository, but does have a
Dockerfile in their git repository and uses version tags for releases, I’d recommend setting up your deployment process to clone their repository, run
git checkout on a specific version tag, and build the image yourself. You can automate checking that repository for new version tags as part of your update process.
In every other case, it’s probably best to simply build your own Dockerfile and container. If the service you want is available in Ubuntu or Alpine’s package manager, I’d recommend using the
apk commands respectively in your Dockerfiles, so that updating to a new version will be as simple as rebuilding those containers.
Security in containers
Containers are typically based off of an operating system container, such as Ubuntu or Alpine. As we’ve seen, Dockerfiles typically install dependencies using the operating system package manager. Additionally, the base python and ruby containers support loading programming languages dependencies via pip/requirements.txt and bundle/RubyGems respectively.
Once containers are deployed, there could be security updates that come out for the packages within those containers. In a fast moving development environment, one might make the argument that containers get built and deployed so often that they’d get those updates on each successive release. However, what if development stops on a service? What if containers are left running in the wild for several months or even years without having new features or requirements?
Even before Docker containers, there were, and still are, a considerable number of open source programs that embedded static versions of their dependencies. Instead of linking out to system libraries, large monolithic programs such as Mozilla Firefox and Libreoffice tended to embed libraries to increase stability, predictability and reduce bugs. Silvio Cesare, an Australian security researcher, examined several applications which embedded their own versions of zlib, libpng, libjpeg, audio/video codecs and many other libraries. The added stability they gained came at the expense of not always being aware of security updates. It led him to develop a tool called Clonewise, that attempts to automatically scan source and binaries to find outdated versions of compiled in libraries1. Companies and teams that are security aware need to build the same type of scripts to detect outdated dependencies and security updates within their existing pipelines used to create and deploy containers.
Exploits within containerized services are typically limited to the containers themselves, any volumes mounted within them, and any network connections they can make to other containers and services. This is why it’s a terrible idea to expose the Docker socket using something like
-v /var/run/docker.sock:/var/run/docker.sock because it can grant an attacker access not only to other containers, but root access to the underlying operating system.
Keep in mind that an application in a container is exactly like an application running on the host OS. Processes within a container run with the same performance attributes that they would outside the container. There isn’t additional performance overhead to applications running inside a container2. All container processes show up in the standard process lists when running
ps -ef. They are somewhat isolated thanks to Linux cgroups. They essentially run in a very fancy chroot jail that can have memory and process limits, but they still share the kernel as the host operating system. If someone finds an exploit within the application running in the container, and then also discovers and unpatched cgroup/kernel vulnerability, a potential attacker could compromise the host.
Memory and Resources in Containers
If you run
top inside a container, you’ll see all of the host’s system memory and processors. If your container engine runs with memory limits enabled, and the process within the container goes over its memory limits, the container runtime may kill your process and restart it. If the programming language you’re using is not container aware, it may not check for cgroup specific memory limits and you’ll get into a situation where your services are restarting constantly when they run out of memory.
The Java VM is particularly bad with this as it will often cache and eat up as much usable memory as is available. There are several scripts on various Github accounts and StackOverflow posts that are used for entry points into Java apps within Docker containers. Most of them check
/sys/fs/cgroup/memory/memory.limit_in_bytes and then adjust
-Xmx Java option as necessary.
This isn’t limited to Java of course. Any language that uses an interpreter and garbage collection will face memory limit issues. Similar scenarios have been reported in Python3. Eventually most interpreters and VMs will be updated to check cgroup memory limits and adjust their memory allocation systems accordingly. Until then, developers may continue to need workaround scripts to indicate usable memory levels to the underlying system.
Trying to diagnose issues in containers can be pretty challenging. A well-built container tends to be very minimal, with as little excess as possible to conserve storage space. Because of this, many standard tools and commands tend to be missing from within containers, including tools like
vi and even
ping. In these situations, developers either get really creative with finding helpful commands, or they take to running
apk to install the needed tools into a container which will get blown away on the next deployment. Another approach is to use a custom base container for all builds that has diagnostic tools which developers need built into it.
Say a container fails to start and you want to be able to see inside of a stopped container or even run a shell in one. This should be a simple task, but it’s actually quite challenging. You can run
docker exec -it <name or id> /bin/bash to create a shell in a currently running container, but this can’t be done when a container is stopped. You can create an image of the stopped container using
docker commit and run the image with a new entrypoint (e.g.
/bin/bash), or you can run the container that’s crashing with a
-v and map the part of the volume with the diagnostic information you want into a location on your host machine. There are other solutions as well, but all of them require some juggling of images, volumes, commands and containers and none of them are straight forward.
There are third party tools like Sysdig that allow developers and administrations to inspect containers and even run system calls within current containers. In all these situations, debugging and diagnosing problems with container systems isn’t simple and requires a considerable amount of work and forethought.
There have been attempts to standardize the format for Linux containers, most notably by the Open Container Initiative. The layout inside the container, at its core, is really just a
chroot environment. This allows containers themselves to be somewhat portable, if they are exported as generic tarballs. For the most part, containers can be migrated across different runtimes. The portability problems with Docker don’t come from the containers themselves, but from the orchestration systems that surround them.
Systems like Kubernetes, DCOS/Marathon, Nomad, Fleet and Docker Swarm are all designed to schedule, deploy, manage and scale services, primarily those that run in Docker containers. Some of these systems have their own built in networking while others require administrators to choose from networking components like WeaveNet or Flannel. IPv6 with containers can be incredibly complicated, unless you use Docker IPv6 NAT. These are complex platforms, with a steep learning curve for setup, deployment and maintenance. Many of them use a json format for describing services and jobs, and none of the formats are interchangeable.
Often system administrators who setup these platforms for developers must create or modify configuration scripts (e.g. Puppet, Ansible, etc.) to create new nodes, load balancers, ingress routes and egress points within their organization’s architecture. Alternatively, some teams use managed Kubernetes services provided by Google, Amazon or other hosting providers. This can ease setup and startup costs, but could lock you into a particular vendor’s implementation of a container service.
The containers themselves are typically portable over different orchestration systems. I’ve worked for companies where we migrated from CoreOS to DC/OS, and the transition wasn’t very difficult. The biggest difficulty is simply setting up and configuring these systems, as well as setting up all the relevant load balancers, networks, DNS and IP mappings that allow all the services to talk to each other, and the outside world, securely.
Zero to a Hundred
There aren’t any container orchestration solutions that easily go from a single node to a hundred. There are solutions like Minikube to help with development and deployment for Kubernetes applications on a single machine, but you can’t start off with a Minikube instance and then simply add more nodes to turn it into a cluster. In my DC/OS tutorial, I show how a minimal cluster on a hosted provider can still be quite expensive for small projects.
I did some limited experiments with both HashiCorp’s Nomad and RancherOS and I think either of these might be better solutions for going from a small to large scale deployment solution. Nomad is much simpler than other offerings, but it also is pretty minimal without as many bells and whistles included. At the time I played around with RancherOS, I was disappointed that so much of it depended on the Web GUI and it seemed to lack a lot of the command line support needed for easy scripting.
My frustration with all the existing scheduling systems is what led me to design Bee2, which is not a scheduling system so much as a custom
docker-compose type system for managing Docker containers on a single host. There may be other solutions for starting very minimal and going to a larger deployment, but the existing major solutions I examined failed to work for my use case.
In many ways, I like Docker. I like the structure of a
Dockerfile, the immutability of the build artifacts, the way Ruby/Python dependencies can be setup within a container without needing a virtual environment, and the fact that containers provide an easy way to deploy full services without needing to deal with package management. Containers that are maintained by software maintainers and that are part of their integration pipelines are great for both trying out software and potentially running the same configuration in production.
Docker has a lot of limitations as well. Security for packages within a container and checking for updated base images usually requires third party or custom scripting. Diagnosing what’s going wrong inside a container can be challenging. Most orchestration systems such as Kubernetes, Marathon and Nomad are all fairly different, each with their own deployment APIs. Configuring and maintaining a full Docker platform typically requires a full development and operations team, as they can grow to be quite complex.
Docker is a big mixed bag, but for better or for worse, it is here to stay. For the most part, it does a good job of isolating applications and their dependencies with minimal overhead. What’s needed is more tooling, easier diagnostics, and a more standardized means of container updating and orchestration. With players big and small like Google, HashiCorp, and Amazon all trying to stake their claims in the Docker ecosystem, it will be interesting to see what will be developed in the next few years.
Clonewise – Automatically Detecting Package Clones and Inferring Security Vulnerabilities. Cesare. Deakin University. Retrieved 30 August 2018. ↩
An Updated Performance Comparison of Virtual Machines and Linux Containers. 21 July 2014. Ferreira, Rajamony, Rubio. IBM. ↩
Why does docker crash on high memory usage?. 11 July 2015. pierrelb. StackOverflow. ↩