- Docker Swarm
Today I’d like to write a bit about Docker, mostly what I found out about it in the course I took recently. As many of the readers know, today Docker and Kubernetes are widely used technologies out there, so I thought I’ll share a short overview of it. I must stress it right away, that this post doesn’t try to be complete in any way, but hopefully it will be enough to maybe serve as an encouragement to those as well, who are just considering to learn Docker. So here it goes.
Let’s first put the obvious aside, and examine what Docker is. It is a virtualization technology, but not in its classical sense (as we knew it for decades). Classical virtualization technologies use something called a hypervisor to emulate hardware, which in turn allows the creation of tailored virtual machines (with different architectures if need be), which act exactly as real world machines. There are mainly two types of hypervisors: Type 1, which runs directly on the hardware (Xen, VMware ESXi, etc), and Type-2, which requires a host OS on which to run (VirtualBox, VMware Player, etc.). Docker on the other hand doesn’t use a hypervisor, instead it uses containers to run software directly on the host OS kernel, while utilizing the kernel’s capabilities for encapsulation. Since Docker relies on Linux kernel capabilities heavily to isolate containerized software from each other, it also means that such software needs to be compatible with the host OS, otherwise the usage of a hypervisor becomes necessary (on Windows hosts Hyper-V for example).
Fig.1. Type-1 vs Type-2 hypervisors
Containers vs VMs
From first glance the differences between containers and full-fledged virtual machines aren’t all that great, so one could ask why did the containerized approach become so popular over traditional virtualization. One of the reasons is that for hypervisor based virtualization one needs to install and configure a whole OS for each machine, even before one can start fine-tuning the environment to a specific need. Although pre-built OS images can be made to circumvent most of the setup phase, it still does not eliminate the need to maintain each installed OS instance (there are automation tools for it as well, but the whole process is still not 100% unattended). Let’s also not forget that a lot of the software used (OS level) is simply repeated with each instance, which apart from having a bigger surface for maintenance, also takes up a lot of space.
This is where containers come in play. In case we have no special requirements towards the underlying OS, we can simply package and supply software that is strictly required for our application, while the rest (hardware abstaction, scheduler, init system, kernel etc.) is provided by the underlying OS through Docker. This way there is only a single OS that needs to be maintained (in many cases by someone else), which in turn is shared between all container instances.
Fig.2. Classical virtualization vs containers
If we examine the abstraction levels a bit closer (Fig.3 blurred areas), we notice, that compared to a hypervisor based virtualization (Fig. 3. left side), where everything outside the VM is abstracted away, we gain one additional level of abstraction (Fig.3. right side), as the OS disappears to be a concern as well. This is a definite step forward, which is probably one of the reasons why Docker became so popular. At this point all we need to do is to install Docker on the OS of our choosing, after which we don’t care any more what is going on outside of it (and how is it provided to us as a service), as long as we can pull in and deploy our prefabricated working environment, or in Docker nomenclature: our image.
Fig.3. Levels of abstraction at hypervisor and container based virtualization
So how to create such an image and what is it exactly anyway? A Docker image is very similar to Linux live distro image you download to run on a USB stick for example, except that (as Fig.2. depicts) apart from the usual Unix directory tree, it requires to contain only those dependencies, that our application strictly needs. Everything else is provided bythe underlying OS through Docker. This way the application residing inside the image doesn’t know that it is in a container, as everything looks exactly the same, as if it were a real Unix environment.
So back to the question of how to create such an image: Basically one uses freely available prefabricated base images to build upon using built-in Docker tools. Such base images can be as tiny as a few MB (Alpine) if we want to add all dependencies manually into the image; or more featureful base images can also be used, where we immediately get the usual base dependencies for a specific need (Node.js, PHP, nginx, CouchDB, etc.). To be able to append stuff to existing Docker images easily, all images use the UnionFS as a basis, which allows a layered and documented building of an image. Each change creates a new r/w layer on top of a previous layer, where the previous layer becomes read-only as soon as the new one is created. Fig.4. depicts such a process.
Fig.4. Creating a custom Docker image layer-by-layer
Creating custom images
The extension of any base image is done using a Dockerfile, which is basically a recipe for Docker for building new images. It usually starts with a
FROM directive, which tells what base image to use, then (almost) each stanza after that represents the creation of a new layer, where all changes denoted by the directive or commands in the stanza are applied to this new layer. Such tasks include
ADD, etc, but we can also define environment variables or open up ports to outside world (from the containers perspective). Interestingly, despite the fact that all previous layers are read-only, all files residing in those layers can apparently be changed in the new layer, as each new layer simply represents a diff to the previous layer. This means that all changes to the image are recorded (and all previous states exist), which makes it easy to roll back or reuse previously made image layers. For example, if one of the sublayers (aka. additions) needs to be manipulated, one just needs to make corrections in the Dockerfile to the stanza representing that particular sublayer, after which Docker will rebuild the image reusing all sublayers below the changed sublayer.
Fig.5. Making changes to existing images
This all sounds very good, but what happens when we need to run an application that might make changes to the filesystem? Will that ruin our image? Do we need to keep separate copies of it? Fortunately not, because as soon as the app in the image is executed, Docker creates a new rw layer on top of the image, which from that point on will be called a ‘container’. Any changes made are made to this new ‘container’ layer only (see image below).
Fig.6. Executing apps from images
This also opens up the possibility to run multiple instances of a particular image, since no running instance can change it. But will the new ‘container’ layer be shared between instances? The short answer is no. The longer answer is no, because each instance (or ‘container’) creates its own independent new layer on top of the image, where each instance can write its own changes to the filesystem (see image below). Furthermore, any data not changed during runtime is shared between all containers through the image, which makes running many instances very space efficient, since only the differences between instances require extra disk space.
Fig.7. Executing multiple independent instances of imaged app
Storing persistent data
This all sounds like a wonderful idea, but let’s think for a second. What happens if the image is updated? Obviously, the container we created on top of the previous version of the image is not compatible with the new image, so it will be replaced by a new container, and all the changes (aka. generated data) made to the container so far will be lost. This is why the general consensus is to make containers ephemeral, or in other words as disposable as possible. Docker offers several possibilities to store data outside the container, depending on out needs: tmpfs mount for temporary data, bind mount for reading-writing data from/to a folder found on the host filesystem, and volumes for host-independent data storage. The image below depicts all the possibilities.
Fig.8. Different types of external storage for containers
There are of course advantages and disadvantages to each. Some of the disadvantages, is that tmpfs mounts are only available if Docker is running on a Linux host and it cannot be shared between containers. Bind mounts rely on the host OS folder tree, which makes an image using this less portable. Volumes need to be managed separately, as they are independent of the containers. The advantages are, that tmpfs will be much more performant than the other solutions. With bind mounts one can make files residing on the host filesystem available to the container extremely easily. Volumes are managed by docker, and as such we manage to maintain the highest system level abstraction (as discussed a few paragraphs above). Both bind mounts and volumes can be used with multiple containers to share the same persistent data between them.
Managing multiple containers
Speaking of multiple containers, how do we manage all the stuff discussed above as a user? The basic way is to use the CLI and use the
docker image and
docker volume management commands. Each of these have subcommands, like
inspect, etc, with which we can set things up one-by-one on a host. This is good for simple tasks and testing purposes, but quickly gets slow and cumbersome as the number of containers, volumes and settings start to grow. By using the CLI we also start to lose sight of what we did on a particular machine to set things up, as nothing get documented anywhere.
Astute readers might have already thought of using shell scripts to automate CLI-based tasks at the end of the previous paragraph, and they would’ve been on the right track, if Docker was still a Linux-only technology. Because it is not the case for years now, Docker developers had to devise means to make such tasks independent of any shell. For this Docker-compose was created, a tool that is able to do all the above mentioned tasks using a docker-compose.yml YAML file, where users can define multiple containers and their properties (volumes, ports, networks, etc) that need to be set up for a certain application. This is definitely a step forward, as we now have a very clear and easy to read documentation of what is exactly needed to run an application.
Fig.9. Creating the same setup using Docker CLI and docker-compose
Communication between containers
Now that we can create and setup multiple containers easily one could ask how do these connect to each other. Containers are isolated by nature, so how do they communicate with each other? How does network traffic go between the host and containers? The simple answer is using virtual networks. Docker provides built-in virtual networking, where one can create several different types of networks: bridge, host, overlay and macvlan.
- Bridge networking (the default one) provides a simple means to connect different containers to each other, which also facilitates the possibility to communicate with the host network.
- Host networking removes the middle layer, thus containers use the host’s networking directly.
- Overlay networking connects separate swarm services (about this a bit later) to each other or to standalone containers.
- Macvlan emulates physical NICs by assigning a MAC address to a container, which makes the container appear as a device on the network.
The image below depicts a possible network setup using all available Docker network types:
Fig.10. Different types of networking available in Docker
As discussed previously, it is possible to easily set up and start multiple containers with docker compose, but unfortunately it is not suitable for deployment tasks, as it doesn’t monitor container health, nor is it easy to distribute load to multiple hosts with it. This is where Docker Swarm comes into play, because it helps to scale both up and out. Let’s take this step by step.
On one side it would be nice, if our app would not become unavailable if our container stops working for whatever reason. A simple solution would be to simply automatically start a new container once the previous one dies, which is one of the features of Swarm. To be able to handle this, Swarm introduces the concepts of service and task. A service is basically an abstaction of a containerized application, behind which the actual work is done by tasks, where a task is an abstraction around the container itself. If for some reason one of the containers fail, Swarm will simply discard the broken container and simply starts a new one inside the task. This approach also adds the benefit of not needing to rebuild connections between containers, as it is done automatically in the background.
Fig.11. Docker swarm starts a new task in place of the failed one
This is great, but it is not an ideal solution, because there will still be some short downtime until the new container spins up. To circumvent this problem, Swarm allows the creation of multiple replicas of the same task within a single service, so if one them dies, the others are still able to serve requests, thus our application always remains online.
Fig.12. By creating multiple replicas of the same task within a service, we can ensure continuous availability of an application on container failures.
Up until now, everything we saw happened on the same host, which would be good enough if we could ensure that there never is any hardware breakdown or maintenance requiring hardware shutdown. Unfortunately, this is never the case, so Swarm provides further facilities to compensate for hardware related problems. Such a concept is nodes. A node is any physical or virtual machine (with their own Docker installation) connected to the same swarm. There are two types of nodes: manager and worker.
A manager node is responsible for accepting client commands, task creation, IP address allocation, task assignment to nodes and worker node management, whereas a worker node simply checks for assigned tasks and executes them if received. To better understand what each type of node does, I’ll borrow an official Docker documentation figure:
Fig.13. Manager and worker node responsibilities in Swarm (Official Docker documentation figure)
Although the image above depicts very nicely what each type of node does, it is a bit misleading, as manager nodes themselves are also able to execute tasks, which in fact they do automatically by default. So basically now we have an interconnected network of machines with different roles, but the question arises: how tasks are distributed among these nodes? To achieve maximum redundancy, manager nodes try to distribute instances of tasks of a particular service to as many nodes as possible. This not only ensures maximum availability, but we also utilize the power of multiple hardware resources to increase overall performance. Here’s a depiction of this:
Fig.14. Swarm distributes service instances accross different nodes
Keeping everything running no matter what
At this point it is pretty obvious, that the more worker nodes join the swarm, the better availability one can achieve, not to mention the increased overall performance. But what would happen if a manager node goes offline? The application would of course become unavailable, so to prevent this, Swarm doesn’t restrict the number of manager or worker nodes within a single swarm. In fact, if there is only a small number of nodes (e.g. 3), it is advisable to make all of them manager nodes to achieve maximum resilience. Since now we have multiple managers, each functionally completely identical to the other, some questions arise: Which one makes the decisions if a request comes in? How do they know if a request has been processed by somebody?
Without going into much detail, a simple answer to the first question is that there is always a Leader manager, which is responsible for managing tasks in a swarm. If the Leader dies, one of the other managers will take on its role, which will then rebalance all required tasks between the remaining functioning nodes. So let’s get back to the second question of how do the other managers know if one of them needs to step up and do stuff. Again, just simply put, all manager nodes share an “internal distributed state store”, through which all of them are aware of all things related to the swarm. Using this shared state and the Raft consensus algorithm they are able to manage the global state of the swarm. Let me borrow an official Docker depiction again to visualize this:
Fig.15. Multiple manager nodes sharing state about the swarm (Official Docker documentation figure)
At this point one could wonder how does all this work? We have multiple nodes, with multiple services, where each service can have multiple instances, and all this works completely transparently. Moreover, all this seems to be a single entity from the outside world, which makes it even more interesting. The answer to this question, as usually, lies in networking. When we create a swarm with a few services, we also create a new virtual network for this swarm. In fact not just one new network, but at least two, and all in all a swarm requires 3 networks to function properly: an overlay network, docker_gwbridge network and an ingress network.
- The overlay network ensures internode communication between services.
- The docker_gwbridge network connects overlay networks (including ingress) to a Docker daemon’s physical network.
- The ingress network handles control and data traffic to swarm services.
One of the most interesting features about such a setup is, that it allows to access our published (to the Internet) service using the IP address of any connected node. All this happens using the ingress network and some Linux kernel features, which basically creates a routing mesh. For a much more detailed explanation of how swarm network works, please consult the official Docker documentation or this excellent in depth analysis.
Managing a swarm
As with bare containers, all Swarm actions can be performed using the CLI and the
docker service and
docker node base commands (and some others). Unfortunately this approach suffers from the same definciencies as the commands discussed for containers, so a solution similar to docker-compose was created for swarms, namely stacks. Stacks use a YAML file almost identical to the compose files discussed before as a recipe, with some added features like replica count, task restart options, service update control, placement settings, etc. The good thing about both of these tools (compose and stacks) is, that they are able to work from the very same YAML file, meaning, that docker-compose is able to deploy and build locally on a dev machine, while ignoring all swarm deployment related information; and stack is also able to deploy to a swarm, while ignoring all build and local deployment related information. As a result, we need to maintain only one single YAML file for both local development and remote deployment.
Fig.16. Using the same YAML file for local development and remote deployment tasks
The only remaining question is security. We might be using some sensitive data, like keys, certificates, passwords, etc. and it would be nice if we didn’t need to hardcode all this into out application for everyone to see. Moreover, there are quite a few networks needed to run a swarm and some of them cross node borders, so how do we keep our data secure during transit and on storage? The good news on the transit side of things is, that all swarm service management traffic (and shared state) is encrypted by default, and application data over the overlay network can be encrypted as well by simply turning this feature on (some performance penalty is to be expected with the latter one).
On the stored sensitive data side of things however Swarm offers Secrets, which is a tool to centrally manage a blob of data (<500KB) in an encrypted fashion. Not only that, it offers ways to expose such stored sensitive data to only select services to ensure minimum visibility. So how does this work? Simply explained, the secret is sent to a swarm manager where it is stored in an encrypted fashion. When a service has access to a particular secret, the decrypted secret will be mounted into the container using an in-memory filesystem. At this point the secret will appear as a normal file in the container filesystem, which contains the decrypted sensitive data we stored as a secret. The nice thing about this approach is, that we do not need to treate such data in any special way from an application’s perspective, we simply read it in just as from any other file.
All in all, all I can say is that Docker truly is a fascinating technology, as it genuinely seem to solve a ton of problems all by itself. If we take other features of Docker into account (contexts, logging, more advanced security, runtime options for hardware resources, etc.), which haven’t been discussed in this short overview, we start to see why it became so popular in a short amount of time. As stated at the beginning of this post, this wasn’t a very detailed overview of Docker. If you want to find out much more about it, I strongly recommend taking some online courses on it and/or read the official online documentation. It really is worth it.
Thanks for reading.