Docker and Containerization
Containerization is a lightweight form of virtualization that allows you to package applications and their dependencies into containers. You can specify the entire stack of dependencies from the operating system, installed os packages through to python packages and your application code. These containers can then run consistently across various computing environments - the same containers can run whether it is on linux, windows, MacOS etc. Unlike traditional virtual machines, containers share the host system's kernel, making them more efficient and faster to start. These containers can then be deployed to the cloud and scaled up or down easily.
What is Docker?
Docker is a platform that automates the building, management, and deployment of containerized applications. Docker is the most common tool used for containerization due to its ease of use and widespread adoption. There are other alternatives if required, such as Podman.
Key Concepts
-
Docker Engine: The core part of Docker, it allows you to build and run containers. It consists of a Docker daemon that runs on the host machine and a Docker CLI (command-line interface) for interacting with the Docker daemon.
-
Docker Images: Immutable snapshots of your application and its environment. These images are built from a Dockerfile and contain everything needed to run the application, including the code, runtime, libraries, and configurations.
-
Docker Containers: Running instances of Docker images. Containers are isolated from each other and the host system, but can interact through defined channels.
-
Dockerfile: A text file that contains instructions for building a Docker image. It specifies the base image, the application's code, dependencies, and configuration.
-
Docker Hub: A cloud-based repository where you can find and share Docker images. You can pull images from Docker Hub or push your own images to it. AWS and Azure also have their own container registries where you can store images of your applications.
Getting Started with Docker
Step 1: Install Docker
Docker can be installed on various operating systems, including Windows, macOS, and Linux. Follow the instructions on the official Docker website to install Docker on your machine.
Step 2: Write a Dockerfile
Create a file named Dockerfile in your project directory. Here’s an example for a simple Python application:
Step 3: Build the Docker Image
Use the Docker CLI to build the image from the Dockerfile:
Bash | |
---|---|
Step 4: Run a Docker Container
Run the container from the image you just built:
Bash | |
---|---|
This command maps port 80 in the container to port 4000 on your local machine.
Step 5: Access Your Application
Open your web browser and go to http://localhost:4000. You should see your application running.
Applying Docker to Bioinformatics Packages
Here are some examples of how I used Docker to containerize bioinformatics applications: https://github.com/shaunchuah/docker_builds
Example container with Kraken and Bracken installed
Here is an example of packaging Kraken and Bracken into a single container to increase the efficiency of my NextFlow pipeline by reducing the amount of data transferred between the computing cluster and cloud object storage.
VSCode Dev Containers
Full documentation here: https://code.visualstudio.com/docs/devcontainers/containers
Visual Studio Code (VSCode) has a feature called Dev Containers that allows you to develop inside a container. This is useful for ensuring that your development environment is consistent across different machines and for sharing your development environment with others. You can define your development environment in a devcontainer.json
file, which specifies the Dockerfile to use, the extensions to install, and other settings.
Example Project Structure:
Bash | |
---|---|
An example Python dev container configuration:
.devcontainer/devcontainer.json | |
---|---|
The corresponding requirements.txt file:
and the corresponding Dockerfile in the root of your project:
Dockerfile | |
---|---|
With this setup, you can open your project in VSCode and select "Reopen in Container" to start developing inside the container. You can also easily upgrade versions by changing the corresponding files and rebuilding the container.
Info
This is an alternative way to manage python environments and dependencies. It also means you can upgrade python versions easily.
Trusted Research Environments
Trusted research environments are currently the way to access unconsented clinical data via SafeHaven or DataLoch. Many TREs essentially use containers to provision a workspace within a secure network from which you get access to the data. Understanding SSH, Docker, and containerization is essential for working within these TREs.