Docker is a containerization software that allows us to isolate software in a similar way to virtual machines but in a much leaner way.
A Docker image is a snapshot of a container that we can define to run our software, or in this case our data pipelines. By exporting our Docker images to Cloud providers such as Amazon Web Services or Google Cloud Platform we can run our containers there.
Docker provides the following advantages:
- Reproducibility: Same environment everywhere
- Isolation: Applications run independently
- Portability: Run anywhere Docker is installed
They are used in many situations:
- Integration tests: CI/CD pipelines
- Running pipelines on the cloud: AWS Batch, Kubernetes jobs
- Spark: Analytics engine for large-scale data processing
- Serverless: AWS Lambda, Google Functions
Check Docker version:
docker --versionRun a simple container:
docker run hello-worldRun something more complex:
docker run ubuntuNothing happens. Need to run it in -it mode:
docker run -it ubuntuWe don't have python there so let's install it:
apt update && apt install python3
python3 -VImportant: Docker containers are stateless - any changes done inside a container will NOT be saved when the container is killed and started again.
When you exit the container and use it again, the changes are gone:
docker run -it ubuntu
python3 -VThis is good, because it doesn't affect your host system. Let's say you do something crazy like this:
docker run -it ubuntu
rm -rf / # don't run it on your computer!Next time we run it, all the files are back.
But, this is not completely correct. The state is saved somewhere. We can see stopped containers:
docker ps -aWe can restart one of them, but we won't do it, because it's not a good practice. They take space, so let's delete them:
docker rm $(docker ps -aq)Next time we run something, we add --rm:
docker run -it --rm ubuntuThere are other base images besides hello-world and ubuntu. For example, Python:
docker run -it --rm python:3.9.16
# add -slim to get a smaller versionThis one starts python. If we want bash, we need to overwrite entrypoint:
docker run -it \
--rm \
--entrypoint=bash \
python:3.9.16-slimSo, we know that with docker we can restore any container to its initial state in a reproducible manner. But what about data? A common way to do so is with volumes.
Let's create some data in test:
mkdir test
cd test
touch file1.txt file2.txt file3.txt
echo "Hello from host" > file1.txt
cd ..Now let's create a simple script test/list_files.py that shows the files in the folder:
from pathlib import Path
current_dir = Path.cwd()
current_file = Path(__file__).name
print(f"Files in {current_dir}:")
for filepath in current_dir.iterdir():
if filepath.name == current_file:
continue
print(f" - {filepath.name}")
if filepath.is_file():
content = filepath.read_text(encoding='utf-8')
print(f" Content: {content}")Now let's map this to a Python container:
docker run -it \
--rm \
-v $(pwd)/test:/app/test \
--entrypoint=bash \
python:3.9.16-slimInside the container, run:
cd /app/test
ls -la
cat file1.txt
python list_files.pyYou'll see the files from your host machine are accessible in the container!
↑ Up | ← Previous | Next →