A three minute guide to Ansible for data engineers

Ansible is a DevOp tool. Data Engineeres can be curious about tech and tools used by DevOps, but rarely have more than 3 minutes to spend for a quick look over the fence.

Here is your 3 minutes read.

To deploy something into an empty server, first we need to instal python there. So we open our Terminal and use ssh to connect to the server:

localhost:~$ ssh myserver.myorg.com
maxim@myserver.myorg.com's password: **********

Welcome to Ubuntu 22.04.4 LTS (GNU/Linux 6.5.0-26-generic x86_64)


Note that after the first connection, the DevOp will probably configure ssh client certificate as well as some sudoers magic, so that we don’t need to enter the password every time we use ssh and therefore we can use ssh in scripts doing something automatically. For example, if we need to install pip on the remote server, we can do

ssh -t myserver.myorg.com sudo apt install python3-pip

Now, let’s say, we have received 10 blank new empty servers to implement our compute cluster and we need to configure all of the in the same way. You cannot pass a list of servers to the ssh command above, but you can do that to Ansible. First, we create a simple file hosts.yml storing our server names (our inventory):



Now we can install pip on all compute servers (even in parallel!) by one command:

ansible -i hosts.yml my-compute-cluster -m shell -a "sudo apt install python3-pip"

This is already great, but executing commands like this has two drawbacks:

  • You need to know the state of each server in my-compute-servers group. For example, you cannot mount a disk partition before you format it, so you have to remember whether partitions have been already formatted or not and
  • The state of all the servers has to be the same. If you have 5 old servers and one new, you want to format and mount the disk partition on the new one, and under no circumstances you want to format the partitions of the old servers.

To solve this, Ansible provides modules that not always execute some command, but first check the current state and skip the execution, if it is not necessery (so it is not “create”, it is “ensure”). For example, to ensure formatting of a disk partition /dev/sdb with the file system ext4, you call

ansible -i hosts.yml my-compute-cluster -m filesystem -a "fstype=ext4 dev=/dev/sdb"

This command won’t touch the old servers and only do something on the new one.

Usually, when preparing the server to host your data pipeline, several configuration steps are required (OS patches need to be applied, software needs to be installed, security must be hardened, data partitions mounted, monitoring established) so instead of having a bash script with commands such as above, Ansible provides comfortable and readable roles format in YAML. The following role prepare-compute-server.yml will for example update the OS, install pip, and format and mount filesystem:

- name: Upgrade OS
    upgrade: yes

- name: Update apt cache and install python3 and pip
    update_cache: yes
    - python3
    - python3-pip

- name: format data partition
    fstype: ext4
    dev: /dev/sdb

- name: mount data partition
    path: /opt/data
    src: /dev/sdb

Roles such this ment to be reusable building blocks and shouldn’t really depended on what rollout project you are currently doing. To facilitate this, it is possible to use placeholders and pass parameters to the roles using Jinja2 syntax. You also have loops, conditional executions and error handling.

To do some particular rollout, you would usually write a playbook, where you specify, what roles have to be executed on what servers:

- hosts: my-compute-cluster
  become: true    # indication to become root on the target servers
    - prepare-compute-server

You can then commit the playbook to your favorite version control system, to keep track who did what when, and then execute it like this

ansible-playbook -i hosts.yml rollout-playbook.yml

Ansible has a huge ecosystem of modules that you can install from its galaxy (similar to PyPi) and also much more features, most notable of which is that instead of having a static inventory of your servers, you can write a script that fetches your machines using some API, for example the EC2 instances from your AWS account.

Alternatives to Ansible are Terraform, Puppet, Chef and Salt.

Leave a comment