SKA PSS CI Systems Developer Guide

Table of Contents

[[TOC]]

Introduction

The PSS Cheetah pipeline will be run on dedicated servers, using a large number of nodes. The contents of this repository enable us to deploy our search software on different machines, both during development of the code and later on in production deployment.

The repository contains a typical Ansible file system structure, including playbooks, tasks, and an inventory file. It also contains the iac_deployer tool, which wraps Ansible and provides some configuration and testing suites. The iac_deployer is used to initiate testing and deployment of configurations as defined in the ansible directory.

The tool should be run from your local host. To get started, check out this repository with git directly to the machine you would like to use as local host, and follow the instructions below. For testing, the tool will create docker containers on the local host (see Scenario 1), while for deployment it will target the machines specified in the inventory file, called production (see Scenario 2).

Some things to note about how the tool works:

If you are running the tool on a local host (bare metal or Virtual Machine):
- Test containers get created and deleted during testing, they don’t persist.
- There are component tests and machine tests:
  - A component test makes a container, then pulls the relevant component (e.g. boost), installs the component, checks everything is working as expected, rolls-back the component, then removes the container.
  - A machine test creates a container following an ansible playbook, installs everything in that playbook, checks everything is working as expected, does a roll-back, then removes the container.
- Tests that include our Gitlab runner component (both machine tests and component tests) create an additional container that has a full instance of gitlab inside, to test registering the gitlab runner.
If you are instead running in gitlab CI:
- All of the above is then run inside an additional container, using Ubuntu 18.04. If you want to change this OS, you may also need to update the requirements file for python.
In actual deployment, you will run the iac_deployer tool on a local machine with the addresses or target machine names included in an inventory file (production).
The tool will create a python virtual environment and install Ansible inside the environment. This is beneficial as it is then isolated from the host machine and can have different versions of python-related software (including Ansible) than what is installed directly on the host.

Installing Python and virtualenv

Before running the iac_deployer script, please ensure you have Python and virtualenv by following the steps below (either for Debian or MacOS).

Debian

Installation via package manager

For Debian-based OSes, please ensure Python 3 (>= 3.6) is installed and the python3-venv apt package is installed, since the Python venv module is used for creating a Python virtual environment where all necessary pip packages will be installed. For example, installation on Ubuntu would look something like the following:

$ sudo apt install python3 python3-venv

NOTE: If you are using Ubuntu 20.04, the python3-venv package may not be available. In which case, you probably already have the virtualenv package. Use the command below to test it.

Verify that you have the virtualenv package by running:

$ python3 -m venv -h

If you see the usage of the venv module, you have successfully installed the virtualenv package.

Installation via pip

Alternatively, you can install the virtualenv package with pip itself. Firstly, ensure you have Python 3. You can type

$ python3 --version

to check this. Next, check if you have pip installed by typing

$ python3 -m pip --version

If you get something like:

pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.6)

then you have pip. Otherwise, install pip either by installing the apt package python3-pip with

$ sudo apt install python3-pip

or by using the ensurepip module:

$ python3 -m ensurepip --upgrade

You should now have pip. Now, install the virtualenv package with:

$ python3 -m pip install virtualenv

Check that you now have the venv module by typing:

$ python3 -m venv -h

If you see the usage of the venv module, you have successfully installed the virtualenv package.

MacOS

Please download the latest Python release from https://www.python.org/downloads/mac-osx/. Then, open a terminal and verify that you have Python 3:

$ python3 --version

Next, install pip with:

$ python -m ensurepip --upgrade

You should now have pip. Now, install the virtualenv package with:

$ python3 -m pip install virtualenv

Check that you now have the venv module by typing:

$ python3 -m venv -h

If you see the usage of the venv module, you have successfully installed the virtualenv package.

Installing Docker on MacOS

If you are using the repository on a Mac, you will need to manually install Docker. (On Debian-based systems, this is automated via Ansible.) Please go to https://docs.docker.com/desktop/mac/install/ and download the correct version of the Docker Desktop installer for your architecture. Then, follow the instructions further down the page (https://docs.docker.com/desktop/mac/install/#install-and-run-docker-desktop-on-mac). Once you’ve finished, open a terminal and verify that you can run:

$ docker --version
$ docker ps

Getting started with iac_deployer

Once you know you have Python 3 and virtualenv (and Docker if you’re on a Mac) installed, you can run iac_deployer (from the repository’s root directory). The first time the script is run, Ansible and other required pip packages will be installed within a Python virtual environment, which will be located in the venv directory at the repository’s root. Once the installation has completed (it will take a little while to install ansible), verify you can view the script’s usage by typing:

$ ./iac_deployer -h

Commands quick-start

List available component tests with:

$ ./iac_deployer component_tests list

You can then choose one from the list and run it with:

$ ./iac_deployer component_tests run <component_name>

You can do similar things for the machine configuration tests i.e.:

$ ./iac_deployer machine_tests list

and

$ ./iac_deployer machine_tests run <machine_name>

Check the status of a machine with:

$ ./iac_deployer machines status <machine_name>

For more detailed help, see the example usage patterns below.

Running the unit tests

You can run the unit tests for the Python code (not for the Ansible configurations) by following the steps below. Note that this is different from running the machine and component tests via iac_deployer - this tests the Python code behind the iac_deployer script.

After making sure you’ve run iac_deployer at least once and installation was successful, source into the created virtual environment with (from the repository’s root):

$ source venv/bin/activate

Change into the python directory:

$ cd python/

Run the tests:

$ tox -e tests

You can also run the linter:

$ tox -e lint

When you’re finished with the virtual environment, simply type:

$ deactivate

to exit the virtual environment.

GitLab branching strategy

The Gitflow workflow will be used (full details are documented here: https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow). This is implemented for this repository in the following manner:

In general, deployments to the production machines should only be made from the main branch. This (currently) isn’t automated via the GitLab CI/CD pipeline; instead, an authorised user (who has SSH keys set up for accessing the production machines) should clone a local copy of the repository and use the script to make the deployment.
- If the user wants to rollback a machine to a previous Git commit, they should clone the repository, then use git to checkout the commit, and then use the deployment script from there.
The integration branch is called dev and is the branch that should be branched off of when creating a new feature branch. Once any feature branches are completed, they should be merged back into dev via merge requests. When the CI/CD pipeline is executed for the dev branch, tests that emulate the production machines’ configurations, as well as component-specific tests, are run. The pipeline is also automatically run whenever a feature branch is pushed to, to help to catch any errors early before merging the branch into dev via a merge request.
When a deployment needs to be made, dev should be merged into main, and then an authorised user should make the deployment from there. If any “hotfixes” need to be applied to address any deployment issues, these should be made on dev or a feature branch; before being merged into main via a merge request.
The differences between our workflow and the linked gitflow workflow are:
- Release branches are not used, since we are not using version numbers on main.
- We should not have any hotfix branches - any hotfixes should be applied to dev or a feature branch.

Git workflow

Please ensure you follow the Git workflow for this repository closely. See the following sections for common use cases:

Editing the repository’s contents

If you wish to make updates to any of the files within this repository (for example if you want to add a new component, edit an existing one, add a new machine configuration etc.), please follow these steps:

Clone the repository to your local machine. If you have GitLab SSH keys set up:

$ git clone git@gitlab.com:ska-telescope/pss/ska-pss-ci-systems.git

Otherwise, use HTTPS:

$ git clone https://gitlab.com/ska-telescope/pss/ska-pss-ci-systems.git

Next, switch to the dev (integration) branch:

$ git checkout dev

Then, create a feature branch from the dev branch:

$ git checkout -b <feature_branch_name>

Now implement your changes. View the below sections for how to add a new component, how to add a machine configuration, etc.

Once you’ve finished implementing your changes, you can optionally run the machine and component tests locally on your development machine before pushing to the remote (which will run all tests again via the CI pipeline).

Please note: at the time of writing, there are some component and machine tests that involve the creation of a local GitLab CE Docker container to run tests against. If you wish to run these tests, your development machine / VM needs to have a minimum of 4GB of RAM, due to the GitLab container requiring that much. For more details, please visit: https://docs.gitlab.com/ee/install/requirements.html#memory

Also note that the component tests can take between 1.5 hours and 5 hours to run, depending on the speed of your machine. The machine tests can take between 20 mins and 2 hours.

To run all component tests, type:

$ ./iac_deployer component_tests run

Similarly, to run all machine tests, type:

$ ./iac_deployer machine_tests run

You can also type the --help argument after each run to view the usage for the run sub-command.

If you want to run a specific component / machine test, just specify the name of the component / machine after the run command. For example:

$ ./iac_deployer component_tests run 'gitlab-runner'

To push your feature branch to the remote and initiate CI testing:

$ git push --set-upstream origin <feature_branch_name>

This should then automatically trigger the CI/CD pipeline for your branch. Visit https://gitlab.com/ska-telescope/pss/ska-pss-ci-systems/-/pipelines to view all the repository’s pipelines.

Ensure that the pipeline passes. If it does not, fix the problems (either via the GitLab Web IDE or your working copy of the repository) and try again. Repeat this until the pipeline passes.

Next, create a merge request (MR) for merging your feature branch into dev and wait for it to be reviewed.

Once your MR has been approved, merge it into dev. Note: it’s okay if dev has failing pipelines, but main should (ideally) never have failing pipelines.

That’s it, your edits are now on dev, the integration branch. If GitLab hasn’t done so already, you can now delete your feature branch. To do this with git:

$ git branch -d <feature_branch_name> # Delete local branch
$ git push -d origin <feature_branch_name> # Delete remote branch

If the remote branch is still listed when you type git branch -a, run this command to update the list of remote branches:

$ git remote update origin --prune

Making a deployment to one or more machines

If you wish to deploy the defined Ansible configurations to one or more machines, please first ensure you are authorised to do this (i.e. you can ssh to each machine you want to deploy to, either via password or key-based authentication).

Now, decide whether you need to make a new commit on the main branch or not. If you want the latest configurations from the dev branch, you will likely need to do this. To make a new commit: first ensure that the latest commit on the dev branch has passed the CI/CD pipeline. If it has not, then either make a “hotfix” directly on the dev branch, or, if the required changes are more substantial, create a feature branch and perform the edits as required, using the steps in the above section. You can then create a merge request (MR) for merging dev into main. You can set the MR to automatically complete once the pipeline for dev has passed.

Once you have a commit on main that has passed the pipeline, you can make the deployment. Please follow these steps:

Clone the repository to your local machine. If you have GitLab SSH keys set up:

$ git clone git@gitlab.com:ska-telescope/pss/ska-pss-ci-systems.git

Otherwise, use HTTPS:

$ git clone https://gitlab.com/ska-telescope/pss/ska-pss-ci-systems.git

Next, switch to the main (master) branch:

$ git checkout main

Next, checkout the particular commit SHA that you want to use for the deployment. (This can be helpful if you want to rollback a machine to a previous deployment.) If you just want to use the latest commit on main, you can skip this step.

$ git checkout <commit_hash>

If you need to, you can run all component / machine tests (note that this will take some time) with:

$ ./iac_deployer component_tests run
$ ./iac_deployer machine_tests run

You can see what changes would be made by your deployment first by using the machines status function:

$ ./iac_deployer machines status <machine_name>

To make the deployment to a particular machine, type:

$ ./iac_deployer machines deploy <machine_name>

If any errors occur during deployment, you will be asked to fix them by making a new commit (or, alternatively, you can try again from the same commit). You should implement any fixes on the integration (dev) or a feature branch (not main), and repeat the above process until the deployment succeeds.

Prerequisites

Python 3

Used pip packages (listed in python/requirements.txt):

ansible == 4.2.0
docker == 5.0.0
PyYAML == 5.4.1
pytest == 6.2.4
pytest-cov == 2.12.1
tox == 3.24.1

Additionally, these packages are used for linting the Python code (listed in python/tox.ini):

flake8
flake8_formatter_junit_xml

Please do not add any new dependencies without consent from the repository’s authors!

Setting up SSH keys

You can enable Ansible to SSH to the host machines that you want to configure without any passwords by setting up SSH keys for them. The easiest way to do this is to use the following commands:

ssh-keygen to generate your SSH key (if you’ve already got a key, i.e. if the file ~/.ssh/id_rsa.pub already exists, you can skip this step)
ssh-copy-id to copy your public SSH key to each host that you need Ansible to be able to connect to.
Then you should test that you can successfully SSH to each host without any passwords with ssh <user>@<hostname>.

Changing the contents of requirements.txt

Please note that if you change the contents of python/requirements.txt, you will need to delete the venv directory and run iac_deployer again, so that the changes are picked up.

Deleting the venv directory

If requirements.txt changes or the iac_deployer script changes, you may wish to delete the virtual environment and create it again. Simply run (from the repository’s root directory):

$ rm -rf venv/

to remove the virtual environment.

The next time you run iac_deployer, it will be re-created and the pip packages listed in python/requirements.txt will be installed within it.

Python script example usage

Testing a component (role)

Initialise the test playbook for the gitlab-runner component for each of the defined operating systems:

$ ./iac_deployer component_tests run 'gitlab-runner'

You can use the -v option to provide verbose output. You can use the --os option to specify which operating system to test the component on e.g. ubuntu-20.04. Available OSes are listed in the ansible/roles/os-containerised/files directory. At the time of writing, these are centos-7, ubuntu-18.04, and ubuntu-20.04.

Testing a machine’s configuration

Initialise the test playbook for the kelvin machine:

$ ./iac_deployer machine_tests run 'kelvin'

This will emulate the kelvin machine’s configuration locally within a development Docker container to highlight any immediate errors.

Checking the status of a machine’s configuration

Check the difference between the contents of the repository for the kelvin machine now and the repository’s contents when it was last deployed to:

$ ./iac_deployer machines status 'kelvin'

If the repository’s contents are the same, then this will simply say that kelvin is up-to-date.

Deploying the machine configurations

To attempt to deploy the configurations for all the machines that are in the inventory file (ansible/production):

$ ./iac_deployer machines deploy

To only deploy to the kelvin and tengu machines:

$ ./iac_deployer machines deploy 'kelvin' 'tengu'

You can use the –rollback-only option to ONLY rollback the existing configuration (if any) of the machine(s).

What are deployments and rollbacks?

A deployment is the usual execution of an Ansible configuration upon a target machine. For example, in deployment mode, a certain APT or YUM package may get installed on the target machine. Whereas, in rollback mode, this package would get uninstalled. The point is, when a deployment is made, there should be corresponding rollback steps that “undo” everything that the deployment did in the first place. You can see examples of deployment and rollback steps in the existing Ansible roles. Each role should have a deploy and rollback subdirectory under tasks that define deployment and rollback steps, respectively (although this is not strictly enforced - and it may not make sense for every role to have this exact structure).

Whenever a new machine configuration is added to this repository, that machine’s status will begin as “not deployed” - i.e. it is waiting to be deployed to. Once iac_deployer has been used to deploy the machine’s configuration from the repository (and there were no errors during this process), the machine’s status will change to “deployed”. Going back to the earlier example, the certain APT or YUM package (which we’ll call package A) will now be installed on the target machine.

If this machine’s configuration needs to be altered (for example, if package A needs to be replaced with a different package - which we’ll call package B), then, firstly, the existing deployment on the machine must be rolled back so that package A is uninstalled. The machine’s status would then change back to “not deployed”. Then, a new deployment can be made from the newer git commit of the repository that installs package B instead of package A. After this deployment is done, the machine’s status would change to “deployed” again. Note that the newer git commit doesn’t need to reference package A anywhere - since the older git commit will be used for rolling back package A.

If, after this process, you decide that package A is needed again instead of package B, then you can checkout the older git commit and initialise the deployment from there. This would first rollback the newer deployment (uninstalling package B), and, if that succeeded, would then proceed with installing package A from the older commit again.

A summary of the steps needed for the above example are shown below. Please note that this is an example used to illustrate how deployment and rollbacks work and thus shouldn’t be done to real production machines, but can be done on a local testing branch of this repository that should be deleted afterwards.

Ensure your currently checked out git commit of the repository contains the install and uninstall steps for package A. If it doesn’t, find the commit SHA and check it out:
```
$ git checkout <commit_sha>
```
Deploy package A to the machine with the name <example_machine>:
```
$ ./iac_deployer machines deploy <example_machine>
```
If no errors occured during the deployment, then package A will now be installed on the machine. You can check the machine’s status with:
```
$ ./iac_deployer machines status <example_machine>
```
This should report that the machine is now deployed to.
Now make a new git commit that replaces the Ansible installation and uninstallation steps of package A with package B. For example, by using git commit:
```
$ git commit -m "Replace package A with B"
```
If you now run the status command again, it should report that the machine is deployed to, but differs in file contents between the commit when it was deployed to and the current git HEAD i.e. it should show the package A steps being replaced by the package B steps.
The existing deployment can be rolled back by using the following command:
```
$ ./iac_deployer machines deploy <example_machine> --rollback-only
```
This can be run from any git commit. It will uninstall package A by making use of the older commit automatically.
The new deployment can then be made (that installs package B). This command should be run from the latest commit (i.e. the commit that includes package B’s installation and uninstallation steps):
```
$ ./iac_deployer machines deploy <example_machine>
```
Note that this step can be skipped if, on the previous step, you omitted the --rollback-only argument. In that case you must have run the command from the git commit that includes package B’s steps. That would have automatically deployed the newer git commit following the rollback of the older one.
Package B should now be installed. You can use the same status command as earlier to check that the machine’s status is now “deployed”.
To now remove package B and install package A again, you need to first checkout the older git commit:
```
$ git checkout <commit_sha that includes package A's steps>
```
From there, you can use iac_deployer again to rollback the existing configuration (removing package B), and then to install package A again, all in one command:
```
$ ./iac_deployer machines deploy <example_machine>
```
Following this, package B should be removed, and package A should be installed again.

Managing the available machine operating systems to run tests on

Machine configuration playbooks (listed under ansible/hosts/) must define a variable, called os, that specifies what operating system the machine is expected to be running. For example, ubuntu-18.04. If, when attempting a deployment, there is a mismatch between the actual detected operating system and the one defined by the os variable, then an error will be raised. The reason this is done is so that the machine tests can accurately emulate the operating systems that the real machines are running via Docker containers. The machine tests do this by looking up the value of the os variable, which is then used to find the corresponding directory that contains the appropriate Dockerfile to create the test container.

The available operating systems to be emulated can be found under ansible/roles/os-containerised/files/. Each OS directory must contain a Dockerfile that will extend from the relevant OS version (e.g. ubuntu:20.04). Please see the existing OS directories and their Dockerfiles to get an idea of what is required. As a minimum, each Dockerfile must contain:

A FROM line - the first line in the Dockerfile. This must specify what Docker image that this Dockerfile should extend from. For example, if the operating system is Ubuntu 20.04, the line would read: FROM ubuntu:20.04. You can browse available Docker images that can be extended from via https://hub.docker.com/
One or more RUN lines that run the following necessary commands:
- A command that installs Python 3. For example, to update the apt package index and install Python: RUN apt-get -y update && apt-get -y install python3.
- A command that upgrades the packages that are already installed. For example, for apt: RUN apt-get -y upgrade
- A command that installs any other necessary packages that are needed by any Ansible modules that will be in use during machine and component tests. For example, the ansible.builtin.apt module requires the python3-apt package to be installed. Therefore, a line such as RUN apt-get -y install python3-apt would be needed in the Dockerfile.
- A command that installs sudo, a command that adds a new user that is added to the sudoers group, and a command that ensures that the user can run sudo commands without needing to type in their password. Please see the existing Dockerfiles for examples of this.
A USER line that sets the user to be used in the container to the created user (mentioned in the above point) rather than the default of ‘root’. This is done to ensure that privilege escalation matches the typical workflow of using sudo on the production machines (rather than logging in directly as the root user).
A CMD line that specifies the sleep infinity command as the default command for the container. This is so the container will stick around while the Ansible playbooks are being run upon it.

Generating ignore lists of files for a machine operating system

If you have just created a new operating system to emulate via the steps above, then when testing components (via the ./iac_deployer component_tests sub-command), you will find that a test may report that there were file changes that weren’t handled by the component being tested. This will probably mean you will need to create an “ignore list” of files for your new operating system. Please see the section entitled “How to create a test for a role (component)” for more details about this.

Note: you can specify what OS to test the component on when using the component_tests sub-command. For example:

$ ./iac_deployer component_tests run <component_name> --os "ubuntu-20.04"

Therefore, by replacing “ubuntu-20.04” with the name of your OS, you can run the component test on your new OS only (rather than all OSes) to speed up testing.

How to create a role (component)

Each Ansible role in this repository represents an individual component that can be deployed to a machine or set of machines. Roles should be placed under the ansible/roles directory. There are already some sample roles that you can look at to get an idea of what is required. Also, there is a full example template for a role under ansible/examples/role_template/. You can copy this directory to ansible/roles/ and rename it and edit its files to suit your needs. Full details on how to create a new role without this template are below.

Create a new directory under ansible/roles. Its name will be the name of the component.
Within this new directory, create another directory called tasks. Within tasks, create a file named main.yml. This file will serve as the main entry point to the role.
The tasks that the role executes should be based on the value of the deploy_mode variable. This variable tells the role whether the role should be deployed (i.e. installed), or rolled back (i.e. uninstalled). deploy_mode will equal 'deploy' for deploy mode, and will equal 'rollback' for rollback mode. You should create deploy and rollback directories under tasks that each contain their own main.yml files. The main.yml file under tasks should then include either deploy/main.yml for deploy mode or rollback/main.yml for rollback mode. The idea is that rollback mode should undo everything that deploy mode did. See the docker and gitlab-runner role for an example of all of this. Or can you can look at the ansible/examples/role_template/tasks/ directory.
If you wish, you can create other .yml files within the deploy or rollback directories that can be included from other task files. For example, you could have debian.yml and redhat.yml files that contain OS-specific steps. See one of the sample roles e.g. docker (or role_template) for an example of this.
If, during the rollback steps, your role needs to remove a package e.g. via APT or YUM, then you must use the package_manager role to do this. The reason for this is package_manager automatically checks for reverse dependencies of the packages you’re trying to remove, and if it finds any, the package removal will be skipped. Reverse dependencies are other packages that have dependencies on the package(s) you’re trying to remove. If the package(s) you have listed were removed, then the reverse dependencies would be automatically removed too, which is undesirable, since you may end up removing a package that you didn’t intend to remove. If it’s okay that these reverse dependencies get removed, then they should be added to the list of packages provided to package_manager, so that it knows that it’s okay for those packages to be removed too. Please see the role_template, gitlab-runner, or docker roles (looking at files under the rollback directory) for an example of this.
If your role depends on another role, then create a directory called meta within the role’s root directory and create a main.yml file within that. Then use the dependencies key, followed by a list of YAML values for each role that your role depends on. See the gitlab-runner role or the role_template for an example. You can read more about role dependencies here: https://docs.ansible.com/ansible/latest/user_guide/playbooks_reuse_roles.html#using-role-dependencies
- NB: Please note that you should be careful if you create multiple roles, where one of these roles depends on the other, that have other overlapping dependencies. For example, if role A conditionally depends on role B, and both role A and B have a non-conditional dependency on role C, then, when running role A, Ansible will apply the condition for role B to the role C dependency for both cases - rather than just for the role B case. This can potentially cause role C’s steps to be skipped for apparently unknown reasons. A workaround for this is to use the Ansible include_role or import_role modules for role C from within role A’s tasks. The role B condition will then not be applied to the included role C.
It can also be helpful to have a set of default values for the variables that your role makes use of, that the user can override (e.g. via the command line) if they wish. You can set the default values via another main.yml file under a defaults directory i.e. the path to main.yml would be <repo root>/ansible/<name of role>/defaults/main.yml.
If your role needs to deploy one or more files to the machine(s) being deployed to, you can store these under a files subdirectory (files should be at the same level as defaults and tasks).
You can use Ansible handlers that can perform tasks e.g. restart a service, only if certain other task(s) have been done e.g. updating the configuration file for that service. The handlers should live in a main.yml file under a handlers subdirectory. You would use the notify setting within a task definition to tell Ansible which handlers should be run when the task is run. For more information about handlers, please visit: https://docs.ansible.com/ansible/latest/user_guide/playbooks_handlers.html

More information about Ansible roles can be found here: https://docs.ansible.com/ansible/latest/user_guide/playbooks_reuse_roles.html

How to create a test for a role (component)

It is good practice to test each Ansible role (component) you have created to ensure that they work as you intend before deploying them to production machines. To do this, for each component that you want to test, you should create a corresponding directory under ansible/tests/components/ that is named the same as the Ansible role (e.g. gitlab-runner). See ansible/examples/component_test_template/ for a template for creating a component test. The component test directory should contain the following files / directories:

deploy.yml and rollback.yml playbooks, that contain steps for deploying and rolling back the component, respectively. During the test, each playbook will be run twice in succession to ensure that no errors occur the second time. These playbooks are mandatory.
Optionally, you can define additional Ansible playbook file(s) that can be used to test the role more thoroughly. You can define multiple playbooks if you want to test different aspects of the component. Each playbook file must start with the test_ prefix and have a .yml extension. For example, test_docker-runner.yml. iac_deployer will search for files that meet this criteria and run each of them in turn after the deploy.yml playbook has been run the second time.
A file called ignore_list.yml can be created in the component test directory. This file should contain a YAML list of file / directory patterns that should cause any files / directories that match any of these patterns to be ignored by the file checking step of the component test. For example, if you know that your component will create a file called /test_file.txt, but you know it doesn’t matter if that file isn’t deleted by the component’s rollback steps, then you should add the file to the ignore list.
- In addition, you can define OS-specific ignore lists by creating a directory named after the OS (e.g. ubuntu-20.04) in the component test directory, and within that directory, you should create another ignore_list.yml file. This file will then only be used when the component is being tested on the defined OS.
- Additionally, if the component (i.e. role) defines any role dependencies in its meta/main.yml file, then the ignore lists for those components (plus the ignore lists for any of the dependencies’ dependencies) will be included too.
Optionally, playbooks called setup.yml and teardown.yml can also be created in the component test directory. These playbooks should contain any steps that need to be carried out before the first run of the deploy.yml playbook and after the last run of the rollback.yml playbook, respectively. setup.yml will be executed once before all other playbooks begin, and teardown.yml will be executed once after all other playbooks have completed. This can be helpful if your tests need some sort of fixture to run their tests against. For example, the gitlab-runner component test creates a local GitLab instance in a container for testing against and removes it at the end.
Also, if the test needs any variables to be set that need to be available to all of the playbooks, then these can be defined in a vars.yml file.

See the gitlab-runner directory for an example test for that component. It is good practice to write your component tests such that once they have completed, everything that the test set up is teared down so the development machine is left in a clean state (i.e. via the use of the setup.yml and teardown.yml playbooks).

Once you are happy with the tests you have created, you should run them by using the wrapper script’s function component_tests. You should pass the name of the component to the function (e.g. gitlab-runner).

How to create a configuration for a machine or group

The roles (components) that can be deployed to machine(s) or group(s) of machines should be available under ansible/roles. Separate Ansible playbooks can then be used to decide which roles should be deployed on which machines.

Firstly, if this is a new machine or group of machines, you will need to add it to the production inventory file (located at <repo root>/ansible/production). See the file for an example configuration. All hosts should be listed at the top of the file, with any groups defined below that. Groups names are defined by surrounding them with square brackets []. See this page for more details about building the inventory file: https://docs.ansible.com/ansible/latest/user_guide/intro_inventory.html
Next, create a file named after the machine or group within the ansible/hosts/ directory (or ansible/groups/ for a group), with a .yml extension i.e. <host / group name>.yml.
- There is a sample file at ansible/examples/template.yml that shows an example which you can adapt to suit your needs. In particular, you should replace all occurrences of <machine / group name> in the file with the name of the machine or group that you want to configure.
Next, specify the list of roles that should be deployed on the particular machine or group. This is simply a YAML list underneath the roles: heading.
For each machine you are configuring, in the host playbook, you need to set the os variable to the operating system that the production machine is running. When the configuration is deployed, if there is a mismatch between this value and the actual detected operating system, an error will be raised. The os variable is used for the machine_tests function so it knows what operating system to use as a base. To see an example of this, view one of the existing machine playbooks in ansible/hosts.
You should now test your machine or group configuration. You can do this by using the machine_tests function from the wrapper script. This function will create a Docker container on your development machine with the same OS as the production machine. It will then deploy the roles within the container, which should help to identify any errors before deploying in production.
Once you are happy that the configuration is tested, you can use the machines deploy function to deploy it in production (following the appropriate git workflow - see the above section).

How to test a configuration for a machine

You can use the machine_tests function of the wrapper script to test your configuration for a production machine locally in a development Docker container to help ensure that the configuration deploys successfully before deploying in production. The steps that the wrapper script takes are as follows:

The user runs machine_tests run [machine_name]. There must be a playbook for the machine under ansible/hosts/ and the operating system for the machine must be defined as a variable called os within this playbook (under the vars section).
The script looks up the value of the os variable for the machine and remembers it
The script runs the ansible/tests/docker-create.yml playbook, passing the os and container_name variables. The container_name variable will be equal to the machine name.
- This will create a Docker container named after the production machine, using the machine’s OS
The script generates a site.yml playbook by concatenating all the ansible/hosts/*.yml and ansible/groups/*.yml playbooks together
The script copies the production inventory and replaces the line for the machine under [all] with a line that is the same except it sets ansible_connection to docker and removes any real IP address of the machine
The script then runs site.yml, passing the edited inventory, but limits the hosts to just the Docker container named after the machine
- This will then deploy the production machine’s config inside the Docker container, and any immediate problems will become obvious
Afterwards, the script runs ansible/tests/docker-destroy.yml with container_name set to the machine name to tear down the Docker container

How the rollback of components works

When a deployment is initialised via the machines deploy function, the following steps will be carried out, for each machine (host) that is getting deployed to:

The status of the machine will be looked up from the status repository and, if it already has a deployed configuration, the configuration will be rolled back. Otherwise, the rollback steps are skipped
The git archive command will be used to generate a tar archive of the repository at the commit SHA of the last deployment (as obtained from the status repository) and the iac_deployer script from the extracted archive’s contents will be run to perform the rollback.
This will cause the host playbook in the extracted archive (found in ansible/hosts) to be run, and the deploy_mode variable will be set to “rollback”.
The roles for the host will then be run with the value of this variable, which will run the uninstall steps for each role, removing the config that was deployed previously.
Once that has finished, the extracted archive will be deleted, and then the original repository’s playbook will be run, but this time with the deploy_mode variable set to “deploy”. This will initiate the latest deployment with the updated roles.
Once the deployment has completed, the status.json file from the status repository will be updated to point to the current commit for this host.
These steps will then be repeated for any other hosts that are being deployed to.

How host status checking works

When the user calls the machines status function, for each machine (host) specified, a diff report that will show the differences between the last deployment to the host and the current repository’s configuration will be generated based on the following criteria:

The report will show a list of roles that have been added, removed, or modified. Also, if the machine’s playbook has changed, this will also be shown.
If there are no role changes, the report will simply state that the host is up-to-date with the repository’s current configuration.

Full script usage details

You can simply type

$ ./iac_deployer --help

to view the script’s usage. The output is shown below.

usage: iac_deployer [-h] {component_tests,machine_tests,machines} ...

Interface to initiate testing and deployment of the defined Ansible
configurations from the repository.

positional arguments:
  {component_tests,machine_tests,machines}
                        Sub-command help
    component_tests     List or run component tests
    machine_tests       List or run tests for machine configurations
    machines            Perform or check machine deployments, including
                        rollbacks

optional arguments:
  -h, --help            show this help message and exit

The command specific help outputs are shown below.

component_tests

$ ./iac_deployer component_tests --help
usage: iac_deployer component_tests [-h] {list,run} ...

positional arguments:
  {list,run}  sub-command help
    list      List available component tests
    run       Run a component test

optional arguments:
  -h, --help  show this help message and exit

machine_tests

$ ./iac_deployer machine_tests --help
usage: iac_deployer machine_tests [-h] {list,run} ...

positional arguments:
  {list,run}  sub-command help
    list      List available machine configurations that can be tested
    run       Run a test for a machine configuration

optional arguments:
  -h, --help  show this help message and exit

machines

$ ./iac_deployer machines --help
usage: iac_deployer machines [-h] {deploy,status} ...

positional arguments:
  {deploy,status}  sub-command help
    deploy         Command to initiate rollbacks (if necessary), followed by
                   deployments of machine configurations
    status         Command to check the status of machines

optional arguments:
  -h, --help       show this help message and exit