How the WMS got automatically provisioned on 5 environments and 14 servers using Ansible and Gitlab CI

In 2017 the developers at the Web Services Group (WSG) along with sysadmins at Network and Communication Services (NCS) upgraded the infrastructure of the Web Management System (WMS) and took this opportunity to revisit the way they collaborate and provision servers in order to address common pain points.

Establishing a formal workflow

The WMS is made of:

2 Test environments (QA, RAS) primarily used by internal and external developers for development and testing purposes
3 Production environments (Training, Staging, Live) intended for end-users

These environments used to be hosted on 4 machines, with some of them serving multiple environments. Both developers and sysadmins used to have a full (root) access to all environments and would manually configure the machines as needed with no formal process or workflow:

Pain points associated with this setup include:

A lack of transparency when making changes: anyone from either team has the ability to change any server without letting the rest of the developers/sysadmins know
A divergence of environments: there is no guarantee that all environments are configured the same way (e.g. have the same Apache configuration)
A divergence of machines in the same environment: likewise, there is no guarantee that the two Staging/Live servers are configured the same way (e.g. have the same file permissions)
Inappropriate permissions: developers being granted a full admin access introduces the ability for them to update parts of the server that should be left to sysadmins (e.g. sudoers file)
No record of customizations: migrating the WMS to new servers means developers and sysadmins have to start by figuring out which parts of the machines have been tailored to the WMS

As a result a formal workflow has been instated:

Developers express their infrastructure needs (packages required, software configuration, remote filesystems, etc) in Ansible playbooks, which are stored on a Git server like regular code
Whenever developers push a change to the Git master branch, a Continuous Integration system (Gitlab CI) automatically instantiates an Ansible client using Docker and then provisions the Test environments by running these playbooks
When developers are satisfied with the state of the Test environments, they submit a Merge Request from the master branch to the deploy branch on Gitlab
Sysadmins review the Merge Request, allowing for feedback, and either accept or reject it
Once sysadmins accept the Merge Request, Gitlab CI instantiates an Ansible client using Docker and then automatically provisions the Production environments by running the playbooks

Key benefits of this new workflow include:

An extensive use of automation which makes the provisioning process testable and reliable, and makes both teams more efficient
Infrastructure changes are now recorded and accessible to both teams, which allows everyone to troubleshoot deployments
A clear separation of roles between developers and sysadmins, which encourages collaboration and promotes the stability of Production environments while making developers more productive when working on Test environments

Automating server provisioning

Key benefits of automating server provisioning:

Efficiency: a new machine can be set up in minutes, whereas manual provisioning would take hours or days; also, Ansible tasks run in parallel which means multiple servers get provisioned at the same time which is a boon for efficiency
Reliability: an automated provisioning process brings the confidence that the new machine will be set up appropriately and will not be missing any package or configuration, whereas manual provisioning introduces the possiblity of human error ; also, Ansible playbooks are idempotent which means they can run anytime to make sure nothing was changed locally without the fear of breaking things ; Ansible playbooks actually run automatically on all environments once a week to ensure consistency in case of occasional manual intervention

This was especially valuable when upgrading the infrastructure of the WMS, as provisioning 14 servers on 5 environments manually would likely have taken more time and effort.

This was also valuable when the team decided to strenghten the WMS infrastructure in August 2017 by adding a couple of extra servers to the Live environment in anticipation of an increased resource usage. Provisioning a couple of extra machines required little effort.

Making infrastructure changes traceable

The new workflow makes infrastructure changes traceable at multiple levels:

The expected state of the machines is stored in a declarative format (Ansible playbooks), so everyone knows what the servers are supposed to look like ; this is especially helpful when a new team member with no prior knowledge comes aboard and needs to quickly learn about the WMS
Ansible playbooks are stored in a version control system (Git), so everyone knows what, when, why, and who pushed a change which offers a de facto history of the infrastructure
Ansible runs in a continuous integration system (Gitlab CI) and the output logs are saved, so everyone knows what, when, why and who deployed a change to each environment and whether it succeeded or not ; this is particularly handy when troubleshooting deployments

None of the above was traceable in the old model.

Clarifying responsibilities while promoting collaboration between developers and sysadmins

The new model distinguishes the role of developers from that of sysadmins and fulfills the needs of both teams:

Ansible runs automatically on Test environments, so developers can quickly experiment and express their needs without any intervention from sysadmins
Only sysadmins can run Ansible on Production environments, so they can ensure that only appropriate parts of the machines are updated and they have a better control on the system

At the same time this workflow encourages the two teams to collaborate:

Sending a Merge Request prompts a formal review as well as discussions about the requested change
Both developers and sysadmins receive a notification when a Merge Request is approved and a deployment is happening on production