How the WMS got automatically provisioned on 5 environments and 14 servers
using Ansible and Gitlab CI
In 2017 the developers at the Web Services Group (WSG) along with sysadmins
at Network and Communication Services (NCS) upgraded the infrastructure of
the Web Management System (WMS) and took this opportunity to revisit the way
they collaborate and provision servers in order to address common pain
points.
Establishing a formal workflow
The WMS is made of:
- 2 Test environments (QA, RAS) primarily used by internal and external
developers for development and testing purposes
- 3 Production environments (Training, Staging, Live) intended for
end-users
These environments used to be hosted on 4 machines, with some of them
serving multiple environments. Both developers and sysadmins used to have a
full (root) access to all environments and would manually configure the
machines as needed with no formal process or workflow:
Pain points associated with this setup include:
- A lack of transparency when making changes: anyone from either team has
the ability to change any server without letting the rest of the
developers/sysadmins know
- A divergence of environments: there is no guarantee that all environments
are configured the same way (e.g. have the same Apache configuration)
- A divergence of machines in the same environment: likewise, there is no
guarantee that the two Staging/Live servers are configured the same way
(e.g. have the same file permissions)
- Inappropriate permissions: developers being granted a full admin access
introduces the ability for them to update parts of the server that should
be left to sysadmins (e.g. sudoers file)
- No record of customizations: migrating the WMS to new servers means
developers and sysadmins have to start by figuring out which parts of the
machines have been tailored to the WMS
As a result a formal workflow has been instated:
- Developers express their infrastructure needs (packages required,
software configuration, remote filesystems, etc) in Ansible playbooks,
which are stored on a Git server like regular code
- Whenever developers push a change to the Git master branch, a
Continuous Integration system (Gitlab CI) automatically instantiates
an Ansible client using Docker and then provisions the Test environments
by running these playbooks
- When developers are satisfied with the state of the Test environments,
they submit a Merge Request from the master branch to the deploy
branch on Gitlab
- Sysadmins review the Merge Request, allowing for feedback, and either
accept or reject it
- Once sysadmins accept the Merge Request, Gitlab CI instantiates an
Ansible client using Docker and then automatically provisions the
Production environments by running the playbooks
Key benefits of this new workflow include:
- An extensive use of automation which makes the provisioning process
testable and reliable, and makes both teams more efficient
- Infrastructure changes are now recorded and accessible to both teams,
which allows everyone to troubleshoot deployments
- A clear separation of roles between developers and sysadmins, which
encourages collaboration and promotes the stability of Production
environments while making developers more productive when working on Test
environments
Automating server provisioning
Key benefits of automating server provisioning:
- Efficiency: a new machine can be set up in minutes, whereas manual
provisioning would take hours or days; also, Ansible tasks run in parallel
which means multiple servers get provisioned at the same time which is a
boon for efficiency
- Reliability: an automated provisioning process brings the confidence that
the new machine will be set up appropriately and will not be missing any
package or configuration, whereas manual provisioning introduces the
possiblity of human error ; also, Ansible playbooks are idempotent which
means they can run anytime to make sure nothing was changed locally without
the fear of breaking things ; Ansible playbooks actually run automatically
on all environments once a week to ensure consistency in case of occasional
manual intervention
This was especially valuable when upgrading the infrastructure of the WMS,
as provisioning 14 servers on 5 environments manually would likely have taken
more time and effort.
This was also valuable when the team decided to strenghten the WMS
infrastructure in August 2017 by adding a couple of extra servers to the Live
environment in anticipation of an increased resource usage. Provisioning a
couple of extra machines required little effort.
Making infrastructure changes traceable
The new workflow makes infrastructure changes traceable at multiple levels:
- The expected state of the machines is stored in a declarative format
(Ansible playbooks), so everyone knows what the servers are supposed to
look like ; this is especially helpful when a new team member with no prior
knowledge comes aboard and needs to quickly learn about the WMS
- Ansible playbooks are stored in a version control system (Git), so
everyone knows what, when, why, and who pushed a change which offers a de
facto history of the infrastructure
- Ansible runs in a continuous integration system (Gitlab CI) and the
output logs are saved, so everyone knows what, when, why and who deployed a
change to each environment and whether it succeeded or not ; this is
particularly handy when troubleshooting deployments
None of the above was traceable in the old model.
Clarifying responsibilities while promoting collaboration between
developers and sysadmins
The new model distinguishes the role of developers from that of sysadmins
and fulfills the needs of both teams:
- Ansible runs automatically on Test environments, so developers can
quickly experiment and express their needs without any intervention from
sysadmins
- Only sysadmins can run Ansible on Production environments, so they can
ensure that only appropriate parts of the machines are updated and they
have a better control on the system
At the same time this workflow encourages the two teams to collaborate:
- Sending a Merge Request prompts a formal review as well as discussions
about the requested change
- Both developers and sysadmins receive a notification when a Merge Request
is approved and a deployment is happening on production