Architecture, Deployment and Operations
A baseline application development and deployment model based on lessons learned from the last two years of building, extending and deploying apps in CI/CD in “enterprise” GIS.
Over the last two years I’ve started and contributed to a number of projects at work. In each of those projects, I’ve seen myriad approaches to solving the same problems: provisioning local development environments, test/stage/prod deployment infrastructure and effectively managing the configurations between all of those things. I’ve made my own naïve attempts to solve these problems on my own, often with little to zero visibility into the context of upstream environments that the code will be deployed into.
While some of those approaches have been well thought-out, executed and documented, most have been nightmarishly inconsistent and fraught with kludges and hacks (occasionally for good reason). This post is an attempt to reconcile the lessons learned from all of that into a single document/design specification that can serve as the basis for further discussion and refinement with other devs and ops folk.
Much of my thinking on this is inspired by discussions with folks in
operations, quality assurance and other developers, from reading opinions on
Twitter, HN, from decomposing aspects of some popular PaaS platforms and from
poring over scores of
What’s the scope of the problem?
Here are some of the stickier issues I’ve seen that my proposed model aims to solve:
Difficulty standing up a development environment on a developer’s local machine
Installing and running the bits needed to run one or more components on a developer’s local machine ranges from cumbersome to impossible.
System-level incompatibilities between builders and runtime environments cause builds to pass but deployments to fail
We had a microservice that unit tested and packaged successfully in CI but
exploded at deployment time because we were vendoring dependencies at build
pip download -r requirements.txt) and the build agent, running
RHEL, provides Python 2.7 compiled with narrow Unicode character width (UCS-2)
while the runtime environment, running Debian, provides Python 2.7 compiled
with wide Unicode character widths (UCS-4) (ref).
PSA: stuff like this is why you should be using Python 3, people. ;)
One-size-fits-all build infrastructure will inevitably lead to snowflake configurations
Most of the time, projects that are even written in the same language vary wildly in configuration and OS requirements. This leads to people adopting “pet” build agents that can handle their project’s needs by making (usually undocumented) ad hoc changes to the build boxes themselves. Eventually, that pet stops functioning in a predictable way and gets decommissioned and replaced with a more generic one that fixes some projects while hosing the others that had come to rely on the nonstandard behavior, starting the cycle all over again.
Installing “sciency” dependencies (e.g., GDAL, scipy, numpy, etc) can be hard to get right and harder to repeat
While they work, some of these libraries require some serious configuration
gymnastics to compile, install and make portable. At the risk of
overgeneralizing, scientists (rightfully) are often more interested in proving
theorem and correctness at the expense of code convention, readability and
(most importantly) portability. Some of the messiest/scariest code I’ve seen
has been from peeking at source code of widely-used geospatial libraries and
generate_tiles.py, and pretty much any
implementation of decimal-degree-to-UTM/MGRS coordinate conversion, etc).
Or, as one commenter at The Daily WTF put it:
As a sysadmin for a research computing facility, “scientists write code. It’s often not very good code” is the biggest understatement I’ve seen on this site. And yes, I’ve been reading this site for several years. Scientific code is, generally speaking, horrific.
I still have nightmares about the time I tried to install GDAL with Python bindings on bare metal on macOS. I’d never used Homebrew because I’d been actively trying to avoid it. I’ll spare you the gory details and just say that I now use Homebrew…
Managed PaaS components being arbitrarily removed/downgraded without notice
We were plagued by a random “rolling back” of CloudFoundry buildpack versions that would happen every couple of weeks or so and I’d have to bug the same guy every single time to have him fix it. I’m sure he got tired of me after the third or fourth time I raised the issue, but apparently not tired enough to automate the whole thing. ¯\_(ツ)_/¯
Mind you, I don’t attribute this occurrence to malice, but likely some AMI or EBS snapshot that kept falling back to baseline because changes were never persisted.
If you’d like to critique some aspect of this model, please post a comment on the GitHub gist I wrote to collect my notes.
This model depends on containers to bridge the gaps between developer laptops,
CI servers and production servers. If the deployment environments (e.g.,
stage, production, qa, etc) can’t run containers for whatever reason, at
minimum, containers should be used in CI to build and stage
that can be installed into whatever raw VM/box comprises the running system.
This model also assumes partial or full adherence to 12-Factor principles, at minimum, externalized configuration in the form of environment variables (preferably) or configuration files (if you’ve just gots to have you some crazy-convoluted configs).
Finally, the model uses AWS concepts (e.g., EC2, RDS, CloudFormation, etc) for the purpose of illustration, but can work in other vendors’ cloud offerings.
Local Development Environment
Developers run Docker on their laptops to run and test individual application
components along with any backing services (e.g., PostgreSQL, RabbitMQ,
GeoServer, etc). Optionally, the requests to those collaborating services can
instead be proxied to the
stage deployment environment instances.
All configurations to the runtime environment occur inside the
docker-compose.yml or some other configuration file that is checked into
version control. Secrets are fed in via environment variables at runtime at
the command prompt, e.g.,:
docker-compose build docker-compose run -e SECRET=secret myproj-component-a
…or by whatever mechanism the developer’s IDE allows them to define and pass environment variables into Docker.
If the component needs some dependency that’s crazy-hard to install or compile,
that process should be extracted into its own
Dockerfile which is used to
create a base image. Once the base image is built, the actual application
Dockerfile should extend via
This is an abstract pipeline design that optimizes for blue/green deployments, repeatable builds and bakes in the ability to rapidly deploy a hotfix in emergencies.
The unit being tested, built and deployed here is some application component such as a microservice.
||Git tag or commit SHA to be built and deployed.|
||Enable builds to optionally complete faster by skipping some slow scans.|
- Prepare Workspace
- Clean workspace
- Check out
- Unit Tests
- Executes unit tests inside Docker container
- Builds artifacts inside Docker container (e.g.,
- Builds artifacts inside Docker container (e.g.,
- Push artifacts to Nexus/S3
- Deploy (initial)
- Pull artifact from Nexus/S3
- Push to
$targetwith version-suffixed domain/route
- Integration Tests
$skip_slow_scans, skip X, Y and/or Z
- Deploy (cutover)
- Point unsuffixed domain/route to newly deployed instance
- Terminate previous instance
Build artifacts would ideally be pushed to some central repository that the CI server has read/write access to. This example just uses some S3 bucket, e.g.:
To accomplish the blue/green deployment, each project component requires at least two DNS entries:
1. Fully-qualified application hostname
This entry enables us to run integration tests on the incoming instance to make sure it actually works before tearing down the previous instance.
The CI server updates this DNS record during each build.
2. Short “alias” to the fully-qualified hostname
This entry is effectively a pointer to the latest deployed instance of a component (identified by its commit SHA-suffixed hostname), allowing for “zero downtime” deployments (in theory). It will always point to a running instance (either the previous instance just before it gets torn down) or the incoming instance (once the alias record gets updated to point to it).
The CI server updates this DNS record during each build if and only if the integration tests for the incoming instance all pass.
Given some project myproj that is to be deployed directly onto one or more EC2 instances:
Creates a deployment target environment (e.g.,
All infrastructure should be tagged for push-button wholesale teardown of everything inside the deployment target environment.
Triggers and waits for each of the above builds and optionally flip some arbitrary switch at the end of it all (maybe set and push some git tags to the repos?).
Given the same set of parameters, the pipeline should be capable of deploying an exact replica each subsequent run.
What’s the next step?
To support the lowest common denominator, this model assumes the deployment
target is not a PaaS, but a collection of EC2 instances that we’re
.rpms onto. But with things like
Fargate on the horizon, the
up-front complexity of deploying containers to production should drop
significantly in the future. As such, my next area of research is use cases
and design considerations for using Kubernetes, Mesos or some other container
orchestration platform in production.
Thanks to James, Marge, Patrick and Travis for providing feedback for this post.