Thoughts on Stage Zero Infrastructure
Some thoughts on “stage zero” infrastructure deployments, i.e. those which bootstrap or kickstart1 further infrastructure, from someone who’s done it far too many times.
I will attempt to keep my comments agnostic of any particular technology, but I can’t promise a lack of bias in my examples.
Disclaimer: Getting these concepts straight can be hard, and while I speak from a position of experience, I still learn new approaches and think of new ways to do things, each and every time.
The Scenario
You’ve been asked to build, or simply want to build, the infrastructure for a new, greenfield project. Unlike other projects that you have built previously, this one isn’t permitted to use any of the existing infrastructure, or the infrastructure simply doesn’t exist yet, and that’s exactly what you’re putting together.
The infrastructure must be able to provision further, core infrastructure deployments in an automated fashion, which will be the business infrastructure for application deployments, and it must be possible to rebuild2 in a disaster scenario.
Considerations
Thinking things through, you realise that there are a few factors you want to take into account, when building your infrastructure.
-
You want simple deployments, for easy debugging.
-
You want to protect your secrets, because security is important.
-
You want to be able to upgrade your stage zero environment.
-
You want an idempotent deployment.
-
You want your design and decisions documented3.
-
You want zero “reliance” on your infrastructure, by child infrastructure.
Simplicity
You don’t want to over-engineer your stage zero environment, because it needs to do nothing other than act as the progenitor for your subsequent environments.
It might be tempting to “solve for X” whilst building your stage zero environment, but this usually results in the trap of building key infrastructure in your stage zero environment, because you have made an incorrect assumption that more than is barely necessary is required.
Your stage zero environment needs to be as minimalist as it can be. Functional, without overreach.
Some Mistakes
It is this author’s opinion, that it is highly unlikely that your stage zero environment requires dedicated versions of the following:
-
DNS Resolvers or Nameservers
-
NTP Servers
-
Authentication or RBAC Solutions
-
VPN Servers4
-
Hashicorp Vault deployments…
Security
One thing you absolutely do want to account for is security. Your environment might be small, and performing one particular function, but that doesn’t mean you don’t want to secure it.
Secrets
Consider creating some named users as part of your deployment for your environments out of trusted users, and ensure that your company off-boarding policy accounts for removing these static users when they leave the business. You should also consider if a particular job role, or title, warrants adding to this static user list, and account for those new hires or leavers.
Decide if you can effectively secure your process and systems with something like a signed and trusted GPG key and derived SSH key from each privileged user.
When deploying infrastructure, it may be necessary to store static secrets within your deployment method. One example could be Ansible, in which case you should consider a mechanism like Ansible Vault, or perhaps GPG encrypting your secret values, and requiring one of your privileged users’ private key for decryption and updating of values.
Alerting
You want to know if your stage zero environment has been compromised (assuming it’s a persistent installation) or if there’s something unusual going on with the infrastructure.
At a base level, you can set up email alerting if users log into the infrastructure in any way, there’s usually a mechanism to do this without external involvement, though you could potentially use an off-the-shelf email provider in some way too without too much added complexity (and a company credit card where required).
If building in a public cloud environment, consider setting up (and documenting) some of the bundled monitoring and alerting features, but we don’t want to focus on those elements in this post, because it adds complexity and isn’t a guarantee for elements like on-prem or private cloud deployments.
Restrictions
Perhaps you can set up firewalls on your infrastructure to ensure connectivity is only possible from approved IP addresses, if this is the case then it might be a reasonable option, but you want to ensure you have scenarios documented which might cause connectivity issues, such as changing of agreements with ISPs.
As ever though, firewall restrictions and white-lists aren’t a good substitute for keeping your environment up to date, and having good policies in place for zero-trust.
Upgrades
A frequent problem I have observed with stage zero environments, is that they quickly grow “stale” or become out of date, both in terms of the mechanisms used to build these environments, and the software versions used.
Upgrade Policies
Ensure you have a policy written down for upgrading your stage zero environment when new releases are made; this also ties into the idea of keeping your environment “simple” so that you don’t have too many of these to track.
Consider using LTS software and operating systems where you can, to minimise the chances of having to update software frequently, whilst having a good degree of confidence in its longevity.
If your environment is persistent, consider spinning up a second and testing upgrades against that environment, before upgrading your primary one.
Always take a backup of key data before attempting an upgrade.
Periodic Rebuilds
It’s good practice to have a policy in place, and a practice internally, which assess and attempts to build your stage-zero environment periodically. Mechanisms for doing deployments can go out of date, and it’s typically a good idea to test new versions against your code, so that you can fix problems as they occur, and not the day before you’re due to build a new greenfield environment5.
Update Notifications
Know what software is installed, I’ve already mentioned this once in the scope of simplicity, but it rings true for security too. You want to be subscribed to mailing lists, have access to CVE boards, and sign up to notification emails where appropriate. Your stage zero environment is still important, and the last thing you want to do is deploy insecure code into an environment, or leave it languishing to get compromised.
Write down your policies around this, discuss them with your security team (if you’re in the privileged position of having one) and ensure that you periodically review what you decide.
Idempotency
Assume for a moment, that your stage zero deployment consists of a set of
scripts, or a makefile which has the potential to be interrupted when being
run, perhaps by the cat unplugging your computer, or your internet dipping out.
This is a scenario that is very real, especially if your deployment runbook involves running things from a local machine, to automate properly further into the infrastructure.
It is up to you, as the engineer performing the work in the stage zero design,
to decide how you want to tackle this issue. You may consider the simplicity of
bash a good reason to write everything in a script of this nature, which means
you have to add a degree of extra checking into your logic, to ensure your
infrastructure doesn’t get built twice, or a box gets overwritten.
Tools exist which have idempotency as a core principle, such as Ansible, which makes it a good choice for these sorts of deployments6. With Ansible’s native functionality, you can re-run the same Role or Playbook and be relatively confident that you’re not going to break anything, it also makes making changes to the infrastructure simpler.
Documentation
Whilst I don’t enjoy the sounds of a needle skipping over the tracks of a record, I have grown quite accustomed to hearing myself bang on about documentation over the years. As an industry, I think we’re getting better about writing it, even if we’re not at the point where the content is quite there yet. I have a lot of respect for people whose primary job is documentation writing, and a good writer should be able to get the point across, without blurring the complexity beyond recognition.
Recording Decisions
It. Is. Important. that you write down your decisions, and why you made them (look at Architectural Decision Records for inspiration.) This is so that future engineers in your situation, don’t think they have to: A. Tear down your infrastructure and start again, because they don’t understand it, or B. Have to spend hours reverse engineering your decisions, in a crisis scenario. There’s also a lot of C. which is the scenario in which you’ve slept, and forgotten why you made that seemingly asinine decision, five years ago, on a Tuesday, in May, at 16:52.
Guiding
Until now, we’ve covered a few scenarios which require documenting guides specifically; how you might go about performing an update to a piece of supporting software; how and when you might check over your deployments or attempt a trial deployment; how you go about off-boarding and on-boarding privileged users.
All these things need to be written down, if only so you can satisfy your stakeholders, any auditors, and yourself, that a process exists that is a known quantity, and that you can stick to.
If you do things right, and keep things simple, there’ll never be a problem and you won’t be blamed when the problem doesn’t happen.
Separation
As stated in the considerations section, you really, really want to avoid making your stage zero deployment a dependency of anything. It should act as the catalyst, the progenitor, the alpha and the omega of your infrastructure, but in the event that it goes down for whatever reason, nothing else should be impacted by the outage and you should have a leisurely time restoring your stage zero build.
With this in mind, consider testing your environment once you’ve enabled the subsequent build jobs for core infrastructure. You’ve built your stage zero environment, the Jenkins jobs have triggered which have build your master Kubernetes cluster, and you’re happy that other environments are spinning up in the metaphysical air around your head. You can now start shutting things down, and making sure nothing else breaks.
Some of this theory crosses into the different types of staging environment, covered in the next section, but as a general rule, even those environments which you keep up all the time (accounting for any power, or usage costs) shouldn’t be able to create a business impact, if they break.
Approaches
Phew, so that’s a good list of considerations to take on board, and hopefully has you thinking about your stage zero environment a little bit. Let’s take a short break from the considerations of how to build, and talk about what to build.
The Persistent Environment
The first environment we’re going to discuss, is arguably the simplest. This environment is quite static, only really gets touched when it’s called upon to build other “core” environments, and typically comprises a few VMs or physical machines, which tick away to themselves quietly and burn resource.
It’s not outside the realm of reason to have this environment exist, but be physically off, meaning a DC engineer or someone with root access to your cloud account has to enable these machines, if changes need to be made. Similarly, your automation scripts could have them perform their actions for bootstrapping core environments, and then when a certain condition is met, it could power them off. Anything is possible with your imagination!
The crux of this environment is that it just is, and will continue to be just is until it isn’t anymore.
I’ve built more of these environments than any other, and they’re probably the most common one that you’ll see in companies which choose to go with stage zero environments in the first place.
The Transferred Environment
This type of environment is for the brave, or perhaps the foolish, depending on your level of reverence. This environment is built locally, say in a VM or a series of containers on your machine, and it initialises itself to the point of being able to provision a “project” in your target space7.
The initialising VM then composes the persistent infrastructure for your staging environment, and transfers its innards into the stage environment’s VMs, leaving itself a husk.
This is tricky to get right, and is fraught with difficulty. It has the advantage of being somewhat agnostic, meaning you can work on the environment and ensure it can talk to what it needs to locally, before moving it into its permanent home, but it does add an additional layer of complexity, arguably for little gain.
I have seen it used to great effect.
Upgrades and updates to these environments typically still require some form of local script run, making them subtly different to…
The Self-aware Environment
These environments are self obsessed, and typically in control of their own faculties, from within. They can be hell for circular dependencies, and be frequently misconfigured with the illusion of being configured correctly, leading to weird and wonderful scenarios when your automation removes a key portion of its brain, or rips out a library from underneath itself that it’s using to upgrade its libraries.
They can work, but they typically involve complex and redundant scenarios. Kubernetes has made these types of deployment a little more possible, because Pods can be rescheduled if their underlying node vanishes for whatever reason, but you effectively introduce known breakages which are auto-recoverable.
They have the advantage of being easier to manage in terms of upgrades, and don’t require additional laptop-driven scripts to perform ad-hoc maintenance, but they do spoil the fabled simplicity, which I’ve espoused throughout this blog.
If you attempt one of these self-aware builds, then heed the warning of dragons above the door, but good look, brave traveller.
Tooling
Lastly, let’s talk a bit about some of the tooling that you might consider for such an endeavour, and I will share some pitfalls that I’ve come across in my own experiences.
Ansible
A good choice for idempotent builds, for the reasons listed in the post, but some consideration should be given to the fact Ansible isn’t overly good at building infrastructure, and while it can, the quality of the Ansible modules used to build against the different environments and APIs can vary wildly.
Ansible is slow, there’s no getting away from it, and while you can speed it up a little with various tips and tricks, you’re still looking at several minutes to an hour for an environment that can serve your needs. If this is acceptable to you for the advantages that it brings (simple syntax, idempotency, Ansible Vault) then you should be aware of the speed impact at least.
There’s also the fact Ansible changes rapidly, and you have to keep on top of your Ansible Roles and Playbooks to ensure they work with modern versions of the tool. There’s a philosophical debate somewhere here about pinning against certain versions, and that’s certainly an option, but if you want longevity and confidence, you need to keep on top of your versions8.
Terraform
Can be okay for building infrastructure, but do bear in mind the stateful requirements of Terraform itself. You need to put policies and guides in place to ensure two people don’t try and run their Terraform at the same time, with an unprotected state, and clobber your infrastructure.
You should also ensure that you actually have somewhere to store your state. In the case of AWS builds, this might be as simple as having a runbook in place to manually create S3 storage, which you then import into Terraform9 after creation. You might even consider paying a nominal amount to AWS to use their S3 storage for you build elsewhere, though this does create a dependency, and isn’t really “stage zero” as you couldn’t rebuild your infrastructure when Bezos inevitably collapses into a supermassive black hole of money.
Bash, Make, scripts in general
Arguably a staple, and not something which is going to disappear anytime soon.
Building a series of bash scripts to provision your infrastructure, with the
appropriate checks and balances in place, could be appealing for a lot of
people, especially those who want the peace of mind that their tooling isn’t
going to go out-of-vogue anytime soon.
Can be more complex to write initially, and arguably has a larger learning curve than Ansible to write correctly, but there’s nothing wrong with the well-trodden realms.
Jenkins
I’ve seen core infrastructure bootstrapped from a single, big, Jenkins box. Usually done with a bunch of bundled and ready-made pipelines which are loaded into the instance automatically. Again, this has an added complexity, especially if you choose to write your declarations in Groovy, and does have the problem of versioning once more.
Still, if you’re more developer minded, and have more than a passing familiarity with Jenkins, this could be the choice you choose to make, and there’s nothing wrong with that if it works for you, and you’re confident that it will work for whoever comes after you.
Summary
If you’ve made it to here, well done, I know this article didn’t have much in the way of practical guidance, but that’s because every build is different, and every requirement is unique. You’ve got company policies, spending budgets, and personal preferences to contend with, so this advice was written with those considerations in mind.
Try it for yourself, see what you can accomplish with a blank slate, and some
common sense. Document your decisions, write your blog posts, and boast about it
on the internet. We will make mistakes, we will learn from them, we will
improve, and maybe at some point in the future, you will write a book on what
you’ve done to build a series of environments, one on top of the other, with the
fulcrum of a bash script that you run from your 2008 era netbook.
If you take anything away from this, remember that simplicity is key. Stage zero environments are great for decoupling your dependencies, and giving you confidence in running your core environments; they provide the mental assurance that you have a path to rebuild everything, if there is a problem, and they’re probably one of the more interesting thought-experiments you’ll ever put into practical use.
Or… you know… just click through the AWS console a bit, winging it every time.
-
When faced with a cavalcade of differing terminology for the same thing, pick the one you like and stick with it, other than on those occasional “wild” days where you change your mind and your documentation becomes inconsistent. ↩︎
-
That doesn’t necessarily mean the rebuild itself needs to be automated, provided you’ve supplied some form of runbook and steps to follow. ↩︎
-
I forget things, you forget things, and nobody wants to get dragged out of bed at three in the morning needlessly. Documentation is important for these reasons, and more, which probably warrant their own post. ↩︎
-
This is a particularly weird one. ↩︎
-
Or the worst-case scenario, you’re told to rebuild everything because of a fundamental issue that can’t be resolved. Nobody wants to be seeking out old versions of Ansible at four in the morning, just because your stage zero code hasn’t been updated for six years. ↩︎
-
And I want to make it very, super, crystal clear that I don’t believe Ansible is a good tool for deployments any larger than this. ↩︎
-
I’m saying this a lot, because I’m trying to be agnostic, but I have seen stage zero environments in AWS, OpenStack, and on physical machines in data centres, to a greater or lesser degree of success. The project here might simply be your DC provider’s API for requesting physical tin. ↩︎
-
The same is true of other Ansible Roles, outside of infrastructure… an oft-forgotten fact. ↩︎
-
Iffy though, and not a decision to be made lightly. State as a source of truth is nice, but do you only want your state to know the status of your infrastructure? ↩︎