Learning from GitLab: Making Production Obvious
We recently took some time to take stock of our practices and procedures following the outage at GitLab. The GitLab outage started when a developer made the easy mistake of entering a command on the wrong server. Unfortunately, that server was part of their production environment. If this could happen to a company like GitLab, could it happen to us? What could we do to avoid falling into the same trap?
It’s easy to have many terminal windows open and forget which one belongs to which environment. We wanted to address this problem with a simple solution that made it painfully obvious that a developer was about to run a command in production. At TrackJS, we use Ansible to automate our infrastructure configuration. To make these changes, we wrote some small Ansible snippets for both Linux and Windows.
A large part of the TrackJS infrastructure runs on Linux. To make it really clear we are SSH’ed to production, we made the command prompt red! In Ubuntu, this is accomplished by modifying the
.bashrc file in the user’s home directory.
The .bashrc file builds an environment variable named
PS1. This variable defines what the command prompt will look like. Any color enabled console will understand ANSI escape codes so we added them to our prompt. To make the change simple, we wrapped the existing PS1 value with color escapes. Wrapping the existing value also let us make the change at the end of the file. This came in useful once we automated the change with Ansible. The begin color flag for red is
\e[0;31m and the end flag is
\e[m; whatever is in between the two will be displayed in the specified color. Putting that together with the prompt variable gave us this at the end of .bashrc:
After reloading bash, our prompt looks like:
Next, we set this up in Ansible. All of our Linux playbooks reference a shared role named “linux-common”. This gives us a nice place to put configuration we want everywhere. We wanted all of production to have this behavior, so we put our new change there.
The usual approach to modifying a file is to copy the whole .bashrc into Ansible as a file or template. We did not want to manage the whole file if it was not necessary. Again, in the name of simplicity, we decided to manage only the part we cared about. To accomplish this, we used a module named blockinfile which is new in Ansible 2. The module lets us put a block of text anywhere in an existing file. One of the built in options places the block at the end of the file. This is quick, easy, and exactly what we want!
But wait, we run the same playbooks against both development and production servers! Lucky for us, we have an Ansible variable for all our servers named, “env”. For production servers, this var is set as:
env: prd. We used this variable to make our change only when we wanted it.
When put together, we got a single task:
We use Windows for some parts of our application and occasionally connect to our Windows servers using Remote Desktop (RDP). We wanted to modify the Windows UI to make production just as obvious as in Linux. We could have made the background red, but the background can be hidden behind fullscreen windows. Instead, we made the Windows UI chrome red!
Windows Server disables a lot of the UI niceties like customizeable colors which are found in desktop versions Windows. Most of the registry settings still exist, but must be modified directly. We modified the key
HKCU:\Software\Microsoft\Windows\DWM\ColorizationColor to tint the chrome in the Windows UI. The default value was
c055c9ed in hex which is an alpha channel(c0) plus a hex color (55c9ed). We changed the color to a blatant red,
dd0000 and kept the same alpha value which gave us
c0dd0000. After we logged out and back in, we saw:
Next we automated it with Ansible. Our change was placed in a role named “windows-common” which served the same purpose as our “linux-common” role. We used Ansible’s built-in win_regedit module to modify the registry. Finally, we converted the key value from hex to decimal so that it was compatible with older Ansible versions. We used this along with our environment check to produce the task:
One Less Foot-Gun
Now we have more red than we know what to do with! By using these one-task Ansible snippets, we made it very obvious when dealing with production vs test servers, and one small avenue to disaster is harder to travel down. Hopefully, you can put this to good use too.
There is a lot more to learn from GitLab and we’re looking at more ways to make TrackJS even more stable and reliable. As we find ways, we’ll be sure to share them with you.