Share this page to your:
Mastodon

There is a lot of information about CI/CD relating to devops, but surprisingly little about the specifics of actually deploying to production. I won't give the traditional summary of devops because you can find one here. So I'll just dive into the problems I'm trying to solve with production deployments and my solutions.

When deploying to any kind of environment you do not just have some code to update. You have a whole ecosystem that has to be consistent. For example you may be deploying a piece of code that needs to talk to a database. Is the database there? Are you sure? Do you need to go look? It isn't just a database, there might be a queue, some kind of network infrastructure and a dozen other things. Did you check them all? What about the other pieces of code that are already running, are they going to work with this new code?

In a test system you can deploy and hope. Fix any problems. Repeat.

But not in a production system. You really have to know. And you have to be able to undo what you just did just in case it all went wrong.

Those are the kinds of issues I'm thinking about with the system I'm building. It's using GKE with Postgres databases, Pubsubs and a few other things. The different services are sensibly independent, all they know about each other is the APIs they use. But those might need to change, of course. I might want to introduce new services, and maybe new databases and pubsubs.

When I look for information about how to manage updating this I keep finding helpful stories showing me just how to update a single service. This is useful, but it doesn't address the problems I mentioned above. So here is my solution:

It starts with Terraform. Using Terraform I can define my Kubernetes cluster, my databases and pubsubs. I also use a Cloud Function and a few other things. Helm is often used in a similar way but Helm gets you things defined in Kubernetes. As far as I can tell it won't create a GCP Pubsub. So Terraform is my choice.

Terraform is good at comparing what is currently deployed with what the newly changed script specifies. So I can change my Terraform definition and just apply it. Terraform works out what actually needs to be done and does it. Great. So if I want a new database or a new pubsub or an updated Cloud Function that just works.

It can be parameterised too. So I can define a set of variables the script needs, and I can define a set of values needed for each environment. For my test-dev environment the parameters look like this:

project = "bill-rush-engineering"
region = "australia-southeast1"
environment = "test-dev"
environment-dns = "engineering.billrush.co.nz"
dbpassword = "mypassword"
test-mode = "true"
service-type = "ClusterIP"
deploymentType = "Recreate"

billrush_machine_type = "n1-standard-2"
billrush_min_node_count = 6
billrush_max_node_count = 9

So I have different domain names for each environment and I can adjust things like the number of nodes and machine size in an environment. I don't need a production scale number of nodes in my test environment, for example, and I can set a test mode for some services.

But there are some downsides to Terraform. It doesn't support a Kubernetes CustomResourceDefinition and a few other things, or not yet anyway. And there is a synchronisation problem whereby after the cluster creation it goes on to deploy services too soon, before the cluster is quite ready. This causes a script failure. I can run it again and, as I said above, Terraform figures out what needs to be added and carries on. But it could be better.

For these reasons I decided to keep the definitions of the Kubernetes objects outside of Terraform. For the Kubernetes objects I just use kubectl apply commands with yaml files. Though it is a bit more interesting than that. The yaml files, like the Terraform script, need to be parameterised. Now, most people do this with Helm and I already mentioned Helm in this context. But it seemed like overkill for my environment. So I went a different way.

My scripts run in CloudBuild, GCP's build environment, which provides me with a way to specify a container and a way to launch a bash script in that container. CloudBuild has lots of other options, but this is the one I needed.

So my bash script looks a bit like this:

terraform apply -auto-approve -var-file=$BRANCH_NAME.tfvars
terraform output | sed -e 's/ = /=/g' | sed -e 's/^/export /g' > /workspace/tf_vars
...
. /workspace/tf_vars
gcloud container clusters get-credentials $BRANCH_NAME-billrush --zone $ZONE
...
for i in ./*-deployment.yaml; do
 echo "processing ${i}"
 envsubst < $i | kubectl apply -f - 
done
kubectl wait --for condition=available --timeout=1000s --all deployments

I've simplfied things here to highlight the important bits.

Notice I refer to $BRANCH_NAME a couple of times. Once for the tfvars file and once to refer to the cluster name. The $BRANCH_NAME is actually the environment name. I have a branch per environment and I keep the things specific to that environment, such as the tfvars file, in the relevant branch. Common files are in master and I git merge changes to master into the branches (rather than the other way around). A git push triggers the CloudBuild script.

The Terraform script is configured to output selected variables to stdout, and you can use terraform outout to just output the variables and not actually run the script. A little bit of sed turns these into export VARIABLE=value which can be read back in a couple of lines down. So all my Terraform output variables are now exported environment variables.

There's a wait for the cluster to be ready (not shown, but it's just a loop with a sleep and a test), then I use gcloud to get the credentials for the cluster I just created (and waited for). This means I can now run kubectl apply against that cluster.

I have a few other yaml files but most of them are of the form SERVICENAME-deployment.yaml so I loop through those and

envsubst < $i | kubectl apply -f -

envsubst checks for environment variable references in the yaml file and substitutes the value. So all my Terraform outputs are available for these substitutions. The result is piped into kubectl apply. Finally I use kubectl wait to wait for all the deployments to complete.

Something that is not yet obvious from this is that my deployment files always specify the image version as a variable eg

image: gcr.io/bill-rush-engineering/billrush/ropetrick:$ROPETRICK_VERSION

The default value for ROPETRICK_VERSION is 'latest', specified in Terraform. But I can add an explicit value for that in my tfvars file. In production or otherwise critical environments I set an explicit version, in test environments I'm usually happy to use the latest. Images, of course, have more than one tag so an image can be both 1.2.3-RELEASE and latest at the same time.

Let's look at a concrete scenario. I want to update the version of the ropetrick service in the environment called test-qa. This environment is not actually production, but it needs to be as close to production as possible. I pull the repository, switch to the test-qa branch and edit the tfvars file to change the value for ropetrick version. Then I git push. CloudBuild is triggered to run the whole script. But Terraform detects that nothing it is interested in has changed. It outputs the valiables and the kubectl does its apply commands. One of those is different and the ropetrick service (and only the ropetrick service) is updated. I specify the deployment type in the yaml file which controls just how the deployment is done. In test-qa I have it set to RollingUpdate which means Kubernetes will update each instance of ropetrick, one at a time, so there will always be at least one running.

Now, say during testing in the test-qa environment it becomes apparent that we have a couple of problems. This new version of ropetrick needs a pubsub defined and it makes a call to new service called wickerman. We completely forgot about those. Okay, so we edit the Terraform script in master to define the pubsub and the new service deployment. We'll want to put an explicit version number for that service in the tfvars file for test-qa. Then we git merge master into test-qa and git push again. This time ropetrick will not be touched because it is already updated, but the pubsub and the wickerman service will be deployed.

If that all checks out, ie test-qa is working just fine, then we can go over to the production branch and do a similar git merge and git push. We will have to check the tfvars file in production to ensure the ropetrick version is the same, or we could make production a branch of test-qa and just git merge from test-qa into production. We know the the same scripts will run and that they will run successfully and consistently. We also know we have a good record of exactly what changes were made to each environment.

What if it all worked in test-qa but in spite of our precautions it went bad in production. Well, this is never great but we can get the system back to how it was fairly simply. CloudBuild keeps a record of each trigger with sensible labels. We can see what was applied and when. And we can retry an earlier trigger which will use the commit it originally used. So that is what we do in this case. Just go to the last working trigger and rerun it. There might be problems with lost data or similar but we will be back to a working system.

This is still a work in progress for me. I haven't taken it live and I'm interested in feedback. I still need a better way to manage secrets. There are several passwords that do not matter much in test systems but really, really do matter in production. Ideally they aren't stored anywhere too easily accessible such as the git repository. I use Bitbucket and I can protect specific branches from the new hire programmer pushing to it and maybe even viewing it by mistake. Even so storing passwords there doesn't feel right. I also still have a problem with the script sometimes failing during the kubectl apply commands, even though I know the cluster is running. But I'll figure that out and meanwhile a rerun always works.

Previous Post Next Post