LinkedIn link GitHub link

Terraforming Azure Databricks - Part I

Last few weeks we were busy trying to make Azure habitable using Terraform. Let’s see if we were able to plant anything.

Introduction to Terraform

There are many tools to choose from if you want to enter the DevOps game. If you browse the net long enough you will most certainly encounter things like Chef, Puppet, or Ansible. Most of those technologies are well established at this point and you can find many comparison charts out there. Still, it’s sometimes hard to choose when all the apples look red and delicious. In this article, we want to take a look at something that became pretty much the default option for provisioning tools - Terraform. Our target? The exotic cloud jungle that has taken the world by storm - Azure.

Before we will try to create a habitat for ourselves we need to ask a serious question - what do we need? Let’s assume that we want to have the ability to run a PySpark Notebook application. The app itself will change a lot because it will be constantly improved by our Not Exactly Cheap Data Science Team and we want to maintain a reliable deployment process. What would we need to achieve that? Mostly likely a way to provision the PySpark cluster, a bit of persistent storage, and just to be sure, at least a basic method to secure all of this. In terms of the Azure world, we will use Azure Storage, Azure Databricks, and a mix of User and Service Principals. We could certainly use some automation over here. Since the Azure Platform does not support any meta-language to do this, we will borrow some provisioning powers from Terraform. Now, another important question - what exactly is Terraform?

As the excerpt from producers webpage summarizes:

Terraform allows infrastructure to be expressed as code in a simple, human readable language called HCL (HashiCorp Configuration Language). It reads configuration files and provides an execution plan of changes, which can be reviewed for safety and then applied and provisioned.

What it means is that Terraform is a declarative DSL (domain-specific language) that allows us to specify what do we want to get, without mentioning how to achieve that. How is this possible? Well, mostly thanks to the power of FOSS community and the widespread adoption of the technology itself. For every piece of software that wants to allow itself to be managed by Terraform a provider and a set of modules needs to be written. Those are created not by the end-user, but by the developers of the original product (or the community around it). Since so many things can already be integrated using Terraform, it’s in the best interest of producers of The Next Big Thing to provide this integration.

Pros of this solution?

  • very easy to start, simple markdown
  • pretty good documentation
  • wide adoption means that for most cases examples float around the internet
  • a lot of existing technologies available (1256 providers and 6409 modules as of today)
  • can help automate a lot of menial work
  • everything depends on the module developers - you don’t have to care
  • you can bundle complex configurations into your own custom modules and share those

Cons?

  • if there is no provider available, then the only thing you can do is cry
  • exotic modules and use cases might be problematic
  • everything depends on the module developers - if some use case is not covered, then it’s not covered, period

First setup

For the purpose of this article, we will use Terraform 0.14.9 and will assume that the reader is already familiar with Azure Platform, but the concepts themselves should not be too complex and therefore accessible without prior knowledge (a quick DuckDuckGo search might be required, sorry).

First thing first - we need to set up Terraform. Depending on your OS mileage may vary, but the simplest way is to just download the binary and add a PATH entry. Next, what you want to do is to set up the az cli tool and run az login. This will allow you to manage objects on Azure Platform from local setup (assuming that you have the rights to do so).

Initialize a new project, in it create a new src directory and put a file called main.tf inside. Then paste the following content inside:

terraform {
  required_providers {
    azurerm    = {
      source  = "hashicorp/azurerm"
      version = "=2.74.0"
    }
  }
}

provider "azurerm" {
  features {}
}

data "azurerm_client_config" "current" {
}

locals {
  appId = "TfPlayground-dev"
  tags  = {
    ApplicationName = "TfPlayground"
    Environment     = "dev"
  }
}

resource "azurerm_resource_group" "this" {
  name     = "rg-TfPlayground-dev"
  location = "West Europe"
  tags     = local.tags
}

A lot of things are already happening - in the required_providers block we told terraform that we will need to integrate with the Azure platform (azurerm). There can be multiple implementations for a provider (which is a great feature), so we specified that we want to use hashicorp/azurerm. The provider "azurerm" lets us pass additional config to the provider, but we do not want to do that just yet. Next, we see the first occurrence of the data keyword. Terraform distinguishes two types of objects - data and resource. resource is something that we create in this particular Terraform “plan” and we will control its lifecycle. data is something that already exists and we just want to use this as a source for the rest of the deployment. Destroying the infrastructure from this plan won’t destroy a data object, its lifecycle is external to us. What follows is the locals block. As the name suggests, those are variables that are scoped locally to our plan. If we think about the plan as a big function which outputs infrastructure (or destroys it), then locals are variables that we declared inside this function. We could live without those, but it helps to keep the code a bit more clean. What comes last is the main dish - the first resource block. This particular instance will create an Azure Resource Group. Let’s try this out.

Run terraform init from the src directory and observe the output. If everything goes right, Terraform will tell us that Terraform has been successfully initialized!. As a bonus, we get .terraform.lock.hcl that helps to alleviate some pains of working with Terraform git repositories.

Now you can run terraform plan. This operation will output the changes that need to be applied to get the state of the infrastructure to the point described in the file. Note that this does not mean that the operations were already performed. This is just a nifty way to see what will happen and to double-check if we really want that.

Ok, let’s finally get our hands dirty - run terraform apply and when prompted type in yes. When the operation completes you can go to Azure Platform and search for rg-TfPlayground-dev resource group. There is a good chance that it will be there. We got one more file in our project repo now - terraform.tfstate. This is how Terraform keeps track of the current state. You can open it, it’s in regular JSON format. Inside you will find some information about our configuration and live “instances”. Let’s run terraform destroy and observe how the state of the file changes. It should be mostly empty at this point. You can go to Azure Platform and verify if the resource group has been deleted.

Handling state files

Some of the readers are probably wondering at this point how to manage the terraform.tfstate file itself. This is just a regular, local entity, but if we have multiple people working on a project then we would like to share it somehow. Putting it inside a git repository might seem like a solution, but we would need to be sure that only one person works on this file at a given time and that all changes are pushed out. This is definitely not a road that we should follow. Fortunately, there is a better solution - we can keep the state file on Azure Platform itself! That way we can have conflicting plan versions, but the state itself will never be compromised. Let me be very explicit about this - all things created by Terraform are “external” to it. You have no additional management layer, but also no protection against shooting yourself in the foot. If you lose the state file then the best thing you can do is manually delete all the resources and start from scratch. It is technically possible to import resources into the state, but we won’t cover this here.

At this point, it would be good to mention a useful “tip” for working with local state. If you ever get stuck and some things do not seem to work as expected (it happened a few times to us on a Linux setup, unfortunately), then you can nuke the whole thing using rm -rf ~/.terraform.d .terraform .terraform.lock.hcl terraform.tfstate*. Once again - depending on OS, the mileage may vary, but the desired result would be to be able to start completely from scratch.

Before we delve further into managing state files, let’s quickly bounce back to one more thing Terraform offers us - variables and outputs. Remember how we said that Terraform can be treated like a big function? Well, no good function can work without arguments and returned values. We will create two new files variables.tf and outputs.tf.

For the first one please paste in:

variable "applicationName" {
 type = string
 default = "TfPlayground"
}

variable "environment" {
 type = string
 default = "dev"
}

variable "region" {
 type = string
 default = "West Europe"
}

For outputs.tf:

output "appId" {
  value = local.appId
}

Now, let’s change some definitions in main.tf:

locals {
  appId = "${var.applicationName}-${var.environment}"
  tags  = {
    ApplicationName = var.applicationName
    Environment     = var.environment
  }
}

resource "azurerm_resource_group" "this" {
  name     = "rg-${local.appId}"
  location = var.region
  tags     = local.tags
}

After running terraform apply, we will end up in pretty much the same state as before. The one thing that changes is that we will produce the output: appId = "TfPlayground-dev". This particular output we could observe even when executing terraform plan, because it’s value can be calculated during the compilation phase.

Ok, time to jump back to state management. Please update the current provider definition to match:

terraform {
  required_providers {
    azurerm    = {
      source  = "hashicorp/azurerm"
      version = "=2.74.0"
    }
  }
  backend "azurerm" {
    storage_account_name = "mystorageaccount"
    container_name       = "myfilesystem"
    key                  = "TfPlayground-dev.tfstate"
  }
}

mystorageaccount should match your Azure Storage Account and myfilesystem a container accessible on this account. You might have noticed that there is no variable substitution for the key property. Why is that? The problem is, that the whole terraform block describes prerequisites that need to be resolved at the time of project initialization. At this point, we are still not calling the big “function”, we are just specifying the “environment” in which this function will eventually be called. In other words - the function arguments are unresolved. This might be a problem because we need to have a separate terraform state file for each environment that we would like to create (dev, int, prod, etc.). All of those will use the same plan files (with varying versions) but will have a different current state. Don’t worry though, there is a simple way to fix it - we just need to override the value of this key when we run the init command.

Let’s re-initialize the project using terraform init -backend-config="key=TfPlayground-dev.tfstate". We need to do this because our backend has changed. Each time we will do it, we would advise to first run terraform destroy anyway. If you want to, or are having some kind of issues, you might want to remove all the old state and config files first. Anyway, in the end, you should get an error similar to this: Error: Either an Access Key / SAS Token or the Resource Group for the Storage Account must be specified. This error message tells us that using just az login won’t be enough in this case. Additionally, to access the platform itself, we will also need to prove that we have permission to use the designated storage account. This issue would disappear if we would be terraforming directly from Azure Pipelines, but for now, let’s bribe security with an access token. You will need to go to Azure Portal > Storage accounts > mystorageaccount > Access keys and copy your key. Now, try once more initializing the project with terraform init -backend-config="key=TfPlayground-dev.tfstate" -backend-config="access_key=[your key here]". This time around everything should complete successfully. If you run terraform plan now, you will notice two things. First, each time you run a command it will very briefly output Acquiring state lock. This may take a few moments.... From here onwards, you are safe to work in a distributed way without compromising state files. Second, there is no content in .tsftate file, everything sits on your storage account.

There is a thing worth noting on the side here. It’s possible to pass variables to terraform plan using config files or environment variables of form TF_VAR_somekey. Apart from that, there are several predefined environment variables that can be used to control Terraform. One of the most useful is TF_LOG. If you ever have trouble figuring out what is wrong, try using export TF_LOG=trace (or maybe just debug).

Let’s finally run terraform apply and update our remote state. Starting from now we won’t use terraform destroy when not necessary. It’s a useful command when we want to shut everything down, but in a lot of cases terraform apply will be more appropriate. If we modify our files and then re-run apply, Terraform will figure out what was added and what was removed, effectively destroying any entity that was deleted.

Ok, with that action we will conclude for today. What we have now is a complete setup that lets us handle the state files without going crazy. That is a perfectly valid project on its own, but we will try to make it even better in upcoming part II of this article. If you want to automatically create a Databricks workspace, mount an external storage account, and manage a Spark cluster, then be sure to check it out! Thank you and we hope you are enjoying the content.