Terraforming Azure Databricks - Part II

Last time we tried to transform Azure with the power of Terraform. Let’s pick up where we left off.

For anyone that missed it - here is the first part of the article.

Creating workspace

Please open the project that we created last time. We will expand it’s capabilities by adding content directly to the main.tf file.

It’s quite simple. First, change the definition of providers:

  required_providers {
    azurerm    = {
      source  = "hashicorp/azurerm"
      version = "=2.74.0"
    }
    databricks = {
      source  = "databrickslabs/databricks"
      version = "=0.3.1"
    }
  }

Next, append all the new objects configs:

resource "azurerm_databricks_workspace" "this" {
  name                        = "ws-${local.appId}"
  resource_group_name         = azurerm_resource_group.this.name
  location                    = azurerm_resource_group.this.location
  sku                         = "premium"
  managed_resource_group_name = "ws-rg-${local.appId}"
  tags                        = local.tags
}

provider "databricks" {
  azure_workspace_resource_id = azurerm_databricks_workspace.this.id
}

Finally, add new output to outputs.tf:

output "databricksHost" {
  value = "https://${azurerm_databricks_workspace.this.workspace_url}/"
}

You will have to re-init the project (providers have changed!) with terraform init -backend-config="key=TfPlayground-dev.tfstate" -backend-config="access_key=[your key here]" and then you can apply the plan as usual. What you get in return, after a long while, is something that looks like that: databricksHost = "https://adb-3447378127568573.13.azuredatabricks.net/". If you follow this link, you will get thrown into a brand new Databricks workspace.

Here lies the biggest pitfall that we encountered on our adventure with terraforming Azure. If we would try to create any additional resources that require this workspace before visiting it through the link, we would most likely fail. Why is that? Although the resource is created in Azure, the workspace itself is not yet “launched”. This makes it impossible to create a mount or a new cluster on top of this workspace. It’s a major inconvenience, but unfortunately, there is not much that can be done about this (time to revisit the list of “cons” of Terraform). Well, since we are still getting a lot for almost free here, let’s just ignore this and continue onwards.

What things are we still missing in our setup? We definitely need PySpark, a Databricks notebook, and some place to store data that is independent of the workspace lifecycle. At the beginning, we mentioned as well that we would like to include some basic security/role management as well. One of the things that we can do, is to assure access to some of the resources just to a specified service principal. Let’s assume that we want to have a separate storage account and we don’t want to give access to it to individual users but we want to assure that everybody can modify data using notebooks. To achieve that, we could put service principal secrets inside a key vault and just manage access to this instance. We can group any number of access rights that we require for a specific process this way and we have one place from which we can grant and revoke privileges.

Key vaults and modules

Ok, let’s jump in. The first thing we would like to do is to add some configuration to our provider:

provider "azurerm" {
  features {
    key_vault {
      purge_soft_delete_on_destroy = true
    }
  }
}

Why do we need this? In the following examples, we will operate on two key vaults - one will be our persistent, manually created, KV that will hold access to all our precious secrets. The second one we will create using automation and copy only the secrets that we require. It will provide us with the ability to modify the source KV without having to worry about some legacy deployment or different environments. The “problem” is that key vaults on Azure try to retain the data even after the resource is destroyed. It is a useful feature, but in the case of Terraform, it can be a bit cumbersome to go each time and manually purge some short-lived KV. purge_soft_delete_on_destroy will ensure that this happen, at the time when destroy stage is executed.

If we have this already covered, then let’s tackle one more concept - user-defined modules. Terraform allows us to “package” a set of resources, data, variables, and outputs, and use this as a self-contained resource. In some sense, we can refactor parts of the big function and define smaller sub-routines to avoid duplication and keep the whole thing a bit cleaner.

Please create a new subdirectory - src/keyvault_read, and inside three files called main.tf, variables.tf and outputs.tf with the following content:

# main.tf

data "azurerm_client_config" "current" {}

data "azurerm_key_vault" "azvault" {
  name                = var.kvName
  resource_group_name = var.kvResourceGroup
}

data "azurerm_key_vault_secret" "secret" {
  for_each     = var.secrets
  key_vault_id = data.azurerm_key_vault.azvault.id
  name         = each.key
}
# variables.tf

variable "kvName" {
  type    = string
}

variable "kvResourceGroup" {
  type    = string
}

variable "secrets" {
  type    = set(string)
}
# outputs.tf

output "secrets_value" {
  value = {for item in [for value in values(data.azurerm_key_vault_secret.secret) : {
    (value.name) = (value.value)
  }] :
  keys(item)[0] => values(item)[0]
  }
}

What’s interesting in this example is the use of for_each in azurerm_key_vault_secret with the conjunction of secrets being defined as set(string). This allows us to declare a piece of data for each element that we pass to the set! In the end, we will pick and choose all the secrets that are relevant to our deployment and copy it to a separate key vault. Speaking of which, let’s create one directory, called keyvault_store, with the same set of files as keyvault_read and content of:

# variables.tf

variable "appId" {
  type = string
}

variable "tags" {
  type = map(string)
}

variable "rgLocation" {
  type = string
}

variable "rgName" {
  type = string
}
# outputs.tf

output "kvResourceId" {
  value = azurerm_key_vault.this.id
}

output "kvVaultUri" {
  value = azurerm_key_vault.this.vault_uri
}

output "kvVaultName" {
  value = azurerm_key_vault.this.name
}
# main.tf
data "azurerm_client_config" "current" {
}

module "keyvault_read" {
  source          = "../keyvault_read"
  kvName          = "mypersistentkv"
  kvResourceGroup = "some-rg"
  secrets         = [
    "service-user",
    "service-user-pass"
  ]
}

resource "azurerm_key_vault" "this" {
  name                = "k-v-${var.appId}"
  location            = var.rgLocation
  resource_group_name = var.rgName
  tenant_id           = data.azurerm_client_config.current.tenant_id
  sku_name            = "premium"
  tags                = var.tags
}

resource "azurerm_key_vault_access_policy" "user" {
  key_vault_id            = azurerm_key_vault.this.id
  tenant_id               = data.azurerm_client_config.current.tenant_id
  object_id               = data.azurerm_client_config.current.object_id
  key_permissions         = [
    "Get",
    "List",
    "Create",
    "Update",
    "Purge",
  ]
  secret_permissions      = [
    "Delete",
    "Get",
    "List",
    "Set",
    "Purge",
  ]
  certificate_permissions = [
    "Get",
    "List",
    "Create",
    "Update",
    "Purge",
  ]
}

resource "azurerm_key_vault_access_policy" "service" {
  key_vault_id            = azurerm_key_vault.this.id
  tenant_id               = data.azurerm_client_config.current.tenant_id
  object_id               = module.keyvault_read.secrets_value["service-user"]
  key_permissions         = [
    "Get",
    "List",
    "Create",
    "Update",
    "Purge",
  ]
  secret_permissions      = [
    "Delete",
    "Get",
    "List",
    "Set",
    "Purge",
  ]
  certificate_permissions = [
    "Get",
    "List",
    "Create",
    "Update",
    "Purge",
  ]
}

resource "azurerm_key_vault_secret" "service_user" {
  depends_on   = [azurerm_key_vault_access_policy.user]
  name         = "service-user"
  value        = module.keyvault_read.secrets_value["service-user"]
  key_vault_id = azurerm_key_vault.this.id
}

resource "azurerm_key_vault_secret" "service_user_pass" {
  depends_on   = [azurerm_key_vault_access_policy.user]
  name         = "service-user-pass"
  value        = module.keyvault_read.secrets_value["service-user-pass"]
  key_vault_id = azurerm_key_vault.this.id
}

A lot has happened here. First of all, look at keyvault_read module declaration. We will be using one module from another module here! We provide all the necessary arguments at the point of execution. The service-user and service-user-pass secrets need to be already present in the KV. Obviously, the KV itself has to exist. The easiest way to set secrets is by using az keyvault secret set --name "service-user" --value "somevalue" --vault-name "mypersistentkv". Next, we create azurerm_key_vault resource. This will be a completely fresh KV, so we want to set initial permissions for the user and service principal. Those listed here are a bit liberal, but for the point of joy-hacking serves just about right. If you look closely, you will notice that we take the object_id for service-user directly from the persistent vault. At the end, we declare both values that we will use from the outside world, the service principal name and password.

Finally, we can add this to the main.tf in the main directory:

module "keyvault_store" {
  source     = "./keyvault_store"
  appId      = local.appId
  tags       = local.tags
  rgLocation = azurerm_resource_group.this.location
  rgName     = azurerm_resource_group.this.name
}

Before applying the plan you will have to re-init the project once more. This is necessary because we changed the features on the provider (purging on delete), and because modules need an “installation” step.

Now we can start adding some additional features on top of Databricks. The first thing we want to mention is that the modules that we previously created are fully functional Terraform entities - that means those can be shared in a registry (local or remote) and freely reused. For simplicity’s sake, we will skip that and just copy-paste our keyvault_read module where required. Please don’t do this at home, except for training purposes.

Creating second project

Let’s create a new Terraform project with src folder and copy keyvault_read module there. Next, please create the following variables.tf file in main directory:

variable "applicationName" {
  type = string
  default = "TfPlayground"
}

variable "environment" {
  type = string
  default = "dev"
}

variable "region" {
  type = string
  default = "West Europe"
}

We need to use the same applicationName as in our first project. This is the only method that we use to specify dependencies across different projects in the same application. Now we need to fill out main.tf file. Starting from the top, please copy the following snippets:

terraform {
  required_providers {
    azurerm    = {
      source  = "hashicorp/azurerm"
      version = "=2.74.0"
    }
    databricks = {
      source  = "databrickslabs/databricks"
      version = "=0.3.1"
    }
  }

  backend "azurerm" {
    storage_account_name = "mystorageaccount"
    container_name       = "myfilesystem"
    key                  = "TfPlayground-SA-dev.terraform.tfstate"
  }
}

provider "azurerm" {
  features {}
}

data "azurerm_client_config" "current" {
}

locals {
  appId = "${var.applicationName}-${var.environment}"
  tags  = {
    ApplicationName = var.applicationName
    Environment     = var.environment
  }
}

module "keyvault_read" {
  source          = "./keyvault_read"
  kvName          = "k-v-${local.appId}"
  kvResourceGroup = "rg-${local.appId}"
  secrets         = [
    "service-user",
    "service-user-pass"]
}

Important to notice here is that we use a different name for the state file. This project will be managed independently of our previous project and we will be able to easily redeploy just parts of our app.

Mounting storage account

The second part of main.tf file will contain a bit more meat on the bone:

data "azurerm_databricks_workspace" "this" {
  name                = "ws-${local.appId}"
  resource_group_name = "rg-${local.appId}"
}

provider "databricks" {
  azure_workspace_resource_id = data.azurerm_databricks_workspace.this.id
  azure_client_id             = module.keyvault_read.secrets_value["service-user"]
  azure_client_secret         = module.keyvault_read.secrets_value["service-user-pass"]
  azure_tenant_id             = data.azurerm_client_config.current.tenant_id
}

data "azurerm_storage_account" "this" {
  name                     = "mystorageaccount2orEven3"
  resource_group_name      = "some-rg"
}

data "azurerm_storage_container" "this" {
  name                  = "myfilesystem"
  storage_account_name  = data.azurerm_storage_account.this.name
}

resource "databricks_azure_adls_gen2_mount" "this" {
  container_name         = data.azurerm_storage_container.this.name
  storage_account_name   = data.azurerm_storage_account.this.name
  mount_name             = "terraformpoc"
  tenant_id              = data.azurerm_client_config.current.tenant_id
  client_id              = module.keyvault_read.secrets_value["service-user"]
  client_secret_scope    = "kv-secret-scope-${local.appId}"
  client_secret_key      = "service-user-pass"
  initialize_file_system = true
}

azurerm_databricks_workspace refers to the workspace that we created before. An interesting thing happens in databricks - we are authenticating using credentials stored in our key vault! azure_tenant_id is taken straight out of the current client config, but it might be possible to have user and service principle on different tenants, so please be vigilant. The azurerm_storage_account obviously does not have to refer to the same storage account that we use for storing state.

Next, we specify the storage container, and then finally our mount in databricks_azure_adls_gen2_mount. One thing to notice is that we do not use key value in client_secret_key, but rather the key name itself.

Databricks secret scopes

So what is this client_secret_scope? Here is the problem - we don’t have it and so far it does not seem to be possible to create this automatically. We need to go to our workspace once more, appending #secrets/createScope at the end of the URL. This will show us the form for creating secret scopes, which are a way to pass any secret information from Azure to Databricks.

You need to provide some information here:

# as in "client_secret_scope" field
Scope Name = kv-secret-scope-TfPlayground-dev
Manage Principal = All Users
# this is the KV that we created via Terraform
DNS Name = https://k-v-tfplayground-dev.vault.azure.net/
# provide your subscription id in place marked as XXX
Resource ID =  /subscriptions/XXX/resourceGroups/rg-TfPlayground-dev/providers/Microsoft.KeyVault/vaults/k-v-TfPlayground-dev

Now that we have this cumbersome process behind us, let’s look closely at what happens here. Our service credentials are used in two places - we are authenticating to Databricks itself as the service user and then databricks_azure_adls_gen2_mount is accessing our credentials through the secret scope to mount storage account. If we add to this the az login at the beginning, then we can have three distinct users/services performing different operations.

Finally, we can go back to our project. Let’s initialize and apply our plan (remember to change the state file key if you override the name in the command line). After a long while we should succeed. Apart from our mount being accessible from a notebook, we can see that a single-node cluster was created in the workspace. Creating a mount, or destroying it, will always require at least one node, which can be a bit of a problem if we have a low quota for “Standard FS Family vCPUs” in Azure. This will be especially important now because we will define a dedicated spark cluster.

Adding Spark cluster

Please add the following snippet to main.tf file:

data "databricks_node_type" "this" {
  gb_per_core = 1
}

data "databricks_spark_version" "this" {
  spark_version = "3"
}

resource "databricks_cluster" "this" {
  cluster_name            = "cluster-${local.appId}"
  idempotency_token       = "cluster-${local.appId}"
  spark_version           = data.databricks_spark_version.this.id
  node_type_id            = data.databricks_node_type.this.id
  autotermination_minutes = 20

  // for single-node cluster
  spark_conf = {
    "spark.databricks.cluster.profile" : "singleNode"
    "spark.master" : "local[*]"
  }

  // for single-node cluster
  custom_tags = {
    "ResourceClass" = "SingleNode"
  }
}

This configuration is mostly self-explanatory, the only thing that might be new here is the idempotency_token - it guarantees that we won’t try to create a new cluster if one with this id already exists. It might be useful in case the process fails after the cluster was already initialized. In general databricks_cluster contains a lot of different configuration options that control cluster parameters. One of more interesting is docker_image, which allows us to initialize with a custom Databricks image:

docker_image {
  url = "myregistry.azurecr.io/myimage:latest"
  basic_auth {
    username = module.keyvault_read.secrets_value["some-registry-user"] # probably will have to be set up separately
    password = module.keyvault_read.secrets_value["some-registry-user-password"]
  }
}

Please create outputs.tf file now:

output "clusterId" {
  value = databricks_cluster.this.id
}

We can plan and apply again. After execution finishes look for the instance of clusterId output that we just declared. Copy it somewhere safe, we will need this in a moment.

Anyway, it would seem that the core of our infrastructure is already provisioned. This is the part that should not change that often in the daily development cycle.

Databricks notebooks

Ok, time for the last part. Let’s create a new Terraform project. It’s the last one, we promise! We will try to define a Databricks notebook job that can be instantiated from a notebook file (local, artifact, etc.). As usual, create an src directory and copy-paste our keyvault_read module (or skip it if you installed it locally). Apart from that, we will need four more files in src dir:

# variables.tf

variable "applicationName" {
  type = string
  default = "TfPlayground"
}

variable "environment" {
  type = string
  default = "dev"
}

variable "region" {
  type = string
  default = "West Europe"
}

variable "notebookFilePath" {
  type = string
  default = "./ExampleNotebook.py"
}

variable "notebookName" {
  type = string
  default = "ExampleNotebook"
}

variable "clusterId" {
  type = string
}

In variables.tf we can see that we have a clusterId variable that does not have a default value. When trying to apply this plan, Terraform will ask us to provide this particular parameter (remember the id we asked to copy?). This is a bit problematic when working with Azure Release Pipelines, but it can be passed from the previous Terraform Task with a little effort.

# outputs.tf

output "job_url" {
  value = databricks_job.this.url
}

Above we have just the URL for the job that we are trying to create. You will be able to find it in the Databricks workspace anyway, but why not make life a bit easier?

# ExampleNotebook.py

print("Hello World!")

Here we have the collective effort of the Data Science team. After six months of deep analysis, they have come up with a way to greet our planet. Exciting! So far we are waiting for any response.

# main.tf

terraform {
  required_providers {
    azurerm    = {
      source  = "hashicorp/azurerm"
      version = "=2.74.0"
    }
    databricks = {
      source  = "databrickslabs/databricks"
      version = "=0.3.1"
    }
  }

  backend "azurerm" {
    storage_account_name = "mystorageaccount"
    container_name       = "myfilesystem"
    key                  = "TfPlayground-DBN-dev.terraform.tfstate"
  }
}

provider "azurerm" {
  features {}
}

data "azurerm_client_config" "current" {
}

locals {
  appId = "${var.applicationName}-${var.environment}"
  tags  = {
    ApplicationName = var.applicationName
    Environment     = var.environment
  }
}

module "keyvault_read" {
  source          = "./keyvault_read"
  kvName          = "k-v-${local.appId}"
  kvResourceGroup = "rg-${local.appId}"
  secrets         = [
    "service-user",
    "service-user-pass"]
}

data "azurerm_databricks_workspace" "this" {
  name                = "ws-${local.appId}"
  resource_group_name = "rg-${local.appId}"
}

provider "databricks" {
  azure_workspace_resource_id = data.azurerm_databricks_workspace.this.id
  azure_client_id             = module.keyvault_read.secrets_value["service-user"]
  azure_client_secret         = module.keyvault_read.secrets_value["service-user-pass"]
  azure_tenant_id             = data.azurerm_client_config.current.tenant_id
}

data "databricks_current_user" "this" {
}

resource "databricks_notebook" "this" {
  source = var.notebookFilePath
  path   = "${data.databricks_current_user.this.home}/${var.notebookName}"
}

resource "databricks_job" "this" {
  name                = "job-${local.appId}"
  existing_cluster_id = var.clusterId
  max_retries         = 0

  notebook_task {
    notebook_path  = databricks_notebook.this.path
  }
}

Our last blob of code is mostly boilerplate. It starts getting interesting with databricks_notebook. Since notebooks in Databricks are always owned by a user, we must assign them to some specific entity.

The databricks_job is a tricky one. It seems that it’s impossible to start a one-off job that gets executed at the moment we finish applying the plan. We can define a cron job that will run periodically, but that’s not always what we would want.

If you finally apply this project and follow the job_url, you should be able to run the job. Well done!

Closing words

There are many things that we did not cover in this tutorial. For those of you that are interested in further developing Terraform skills, we would recommend starting from visiting full documentation of modules and providers that we used. Maybe you can solve in an interesting way some of the problems that we could not?

We tried to be as objective as possible here and show both the pros and cons of working with Azure via Terraform. Overall this was a pleasant experience and although it’s a bit rough around the edges it seems like integration is heading in the right direction. No matter if you work with Azure or not, we would strongly recommend checking out Terraform!