# Terraforming Azure Databricks - Part II

Last time we tried to transform Azure with the power of Terraform. Let’s pick up where we left off.

For anyone that missed it - here is the first part of the article.

#### Creating workspace

Please open the project that we created last time. We will expand it’s capabilities by adding content directly to the main.tf file.

It’s quite simple. First, change the definition of providers:

  required_providers {
azurerm    = {
source  = "hashicorp/azurerm"
version = "=2.74.0"
}
databricks = {
source  = "databrickslabs/databricks"
version = "=0.3.1"
}
}


Next, append all the new objects configs:

resource "azurerm_databricks_workspace" "this" {
name                        = "ws-${local.appId}" resource_group_name = azurerm_resource_group.this.name location = azurerm_resource_group.this.location sku = "premium" managed_resource_group_name = "ws-rg-${local.appId}"
tags                        = local.tags
}

provider "databricks" {
azure_workspace_resource_id = azurerm_databricks_workspace.this.id
}


Finally, add new output to outputs.tf:

output "databricksHost" {
value = "https://${azurerm_databricks_workspace.this.workspace_url}/" }  You will have to re-init the project (providers have changed!) with terraform init -backend-config="key=TfPlayground-dev.tfstate" -backend-config="access_key=[your key here]" and then you can apply the plan as usual. What you get in return, after a long while, is something that looks like that: databricksHost = "https://adb-3447378127568573.13.azuredatabricks.net/". If you follow this link, you will get thrown into a brand new Databricks workspace. Here lies the biggest pitfall that we encountered on our adventure with terraforming Azure. If we would try to create any additional resources that require this workspace before visiting it through the link, we would most likely fail. Why is that? Although the resource is created in Azure, the workspace itself is not yet “launched”. This makes it impossible to create a mount or a new cluster on top of this workspace. It’s a major inconvenience, but unfortunately, there is not much that can be done about this (time to revisit the list of “cons” of Terraform). Well, since we are still getting a lot for almost free here, let’s just ignore this and continue onwards. What things are we still missing in our setup? We definitely need PySpark, a Databricks notebook, and some place to store data that is independent of the workspace lifecycle. At the beginning, we mentioned as well that we would like to include some basic security/role management as well. One of the things that we can do, is to assure access to some of the resources just to a specified service principal. Let’s assume that we want to have a separate storage account and we don’t want to give access to it to individual users but we want to assure that everybody can modify data using notebooks. To achieve that, we could put service principal secrets inside a key vault and just manage access to this instance. We can group any number of access rights that we require for a specific process this way and we have one place from which we can grant and revoke privileges. #### Key vaults and modules Ok, let’s jump in. The first thing we would like to do is to add some configuration to our provider: provider "azurerm" { features { key_vault { purge_soft_delete_on_destroy = true } } }  Why do we need this? In the following examples, we will operate on two key vaults - one will be our persistent, manually created, KV that will hold access to all our precious secrets. The second one we will create using automation and copy only the secrets that we require. It will provide us with the ability to modify the source KV without having to worry about some legacy deployment or different environments. The “problem” is that key vaults on Azure try to retain the data even after the resource is destroyed. It is a useful feature, but in the case of Terraform, it can be a bit cumbersome to go each time and manually purge some short-lived KV. purge_soft_delete_on_destroy will ensure that this happen, at the time when destroy stage is executed. If we have this already covered, then let’s tackle one more concept - user-defined modules. Terraform allows us to “package” a set of resources, data, variables, and outputs, and use this as a self-contained resource. In some sense, we can refactor parts of the big function and define smaller sub-routines to avoid duplication and keep the whole thing a bit cleaner. Please create a new subdirectory - src/keyvault_read, and inside three files called main.tf, variables.tf and outputs.tf with the following content: # main.tf data "azurerm_client_config" "current" {} data "azurerm_key_vault" "azvault" { name = var.kvName resource_group_name = var.kvResourceGroup } data "azurerm_key_vault_secret" "secret" { for_each = var.secrets key_vault_id = data.azurerm_key_vault.azvault.id name = each.key }  # variables.tf variable "kvName" { type = string } variable "kvResourceGroup" { type = string } variable "secrets" { type = set(string) }  # outputs.tf output "secrets_value" { value = {for item in [for value in values(data.azurerm_key_vault_secret.secret) : { (value.name) = (value.value) }] : keys(item)[0] => values(item)[0] } }  What’s interesting in this example is the use of for_each in azurerm_key_vault_secret with the conjunction of secrets being defined as set(string). This allows us to declare a piece of data for each element that we pass to the set! In the end, we will pick and choose all the secrets that are relevant to our deployment and copy it to a separate key vault. Speaking of which, let’s create one directory, called keyvault_store, with the same set of files as keyvault_read and content of: # variables.tf variable "appId" { type = string } variable "tags" { type = map(string) } variable "rgLocation" { type = string } variable "rgName" { type = string }  # outputs.tf output "kvResourceId" { value = azurerm_key_vault.this.id } output "kvVaultUri" { value = azurerm_key_vault.this.vault_uri } output "kvVaultName" { value = azurerm_key_vault.this.name }  # main.tf data "azurerm_client_config" "current" { } module "keyvault_read" { source = "../keyvault_read" kvName = "mypersistentkv" kvResourceGroup = "some-rg" secrets = [ "service-user", "service-user-pass" ] } resource "azurerm_key_vault" "this" { name = "k-v-${var.appId}"
location            = var.rgLocation
resource_group_name = var.rgName
tenant_id           = data.azurerm_client_config.current.tenant_id
tags                = var.tags
}

resource "azurerm_key_vault_access_policy" "user" {
key_vault_id            = azurerm_key_vault.this.id
tenant_id               = data.azurerm_client_config.current.tenant_id
object_id               = data.azurerm_client_config.current.object_id
key_permissions         = [
"Get",
"List",
"Create",
"Update",
"Purge",
]
secret_permissions      = [
"Delete",
"Get",
"List",
"Set",
"Purge",
]
certificate_permissions = [
"Get",
"List",
"Create",
"Update",
"Purge",
]
}

resource "azurerm_key_vault_access_policy" "service" {
key_vault_id            = azurerm_key_vault.this.id
tenant_id               = data.azurerm_client_config.current.tenant_id
key_permissions         = [
"Get",
"List",
"Create",
"Update",
"Purge",
]
secret_permissions      = [
"Delete",
"Get",
"List",
"Set",
"Purge",
]
certificate_permissions = [
"Get",
"List",
"Create",
"Update",
"Purge",
]
}

resource "azurerm_key_vault_secret" "service_user" {
depends_on   = [azurerm_key_vault_access_policy.user]
name         = "service-user"
key_vault_id = azurerm_key_vault.this.id
}

resource "azurerm_key_vault_secret" "service_user_pass" {
depends_on   = [azurerm_key_vault_access_policy.user]
name         = "service-user-pass"
key_vault_id = azurerm_key_vault.this.id
}


A lot has happened here. First of all, look at keyvault_read module declaration. We will be using one module from another module here! We provide all the necessary arguments at the point of execution. The service-user and service-user-pass secrets need to be already present in the KV. Obviously, the KV itself has to exist. The easiest way to set secrets is by using az keyvault secret set --name "service-user" --value "somevalue" --vault-name "mypersistentkv". Next, we create azurerm_key_vault resource. This will be a completely fresh KV, so we want to set initial permissions for the user and service principal. Those listed here are a bit liberal, but for the point of joy-hacking serves just about right. If you look closely, you will notice that we take the object_id for service-user directly from the persistent vault. At the end, we declare both values that we will use from the outside world, the service principal name and password.

Finally, we can add this to the main.tf in the main directory:

module "keyvault_store" {
source     = "./keyvault_store"
appId      = local.appId
tags       = local.tags
rgLocation = azurerm_resource_group.this.location
rgName     = azurerm_resource_group.this.name
}


Before applying the plan you will have to re-init the project once more. This is necessary because we changed the features on the provider (purging on delete), and because modules need an “installation” step.

Now we can start adding some additional features on top of Databricks. The first thing we want to mention is that the modules that we previously created are fully functional Terraform entities - that means those can be shared in a registry (local or remote) and freely reused. For simplicity’s sake, we will skip that and just copy-paste our keyvault_read module where required. Please don’t do this at home, except for training purposes.

#### Creating second project

Let’s create a new Terraform project with src folder and copy keyvault_read module there. Next, please create the following variables.tf file in main directory:

variable "applicationName" {
type = string
default = "TfPlayground"
}

variable "environment" {
type = string
default = "dev"
}

variable "region" {
type = string
default = "West Europe"
}


We need to use the same applicationName as in our first project. This is the only method that we use to specify dependencies across different projects in the same application. Now we need to fill out main.tf file. Starting from the top, please copy the following snippets:

terraform {
required_providers {
azurerm    = {
source  = "hashicorp/azurerm"
version = "=2.74.0"
}
databricks = {
source  = "databrickslabs/databricks"
version = "=0.3.1"
}
}

backend "azurerm" {
storage_account_name = "mystorageaccount"
container_name       = "myfilesystem"
key                  = "TfPlayground-SA-dev.terraform.tfstate"
}
}

provider "azurerm" {
features {}
}

data "azurerm_client_config" "current" {
}

locals {
appId = "${var.applicationName}-${var.environment}"
tags  = {
ApplicationName = var.applicationName
Environment     = var.environment
}
}

kvName          = "k-v-${local.appId}" kvResourceGroup = "rg-${local.appId}"
secrets         = [
"service-user",
"service-user-pass"]
}


Important to notice here is that we use a different name for the state file. This project will be managed independently of our previous project and we will be able to easily redeploy just parts of our app.

#### Mounting storage account

The second part of main.tf file will contain a bit more meat on the bone:

data "azurerm_databricks_workspace" "this" {
name                = "ws-${local.appId}" resource_group_name = "rg-${local.appId}"
}

provider "databricks" {
azure_workspace_resource_id = data.azurerm_databricks_workspace.this.id
azure_tenant_id             = data.azurerm_client_config.current.tenant_id
}

data "azurerm_storage_account" "this" {
name                     = "mystorageaccount2orEven3"
resource_group_name      = "some-rg"
}

data "azurerm_storage_container" "this" {
name                  = "myfilesystem"
storage_account_name  = data.azurerm_storage_account.this.name
}

container_name         = data.azurerm_storage_container.this.name
storage_account_name   = data.azurerm_storage_account.this.name
mount_name             = "terraformpoc"
tenant_id              = data.azurerm_client_config.current.tenant_id
idempotency_token       = "cluster-${local.appId}" spark_version = data.databricks_spark_version.this.id node_type_id = data.databricks_node_type.this.id autotermination_minutes = 20 // for single-node cluster spark_conf = { "spark.databricks.cluster.profile" : "singleNode" "spark.master" : "local[*]" } // for single-node cluster custom_tags = { "ResourceClass" = "SingleNode" } }  This configuration is mostly self-explanatory, the only thing that might be new here is the idempotency_token - it guarantees that we won’t try to create a new cluster if one with this id already exists. It might be useful in case the process fails after the cluster was already initialized. In general databricks_cluster contains a lot of different configuration options that control cluster parameters. One of more interesting is docker_image, which allows us to initialize with a custom Databricks image: docker_image { url = "myregistry.azurecr.io/myimage:latest" basic_auth { username = module.keyvault_read.secrets_value["some-registry-user"] # probably will have to be set up separately password = module.keyvault_read.secrets_value["some-registry-user-password"] } }  Please create outputs.tf file now: output "clusterId" { value = databricks_cluster.this.id }  We can plan and apply again. After execution finishes look for the instance of clusterId output that we just declared. Copy it somewhere safe, we will need this in a moment. Anyway, it would seem that the core of our infrastructure is already provisioned. This is the part that should not change that often in the daily development cycle. #### Databricks notebooks Ok, time for the last part. Let’s create a new Terraform project. It’s the last one, we promise! We will try to define a Databricks notebook job that can be instantiated from a notebook file (local, artifact, etc.). As usual, create an src directory and copy-paste our keyvault_read module (or skip it if you installed it locally). Apart from that, we will need four more files in src dir: # variables.tf variable "applicationName" { type = string default = "TfPlayground" } variable "environment" { type = string default = "dev" } variable "region" { type = string default = "West Europe" } variable "notebookFilePath" { type = string default = "./ExampleNotebook.py" } variable "notebookName" { type = string default = "ExampleNotebook" } variable "clusterId" { type = string }  In variables.tf we can see that we have a clusterId variable that does not have a default value. When trying to apply this plan, Terraform will ask us to provide this particular parameter (remember the id we asked to copy?). This is a bit problematic when working with Azure Release Pipelines, but it can be passed from the previous Terraform Task with a little effort. # outputs.tf output "job_url" { value = databricks_job.this.url }  Above we have just the URL for the job that we are trying to create. You will be able to find it in the Databricks workspace anyway, but why not make life a bit easier? # ExampleNotebook.py print("Hello World!")  Here we have the collective effort of the Data Science team. After six months of deep analysis, they have come up with a way to greet our planet. Exciting! So far we are waiting for any response. # main.tf terraform { required_providers { azurerm = { source = "hashicorp/azurerm" version = "=2.74.0" } databricks = { source = "databrickslabs/databricks" version = "=0.3.1" } } backend "azurerm" { storage_account_name = "mystorageaccount" container_name = "myfilesystem" key = "TfPlayground-DBN-dev.terraform.tfstate" } } provider "azurerm" { features {} } data "azurerm_client_config" "current" { } locals { appId = "${var.applicationName}-${var.environment}" tags = { ApplicationName = var.applicationName Environment = var.environment } } module "keyvault_read" { source = "./keyvault_read" kvName = "k-v-${local.appId}"
kvResourceGroup = "rg-${local.appId}" secrets = [ "service-user", "service-user-pass"] } data "azurerm_databricks_workspace" "this" { name = "ws-${local.appId}"
resource_group_name = "rg-${local.appId}" } provider "databricks" { azure_workspace_resource_id = data.azurerm_databricks_workspace.this.id azure_client_id = module.keyvault_read.secrets_value["service-user"] azure_client_secret = module.keyvault_read.secrets_value["service-user-pass"] azure_tenant_id = data.azurerm_client_config.current.tenant_id } data "databricks_current_user" "this" { } resource "databricks_notebook" "this" { source = var.notebookFilePath path = "${data.databricks_current_user.this.home}/${var.notebookName}" } resource "databricks_job" "this" { name = "job-${local.appId}"
existing_cluster_id = var.clusterId
max_retries         = 0

notebook_path  = databricks_notebook.this.path
}
}


Our last blob of code is mostly boilerplate. It starts getting interesting with databricks_notebook. Since notebooks in Databricks are always owned by a user, we must assign them to some specific entity.

The databricks_job is a tricky one. It seems that it’s impossible to start a one-off job that gets executed at the moment we finish applying the plan. We can define a cron job that will run periodically, but that’s not always what we would want.

If you finally apply this project and follow the job_url, you should be able to run the job. Well done!

#### Closing words

There are many things that we did not cover in this tutorial. For those of you that are interested in further developing Terraform skills, we would recommend starting from visiting full documentation of modules and providers that we used. Maybe you can solve in an interesting way some of the problems that we could not?

We tried to be as objective as possible here and show both the pros and cons of working with Azure via Terraform. Overall this was a pleasant experience and although it’s a bit rough around the edges it seems like integration is heading in the right direction. No matter if you work with Azure or not, we would strongly recommend checking out Terraform!