Best Practices for Provisioning Databricks Infrastructure on Azure via Terraform

By the end of this guide, your Terraform project will be able to bring up a Unity Catalog enabled Azure Databricks Workspace, with repos imported and clusters already created.

The Project


In this guide, you will learn to quickly and easily set up a basic infrastructure proof of concept.

My original goal was to design a blueprint for an ‘ideal’ Databricks infrastructure, one that is: 

  1. Supportive of the setups most of our current and future clients use.
  2. Usable to define the Hiflylabs standards and best practices for Databricks infrastructures. 
  3. Documented well enough that a junior colleague can set up a new infrastructure using my guidelines.

This infrastructure can be expanded upon and modified to fit your exact needs.

Need for Automation

The proposed architecture (see below) is complex enough to require automation for the setup and ongoing development of the Delta Lakehouse.

We implement this project as ‘Infrastructure as Code’ for the following reasons:

  1. Consistency of repeatable deployments across environments.
  2. Testability throughout infrastructure configurations.
  3. Rapid recovery capabilities.
  4. Self-documentation of the system's architecture and configuration.
  5. Version control and easier rollbacks (if needed).
  6. Collaboration through team code review.

The two main areas where automation comes into the picture are setting up cloud resources and the release cycle of the Delta Lakehouse.

Proposed Architecture

The first step was to define what an ideal basic infrastructure is.

For this phase, I leveraged an architecture we recently proposed to a client for their data platform design. Rather than synthesizing various possible architectures to find a common denominator, I determined that this recent design could effectively address our first requirement (Consistency).

Additionally, selecting this architecture provides us with a strategic advantage should the client proceed with implementation in the future. 😉

For the purposes of this blog post, I've adapted the architecture into a more generalized design to maintain client confidentiality:

image1_sztg_dbx-azure-terraform.jpg

Let’s zoom in on the main elements of this proposal:

Cloud Storage

It could occur to you to separate the elements of the medallion architecture for a given environment into different Storage Accounts. For this project, I didn’t deem it necessary. Separation is possible, the benefit of such granular division is not apparent. But the design can be easily adapted if business requirements change.

Should all data related to an environment be stored in a single Storage Account? No. But let’s think about it:

More info about physical separation can be found in the Unity Catalog best practices document.

Unity Catalog

Unity Catalog is solely responsible for the catalog functionality and access control. 

The Delta Lakehouse layer should only know about (and use) the data via the three levels of the Unity Catalog namespace. No physical locations should be used in the Delta Lakehouse layer.

Delta Lakehouse

In this design, the Delta Lakehouse uses a Databricks platform and it's built via dbt. 

Hiflylabs has extensive experience utilizing dbt for the creation and management of lakehouses on Databricks platforms. To support the lakehouse development with out of box functionalities the dbt component is included in the design. However, it's not a must to use dbt for building a lakehouse - it could be entirely built on Databricks itself, if that's a requirement.

Terraform

Terraform emerged as the optimal solution for automating cloud resource setup. Given Hiflylabs' prior experience with Terraform for Snowflake infrastructure provisioning, we were confident in its applicability to our current needs. Although the internal project team had limited Terraform experience, we embraced this as an opportunity for skill development.

CI/CD

Installation Guide

Requirements

Note: The following installation steps have worked on my macOS Sonoma (Version 14.4.1). For other operating systems, please refer to the official documentation linked below each software.

Terraform

I installed Terraform using Brew. I opened a terminal and ran the following commands:

$ brew update
$ brew tap hashicorp/tap
$ brew install hashicorp/tap/terraform

To verify the installation I ran:

$ terraform -help
image2_sztg_dbx-azure-terraform.png


Click here for the official Terraform Installation instructions for other OS’ and troubleshooting

Azure CLI

I also used Brew for installing the Azure CLI. I ran the following in the terminal:

$ brew install azure-cli

To verify the installation I ran:

$ az -h
image3_sztg_dbx-azure-terraform.png

Follow the official Azure CLI installation instructions if you get stuck.

Authentication

For Azure authentication I used the Azure CLI.

I ran the following in the terminal:

$ az login
image4_sztg_dbx-azure-terraform.png

Note: It is enough to authenticate through 'az login' once and then Terraform can use the OAuth 2.0 token for subsequent authentications.

Implementation

While Terraform code can be consolidated into a single file for basic functionality, we prioritized testing modular code structures to ensure scalability for production-grade implementations.

Note: The code is available on Github—use v_1.0 release for this post.

Structure of the code base

The code base is broken up into two main folders:

Modules

The structure of the modules is the following:

File nameExplanation
README.md
Contains an explanation of the module, what it does, how it works, etc.
<module_name>.tfContains the resource (*) definitions that the module creates.
variables.tfDeclares the variable values that the module uses.
required_providers.tfSpecifies the Terraform providers that the module uses - versions can be specified.
outputs.tfDeclares the values that the module returns.

Projects

The structure of the projects is the following:

File nameExplanation
README.mdContains an explanation of the project, what it does, how it works, etc.
<project_name>.tfContains the module calls with dependencies that are needed to create the desired infrastructure.
variables.tfDeclares the variable values that the project uses.
terraform.tfvars.templateA template to help create 'terraform.tfvars'.
terraform.tfvarsContains the value assignment to the variables declared in 'variables.tf'. Note: it doesn't exist in the repo, it needs to be created with your own details.
data.tfSpecifies the Terraform data (*) elements that the project uses.
required_providers.tfSpecifies the Terraform providers that the project uses - versions can be specified.
provider_instancesDeclares and configures the provider instances that are used in the project.
outputs.tfDeclares the values that the project returns and prints to the screen.

(*) In Terraform there are two principal elements when building scripts: resources and data sources. Resource is something that will be created by and controlled by the script. A data source is something which Terraform expects to exist.

Variables

The 'variables.tf' file contains the variable declarations together with the specification of the default values. The 'terraform.tfvars' file contains the value assignments to the declared variables. Note: entries in 'terraform.tfvars' overwrite the default values specified in 'variables.tf'.

All the non-sensitive variables contain default values. When there is no entry for a variable in 'terraform.tfvars', the default value is taken.

Variables that are declared as sensitive cannot have default values, so they need to be declared in 'terraform.tfvars' in order to run the code.

Running the code

Check out the repository

Create and configure 'terraform.tfvars'

image5_sztg_dbx-azure-terraform.jpg

 

Initialize Terraform

Run the script

Results

Terraform Output

Azure Resource Group

module.azure[0].azurerm_resource_group.this: Creation complete after 2s [id=/subscriptions/9869e986-3f70-4d81-9f1d-7a7b29328568/resourceGroups/dbx-terraform-bootstrap]

Azure Databricks Access Connector

module.unity-catalog-azure.azurerm_databricks_access_connector.acces_connector: Creation complete after 17s [id=/subscriptions/9869e986-3f70-4d81-9f1d-7a7b29328568/resourceGroups/dbx-terraform-bootstrap/providers/Microsoft.Databricks/accessConnectors/access-connector]

Databricks Unity Catalog

module.unity-catalog-azure.azurerm_storage_account.unity_catalog: Creation complete after 24s [id=/subscriptions/9869e986-3f70-4d81-9f1d-7a7b29328568/resourceGroups/dbx-terraform-bootstrap/providers/Microsoft.Storage/storageAccounts/unitistoragedbxterraform]

Azure Storage Container for Unity Catalog

module.unity-catalog-azure.azurerm_storage_container.unity_catalog: Creation complete after 1s [id=https://unitistoragedbxterraform.blob.core.windows.net/unitymetastore]

Azure Blob Data Contributor Role Assignment

module.unity-catalog-azure.azurerm_role_assignment.unity_blob_data_contributor: Creation complete after 26s [id=/subscriptions/9869e986-3f70-4d81-9f1d-7a7b29328568/resourceGroups/dbx-terraform-bootstrap/providers/Microsoft.Storage/storageAccounts/unitistoragedbxterraform/providers/Microsoft.Authorization/roleAssignments/35f9a12e-0dd4-0282-25e4-24d223-7c3e4f]

Databricks Workspace

module.dbx-workspace.azurerm_databricks_workspace.this: Creation complete after 2m25s [id=/subscriptions/9869e986-3f70-4d81-9f1d-7a7b29328568/resourceGroups/dbx-terraform-bootstrap/providers/Microsoft.Databricks/workspaces/dbx-terraform-bootstrap]

Databrick Unity Catalog Workspace Assignment

module.unity-catalog-metastore.module.unity-catalog-workspace-assignment.databricks_metastore_assignment.prod: Creation complete after 0s [id=3257333208545686|9d430cc4-e7dd-4b97-b095-9eb24226ac99]

Unity Metastore Data Access

module.unity-catalog-metastore.databricks_metastore_data_access.access-connector-data-access: Creation complete after 2s [id=9d430cc4-e7dd-4b97-b095-9eb24226ac99|access-connector]

Databricks Repos

module.dbx-repos.databricks_git_credential.ado: Creation complete after 2s [id=764573043456860]
module.dbx-repos.databricks_repo.all["repo_1"]: Creation complete after 7s [id=106725931470002]

Databricks Cluster

module.dbx-auto-scaling-clusters[0].databricks_cluster.shared_autoscaling["cluster_1"]: Creation complete after 7m16s [id=0507-085727-ym3yf34v]
Terraform Codebase

After running 'terraform apply' the following files/folders were created by Terraform:

image16_sztg_dbx-azure-terraform.jpg
File/Folder nameExplanation
.terraform
The folder is a local cache where Terraform retains some files it will need for subsequent operations against this configuration. 
.terraform.lock.hclLock file that makes sure that the same infrastructure will be created if multiple users are working. It serves as a central repository for the particular provider and module versions that you have used in your configuration.
terraform.tfstateThe content of this file is a JSON formatted mapping of the resources defined in the configuration and those that exist in your infrastructure.
terraform.tfstate.backupBackup of the state file above.

Azure Resources
image17_sztg_dbx-azure-terraform.jpg

The following Azure resources were created:

Databricks Resources
Workspace with Repo

The specified Github repository was imported

image18_sztg_dbx-azure-terraform.jpg
Cluster

The specified auto scaling cluster was created

image19_sztg_dbx-azure-terraform.jpg
Unity Catalog

A Unity Catalog was created and assigned to the Workspace

image20_sztg_dbx-azure-terraform.jpg
Findings

General experiences

Conditional creation of cloud resources

I was looking for a way to decide if the Azure resource group (or any other resource) should be created or not. I only found a workaround to provide this functionality.

In Terraform we can use conditional statements in the following syntax:

<boolean expression> ? <return value for true> : <return value for false>

I solved the conditional creation of the Azure resource group this way:

count = var.create-azure-resource-group ? 1 : 0

(source)

When we set the count for a Terraform resource, it will create as many pieces of that resource.  In the case above, the conditional statement either sets this value to 0 or 1 - based on the configuration variable. When count is set to 0, the rest of the Terraform code for that resource creation is skipped.

Single vs. Multiple Creation of Resources

Running the Terraform code handled the creation of resources elegantly.

When something in the codethen Terraform
existedleft it untouched
changedrecreated it
was addedcreated it

Everything worked like this until the point I created Databricks clusters via Terraform. 
For the clusters however, the behavior has changed: running every terraform apply command created a new Databricks cluster with the same name as before.

My solution to this issue was to set count for the resource.

image21_sztg_dbx-azure-terraform.jpg

(source) and (source)

Unable to destroy an existing Unity metastore

Even though I set force_destroy = true

image22_sztg_dbx-azure-terraform.jpg

(source) for the Unity Metastore resource, I keep getting the following error message when running 'terraform destroy' – I could not figure out the solution to this problem in time, so I ended up deleting the metastore manually when I needed to.

image23_sztg_dbx-azure-terraform.png

│ Error: cannot delete metastore data access: Storage credential 'access-connector' cannot be deleted because it is configured as this metastore's root credential. Please update the metastore's root credential before attempting deletion.

Extended Infrastructure

There are a few more areas we need to consider when developing production code

  1. Handle user and group permissions
  2. Setting up dbt
  3. Coming up with efficient ways of creating multiple environments - as of now, it is not clear if it makes more sense to:
    1. create separate projects for every environment,
    2. use separate .tfvar files and specify separate destinations for the Terraform files for every environment 
    3. or have a single monolithic project that creates everything - specify the differences in configuration
  4. Handle deployment of Databricks code via Terraform

Future Development Roadmap

  1. Extend Proof of Concept:
    1. Implement CI/CD automation in subsequent iteration
    2. Scale to encompass all environments, Unity catalogs, and storage accounts (leveraging existing expertise)
  2. Expand Extended Infrastructure implementation
  3. Address Unity Metastore management:
    1. Develop Terraform-based destruction process for Unity Metastore
    2. Evaluate appropriateness of Terraform for Unity Metastore management

Conclusion

Terraform proves to be an effective tool for automating cloud resource creation (Infrastructure as Code) in Databricks environments.

While there's a learning curve to writing manageable Terraform code, its support for modular design enhances efficiency and maintainability.

As a data engineer, proficiency in this area is valuable—we’re expected to set up and manage non-production environments (e.g., development, testing).

The investment in automating infrastructure setup through code consistently yields benefits, reducing manual effort and potential errors.

All in all: This approach not only saves time but also improves consistency across environments, facilitating easier scaling and modifications as project requirements evolve.
 

Databricks

Explore more stories

Hack Your Data Warehouse: Focused Optimization At Your Own Pace

|BALÁZS RAGÁLYI|

Traditional data warehouses are hitting their limits, but complete overhauls aren't always the answer. Through the lens of our recent case study, discover how incremental Data Vault implementation can deliver dramatic improvements without disrupting operations. From 98% faster processing to flexible business rules, learn practical approaches to modernization that you can start implementing today.

We want to work with you.

Hiflylabs is your partner in building your future. Share your ideas and let's work together!