Series of Azure Databricks posts:
- Dec 01: What is Azure Databricks
- Dec 02: How to get started with Azure Databricks
- Dec 03: Getting to know the workspace and Azure Databricks platform
- Dec 04: Creating your first Azure Databricks cluster
- Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and jobs
Yesterday we started exploring the Azure services that are created when using Azure Databricks. One of the service, that I would like to explore today is storage and especially how to import and how to store data.
Log in to Azure Databricks and on the main (home) site select “Create Table” under recommended common task. Don’t start your cluster yet (if it’s running, please terminate it for now).
This will prompt you a variety of actions on importing data to DBFS or connecting Azure Databricks with other services.
Drag the data file (available on Github in data folder) named Day6data.csv to square for upload. For easier understanding, let’s check the CSV file schema (simple one, three columns: 1. Date (datetime format), 2. Temperature (integer format), 3. City (string format)).
But before you start with uploading the data, let’s check the Azure resource group. I have not yet started any Databricks cluster in my workspace. And here you can see that Vnet, Storage and Network Security group will always be available for Azure Databricks service. Only when you start the cluster, additional services (IP addresses, disks, VM,…) will appear.
This gives us better idea where and how data is persisted. Your data will always be available and stored on blob storage. Meaning, even if you decide – not only to terminate the cluster, but to delete the cluster as well, your data will always be safely stored. Only when you add new cluster to same workspace, cluster will automatically retrieved the data from blob storage.
Drag and drop the csv file in the “Drop zone” as discussed previously. And is should looked like this:
You have now two options:
- create table with UI
- create table in Notebook
Select the “Create table with UI”. Only now you will be asked to select the cluster:
Now select the “Create table in Notebook” and Databricks will create a first Notebook for you using Spark language to upload the data to DBFS.
In case I want to run this notebook, I will need to have my cluster up and running. So let’s start a cluster. On your left vertical navigation bar, select Cluster Icon. You will get the list of all the clusters you are using. Select the one we have created on Day 4.
If you want, check the resource group for your Azure Databricks to see all the running VM, disks and VNets.
Now insert the data using the import method, by drag and drop the CSV file in the “Drop Zone” (repeat the process) and hit “Create Table with UI”. Now you should have Cluster available. Select it and preview the Table.
You can see that table name is propagated from filename, the file Type is automatically selected, Column delimiter is automatically selected. Only “First row in header” should be selected in order to have columns properly named and data types corrected, respectively.
Now we can create a table. After Databricks will finish, the report will be presented with recap of the table location (yes, location!), Schema and overview of sample data.
This table is now available on my Cluster. What does this mean? This table is now persistent on your cluster, but not only on cluster, but on your Azure Databricks Workspace. This is important to understand how and where data is stored. Go to Data icon on left vertical navigation bar.
This database is attached to my Cluster. If I terminate my cluster, will I loose my data? Trying stoping the cluster and check data again. And bam… Database is not available, since there is no cluster “attached” to it.
But hold your horses. Data is still available on blob storage, just not seen to DBFS. Database will be visible again, when you start your cluster.
2. Storing data to DBFS
DBFS – Databricks File System is a distrubuted file system mounted into an enclosed Azure Databricks workspace. DBFS is available on selected cluster through UI or Notebooks. In this way, DBFS is decoupled data layer (or abstraction layer) on top of Azure object storage
is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the following benefits:
- easy communication and interaction with object storage using bash / CLI command line
- data is always persistent
- mounting storage objects is easy and accessing it done seamlessly
- No additional credentials are needed, since you are “locked” in azure workspace.
Storage is located as root and there are some folders created with following locations:
- dbfs:/root – is a root folder
- dbfs:/filestore – folder that holds imported data files, generated plots, tables, and uploaded libraries
- dbfs:/databricks – folder for mlflow, init scripts, sample public datasets, etc.
- dbfs:/user/hive – data and metadata to hive (SQL) tables
You will find many other folder that will be generated though notebooks.
Before we begin, let’s make your life easier. Go to admin console setting, select advanced tab and find “DBFS File browser“. By default, this option is disabled, so let’s enable it.
This will enable you to view the data through DBFS structure, give you the upload option and search option.
Uploading files will be now easier and would be seen immediately in FileStore. There is same file prefixed Day6Data_dbfs.csv in github data folder, that you can upload manually and it would be seen in FileStore:
Tomorrow we will explore how we can use Notebook to access this file in different commands (CLI, Bash, Utils, Python, R, Spark). And since we will be using notebooks for the first time, we will do a little exploration of notebooks as well.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!