Skip to main content
Dave Mason - Mastodon Dave Mason - LinkedIn Dave Mason - GitHub Dave Mason - Counter Social

Read and Write files to Azure Data Lake with SQL Server

Dave Mason - RStats AzureStor

In the previous post, we looked at one way to read and write parquet files (and csv files) from/to disk with SQL Server via Machine Learning Services. For file I/O, there's no need to limit ourselves to the local file system or network shares. We can also interact with Azure storage. AzureR is a family of packages designed to be lightweight yet powerful by working directly with the Azure REST API, and without any dependencies on other languages. In this post, we'll look at some code samples that use the AzureStor package for working with files in Azure Data Lake.

The interface for accessing storage is similar across blobs, files and ADLSGen2. You call the storage_endpoint() function and provide the endpoint URI, along with your authentication credentials. AzureStor will figure out the type of storage from the URI. Each type of storage supports one or more of the following authentication options:

  • Access key
  • Shared access signature (SAS)
  • Azure Active Directory (AAD) OAuth token

For connections to ADLSGen2 storage containers, our code samples will use an Access Key:

Dave Mason - Azure Access Keys

The storage account I'll connect to has a few containers, including a testing-davem container with a couple of files:

Dave Mason - Azure Data Lake files

Here's a short code sample that reads a csv file with four lines of R code:

  1. Load the AzureStor package.
  2. Connect to the ADLSGen2 storage point.
  3. Get a container reference.
  4. Read a file via storage_read_csv().

EXEC sp_execute_external_script  
	@language = N'R',
	@script = N' 
library(AzureStor)
dataLakeEndpoint <- storage_endpoint("https://mystorage.blob.core.windows.net", key="access_key")
cont <- storage_container(dataLakeEndpoint, "testing-davem")
x<-storage_read_csv(container = cont, file = "my-files/iris.csv")
',
	@output_data_1_name = N'x';

The file contents are assigned to variable x (a data frame), and returned to SQL Server as a result set.

Dave Mason - RStats iris

With the arrow package, you can read many other file formats, including parquet files. Here we read in a parquet file as a raw vector, which is passed to arrow's read_parquet() function:

EXEC sp_execute_external_script  
	@language = N'R',
	@script = N' 
library(AzureStor)
library(arrow)
dataLakeEndpoint <- storage_endpoint("https://mystorage.blob.core.windows.net", key="access_key")
cont <- storage_container(dataLakeEndpoint, "testing-davem")
rawVector <- storage_download(cont, src="my-files/mtcars.parquet", dest=NULL)
x<-read_parquet(rawVector)
',
	@output_data_1_name = N'x';

Once again, the file contents are assigned to a variable (x, a data frame), and returned to SQL Server as a result set.

Dave Mason - RStats mtcars

The AzureStor package has a storage_upload() function that allows us to write to Azure storage. Here are two examples that query AdventureWorks and export the data: the first to a csv file, and the second to a parquet file:


--#Write csv file to Data Lake 
EXEC sp_execute_external_script  
	@language = N'R',
	@script = N' 
#Write to Data Lake
library(AzureStor)
dataLakeEndpoint <- storage_endpoint("https://mystorage.blob.core.windows.net", key="access_key")
cont <- storage_container(dataLakeEndpoint, "testing-davem")

newFile <- tempfile(fileext = ".csv")
write.csv(advWorks, newFile)
storage_upload(cont, src = newFile, dest = "my-files/advWorks.csv")

if (file.exists(newFile)) {
	file.remove(newFile)
}
',
	@input_data_1 = N'
SELECT p.BusinessEntityID, p.FirstName, p.MiddleName, p.LastName, pp.PhoneNumber
FROM AdventureWorks.Person.Person p
JOIN AdventureWorks.Person.PersonPhone pp
	ON pp.BusinessEntityID = p.BusinessEntityID',
	@input_data_1_name = N'advWorks';
GO

--#Write parquet file to Data Lake
EXEC sp_execute_external_script  
	@language = N'R',
	@script = N' 
#Write to Data Lake
library(AzureStor)
library(arrow)
dataLakeEndpoint <- storage_endpoint("https://mystorage.blob.core.windows.net", key="access_key")
cont <- storage_container(dataLakeEndpoint, "testing-davem")

newFile <- tempfile(fileext = ".parquet")
write_parquet(advWorks,	newFile)
storage_upload(cont, src = newFile, dest = "my-files/advWorks.parquet")

if (file.exists(newFile)) {
	file.remove(newFile)
}
',
	@input_data_1 = N'
SELECT p.BusinessEntityID, p.FirstName, p.MiddleName, p.LastName, pp.PhoneNumber
FROM AdventureWorks.Person.Person p
JOIN AdventureWorks.Person.PersonPhone pp
	ON pp.BusinessEntityID = p.BusinessEntityID',
	@input_data_1_name = N'advWorks';
GO

Back in the Azure Portal, a refresh of the browser reveals the two files we created.

Dave Mason - Azure Data Lake files

Comments

Post comment