Read and Write files to Azure Data Lake with SQL Server
- Posted in:
- R
- SQL Server
- Azure
In the previous post, we looked at one way to read and write parquet files (and csv files) from/to disk with SQL Server via Machine Learning Services. For file I/O, there's no need to limit ourselves to the local file system or network shares. We can also interact with Azure storage. AzureR is a family of packages designed to be lightweight yet powerful by working directly with the Azure REST API, and without any dependencies on other languages. In this post, we'll look at some code samples that use the AzureStor package for working with files in Azure Data Lake.
The interface for accessing storage is similar across blobs, files and ADLSGen2. You call the storage_endpoint() function and provide the endpoint URI, along with your authentication credentials. AzureStor will figure out the type of storage from the URI. Each type of storage supports one or more of the following authentication options:
- Access key
- Shared access signature (SAS)
- Azure Active Directory (AAD) OAuth token
For connections to ADLSGen2 storage containers, our code samples will use an Access Key:
The storage account I'll connect to has a few containers, including a testing-davem container with a couple of files:
Here's a short code sample that reads a csv file with four lines of R code:
- Load the AzureStor package.
- Connect to the ADLSGen2 storage point.
- Get a container reference.
- Read a file via storage_read_csv().
EXEC sp_execute_external_script @language = N'R', @script = N' library(AzureStor) dataLakeEndpoint <- storage_endpoint("https://mystorage.blob.core.windows.net", key="access_key") cont <- storage_container(dataLakeEndpoint, "testing-davem") x<-storage_read_csv(container = cont, file = "my-files/iris.csv") ', @output_data_1_name = N'x';
The file contents are assigned to variable x (a data frame), and returned to SQL Server as a result set.
With the arrow package, you can read many other file formats, including parquet files. Here we read in a parquet file as a raw vector, which is passed to arrow's read_parquet() function:
EXEC sp_execute_external_script @language = N'R', @script = N' library(AzureStor) library(arrow) dataLakeEndpoint <- storage_endpoint("https://mystorage.blob.core.windows.net", key="access_key") cont <- storage_container(dataLakeEndpoint, "testing-davem") rawVector <- storage_download(cont, src="my-files/mtcars.parquet", dest=NULL) x<-read_parquet(rawVector) ', @output_data_1_name = N'x';
Once again, the file contents are assigned to a variable (x, a data frame), and returned to SQL Server as a result set.
The AzureStor package has a storage_upload() function that allows us to write to Azure storage. Here are two examples that query AdventureWorks and export the data: the first to a csv file, and the second to a parquet file:
--#Write csv file to Data Lake EXEC sp_execute_external_script @language = N'R', @script = N' #Write to Data Lake library(AzureStor) dataLakeEndpoint <- storage_endpoint("https://mystorage.blob.core.windows.net", key="access_key") cont <- storage_container(dataLakeEndpoint, "testing-davem") newFile <- tempfile(fileext = ".csv") write.csv(advWorks, newFile) storage_upload(cont, src = newFile, dest = "my-files/advWorks.csv") if (file.exists(newFile)) { file.remove(newFile) } ', @input_data_1 = N' SELECT p.BusinessEntityID, p.FirstName, p.MiddleName, p.LastName, pp.PhoneNumber FROM AdventureWorks.Person.Person p JOIN AdventureWorks.Person.PersonPhone pp ON pp.BusinessEntityID = p.BusinessEntityID', @input_data_1_name = N'advWorks'; GO --#Write parquet file to Data Lake EXEC sp_execute_external_script @language = N'R', @script = N' #Write to Data Lake library(AzureStor) library(arrow) dataLakeEndpoint <- storage_endpoint("https://mystorage.blob.core.windows.net", key="access_key") cont <- storage_container(dataLakeEndpoint, "testing-davem") newFile <- tempfile(fileext = ".parquet") write_parquet(advWorks, newFile) storage_upload(cont, src = newFile, dest = "my-files/advWorks.parquet") if (file.exists(newFile)) { file.remove(newFile) } ', @input_data_1 = N' SELECT p.BusinessEntityID, p.FirstName, p.MiddleName, p.LastName, pp.PhoneNumber FROM AdventureWorks.Person.Person p JOIN AdventureWorks.Person.PersonPhone pp ON pp.BusinessEntityID = p.BusinessEntityID', @input_data_1_name = N'advWorks'; GO
Back in the Azure Portal, a refresh of the browser reveals the two files we created.
Comments
Very nice approach Are there any benefits to using POLYBASE?
Yuri Gorelik