Skip to main content
Dave Mason - Mastodon Dave Mason - LinkedIn Dave Mason - Codeberg Dave Mason - Counter Social

Import and Export Parquet Files with SQL Server

Dave Mason - Apache Arrow

What is a parquet file? It's a file created with the open source Apache Parquet file format. Parquet files have data stored in a columnar fashion that is higly compressed, making storage and retrieval very efficient. One option for working with parquet files is Apache Arrow, a software development platform for in-memory analytics. Its libraries are available for numerous languages, including R and Python. There's no native T-SQL support for Apache Arrow, but SQL Server has in-database support for R and Python via Machine Learning Services. For this post, I'll demonstrate with code samples that use sp_execute_external_script and the arrow package for R.

Arrow Package

The fist step will be to install the arrow package. I had a bear of a time with this, so I collected some notes and shared in another post. I also blogged about some considerations for interacting with the file system. Take a look--if you have some issues, maybe what worked for me will work for you.

With the package installed, we can create a parquet file with the write_parquet() function. The following uses the built-in sample data frame iris to create the file. Nothing is returned:

EXEC sp_execute_external_script  
	@language = N'R',
	@script = N' 
library(arrow)
write_parquet(x = iris, sink = "C:\\R fileops\\iris.parquet")
';

Here's what we see in file explorer:

Dave Mason - iris data frame

We can also read iris data from the file with the read_parquet() function. (Note the return type of read_parquet() is a Tibble. sp_execute_external_script handles this without complaint. I would imagine there is an implicit coercion to data frame somewhere in the call stack).

EXEC sp_execute_external_script  
	@language = N'R',
	@script = N' 
library(arrow)
x<-read_parquet(file = "C:\\R fileops\\iris.parquet")
',
	@output_data_1_name = N'x';
Dave Mason - read_parquet()

At just 150 rows, this isn't terribly exciting. Let's make things more interesting with a larger data set, this time using data from a SQL Server database. Here we'll create a parquet file of data from the FactProductInventory table of the AdventureWorksDW database.

EXEC sp_execute_external_script  
	@language = N'R',
	@script = N' 
library(arrow)
write_parquet(x = FactProdInv, sink = "C:\\R fileops\\FactProductInventory.parquet")
',
	@input_data_1 = N'SELECT * FROM AdventureWorksDW.dbo.FactProductInventory',
	@input_data_1_name = N'FactProdInv';
Dave Mason dbo.FactProductInventory

The parquet file created is about 3 MB and consists of 776,286 rows of data. How does that compare to a *.csv file? Let's find out. Here, the write.csv() function creates an uncompressed csv file consisting of the same data from the FactProductInventory table:

EXEC sp_execute_external_script  
	@language = N'R',
	@script = N' 
write.csv(x = FactProdInv, file = "C:\\R fileops\\FactProductInventory.csv")
',
	@input_data_1 = N'SELECT * FROM AdventureWorksDW.dbo.FactProductInventory',
	@input_data_1_name = N'FactProdInv';
Dave Mason - dbo.FactProductInventory

The difference in size is striking: 35 MB for the csv file vs 3 MB for the parquet file. Compression ratios will vary from one data set to another, of course. Here, it's close to 12X. Your mileage may vary.

As before (with the iris.parquet file), we can read the larger file via read_parquet():

EXEC sp_execute_external_script  
	@language = N'R',
	@script = N' 
library(arrow)
x<-read_parquet(file = "C:\\R fileops\\FactProductInventory.parquet")
',
	@output_data_1_name = N'x';
Dave Mason - read_parquet()

We can also insert the parquet file data to a #temp or permanent table. From there, you could proceed in a number of different directions:

CREATE TABLE #FactProductInventory(
	[ProductKey] [int] NOT NULL,
	[DateKey] [int] NOT NULL,
	[MovementDate] [date] NOT NULL,
	[UnitCost] [money] NOT NULL,
	[UnitsIn] [int] NOT NULL,
	[UnitsOut] [int] NOT NULL,
	[UnitsBalance] [int] NOT NULL
) WITH (DATA_COMPRESSION = ROW);
GO

INSERT INTO #FactProductInventory
EXEC sp_execute_external_script  
	@language = N'R',
	@script = N' 
library(arrow)
x<-read_parquet(file = "C:\\R fileops\\FactProductInventory.parquet")
',
	@output_data_1_name = N'x';


STDERR message(s) from external script: 

Attaching package: 'arrow'

The following object is masked from 'package:utils':

    timestamp

Warning message:
package 'arrow' was built under R version 3.5.3 


(776286 rows affected)

Comments

Comment by Martin

I'm amazed that this works on your machine. I tried with SQL Server 2019. It comes with R 3.5.2. It also cannot be updated. But arrow package needs a newer version of R. How did you manage to get this running? There is another way: You can use a new version of R with arrow installed in SQL Server as an external language. But then, running the stored procedure sp_execute_external_script just produces error HRESULT 0x80004004. Why is this all such a fuzz?

Martin
Comment by Fritz

You can also do this with sling (https://slingdata.io). See below: # set connection via env var export mssql='sqlserver://...' # test connection sling conns test mssql # run export for one table sling run --src-conn mssql --src-stream 'my_schema.my_table' --tgt-object 'file://my_folder/my_table.parquet' # run export for many tables sling run --src-conn mssql --src-stream 'my_schema.*' --tgt-object 'file://{stream_schema}/{stream_table}.parquet'

Post comment