Skip to main content
Dave Mason - Mastodon Dave Mason - LinkedIn Dave Mason - GitHub Dave Mason - Counter Social

CSV Row Counts

Dave Mason - SQL Server - CSV Row Count

This is one of those posts where if my opening premise is wrong, the whole thing is a moot point. I've been wrong before, so let's roll! There doesn't seem to be an efficient way to get a count of the number of records in a CSV file. Every record can be a different number of characters/bytes. So in my not-so-expert conclusion, you have to open the CSV file and read it into memory in its entirety to get the record count. This can be quite inefficient for large files.

I was reminded of this recently as I was working with R, trying to read a nearly 2 GB data file. I wanted to read in 5% of the data and output it to a smaller file that would make the test code run faster. The particular function I was working with needed a row count as one of its parameters. For me, that meant I had to determine the number of rows in the source file and multiply by 0.05. I tied the code for all of those tasks into one script block.

Now, none to my surprise, it was slow. In my short experience, I've found R isn't particularly snappy--even when the data can fit comfortably in memory. I was pretty sure I could beat R's record count performance handily with C#. And I did. I found some related questions on StackOverflow. A small handful of answers discussed the efficiency of various approaches. I only tried two C# variations: my original attempt, and a second version that was supposed to be faster (the improvement was nominal).

Sometime later I remembered OPENROWSET for SQL Server. And I wondered...how fast would SQL Server count records in a CSV file? Pretty fast, it turns out. I didn't spend a ton of time on the R, C#, or T-SQL code. It's below if you want to take a look. In short, R was the slowest, C# was pretty fast, and T-SQL with OPENROWSET was the fastest.

RC#T-SQL
122.60 seconds1st Version: 12.7982 seconds
2nd Version: 12.2451 seconds
6.996 seconds

R


> begin <- proc.time()
> setwd("C:/Data/")
> filepath <- file("yellow_tripdata_2016-01.csv",open="r")
> maxLinesToRead <- 20000
> counter <- 0
> ( while((linesread <- length(readLines(con = filepath, n = maxLinesToRead))) > 0 )
+ counter <- counter+linesread )
NULL
Warning message:
closing unused connection 3 (yellow_tripdata_2016-01.csv)
> close(filepath)
> counter
[1] 10906859
> proc.time() - begin
user system elapsed
116.90 3.08 122.60
> paste("Record Count", counter, sep = ": ")
[1] "Record Count: 10906859"


C#


//Version 1
int counter = 0;
DateTime start = DateTime.Now;

using (System.IO.StreamReader file =
new System.IO.StreamReader(@"C:\Data\yellow_tripdata_2016-01.csv"))
{
while (file.ReadLine() != null)
{
counter++;
}

file.Close();
}

DateTime end = DateTime.Now;
TimeSpan ellapsed = end - start;
Console.WriteLine(
"Record count: " + counter.ToString());
Console.WriteLine(ellapsed.ToString(
"c"));


//Version 2
int counter = 0;
DateTime start = DateTime.Now;
counter = System.IO.File.ReadLines(
@"C:\Data\yellow_tripdata_2016-01.csv").Count();
DateTime end = DateTime.Now;
TimeSpan ellapsed = end - start;
Console.WriteLine(
"Record count: " + counter.ToString());
Console.WriteLine(ellapsed.ToString(
"c"));


T-SQL


DECLARE @Start DATETIME2(7) = CURRENT_TIMESTAMP;

SELECT COUNT(*)
FROM OPENROWSET (
BULK 'C:\Data\yellow_tripdata_2016-01.csv',
FORMATFILE = 'C:\Data\VarCharMax.xml'
) AS d

SELECT DATEDIFF_BIG(MILLISECOND, @Start, CURRENT_TIMESTAMP) / 1000.0


OPENROWSET Format File


<?xml version="1.0"?>
<
BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<
RECORD>
<
FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\r\n" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</
RECORD>
<
ROW>
<
COLUMN SOURCE="1" NAME="VarCharMax" xsi:type="SQLVARYCHAR"/>
</
ROW>
</
BCPFORMAT>

Comments

Post comment