Multiple Output Datasets With R and SQL Server
- Posted in:
- Development
- R
- SQL Server 2016
- SQL Server 2017
- T-SQL
Something I've been fooling around with lately is using R to scrape web pages for HTML table data. As an example, there are two HTML tables on this post about unpivoting the output of a RESTORE HEADER command in T-SQL. Basically, there's a "wide" HTML table with dozens of columns, and a "narrow" HTML table with just two.
Here is some initial R code I came up with (and the output) to nab the HTML table data as data frames. Keep in mind I am still very new to R. What I've written may be "clumsy or random". Plus, it's been brought to my attention that the rvest package is superior to the methods I've used. But the web scraping isn't what inspired this post.
#install.packages('XML')
#install.packages('RCurl')
library(RCurl)
library(XML)
url <- 'https://DaveMason.me/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html'
tables = getURL(url)
tables <- readHTMLTable(tables, header = TRUE, as.data.frame = TRUE, stringsAsFactors = F)
tables
$HO
BackupName BackupDescription BackupType
1 SomeApp9_DatabaseTemplate-Full Database Backup NULL 1
ExpirationDate Compressed Position DeviceType UserName
1 NULL 0 1 2 SomeVendor\\SRHadden
ServerName DatabaseName DatabaseVersion
1 SomeSqlHost\\SQL2008R2 SomeApp9_DatabaseTemplate 661
DatabaseCreationDate BackupSize FirstLSN LastLSN
1 2014-01-02 09:13:15.000 31781888 10993000000463100036 10993000000464800001
CheckpointLSN DatabaseBackupLSN BackupStartDate
1 10993000000463100036 10993000000413800256 2015-09-14 13:18:44.000
BackupFinishDate SortOrder CodePage UnicodeLocaleId
1 2015-09-14 13:18:46.000 52 0 1033
UnicodeComparisonStyle CompatibilityLevel SoftwareVendorId
1 196609 90 4608
SoftwareVersionMajor SoftwareVersionMinor SoftwareVersionBuild MachineName
1 10 50 1617 SomeSqlHost
Flags BindingID
1 512 62542A27-94A4-4010-905A-32AEFA4016AB
RecoveryForkID Collation
1 FC58F82C-0D2C-4D7F-9225-3DF679BBF636 SQL_Latin1_General_CP1_CI_AS
FamilyGUID HasBulkLoggedData IsSnapshot IsReadOnly
1 25E2953C-BAAA-4A41-9E08-B0E4D82B7088 0 0 0
IsSingleUser HasBackupChecksums IsDamaged BeginsLogChain HasIncompleteMetaData
1 0 0 0 0 0
IsForceOffline IsCopyOnly FirstRecoveryForkID ForkPointLSN
1 0 0 FC58F82C-0D2C-4D7F-9225-3DF679BBF636 NULL
RecoveryModel DifferentialBaseLSN DifferentialBaseGUID BackupTypeDescription
1 SIMPLE NULL NULL Database
BackupSetGUID CompressedBackupSize Containment
1 AC2D9342-692E-4D05-B321-16274E6040C3 31781888 0
$HO
Field Value
1 BackupName SomeApp9_DatabaseTemplate-Full Database Backup
2 BackupDescription NULL
3 BackupType 1
4 ExpirationDate NULL
5 Compressed 0
6 Position 1
7 DeviceType 2
8 UserName SomeVendor\\SRHadden
9 ServerName SomeSqlHost\\SQL2008R2
10 DatabaseName SomeApp9_DatabaseTemplate
11 DatabaseVersion 661
12 DatabaseCreationDate Jan 2 2014 9:13AM
13 BackupSize 31781888
14 FirstLSN 10993000000463100036
15 LastLSN 10993000000464800001
16 CheckpointLSN 10993000000463100036
17 DatabaseBackupLSN 10993000000413800256
18 BackupStartDate Sep 14 2015 1:18PM
19 BackupFinishDate Sep 14 2015 1:18PM
20 SortOrder 52
21 CodePage 0
22 UnicodeLocaleId 1033
23 UnicodeComparisonStyle 196609
24 CompatibilityLevel 90
25 SoftwareVendorId 4608
26 SoftwareVersionMajor 10
27 SoftwareVersionMinor 50
28 SoftwareVersionBuild 1617
29 MachineName SomeSqlHost
30 Flags 512
31 BindingID 62542A27-94A4-4010-905A-32AEFA4016AB
32 RecoveryForkID FC58F82C-0D2C-4D7F-9225-3DF679BBF636
33 Collation SQL_Latin1_General_CP1_CI_AS
34 FamilyGUID 25E2953C-BAAA-4A41-9E08-B0E4D82B7088
35 HasBulkLoggedData 0
36 IsSnapshot 0
37 IsReadOnly 0
38 IsSingleUser 0
39 HasBackupChecksums 0
40 IsDamaged 0
41 BeginsLogChain 0
42 HasIncompleteMetaData 0
43 IsForceOffline 0
44 IsCopyOnly 0
45 FirstRecoveryForkID FC58F82C-0D2C-4D7F-9225-3DF679BBF636
46 ForkPointLSN NULL
47 RecoveryModel SIMPLE
48 DifferentialBaseLSN NULL
49 DifferentialBaseGUID NULL
50 BackupTypeDescription Database
51 BackupSetGUID AC2D9342-692E-4D05-B321-16274E6040C3
52 CompressedBackupSize 31781888
53 containment 0
$`NULL`
NULL
$`NULL`
CONNECT
Ok, so the rookie web scraping R code notwithstanding, I've got the beginnings of something here (I think). Now what? Getting the output data from R into a SQL Server table sounds like a logical next step. But I've got a couple of issues.
The first is that I was expecting two HTML tables, but there are actually four of them. HTML table #3 is null--it's probably the container for the "SHARE" buttons towards the bottom of the page, above the comments. And HTML table #4, while not null, consists of one "empty" row--it's the "CONNECT" area in the upper-right part of the page. Hey! While you're checking out those two sections, feel free to share this article or connect with me on Twitter or LinkedIn! 😊 By the way, you likely won't see those sections unless you're using a desktop browser. Mobile devices render the page differently.
The second issue is that sp_execute_external_script returns a single data frame as output. The R code returns a list (or vector?) that may contain multiple data frames. I could just return the first one. But that kinda sucks. How can I grab them all?
Let's start by only returning one data frame. I can change line #9 above to return just the first item in the list:
tables[[1]]
Now I can run that R code in SQL Server as follows:
--Return first html table.
EXEC sp_execute_external_script
@language = N'R',
@script = N'library(RCurl)
library(XML)
url <- ''https://DaveMason.me/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html'';
tables = getURL(url);
tables <- readHTMLTable(tables, header = TRUE, as.data.frame = TRUE, stringsAsFactors = F);
OutputDataSet <- tables[[1]]'
WITH RESULT SETS UNDEFINED;
Here is the partial output in SSMS:
I could also run sp_execute_external_script multiple times--once for each data frame. Here's an example with data frame #1 and #2:
--Return first html table.
EXEC sp_execute_external_script
@language = N'R',
@script = N'library(RCurl)
library(XML)
url <- ''https://DaveMason.me/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html'';
tables = getURL(url);
tables <- readHTMLTable(tables, header = TRUE, as.data.frame = TRUE, stringsAsFactors = F);
OutputDataSet <- tables[[1]]'
WITH RESULT SETS UNDEFINED;
--Return second html table.
EXEC sp_execute_external_script
@language = N'R',
@script = N'library(RCurl)
library(XML)
url <- ''https://DaveMason.me/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html'';
tables = getURL(url);
tables <- readHTMLTable(tables, stringsAsFactors = F);
OutputDataSet <- tables[[2]]'
WITH RESULT SETS UNDEFINED;
Again, here is the partial output in SSMS:
Of course for this strategy to work, I'd have to know ahead of time how many data frames/HTML tables there are. Hmmm. Can dynamic T-SQL help me here? If I could find out at run time how many data frames there are, and which ones I may or may not want, then why not? Here's some R code that reads HTML tables into a variable as a list of data frames(line 8), iterates through the list (starting at line 18), decides if the HTML table has any data in it (lines 21, 24), and adds the HTML table number (the element number in the list) to a different data frame (line 27). The output shows us we would want HTML tables 1, 2, and 4. (Yeah, I really didn't want #4. But that can be fixed by enhancing the R code to be more selective. Let's just go with it for now.)
#install.packages('XML')
#install.packages('RCurl')
library(RCurl)
library(XML)
url <- 'https://DaveMason.me/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html'
tables = getURL(url)
tables <- readHTMLTable(tables, header = TRUE, as.data.frame = TRUE, stringsAsFactors = F)
#tables
#Create an "empty" data frame, with a single "column" of numbers.
retDataFrameIndexes <- data.frame(rowIndex = numeric())
#Were any html tables found?
if (length(tables) > 0) {
#Iterate through list.
for (i in 1:length(tables)) {
#Make sure the data frame isn't null...
if (!is.null(tables[[i]])) {
#...and that it has rows.
if (nrow(tables[[i]]) > 0) {
#Add a "row" to our data frame.
retDataFrameIndexes <- rbind(retDataFrameIndexes, data.frame(rowIndex = i))
}
}
}
}
retDataFrameIndexes
> retDataFrameIndexes
rowIndex
1 1
2 2
3 4
We can take the script above, run it in SQL Server via sp_execute_external_script, pass the output back to SQL Server as the OutputDataSet, and capture the results into a temp table. Still with me? Good! Next, we'll create a cursor, iterate through the temp table of HTML table numbers, and execute sp_execute_external_script, using dynamic T-SQL to specify which data frame/HTML table number we want.
CREATE TABLE #HtmlTables (RowIndex INT);
INSERT INTO #HtmlTables
EXEC sp_execute_external_script
@language = N'R',
@script = N'#install.packages(''XML'')
#install.packages(''RCurl'')
library(RCurl)
library(XML)
url <- ''https://DaveMason.me/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html''
tables = getURL(url)
tables <- readHTMLTable(tables, header = TRUE, as.data.frame = TRUE, stringsAsFactors = F)
#Create an "empty" data frame, with a single "column" of numbers.
OutputDataSet <- data.frame(rowIndex = numeric())
#Were any html tables found?
if(length(tables)>0)
{
#Iterate through list.
for(i in 1:length(tables))
{
#Make sure the data frame isn''t null...
if(!is.null(tables[[i]]))
{
#...and that it has rows.
if(nrow(tables[[i]]) > 0)
{
#Add a "row" to our data frame.
OutputDataSet <- rbind(OutputDataSet, data.frame(rowIndex = i))
}
}
}
}'
DECLARE @RowIdx INT;
DECLARE curRowIdx CURSOR FAST_FORWARD READ_ONLY FOR
SELECT RowIndex FROM #HtmlTables;
OPEN curRowIdx;
FETCH NEXT FROM curRowIdx INTO @RowIdx;
WHILE (@@FETCH_STATUS = 0)
BEGIN
DECLARE @RScript NVARCHAR(MAX) = N'library(RCurl)
library(XML)
url <- ''https://DaveMason.me/2017/02/Unpivot-Output-Of-RESTORE-HEADERONLY.html'';
tables = getURL(url);
tables <- readHTMLTable(tables, header = TRUE, as.data.frame = TRUE, stringsAsFactors = F);
OutputDataSet <- tables[[' + CAST(@RowIdx AS NVARCHAR) + ']]';
EXEC sp_execute_external_script
@language = N'R',
@script = @RScript
WITH RESULT SETS UNDEFINED;
FETCH NEXT FROM curRowIdx INTO @RowIdx;
END
CLOSE curRowIdx;
DEALLOCATE curRowIdx;
DROP TABLE #HtmlTables;
Here's the output from SSMS:
So there you have it. Multiple output datasets with R and SQL Server. All it took was R (natch), sp_execute_external_script running essentially the same script multiple times, some dynamic T-SQL, a temp table, and a cursor. After all of that, I think I want to throw up. This was a long post--thanks for staying with me to the end. As always, happy coding!
UPDATE
I thoughtlessly forgot to credit some friends on Slack. Thank you for your input and assistance!
Comments