Skip to main content
Dave Mason - Mastodon Dave Mason - LinkedIn Dave Mason - GitHub Dave Mason - Counter Social

Scraping Web Pages - Part 2

Dave Mason - RCurl XML

In the last post, we looked at a way to scrape HTML table data from web pages, and save the data to a table in SQL Server. One of the drawbacks is the need to know the schema of the data that gets scraped--you need a SQL Server table to store the data, after all. Another shortcoming is if there are multiple HTML tables, you need to identify which one(s) you want to save.

For this post, we'll revisit web scraping with Machine Learning Services and R. This time, we'll take a schema-less approach that returns JSON data. As before, this web page will be scraped: Boston Celtics 2016-2017. It shows two HTML tables (grids) of data for the Boston Celtics, a professional basketball team. The first grid lists the roster of players, the second is a listing of games played during the regular season.


R Code

This code uses the RCurl and XML packages (as was the case in the last post), plus the jsonlite package. There's a bit more code this time. Hopefully the code comments are helpful:


library(RCurl)
library(XML)
library(jsonlite)

#type.convert - sets appropriate class for each data frame variable/column.
TypeConvertDataFrameList <- function(lsDataFrame) {
lapply(lsDataFrame, type.convert,
as.is = TRUE)
}

url <-
"https://DaveMason.me/p/boston-celtics-2016-2017.html"
html = RCurl::getURL(url)

dfHtmlTableData <- XML::readHTMLTable(html, header =
TRUE,
as.data.frame = TRUE,
stringsAsFactors =
FALSE)
#remove null elements from list.
dfHtmlTableData <- dfHtmlTableData[!sapply(dfHtmlTableData, is.null)]

#convert data frame variables to appropriate classes.
dfHtmlTableData <- lapply(dfHtmlTableData, TypeConvertDataFrameList)

dfTables <- data.frame(url = rep(c(url), each = length(dfHtmlTableData)))

#Empty vector for JSON string data.
jsonStringsFactor <- rep(c(""), each = length(dfHtmlTableData))

if (length(dfHtmlTableData) > 0) {
#Convert each HTML table to JSON.
for (i in 1:length(dfHtmlTableData)) {
jsonStringsFactor[i] <- jsonlite::toJSON(
as.data.frame(dfHtmlTableData[[i]]), auto_unbox = TRUE)
}
}

dfTables$HtmlTableAsJson <- jsonStringsFactor
dfTables

Lines 6-8 define a separate function that is invoked farther down in the script. Line 10 specifies the URL string for the web page we want to scrape. Line 11 gathers the entirety of HTML from the page. And Line 13 creates a list of data frames. At line 17, NULL elements (if any) are removed from the list. At line 20 the lapply is invoked. It iterates over the list of HTML table data and in turn invokes the TypeConvertDataFrameList from line 5. This function iterates over the variables of a data frame (much like columns of a database table) and attempts to cast them from a character (string) class to a different class, if appropriate. Without this, all of our scraped data will be returned as strings.

At line 22 we begin creating the single data frame that will be returned to SQL Server. A "url" variable (column) is created with one observation (row) for each HTML table found. Each value is the URL of the web page that was scraped. At line 25, a jsonStringsFactor vector is created with one element for each HTML table found. It has empty string values initially. In lines 29-31, the code iterates over each data frame elements in dfHtmlTableData. The data from each data frame element is converted to JSON and assigned to the corresponding element in jsonStringsFactor. Line 35 creates a "HtmlTableAsJson" property (column) for our data frame, and assigns jsonStringsFactor to it. We now have a data frame with two variables (columns) and three observations (rows). Here is the output of the code (truncated for display purposes):


url HtmlTableAsJson
--------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------
https://...celtics-2016-2017.html [{"Jersey.Number":0,"Player":"Avery Bradley","Position":"SG","Height":"6-3","Weight":180,"Birthdate":"11/26/1990","Years.Experience":"6","College":"Texas"},...]
https://...celtics-2016-2017.html [{"Game":1,"Date":"Wed, Oct 26, 2016","Time":"7:30p","At":"","Opponent":"Brooklyn Nets","Result":"W","OT":"","Points.For":122,"Points.Against":117,"Wins.Sea...]
https://...celtics-2016-2017.html [{}]


SQL Server

With the necessary R code in hand, we can run it in-database with Machine Learning Services. I've wrapped the code within a stored procedure:


CREATE OR ALTER PROCEDURE dbo.ScrapeHtmlTables
@Url
NVARCHAR(MAX)
AS
BEGIN

DECLARE
@RScript NVARCHAR(MAX);
SET @RScript = N'library(RCurl)
library(XML)
library(jsonlite)

#type.convert - sets appropriate class for each data frame variable/column.
TypeConvertDataFrameList <- function(lsDataFrame){
lapply(lsDataFrame, type.convert, as.is = TRUE)
}

url <- "' + @Url + '"
html = RCurl::getURL(url)

dfHtmlTableData <- XML::readHTMLTable(html, header = TRUE,
as.data.frame = TRUE,
stringsAsFactors = FALSE)
#remove null elements from list.
dfHtmlTableData <- dfHtmlTableData[!sapply(dfHtmlTableData, is.null)]

#convert data frame variables to appropriate classes.
dfHtmlTableData <- lapply(dfHtmlTableData, TypeConvertDataFrameList)

dfTables <- data.frame(url = rep(c(url), each = length(dfHtmlTableData)))

#Empty vector for JSON string data.
jsonStringsFactor <- rep(c(""), each = length(dfHtmlTableData))

if (length(dfHtmlTableData) > 0) {
#Convert each HTML table to JSON.
for (i in 1:length(dfHtmlTableData)) {
jsonStringsFactor[i] <- jsonlite::toJSON(
as.data.frame(dfHtmlTableData[[i]]), auto_unbox = TRUE)
}
}

dfTables$HtmlTableAsJson <- jsonStringsFactor
OutputDataSet <- dfTables

OutputDataSet'
;

EXEC sp_execute_external_script
@language = N'R',
@script = @RScript
WITH RESULT SETS (
(
[Url] VARCHAR(2048),
HtmlTableAsJson VARCHAR(MAX)
)
);
END
GO

Now we can retieve a web page's HTML table data by invoking a stored procedure and passing in a URL string. We don't need to know how many HTML tables there are, and we don't need to know their schemas:


CREATE TABLE #HtmlTableData (
ID INT IDENTITY,
[Url] VARCHAR(2048),
HtmlTableAsJson VARCHAR(MAX)
)

INSERT INTO #HtmlTableData
EXEC dbo.ScrapeHtmlTables @Url = N'https://DaveMason.me/p/boston-celtics-2016-2017.html';

SELECT *
FROM #HtmlTableData;

Dave Mason - RStats - sp_execute_external_script


JSON

From here on out, parsing the returned data is a JSON exercise. We would, of course, need to know the JSON schema to query it in a tabular or meaningful way. Or would we?

In another post, I described an automated way to infer JSON schema using dynamic TSQL. Perhaps that article will pique your interest. Here's a look at what I had in mind.


DECLARE @Json VARCHAR(MAX);
SELECT @Json = d.HtmlTableAsJson
FROM #HtmlTableData d
WHERE d.ID = 2;

EXEC dbo.OpenJsonTabular @Json;
Dave Mason - JSON

Comments

Post comment