Title: | Utilities for Generating SQL Queries from Model Objects |
---|---|
Description: | Provides utilities for generating SQL queries (particularly CREATE TABLE statements) from R model objects. The most important use case is generating SQL to score a generalized linear model or related model represented as an R object, in which case the package handles parsing formula operators and including the model's response function. |
Authors: | William Brannon [aut, cre] |
Maintainer: | William Brannon <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.4 |
Built: | 2024-11-14 04:14:29 UTC |
Source: | https://github.com/wwbrannon/sqlscore |
Generate a CREATE TABLE statement to score the passed model on a preexisting database table. The statement will generate predictions entirely in the database, with no need to fetch data into R. Models need not be GLMs, but their prediction steps must consist of applying a response function to a linear predictor.
create_statement(mod, dest_table, src_table, dest_schema = NULL, dest_catalog = NULL, src_schema = NULL, src_catalog = NULL, drop = FALSE, temporary = FALSE, pk = c("id"), response = NULL, con = dbplyr::simulate_dbi())
create_statement(mod, dest_table, src_table, dest_schema = NULL, dest_catalog = NULL, src_schema = NULL, src_catalog = NULL, drop = FALSE, temporary = FALSE, pk = c("id"), response = NULL, con = dbplyr::simulate_dbi())
mod |
A supported model object. |
dest_table |
The unqualified DB name of the destination table. |
src_table |
The unqualified DB name of the source table. |
dest_schema |
The DB schema of the destination table. |
dest_catalog |
The DB catalog of the destination table. |
src_schema |
The DB schema of the source table. |
src_catalog |
The DB catalog of the source table. |
drop |
Whether to generate a DROP TABLE IF EXISTS before the CREATE TABLE. |
temporary |
Whether the destination table should be a temporary table. |
pk |
A vector of primary key column names. |
response |
The name of a custom response function to apply to the linear predictor. |
con |
A DBI connection to control the details of SQL generation; defaults to dbplyr::simulate_dbi() for the best guess at portable SQL. |
An open database connection can be passed as the 'con' argument, or the ‘dbplyr::simulate_*' functions can be used in applications which don’t have a DB connection when they need to generate SQL.
A dbplyr SQL object representing the SELECT statement.
Specific packages and models that are known to work include: glm and lm from package:stats, cv.glmnet from package:glmnet, glmboost from package:mboost, and bayesglm from package:arm.
Default S3 methods are for objects structured like those of class "glm", so models not listed here may work if they resemble those objects, but are not guaranteed to.
Note that if the model object transformed its training data before fitting (e.g., centering and scaling predictors), the generated SQL statement will not include those transformations. A future release may include that functionality, but centering and scaling in particular are difficult to do efficiently and portably in SQL.
# Basic create statements mod <- glm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris) create_statement(mod, src_table="tbl_name", dest_table="target_tbl") create_statement(mod, src_table="tbl_name", src_schema="schema_name", src_catalog="catalog_name", dest_table="target_tbl") create_statement(mod, src_table="tbl_name", src_schema="schema_name", src_catalog="catalog_name", dest_table="target_tbl", dest_schema="target_schema", dest_catalog="target_catalog", pk=c("lab", "specimen_id")) #With a custom response function create_statement(mod, src_table="tbl_name", src_schema="schema_name", dest_table="target_tbl", response="probit") # With a model-derived non-identity response function mod <- glm(Sepal.Length > 5.0 ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris, family=binomial("logit")) create_statement(mod, src_table="tbl_name", dest_table="target_tbl") #With formula operators x <- matrix(rnorm(100*20),100,20) colnames(x) <- sapply(1:20, function(x) paste0("X", as.character(x))) x <- as.data.frame(x) mod <- glm(X2 ~ X3 + X5 + X15*X8, data=x) create_statement(mod, src_table="tbl_name", dest_table="target_tbl") create_statement(mod, src_table="tbl_name", dest_table="target_tbl", response="cauchit")
# Basic create statements mod <- glm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris) create_statement(mod, src_table="tbl_name", dest_table="target_tbl") create_statement(mod, src_table="tbl_name", src_schema="schema_name", src_catalog="catalog_name", dest_table="target_tbl") create_statement(mod, src_table="tbl_name", src_schema="schema_name", src_catalog="catalog_name", dest_table="target_tbl", dest_schema="target_schema", dest_catalog="target_catalog", pk=c("lab", "specimen_id")) #With a custom response function create_statement(mod, src_table="tbl_name", src_schema="schema_name", dest_table="target_tbl", response="probit") # With a model-derived non-identity response function mod <- glm(Sepal.Length > 5.0 ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris, family=binomial("logit")) create_statement(mod, src_table="tbl_name", dest_table="target_tbl") #With formula operators x <- matrix(rnorm(100*20),100,20) colnames(x) <- sapply(1:20, function(x) paste0("X", as.character(x))) x <- as.data.frame(x) mod <- glm(X2 ~ X3 + X5 + X15*X8, data=x) create_statement(mod, src_table="tbl_name", dest_table="target_tbl") create_statement(mod, src_table="tbl_name", dest_table="target_tbl", response="cauchit")
Generate an unevaluated call corresponding to the predict step of the passed model. The call represents the linear predictor in terms of elementary functions on the underlying column names. Before translation into SQL, it should have a response function applied by score_expression (which may be a no-op in the case of the identity response).
linpred(mod)
linpred(mod)
mod |
A supported model object. |
An unevaluated R call object representing the linear predictor.
The Binomial models in glmboost return coefficients which are 1/2 the coefficients fit by a call to glm(..., family=binomial(...)), because the response variable is internally recoded to -1 and +1. sqlscore multiplies the returned coefficients by 2 to put them back on the same scale as glm, and adds the glmboost offset to the intercept before multiplying.
# A Gaussian GLM including factors mod <- glm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris) linpred(mod) # A binomial GLM - linear predictor is unaffected mod <- glm(Sepal.Length > 5.0 ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris, family=binomial("logit")) linpred(mod) #With formula operators x <- matrix(rnorm(100*20),100,20) colnames(x) <- sapply(1:20, function(x) paste0("X", as.character(x))) x <- as.data.frame(x) mod <- glm(X2 ~ X3 + X5 + X15*X8, data=x) linpred(mod)
# A Gaussian GLM including factors mod <- glm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris) linpred(mod) # A binomial GLM - linear predictor is unaffected mod <- glm(Sepal.Length > 5.0 ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris, family=binomial("logit")) linpred(mod) #With formula operators x <- matrix(rnorm(100*20),100,20) colnames(x) <- sapply(1:20, function(x) paste0("X", as.character(x))) x <- as.data.frame(x) mod <- glm(X2 ~ X3 + X5 + X15*X8, data=x) linpred(mod)
Generate an unevaluated call corresponding to the predict step of the passed model. The call represents the response function of the linear predictor in terms of elementary functions on the underlying column names, and is suitable for direct translation into SQL.
score_expression(mod, response = NULL)
score_expression(mod, response = NULL)
mod |
A supported model object. |
response |
The name of a custom response function to apply to the linear predictor. |
An unevaluated R call object representing the response function of the linear predictor.
The Binomial models in glmboost return coefficients which are 1/2 the coefficients fit by a call to glm(..., family=binomial(...)), because the response variable is internally recoded to -1 and +1. sqlscore multiplies the returned coefficients by 2 to put them back on the same scale as glm, and adds the glmboost offset to the intercept before multiplying.
# A Gaussian GLM including factors mod <- glm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris) score_expression(mod) # A binomial GLM - linear predictor is unaffected mod <- glm(Sepal.Length > 5.0 ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris, family=binomial("logit")) score_expression(mod) #With a hand-specified response function score_expression(mod, response="probit") #With formula operators x <- matrix(rnorm(100*20),100,20) colnames(x) <- sapply(1:20, function(x) paste0("X", as.character(x))) x <- as.data.frame(x) mod <- glm(X2 ~ X3 + X5 + X15*X8, data=x) score_expression(mod)
# A Gaussian GLM including factors mod <- glm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris) score_expression(mod) # A binomial GLM - linear predictor is unaffected mod <- glm(Sepal.Length > 5.0 ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris, family=binomial("logit")) score_expression(mod) #With a hand-specified response function score_expression(mod, response="probit") #With formula operators x <- matrix(rnorm(100*20),100,20) colnames(x) <- sapply(1:20, function(x) paste0("X", as.character(x))) x <- as.data.frame(x) mod <- glm(X2 ~ X3 + X5 + X15*X8, data=x) score_expression(mod)
Generate a SELECT statement to score the passed model on a preexisting database table. The statement will generate predictions entirely in the database, with no need to fetch data into R. Models need not be GLMs, but their prediction steps must consist of applying a response function to a linear predictor.
select_statement(mod, src_table, src_schema = NULL, src_catalog = NULL, pk = c("id"), response = NULL, con = dbplyr::simulate_dbi())
select_statement(mod, src_table, src_schema = NULL, src_catalog = NULL, pk = c("id"), response = NULL, con = dbplyr::simulate_dbi())
mod |
A supported model object. |
src_table |
The unqualified DB name of the source table. |
src_schema |
The DB schema of the source table. |
src_catalog |
The DB catalog of the source table. |
pk |
A vector of primary key column names. |
response |
The name of a custom response function to apply to the linear predictor. |
con |
A DBI connection to control the details of SQL generation; defaults to dbplyr::simulate_dbi() for the best guess at portable SQL. |
An open database connection can be passed as the 'con' argument, or the ‘dbplyr::simulate_*' functions can be used in applications which don’t have a DB connection when they need to generate SQL.
A dbplyr SQL object representing the SELECT statement.
Specific packages and models that are known to work include: glm and lm from package:stats, cv.glmnet from package:glmnet, glmboost from package:mboost, and bayesglm from package:arm.
Default S3 methods are for objects structured like those of class "glm", so models not listed here may work if they resemble those objects, but are not guaranteed to.
Note that if the model object transformed its training data before fitting (e.g., centering and scaling predictors), the generated SQL statement will not include those transformations. A future release may include that functionality, but centering and scaling in particular are difficult to do efficiently and portably in SQL.
# Basic select statements mod <- glm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris) select_statement(mod, src_table="tbl_name") select_statement(mod, src_table="tbl_name", src_schema="schema_name", src_catalog="catalog_name") select_statement(mod, src_table="tbl_name", src_schema="schema_name", src_catalog="catalog_name", pk=c("lab", "specimen_id")) #With a custom response function select_statement(mod, src_table="tbl_name", src_schema="schema_name", response="probit") # With a model-derived non-identity response function mod <- glm(Sepal.Length > 5.0 ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris, family=binomial("logit")) select_statement(mod, src_table="tbl_name") #With formula operators x <- matrix(rnorm(100*20),100,20) colnames(x) <- sapply(1:20, function(x) paste0("X", as.character(x))) x <- as.data.frame(x) mod <- glm(X2 ~ X3 + X5 + X15*X8, data=x) select_statement(mod, src_table="tbl_name") select_statement(mod, src_table="tbl_name", response="cauchit")
# Basic select statements mod <- glm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris) select_statement(mod, src_table="tbl_name") select_statement(mod, src_table="tbl_name", src_schema="schema_name", src_catalog="catalog_name") select_statement(mod, src_table="tbl_name", src_schema="schema_name", src_catalog="catalog_name", pk=c("lab", "specimen_id")) #With a custom response function select_statement(mod, src_table="tbl_name", src_schema="schema_name", response="probit") # With a model-derived non-identity response function mod <- glm(Sepal.Length > 5.0 ~ Sepal.Width + Petal.Length + Petal.Width + Species, data=datasets::iris, family=binomial("logit")) select_statement(mod, src_table="tbl_name") #With formula operators x <- matrix(rnorm(100*20),100,20) colnames(x) <- sapply(1:20, function(x) paste0("X", as.character(x))) x <- as.data.frame(x) mod <- glm(X2 ~ X3 + X5 + X15*X8, data=x) select_statement(mod, src_table="tbl_name") select_statement(mod, src_table="tbl_name", response="cauchit")
The sqlscore package provides utilities for generating sql queries (particularly CREATE TABLE statements) from R model objects. The most important use case is generating SQL to score a GLM or related model represented as an R object, in which case the package handles parsing formula operators and including the model's response function. The models scored need not be generalized linear models, strictly speaking, but their prediction steps must consist of applying a response function to a linear predictor. The package handles escaping and dealing with formula operators, and provides a way to use a custom response function if desired.
The SQL-generating functions create_statement and select_statement do what their names suggest and generate CREATE TABLE and SELECT statements for model scoring. Helper functions include linpred(), which generates an R call object representing the linear predictor, and score_expression, an S3 generic that handles wrapping the linear predictor in the response function.
Specific packages and models that are known to work include: glm and lm from package:stats, cv.glmnet from package:glmnet, glmboost from package:mboost, and bayesglm from package:arm.
Default S3 methods are for objects structured like those of class "glm", so models not listed here may work if they resemble those objects, but are not guaranteed to.