vignettes/custom-pred-and-model-functions.Rmd
custom-pred-and-model-functions.Rmd
{sperrorest} is a generic framework which aims to work with all R models/packages. In statistical learning, model setups, their formulas and error measures all depend on the family of the response variable. Various families exist (numeric, binary, multiclass) which again include sub-families (e.g. gaussian or poisson distribution of a numeric response).
This detail needs to be specified via the respective function,
e.g. when using glm()
with a binary response, one needs to
set family = "binomial"
to make sure that the model does
something meaningful. Most of the time, the same applies to the generic
predict()
function. For the glm()
case, one
would need to set type = "response"
if the predicted values
should reflect probabilities instead of log-odds.
These settings can be specified using model_args
and
pred_args
in sperrorest()
. So fine, “why do we
need to write all these wrappers and custom model/predict functions
then?!”
model_fun
expects at least formula argument and a
data.frame with the learning sample. All arguments, including the
additional ones provided via model_args
, are getting passed
to model_fun
via a do.call()
call. However, if
model_fun
does not have an argument named
formula
but e.g. fixed
(like it is the case
for glmmPQL()
) the do.call()
call will fail
because sperrorest()
tries to pass an argument named
formula
but glmmPQL
expects an argument named
fixed
.
In this case, we need to write a wrapper function for
glmmPQL
(named glmmPQL_modelfun
here) which
accounts for this naming problem. Here, we are passing the
formula
argument to our custom model function which then
does the actual call to glmmPQL()
using the supplied
formula
object as the fixed
argument of
glmmPQL
. By default, glmmPQL()
has further
arguments like family
or random
. If we want to
use these, we pass them to model_args
which then appends
these to the arguments of glmmPQL_modelfun
.
glmmPQL_modelfun <- function(formula = NULL, data = NULL, random = NULL,
family = NULL) {
fit <- glmmPQL(fixed = formula, data = data, random = random, family = family)
return(fit)
}
Unless specified explicitly, sperrorest()
tries to use
the generic predict()
function. This function works
differently depending on the class of the provided fitted model,
i.e. many models slightly differ in the naming (and availability) of
their arguments. For example, when fitting a Support Vector Machine
(SVM) with a binary response variable, package kernlab
expects an argument type = "probabilities"
in its
predict()
call to receive predicted probabilities while in
package e1071
it is "probability = TRUE"
.
Similar to model_args
, this can be accounted for in the
pred_args
of sperrorest()
.
However, sperrorest()
expects that the predicted values
(of any response type) are stored directly in the returned object of the
predict()
function. While this is the case for many models,
mainly with a numeric response, classification cases often behave
differently. Here, the predicted values (classes in this case) are often
stored in a sub-object named class
or
predicted
.
Since there is no way to account for this in a general way (when
every package may return the predicted values in a different
format/column), we need to account for it by providing a custom predict
function which returns only the predicted values so that
sperrorest()
can continue properly. This time we are
showing two examples. The first takes again a binary classification
using randomForest
.
When calling predict on a fitted randomForest
model with
a binary response variable, the predicted values are actually stored in
the resulting object returned by predict()
(here called
pred
). So why do we have trouble here then?
Simply because pred
is a matrix containing both
probabilities for the FALSE
(= 0) and TRUE
(=
1) case. sperrorest()
needs a vector containing only the
predicted values of the TRUE
case to pass these further
onto err_fun()
which then takes care of calculating all the
error measures. So the important part is to subset the resulting matrix
in the pred
object to TRUE
cases only and
return the result.
rf_predfun <- function(object = NULL, newdata = NULL, type = NULL) {
pred <- predict(object = object, newdata = newdata, type = type)
pred <- pred[, 2]
}
The same case (binary response) using svm
from the
e1071
package. Here, the predicted probabilities are stored
in a sub-object of pred
. We can address it using the
attr()
function. Then again, we only need the
TRUE
cases for sperrorest()
.