`vignettes/custom-pred-and-model-functions.Rmd`

`custom-pred-and-model-functions.Rmd`

{sperrorest} is a generic framework which aims to work with all R models/packages. In statistical learning, model setups, their formulas and error measures all depend on the family of the response variable. Various families exist (numeric, binary, multiclass) which again include sub-families (e.g. gaussian or poisson distribution of a numeric response).

This detail needs to be specified via the respective function, e.g. when using `glm()`

with a binary response, one needs to set `family = "binomial"`

to make sure that the model does something meaningful. Most of the time, the same applies to the generic `predict()`

function. For the `glm()`

case, one would need to set `type = "response"`

if the predicted values should reflect probabilities instead of log-odds.

These settings can be specified using `model_args`

and `pred_args`

in `sperrorest()`

. So fine, “why do we need to write all these wrappers and custom model/predict functions then?!”

`model_fun`

expects at least formula argument and a data.frame with the learning sample. All arguments, including the additional ones provided via `model_args`

, are getting passed to `model_fun`

via a `do.call()`

call. However, if `model_fun`

does not have an argument named `formula`

but e.g. `fixed`

(like it is the case for `glmmPQL()`

) the `do.call()`

call will fail because `sperrorest()`

tries to pass an argument named `formula`

but `glmmPQL`

expects an argument named `fixed`

.

In this case, we need to write a wrapper function for `glmmPQL`

(named `glmmPQL_modelfun`

here) which accounts for this naming problem. Here, we are passing the `formula`

argument to our custom model function which then does the actual call to `glmmPQL()`

using the supplied `formula`

object as the `fixed`

argument of `glmmPQL`

. By default, `glmmPQL()`

has further arguments like `family`

or `random`

. If we want to use these, we pass them to `model_args`

which then appends these to the arguments of `glmmPQL_modelfun`

.

```
glmmPQL_modelfun <- function(formula = NULL, data = NULL, random = NULL,
family = NULL) {
fit <- glmmPQL(fixed = formula, data = data, random = random, family = family)
return(fit)
}
```

Unless specified explicitly, `sperrorest()`

tries to use the generic `predict()`

function. This function works differently depending on the class of the provided fitted model, i.e. many models slightly differ in the naming (and availability) of their arguments. For example, when fitting a Support Vector Machine (SVM) with a binary response variable, package `kernlab`

expects an argument `type = "probabilities"`

in its `predict()`

call to receive predicted probabilities while in package `e1071`

it is `"probability = TRUE"`

. Similar to `model_args`

, this can be accounted for in the `pred_args`

of `sperrorest()`

.

However, `sperrorest()`

expects that the predicted values (of any response type) are stored directly in the returned object of the `predict()`

function. While this is the case for many models, mainly with a numeric response, classification cases often behave differently. Here, the predicted values (classes in this case) are often stored in a sub-object named `class`

or `predicted`

.

Since there is no way to account for this in a general way (when every package may return the predicted values in a different format/column), we need to account for it by providing a custom predict function which returns only the predicted values so that `sperrorest()`

can continue properly. This time we are showing two examples. The first takes again a binary classification using `randomForest`

.

When calling predict on a fitted `randomForest`

model with a binary response variable, the predicted values are actually stored in the resulting object returned by `predict()`

(here called `pred`

). So why do we have trouble here then?

Simply because `pred`

is a matrix containing both probabilities for the `FALSE`

(= 0) and `TRUE`

(= 1) case. `sperrorest()`

needs a vector containing only the predicted values of the `TRUE`

case to pass these further onto `err_fun()`

which then takes care of calculating all the error measures. So the important part is to subset the resulting matrix in the `pred`

object to `TRUE`

cases only and return the result.

```
rf_predfun <- function(object = NULL, newdata = NULL, type = NULL) {
pred <- predict(object = object, newdata = newdata, type = type)
pred <- pred[, 2]
}
```

The same case (binary response) using `svm`

from the `e1071`

package. Here, the predicted probabilities are stored in a sub-object of `pred`

. We can address it using the `attr()`

function. Then again, we only need the `TRUE`

cases for `sperrorest()`

.