R/sperrorest_resampling.R
partition_cv_strat.Rd
partition_cv_strat
creates a set of sample indices
corresponding to cross-validation test and training sets.
partition_cv_strat(
data,
coords = c("x", "y"),
nfold = 10,
return_factor = FALSE,
repetition = 1,
seed1 = NULL,
strat
)
data.frame
containing at least the columns specified by
coords
vector of length 2 defining the variables in data
that
contain the x and y coordinates of sample locations
number of partitions (folds) in nfold
-fold cross-validation
partitioning
if FALSE
(default), return a represampling object;
if TRUE
(used internally by other sperrorest functions), return a
list
containing factor vectors (see Value)
numeric vector: cross-validation repetitions to be
generated. Note that this is not the number of repetitions, but the indices
of these repetitions. E.g., use repetition = c(1:100)
to obtain (the
'first') 100 repetitions, and repetition = c(101:200)
to obtain a
different set of 100 repetitions.
seed1+i
is the random seed that will be used by set.seed in
repetition i
(i
in repetition
) to initialize the random number
generator before sampling from the data set.
character: column in data
containing a factor variable over
which the partitioning should be stratified; or factor vector of length
nrow(data)
: variable over which to stratify
A represampling object, see also partition_cv()
.
partition_strat_cv
, however, stratified with respect to the variable
data[,strat]
; i.e., cross-validation partitioning is done within each set
data[data[,strat]==i,]
(i
in levels(data[, strat])
), and the i
th
folds of all levels are combined into one cross-validation fold.
data(ecuador)
parti <- partition_cv_strat(ecuador,
strat = "slides", nfold = 5,
repetition = 1
)
idx <- parti[["1"]][[1]]$train
mean(ecuador$slides[idx] == "TRUE") / mean(ecuador$slides == "TRUE")
#> [1] 0.9996672
# always == 1
# Non-stratified cross-validation:
parti <- partition_cv(ecuador, nfold = 5, repetition = 1)
idx <- parti[["1"]][[1]]$train
mean(ecuador$slides[idx] == "TRUE") / mean(ecuador$slides == "TRUE")
#> [1] 1.009664
# close to 1 because of large sample size, but with some random variation