Train an IUCNN Model — iucnn_train

Trains an IUCNN model based on a data.frame of features for a set of species, generated by iucnn_prepare_features, and the corresponding IUCN classes formatted as a iucnn_labels object with iucnn_prepare_labels. Note that NAs are not allowed in the features, and taxa with NAs will automatically be removed! Taxa, for which information is only present in one of the two input objects will be removed as well.

iucnn_train_model(
  x,
  lab,
  path_to_output = "iuc_nn_model",
  production_model = NULL,
  mode = "nn-class",
  test_fraction = 0.2,
  cv_fold = 1,
  seed = 1234,
  max_epochs = 1000,
  patience = 200,
  n_layers = "50_30_10",
  use_bias = TRUE,
  balance_classes = FALSE,
  act_f = "auto",
  act_f_out = "auto",
  label_stretch_factor = 1,
  randomize_instances = TRUE,
  dropout_rate = 0,
  mc_dropout = TRUE,
  mc_dropout_reps = 100,
  label_noise_factor = 0,
  rescale_features = FALSE,
  save_model = TRUE,
  overwrite = FALSE,
  verbose = 1
)

Arguments

x: a data.set, containing a column "species" with the species names, and subsequent columns with different features.
lab: an object of the class iucnn_labels, as generated by iucnn_prepare_labels containing the labels for all species.
path_to_output: character string. The path to the location where the IUCNN model shall be saved
production_model: an object of type iucnn_model (default=NULL). If an iucnn_model is provided, iucnn_train_model will read the settings of this model and reproduce it, but use all available data for training, by automatically setting the validation set to 0 and cv_fold to 1. This is recommended before using the model for predicting the IUCN status of not evaluated species, as it generally improves the prediction accuracy of the model. Choosing this option will ignore all other provided settings (below).
mode: character string. Choose between the IUCNN models "nn-class" (default, tensorflow neural network classifier), "nn-reg" (tensorflow neural network regression), or "bnn-class" (Bayesian neural network classifier)
test_fraction: numeric. The fraction of the input data used as test set.
cv_fold: integer (default=1). When setting cv_fold > 1, iucnn_train_model will perform k-fold cross-validation. In this case, the provided setting for test_fraction will be ignored, as the test size of each CV-fold is determined by the specified number provided here.
seed: integer. Set a starting seed for reproducibility.
max_epochs: integer. The maximum number of epochs.
patience: integer. Number of epochs with no improvement after which training will be stopped.
n_layers: character string. Define number node per layer by providing a character string where the number of nodes for each layer are separated by underscores. E.g. '50_30_10' (default) will train a model with 3 hidden layers with 50, 30, and 10 nodes respectively. Note that the number of nodes in the output layer is automatically determined based on the number of unique labels in the training set.
use_bias: logical (default=TRUE). Specifies if a bias node is used in the first hidden layer.
balance_classes: logical (default=FALSE). If set to TRUE, iucnn_train_model will perform supersampling of the training instances to account for uneven class distribution in the training data. In case of training an bnn-class model, choosing this option will add the estimation of class weights instead, to account for class imbalances.
act_f: character string. Specifies the activation function should be used in the hidden layers. Available options are: "relu", "tanh", "sigmoid", or "swish" (latter only for bnn-class). If set to 'auto' (default), iucnn_train_model will pick a reasonable default ('relu' for nn-class or nn-reg, and 'swish' for bnn-class).
act_f_out: character string. Similar to act_f, this specifies the activation function for the output layer. Available options are "softmax" (nn-class, bnn-class), "tanh" (nn-reg), "sigmoid" (nn-reg), or no activation function "" (nn-reg). When set to "auto" (default), a suitable output activation function will be chosen based on the chosen mode ('softmax' for nn-class or bnn-class, 'tanh' for nn-reg).
label_stretch_factor: numeric (only for mode nn-reg). The provided value will be applied as a factor to stretch or compress the labels before training a regression model. A factor smaller < 1.0 will compress the range of labels, while a factor > 1 will stretch the range.
randomize_instances: logical. When set to TRUE (default) the instances will be shuffled before training (recommended).
dropout_rate: numeric. This will randomly turn off the specified fraction of nodes of the neural network during each epoch of training making the NN more stable and less reliant on individual nodes/weights, which can prevent over-fitting (only available for modes nn-class and nn-reg). See mc_dropout setting explained below if dropout shall also be applied to the predictions.
mc_dropout: logical. If set to TRUE, the predictions (including the validation accuracy) based on a model trained with a dropout fraction > 0 will reflect the stochasticity introduced by the dropout method (MC dropout predictions). This is e.g. required when wanting to predict with a specified accuracy threshold (see target_acc option in iucnn_predict_status). This option is activated by default when chosing a dropout_rate > 0, unless it is manually set to FALSE here.
mc_dropout_reps: integer. The number of MC iterations to run when predicting validation accuracy and calculating the accuracy-threshold table required for making predictions with an accuracy threshold. The default of 100 is usually sufficient, larger values will lead to longer computation times, particularly during model testing with cross-validation.
label_noise_factor: numeric (only for mode nn-reg). Add specified amount of random noise to the input labels to give the categorical labels a more continuous spread before training the regression model. E.g. a value of 0.2 will redraw a label of a species categorized as Vulnerable (class=2) randomly between 1.8 and 2.2, based on a uniform probability distribution.
rescale_features: logical. Set to TRUE if all feature values shall be rescaled to values between 0 and 1 prior to training (default=FALSE).
save_model: logical. If TRUE the model is saved to disk.
overwrite: logical. If TRUE existing models are overwritten. Default is set to FALSE.
verbose: Default 0, set to 1 for iucnn_train_model to print additional info to the screen while training.

Value

outputs an iucnn_model object which can be used in iucnn_predict_status for predicting the conservation status of not evaluated species.

Note

See vignette("Approximate_IUCN_Red_List_assessments_with_IUCNN") for a tutorial on how to run IUCNN.

Examples

if (FALSE) {
data("training_occ") #geographic occurrences of species with IUCN assessment
data("training_labels")# the corresponding IUCN assessments

# 1. Feature and label preparation
features <- iucnn_prepare_features(training_occ, type = "geographic") # Training features
labels_train <- iucnn_prepare_labels(training_labels, features) # Training labels

# 2. Model training
m1 <- iucnn_train_model(x = features, lab = labels_train)

summary(m1)
plot(m1)
}