Correcting for bias in distribution modelling for rare species using citizen science data
Citations Over TimeTop 10% of 2017 papers
Abstract
Abstract Aim To improve the accuracy of inferences on habitat associations and distribution patterns of rare species by combining machine‐learning, spatial filtering and resampling to address class imbalance and spatial bias of large volumes of citizen science data. Innovation Modelling rare species’ distributions is a pressing challenge for conservation and applied research. Often, a large number of surveys are required before enough detections occur to model distributions of rare species accurately, resulting in a data set with a high proportion of non‐detections (i.e. class imbalance). Citizen science data can provide a cost‐effective source of surveys but likely suffer from class imbalance. Citizen science data also suffer from spatial bias, likely from preferential sampling. To correct for class imbalance and spatial bias, we used spatial filtering to under‐sample the majority class (non‐detection) while maintaining all of the limited information from the minority class (detection). We investigated the use of spatial under‐sampling with randomForest models and compared it to common approaches used for imbalanced data, the synthetic minority oversampling technique ( SMOTE ), weighted random forest and balanced random forest models. Model accuracy was assessed using kappa, Brier score and AUC . We demonstrate the method by evaluating habitat associations and seasonal distribution patterns using citizen science data for a rare species, the tricoloured blackbird ( Agelaius tricolor ). Main Conclusions Spatial under‐sampling increased the accuracy of each model and outperformed the approach typically used to direct under‐sampling in the SMOTE algorithm. Our approach is the first to characterize winter distribution and movement of tricoloured blackbirds. Our results show that tricoloured blackbirds are positively associated with grassland, pasture and wetland habitats, and negatively associated with high elevations or evergreen forests during both winter and breeding seasons. The seasonal differences in distribution indicate that individuals move to the coast during the winter, as suggested by historical accounts.
Related Papers
- → A Multiple Resampling Method for Learning from Imbalanced Data Sets(2004)1,015 cited
- → On the Performance of Oversampling Techniques for Class Imbalance Problems(2020)15 cited
- → Citizen science aids the quantification of the distribution and prediction of present and future temporal variation in habitat suitability at species’ range edges(2023)8 cited
- → Effects of Resampling Techniques on Imbalanced Data Classification: A New Under-resampling Method(2021)4 cited
- → A Detailed Analysis of the Multi-Class Classification Problem in Network Intrusion Detection using Resampling Techniques(2021)1 cited