Semi-Supervised Learning Ensemble Model for Water Quality Classification in Mexican Municipalities
Chapter in Scopus
-
- Overview
-
- Identity
-
- Additional document info
-
- View All
-
Overview
abstract
-
Water quality in Mexico has been largely affected by pollutants, which have had a negative impact on the health of the country¿s water resources. Although monitoring stations are located within 1046 municipalities of Mexico, 1432 remain unaccounted. We developed a Semi-Supervised Learning Ensemble Model (SSLEM) using an integrated dataset containing water quality information, demographics, and industry metrics for each municipality. The SSLEM integrates predictions from three distinct supervised algorithms: XGBoost, CatBoost, and Random Forest. It employs an averaged-confidence-based selection strategy that retrains itself with labeled data and pseudo-labels from predictions with at least 75% confidence. This confidence threshold was chosen by evaluating SSLEM under different confidence thresholds (70%, 75%, 80%, and 85%). After at most 40 iterations, the SSLEM predicted 80% of the missing data, achieving 54% accuracy using three-level labels (good, medium, and bad). The predicted labels of these 1,136 municipalities translate to an estimated population of 20 million by the year 2020. The SSLEM was compared to an ensemble model (EM) without self-learning, which produced predictions for only 53% of the missing data with at least 75% confidence and a 52% accuracy. Additionally, the SSLEM could be best suited for proposing to the respective authorities which municipalities could benefit most from placing measuring stations in vulnerable areas. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
status
publication date
published in
Identity
Digital Object Identifier (DOI)
Additional document info
has global citation frequency
start page
end page
volume