Water Quality Classification via Cost-Efficient Machine Learning: A Case Study in Nuevo León

Accurate water-quality assessment is vital, but laboratory costs limit monitoring in many regions. We test whether a small, low-cost indicator panel can classify Water Quality Index (WQI) categories in Nuevo León, Mexico. Using 1,302 REMANECA samples, we computed WQI with a weighted multiplicative model and trained five classifiers (RF, SVM, DT, KNN, NB) on physicochemical features. Cross-validation ranked Random Forest (RF) best with 11 indicators (accuracy 0.921±0.023; weighted F1 0.912±0.028; macro precision 0.926±0.037; macro recall 0.785±0.073). Feature selection and importances emphasized total hardness, coliforms, nutrients (PO4, NO3-, NH3), and pH. A cost-aware five-test panel (hardness, PO4, pH, NH3, SST) retained strong performance (RF accuracy 0.857±0.026; weighted F1 0.829±0.030) with reduced minority-class sensitivity (macro recall 0.615±0.059). Errors concentrated between adjacent categories; detection of heavily contaminated water remained stable (recall 98% to 97%) and the majority class stayed high (99% to 98%), while ¿excellent¿ and ¿slightly contaminated¿ degraded. These results show that reliable WQI classification is achievable with a compact, low-cost indicator set. A tiered strategy¿screen with the five-test panel and confirm with the full suite¿can expand coverage under fixed budgets while preserving identification of severe contamination. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

Water Quality Classification via Cost-Efficient Machine Learning: A Case Study in Nuevo León Chapter in Scopus