Comparing the Predictive Performance, Interpretability, and Accessibility of Machine Learning and Physically Based Models for Water Treatment

Comparing the Predictive Performance, Interpretability, and Accessibility of Machine Learning and Physically Based Models for Water Treatment

Abstract

Using an organic carbon removal data set (n = 500), we compared a physically based semiempirical coagulation model (Langmuir sorption-removal) and three ML modeling methods using quantitative (model performance) and qualitative (model interpretability and accessibility) criteria to identify potential barriers to adoption in water treatment. We found that a gradient-boosted tree ensemble and an artificial neural network provided the most accurate predictions of organic carbon removal and that all models provided accurate predictions when test data were well-characterized by the training data and confirmed that the physically based model had the lowest prediction error when extrapolating. As assessed by the ability of model predictions to be reconciled with industry-specific knowledge, the physically based and linear models were the most interpretable. As assessed by the ability for utilities to implement models on an ad hoc basis, the physically based and multiple linear models were deemed to be the most accessible. Collectively, our study suggests that ML-based models offer the best predictive performance when adequate training data are available and that physically based models are best suited when extrapolation is necessary. Potential solutions for limited interpretability of ML-based models include variable importance and sensitivity analysis; a potential solution for limited accessibility of ML-based models is training of stakeholders in modeling techniques.

Publication
ACS ES&T Engineering