A methodological framework for identifying potential sources of soil heavy metal pollution based on machine learning: A case study in the Yangtze Delta, China.


Institute of Agricultural Remote Sensing & Information Technology Application, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, China. Electronic address: [Email]


It is a great challenge to identify the many and varied sources of soil heavy metal pollution. Often little information is available regarding the anthropogenic factors and enterprises that could potentially pollute soils. In this study we use freely available geographical data from a search engine in conjunction with machine learning methodologies to identify and classify potentially polluting enterprises in the Yangtze Delta, China. The data were classified into 31 separate and four integrated industry types by five different machine learning approaches. Multinomial naive Bayesian (NB) methods achieved an accuracy of 87% and Kappa coefficient of 0.82 and were used to classify the geographic data from more than 260,000 enterprises. The relationship between the different industry classes and measurements of soil cadmium (Cd) and mercury (Hg) concentrations was explored using bivariate local Moran's I analysis. The analysis revealed areas where different industry classes had led to soil pollution. In the case of Cd, elevated concentrations also occurred in some areas because of excessive fertilization and coal mining. This study provides a new approach to investigate the interaction between anthropogenic pollution and natural sources of soil heavy metals to inform pollution control and planning decisions regarding the location of industrial sites.


Bivariate local Moran's I analysis,Heavy metal pollution,Multinomial naive bayesian methods,Potentially polluting enterprises,Source identification,