Please use this identifier to cite or link to this item: http://dx.doi.org/10.25673/121194
Full metadata record
DC FieldValueLanguage
dc.contributor.authorCorrea, José-
dc.contributor.authorTavakoli, Hamed-
dc.contributor.authorVogel, Sebastian-
dc.contributor.authorGebbers, Robin-
dc.date.accessioned2025-11-11T08:29:47Z-
dc.date.available2025-11-11T08:29:47Z-
dc.date.issued2025-
dc.identifier.urihttps://opendata.uni-halle.de//handle/1981185920/123147-
dc.identifier.urihttp://dx.doi.org/10.25673/121194-
dc.description.abstractSensor-based soil analysis methods, particularly optical spectroscopy, have gained as efficient alternatives to labor-intensive laboratory analyses for assessing soil properties. Especially in-situ measurements with mobile sensors streamline data collection, reducing both time and costs. This approach hinges on correlating sensor signals with laboratory-derived soil physico-chemical properties using mathematical calibration models. The models are trained and their parameters fine-tuned using a training dataset. It is best practice to evaluate the performance of calibration models by a test dataset, which is independent from the training dataset. However, certain commonly applied data preprocessing procedures can unintentionally introduce unwanted dependencies between the training and the test dataset. This is called data leakage. A model trained on these datasets will perform very well on both the training and test sets. However, it will show much poorer performance when tested on a truly independent dataset. Thus, the calibration model overfits via data leakage. In this study, we illustrate the consequences of data leakage by two common preprocessing procedures in soil sensing, namely principle component analysis (PCA) and spatial interpolation through ordinary kriging, on the prediction of soil properties by near infrared (NIR) spectroscopy. The NIR spectra were obtained in the laboratory and in the field. Laboratory measurements by standard wet-chemistry methods of soil pH value, total organic carbon (TOC) and total nitrogen (TN) content of 159 soil samples were used as target variables. Based on the results of this study, PCA and spatial interpolation led to data leakage when executed before data splitting. To avoid data leakage, we encourage researchers to carefully design leak-free data processing pipelines. These pipelines should encapsulate preprocessing methods, model fitting, and (if needed) spatial interpolation, ensuring that training and test sets are completely independent.eng
dc.language.isoeng-
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/-
dc.subject.ddc550-
dc.titleOverfitting due to data leakage in soil sensor calibration : examples from lab-based and in-situ soil NIR spectroscopyeng
dc.typeArticle-
local.versionTypepublishedVersion-
local.bibliographicCitation.journaltitleComputers and electronics in agriculture-
local.bibliographicCitation.volume239-
local.bibliographicCitation.issuePart A-
local.bibliographicCitation.pagestart1-
local.bibliographicCitation.pageend20-
local.bibliographicCitation.publishernameElsevier Science-
local.bibliographicCitation.publisherplaceAmsterdam [u.a.]-
local.bibliographicCitation.doi10.1016/j.compag.2025.110920-
local.openaccesstrue-
dc.identifier.ppn194080227X-
cbs.publication.displayform2025-
local.bibliographicCitation.year2025-
cbs.sru.importDate2025-11-11T08:29:23Z-
local.bibliographicCitationEnthalten in Computers and electronics in agriculture - Amsterdam [u.a.] : Elsevier Science, 1985-
local.accessrights.dnbfree-
Appears in Collections:Open Access Publikationen der MLU

Files in This Item:
File Description SizeFormat 
1-s2.0-S0168169925010269-main.pdf12.99 MBAdobe PDFThumbnail
View/Open