Method for the Selection of Modeling Variables and Rejection of Outlier Samples in a Simultaneous Fashion Using a Robust Genetic Algorithm
Publication Date: 2008-Nov-11
The IP.com Prior Art Database
We have combined the work of Leardi and Hubert into a single algorithm which selects an optimum variable set and simultaneously identifies outliers without explicitly knowing in advance which variable set to use for outlier identification. The basic steps are: 1.The GA selects a set of variables to evaluate. 2.A robust PLS is done to identify sample outliers for the variable set being evaluated. 3.The reduced sample set is used to generate a cross-validation error from standard PLS. 4.Variables corresponding to the best cross-validation error from the reduced sample set are propagated into the next generation of the GA.
Building a multivariate model begins with two main steps – assembling a set of calibration samples that are representative and have accurate reference values, and choosing appropriate variables with which to make a model.
Many modeling approaches assume that there is no error in the reference values and seek to achieve the best fit by minimizing the difference between the predicted values and reference values for all samples. If some of the reference values are not accurate then this inaccuracy will be transferred to the model and predicted values may also be inaccurate and/or imprecise. A second assumption is that the descriptor variables associated with each sample also have no error and are relevant for predicting. In reality, the descriptor variables are sometimes inaccurate and/or irrelevant and use of these variables will also cause problems with the model.
Given that all reference values in the calibration set are accurate, many tools exist for choosing the best subset of variables for regression. Conversely, given the best set of variables, tools also exist to identify samples with inaccurate reference values. In reality, however, both situations exist simultaneously. With any given modeling problem, there may be samples with inaccurate reference values and/or descriptor variables and the best set of descriptor variables to use in the model is unknown.
Using a genetic algorithm (GA) approach to identify variables, and a robust Partial Least Squares (PLS) algorithm to identify sample outliers as the GA progresses, we have been able to handle both issues simultaneously. A genetic algorithm uses evolutionary principles to efficiently search a large, many-dimensional space to find combinations of variables that meet some criterion. Leardi has developed a GA that works well with spectroscopic data. [i], [ii] Robust statistical methods refer to the class of methods developed to produce reliable results even when some of the observations are outliers. The need and effectiveness of robust methods has been described in many sources.[iii], [iv], [v] Verboven and Hubert have developed a library for robust analysis called LIBRA[vi] and have applied this to fast model selection for robust calibration methods.[vii]
We have combined the work of Leardi and Hubert into a single algorithm which selects an optimum variable set and simultaneously identifies outliers without explicitly knowing in advance which variable set to use for outlier identification. The basic steps are:
- The GA sel...