Using Clutter to Improve Models
Barry M. Wise
Eigenvector Research, Inc.
Manson, WA USA
Abstract
Clutter, defined as the confounding effects of interfering chemical species, physical effects, noise
and instrument non-idealities, is present in all measurements. Sources of clutter include variation in
chemical interferents, physical effects such as scattering due to particles, changes in temperature or pressure,
instrument drift, detector non-linearity, as well as non-systematic random noise. The effect of clutter on
models for sample classification or regression can be mitigated through use of a clutter model. These models
can be derived in a number of ways such as combined class-centered data, background characterization or y-
block gradient. Once obtained, they can be used to construct filters to be used in preprocessing, such as
Generalized Least Squares Weighting, (GLSW), and External Parameter Orthogonalization (EPO). Clutter models
can also be used directly with alternative model forms based on Classical Least Squares (CLS) such as Extended
Least Squares (ELS). This talk discusses methods for obtaining clutter models and demonstrates their use in a
number of applications.
Over the past dozen years, a number of powerful spectral analysis methods have been published which make
use of orthogonalization (i.e. projection followed by weighted subtraction) of interferences or "clutter." These
filtering methods provide a means to mitigate the effect of interferences arising from background chemical or
physical species, instrumental artifacts, systematic sampling errors and instrument or system drift. They have
been used very effectively with complex biological systems, remote sensing applications, chemical process
monitoring and calibration transfer problems.
This class of methods includes Orthogonal Partial Least Squares (O-PLS), External Parameter Orthogonalization
(EPO), Dynamic Orthogonal Projection (DOP), Orthogonal Signal Correction (OSC), Constrained Principal
Spectral Analysis (CPSA), Generalized Least Squares Weighting (GLSW), and Science Based Calibration (SBC)
among others. All are based on the orthogonalization premise and each touts a unique ability to improve
model performance, robustness, and/or interpretability.
Some relationships between these methods are noted, along with ties to older work. Examples are given of the
use of the methods in calibration and classification problems in pharmaceutical, petrochemical and remote
Outline
What is clutter?
Orthogonalization filters
How to get a clutter models
Ways to deal with clutter
Examples
What is “Clutter?
A confused multitude of things: a condition in which
things are not in their expected places
Radar Clutter Definition: (DOD, NATO) Unwanted
signals, echoes, or images on the face of the display
tube, which interfere with observation of desired
signals.
Variations in the signal (e.g. spectra) not due to the
factor (e.g. analyte) of interest due to systematic or
random effects
Measured Signal
Clutter is present in all measurements
X-block, Y-block
Measured Signal
Target Signal
Clutter Signal
Interference Signal
Noise
5
Sources of Clutter
Systematic background variability
in the system being sensed
Interfering analytes not of interest
Changes in particle size distribution
T, P changes,
Variable sample matrix, e.g. pH
due to physics of instrument
Drift, optics clouding
Instrument maintenance
Variable baseline or gain
Non-systematic random noise
homoscedastic, heteroscedastic
6
Orthogonalization Filters
Remove clutter from data which interfere with
signal of interest
Filters return spectra with clutter “removed
“Hard” orthogonalization is projection of a
subspace out of the data
"Softorthogonalization is deweighting but
not outright complete subtraction
Some Examples Using Orthogonalization Filters
(by Eigenvector)
In vivo Tissue identification with NIR probe
Cancer detection using in vivo fluorescence
Identification of arthlesclerosis in artery walls
using NIR
Determination of hydroxide concentration in
high-concentration aqueous ion solutions
using Raman spectroscopy
Identification of chemical species in remote
sensing
Method 1: Orthogonalization of Model
Method 2: Pre-selection of "clutter"
SOME Orthogonalization Filters
OSC Orthogonal Signal Correction (Wold et. al. 1998)
OPLS Orthogonal PLS (Trygg, Wold 2002 , patented)
MOSC Modified OSC (POSC - Feudale, Tan, S. Brown 2003)
CPSA - Constrained Principal Spectral Analysis (J. Brown 1990 , patented)
EPO External Parameter Orthogonalization (Roger et. al 2003)
GLS Generalized Least Squares (Aitken 1935, Martens et. al. 2003)
SBC Science Based Calibration (Marbach 2005, patented)
EMSC Extended Multiplicative Scatter Correction (Martens, Stark)
ELS/EMM Extended Least Squares/Extended Mixture Model
Focusing on this
Two General Approaches
Filter
PCA
Decomposition
Clutter
Spectra
Clutter
Loadings
Choose
Subset
Method 2: Pre-selection of "clutter"Method 1: Orthogonalization of Model
PCA or PLS
Decomposition
Calibration
Spectra
Scores &
Loadings
Orthogonalize
To Y-block
Y-block
(Classes)
Clutter
Loadings
Filtered
Spectra
Repeat for
Multiple
Components
Pre-selection Methods…
Filter
PCA
Decomposition
Clutter
Spectra
Clutter
Loadings
Choose
Subset
Identical
Choose # of PCs
Quite similar
Down-weight by
scale of eigenvalues
All the same
CLS type models
CPSA - Constrained Principal Spectral Analysis
EPO External Parameter Orthogonalization
GLS Generalized Least Squares
SBC Science Based Calibration
EMSC Extended MSC
EMM/ELS Extended Mixture Model
Pre-selecting Clutter
How to get clutter?
Look at differences in samples
which should otherwise be
the same.
In classification all samples
within a class should
nominally be the same!
Use Calibration itself!
Filter
PCA
Decomposition
Clutter
Spectra
Clutter
Loadings
Choose
Subset
Calibration
Spectra
More on How to Get Clutter
Pure component spectra of known
interferences
Subspace spanned by
samples where analyte of interest is not present
variation in data that is all of the same class
repeat measurement of blanks
off-target pixels in remote sensing
Make it up! e.g. polynomial baseline shapes
Y-gradient Method
Sort samples by y (reference) values
Take differences between adjacent samples
Weight X-differences by inverse of difference
in y values
Deweight by covariance of differences (GLS) or
orthogonalize against some number of PCs
(EPO, ELS, EMM, PA-CLS)
15
Clutter Covariance
X
c
= (X
1,c
x
1,c
) + (X
2,c
x
2,c
) + ...
C =
X
c
T
X
c
N 1
Clutter source 1 Clutter source 2
16
Covariance to Clutter Basis
C = VS
2
V
T
!!!
B = V
1...k
For basis choose some
number of factors
Covariance to GLS Weighting
Matrix
C = VS
2
V
T
G = VD
1
V
T
!!
d
i ,i
1
=
1
s
i ,i
2
α
2
+ 1
with Large α è ∞,
dimension
unaffected
Small α è 0,
dimension eliminated
weighting matrix
Choosing Components
1 2 3 4 5 6 7 8 9 10 11
-5.5
-5
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
Principal Component Number
log(eigenvalues)
EPO / CPSA
x
f
= x - xP
k
P
k
T
GLS / SBC
x
f
= x - xPDP
T
Eigenvalues of Clutter
One adjustable parameter in each method
k=4 k=5k=3
decreasing α
Other Similar Pre-selection Filters…
Extended Mixture Model (Extended Least
Squares) orthogonal filtering for Classical
Least Squares (CLS) models!
Target (Calibration) Spectra
S
target
Clutter Spectra
S
clutter
c = xS(S
T
S)
-1
Pseudo-inverse is an
orthogonalization!
Equivalent to full-rank
EPO / CPSA model
Extended Multiplicative Scatter
Correction
EMSC attempts to correct for scatter that
appears in forms other than just linear using
the extended mixture model
( )
( )
1
2
2
1
2
2, 2 1
ref
P
TT
corrected P
c
c
-
éù
éù
=
êú
ëû
ëû
=
=-
ssυυ1
c
cZZZs
ssPc
[ ]
2
(1 ) 2
1
NxK
Nx K
P
c
+
éù
=
ëû
=
éù
=
êú
ëû
P υυ1
ZsP
c
c
20
EMSC
2
(1 )
1
1(1 )
NxK
Nx J K L ref A
TTTT
SPQ
xJKL
c
++ +
++ +
éù
=
ëû
éù
=
ëû
éù
=
ëû
P υυ1
ZsSPQ
cccc
!
21
can add spectra of known target analyte S
A,NxJ
can add spectra or basis of clutter Q
NxL
.
We think it is useful to use Clutter!
Example Classification Data
4000 3500 3000 2500 2000 1500 1000 500
0
0.5
1
1.5
2
2.5
Using these regions only
Mid-IR spectra of food grade oils
Classify oils, detect adulterated olive oil
PCA Scores Plot of Oils
Olive oil
Corn oil
Safflower oil
Corn margarine
Selected regions,
mean centering
only
GLS α = 1
GLS α = 0.3
GLS α = 0.1
GLS α = 0.03
GLS α = 0.01
GLS α = 0.003
Calibration with MSC
0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2 0.25 0.3
0.1
0.05
0
0.05
0.1
0.15
0.2
0.25
Scores on PC 1 (84.99%)
Scores on PC 2 (12.71%)
Samples/Scores Plot of Olive Oil Calibration
Scores on PC 2 (12.71%)
CMarg
Corn
Olive
Saffl
Cal and Test with MSC
0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2 0.25 0.3
0.1
0.05
0
0.05
0.1
0.15
0.2
0.25
Scores on PC 1 (84.99%)
Scores on PC 2 (12.71%)
Samples/Scores Plot of Olive Oil Calibration & Oiltest,
With MSC and GLS
0.06 0.04 0.02 0 0.02 0.04 0.06 0.08 0.1
0.04
0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Scores on PC 1 (61.84%)
Scores on PC 2 (37.55%)
Samples/Scores Plot of Olive Oil Calibration & Oiltest,
Zoom on Olive Oil
0.06 0.055 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015
0.035
0.03
0.025
0.02
0.015
0.01
Scores on PC 1 (61.84%)
Scores on PC 2 (37.55%)
Samples/Scores Plot of Olive Oil Calibration & Oiltest,
Calibration and test Olive Oil
Adulterated Olive Oil
Zoom on Corn and Safflower Oil
0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08
15
10
5
0
5
x 10
3
Scores on PC 1 (61.84%)
Scores on PC 2 (37.55%)
Samples/Scores Plot of Olive Oil Calibration & Oiltest,
Calibration and test Corn Oil
Calibration and test Safflower Oil
With MSC and EPO
0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2 0.25
0.05
0
0.05
0.1
0.15
0.2
Scores on PC 1 (88.03%)
Scores on PC 2 (11.52%)
Samples/Scores Plot of Olive Oil Calibration & Oiltest,
Indian Pines Data
Classic image data set used in many
publications
Crop area near West Lafayette, Indiana
Ground truth identified 16 know crop areas
Data from AVIRIS: Airborne Visible/Infrared
Imaging Spectrometer
220 channels, 400-2500nm
Indian Pines Image
Soybean Fields
Soybeans no till
Soybeans min
Soybeans clean
PLS-DA, Mean-Center Only
Class Probability Image
PLS-DA, EPO 1-PC
Class Probability Image
Example Calibration Data
IDRC-2002 Shootout data
NIR Transflectance of pharmaceutical tablets
Goal is to predict assay value
600 800 1000 1200 1400 1600 1800
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
Wavelength (nm)
Signal Intensity
Calibration and Test with MSC & MC
150 160 170 180 190 200 210 220 230 240
150
160
170
180
190
200
210
220
230
240
Y Measured 3 assay
Y Predicted 3 assay
Samples/Scores Plot of calibrate_1,c & test_1,
R^2 = 0.964
2 Latent Variables
RMSEC = 3.3253
RMSEP = 3.3487
Calibration Bias = 0
Prediction Bias = 0.4224
With MSC, GLS & MC
150 160 170 180 190 200 210 220 230 240
150
160
170
180
190
200
210
220
230
240
Y Measured 3 assay
Y Predicted 3 assay
Samples/Scores Plot of calibrate_1,c & test_1,
R^2 = 0.984
2 Latent Variables
RMSEC = 2.5171
RMSEP = 2.159
Calibration Bias = 1.1369e13
Prediction Bias = 0.067298
With MSC, EPO & MC
150 160 170 180 190 200 210 220 230 240
150
160
170
180
190
200
210
220
230
240
Y Measured 3 assay
Y Predicted 3 assay
Samples/Scores Plot of calibrate_1,c & test_1,
R^2 = 0.979
2 Latent Variables
RMSEC = 3.0015
RMSEP = 2.3951
Calibration Bias = 8.5265e14
Prediction Bias = 0.18893
Orthogonalization Filters
Filter
Soft/
Hard
Adj.
Params
Clutter source
Improves Prediction?
OSC
Hard
#
LVs
Part of
X orthogonal to y
No,
but reduces models
complexity
O
-PLS
Hard
#
LVs
Part
of X-model space orthogonal
to
X’y
No, but sometimes improves
interpretation
MOSC
Hard
# PCs
Part of
X orthogonal to y
Maybe
CPSA
Hard
# PCs
A
priori, includes pathlength adj.
Yes
EPO
Hard
#
PCs
Classes,
y-gradient or a priori
Yes
DOP
Hard
# PCs
Synthetic
reference samples
Yes
GLS
Soft
Shrinkage
a
Classes,
y-gradient or a priori
Yes
SBC
Soft
# PCs (20?)
Repeat samples
or blanks
Yes
EMM
Hard
None
A priori
from known interferents,
clutter subspace
Yes, CLS model
ELS
Hard
# PCs
Clutter subspace
Yes
PA
-CLS
Hard
None/# PCs
Baseline shapes,
residuals
Yes,
CLS model
WLS
Soft
Regularization
Noise measurements
Yes
Conclusions
Main differences between methods are
How the clutter is defined
Whether the de-weighting is hard or soft
Filtering methods are more similar than
published statements might have you believe
Methods achieve similar results, model
performance generally improved (except O-PLS, OSC)
Interpretation of filtered results can be
challenging – except OPLS (ideally)