Wise_Clutter

Using Clutter to Improve Models

Barry M. Wise

Eigenvector Research, Inc.

Manson, WA USA

Abstract

Clutter, defined as the confounding effects of interfering chemical species, physical effects, noise

and instrument non-idealities, is present in all measurements. Sources of clutter include variation in

chemical interferents, physical effects such as scattering due to particles, changes in temperature or pressure,

instrument drift, detector non-linearity, as well as non-systematic random noise. The effect of clutter on

models for sample classification or regression can be mitigated through use of a clutter model. These models

can be derived in a number of ways such as combined class-centered data, background characterization or y-

block gradient. Once obtained, they can be used to construct filters to be used in preprocessing, such as

Generalized Least Squares Weighting, (GLSW), and External Parameter Orthogonalization (EPO). Clutter models

can also be used directly with alternative model forms based on Classical Least Squares (CLS) such as Extended

Least Squares (ELS). This talk discusses methods for obtaining clutter models and demonstrates their use in a

number of applications.

Over the past dozen years, a number of powerful spectral analysis methods have been published which make

use of orthogonalization (i.e. projection followed by weighted subtraction) of interferences or "clutter." These

filtering methods provide a means to mitigate the effect of interferences arising from background chemical or

physical species, instrumental artifacts, systematic sampling errors and instrument or system drift. They have

been used very effectively with complex biological systems, remote sensing applications, chemical process

monitoring and calibration transfer problems.

This class of methods includes Orthogonal Partial Least Squares (O-PLS), External Parameter Orthogonalization

(EPO), Dynamic Orthogonal Projection (DOP), Orthogonal Signal Correction (OSC), Constrained Principal

Spectral Analysis (CPSA), Generalized Least Squares Weighting (GLSW), and Science Based Calibration (SBC)

among others. All are based on the orthogonalization premise and each touts a unique ability to improve

model performance, robustness, and/or interpretability.

Some relationships between these methods are noted, along with ties to older work. Examples are given of the

use of the methods in calibration and classification problems in pharmaceutical, petrochemical and remote

Outline

• What is clutter?

• Orthogonalization filters

• How to get a clutter models

• Ways to deal with clutter

• Examples

What is “Clutter?”

• A confused multitude of things: a condition in which

things are not in their expected places

• Radar Clutter Definition: (DOD, NATO) Unwanted

signals, echoes, or images on the face of the display

tube, which interfere with observation of desired

signals.

• Variations in the signal (e.g. spectra) not due to the

factor (e.g. analyte) of interest due to systematic or

random effects

Measured Signal

• Clutter is present in all measurements

– X-block, Y-block

Measured Signal

Target Signal

Clutter Signal

Interference Signal

Noise

Sources of Clutter

• Systematic background variability

– in the system being sensed

• Interfering analytes not of interest

• Changes in particle size distribution

• T, P changes,

• Variable sample matrix, e.g. pH

– due to physics of instrument

• Drift, optics clouding

• Instrument maintenance

• Variable baseline or gain

• Non-systematic random noise

• homoscedastic, heteroscedastic

Orthogonalization Filters

• Remove clutter from data which interfere with

signal of interest

• Filters return spectra with clutter “removed”

• “Hard” orthogonalization is projection of a

subspace out of the data

• "Soft” orthogonalization is deweighting but

not outright complete subtraction

Some Examples Using Orthogonalization Filters

(by Eigenvector)

• In vivo Tissue identification with NIR probe

• Cancer detection using in vivo fluorescence

• Identification of arthlesclerosis in artery walls

using NIR

• Determination of hydroxide concentration in

high-concentration aqueous ion solutions

using Raman spectroscopy

• Identification of chemical species in remote

sensing

Method 1: Orthogonalization of Model

Method 2: Pre-selection of "clutter"

SOME Orthogonalization Filters

• OSC – Orthogonal Signal Correction (Wold et. al. 1998)

• OPLS – Orthogonal PLS (Trygg, Wold 2002 , patented)

• MOSC – Modified OSC (POSC - Feudale, Tan, S. Brown 2003)

• CPSA - Constrained Principal Spectral Analysis (J. Brown 1990 , patented)

• EPO – External Parameter Orthogonalization (Roger et. al 2003)

• GLS – Generalized Least Squares (Aitken 1935, Martens et. al. 2003)

• SBC – Science Based Calibration (Marbach 2005, patented)

• EMSC – Extended Multiplicative Scatter Correction (Martens, Stark)

• ELS/EMM – Extended Least Squares/Extended Mixture Model

Focusing on this

Two General Approaches

Filter

PCA

Decomposition

Clutter

Spectra

Clutter

Loadings

Choose

Subset

Method 2: Pre-selection of "clutter"Method 1: Orthogonalization of Model

PCA or PLS

Decomposition

Calibration

Spectra

Scores &

Loadings

Orthogonalize

To Y-block

Y-block

(Classes)

Clutter

Loadings

Filtered

Spectra

Repeat for

Multiple

Components

Pre-selection Methods…

Filter

PCA

Decomposition

Clutter

Spectra

Clutter

Loadings

Choose

Subset

• Identical

• Choose # of PCs

• Quite similar

• Down-weight by

scale of eigenvalues

All the same…

• CLS type models

• CPSA - Constrained Principal Spectral Analysis

• EPO – External Parameter Orthogonalization

• GLS – Generalized Least Squares

• SBC – Science Based Calibration

• EMSC – Extended MSC

• EMM/ELS – Extended Mixture Model

Pre-selecting Clutter

How to get clutter?

Look at differences in samples

which should otherwise be

the same.

In classification – all samples

within a class should

nominally be the same!

Use Calibration itself!

Filter

PCA

Decomposition

Clutter

Spectra

Clutter

Loadings

Choose

Subset

Calibration

Spectra

More on How to Get Clutter

• Pure component spectra of known

interferences

• Subspace spanned by

– samples where analyte of interest is not present

– variation in data that is all of the same class

– repeat measurement of blanks

– off-target pixels in remote sensing

• Make it up! e.g. polynomial baseline shapes

Y-gradient Method

• Sort samples by y (reference) values

• Take differences between adjacent samples

• Weight X-differences by inverse of difference

in y values

• Deweight by covariance of differences (GLS) or

orthogonalize against some number of PCs

(EPO, ELS, EMM, PA-CLS)

Clutter Covariance

€

= (X

1,c

− x

1,c

) + (X

2,c

− x

2,c

) + ...

€

C =

N −1

Clutter source 1 Clutter source 2

Covariance to Clutter Basis

€

C = VS

!!!

B = V

1...k

For basis choose some

number of factors

Covariance to GLS Weighting

Matrix

€

C = VS

€

G = VD

−1

i ,i

−1

i ,i

+ 1

with Large α è ∞,

dimension

unaffected

Small α è 0,

dimension eliminated

weighting matrix

Choosing Components

1 2 3 4 5 6 7 8 9 10 11

-5.5

-5

-4.5

-4

-3.5

-3

-2.5

-2

-1.5

Principal Component Number

log(eigenvalues)

EPO / CPSA

= x - xP

GLS / SBC

= x - xPDP

Eigenvalues of Clutter

One adjustable parameter in each method

k=4 k=5k=3

decreasing α

Other Similar Pre-selection Filters…

• Extended Mixture Model (Extended Least

Squares) orthogonal filtering for Classical

Least Squares (CLS) models!

Target (Calibration) Spectra

target

Clutter Spectra

clutter

c = xS(S

-1

Pseudo-inverse is an

orthogonalization!

Equivalent to full-rank

EPO / CPSA model

Extended Multiplicative Scatter

Correction

• EMSC attempts to correct for scatter that

appears in forms other than just linear using

the extended mixture model

( )

2, 2 1

ref

corrected P

éù

êú

ëû

ssυυ1

cZZZs

ssPc

[ ]

(1 ) 2

NxK

Nx K

éù

ëû

éù

êú

ëû

P υυ1

ZsP

EMSC

( )

2, 2 1

ref

corrected P Q

éù

ëû

=- -

ssSPQc

cZZZs

ssPcQc

(1 )

1(1 )

NxK

Nx J K L ref A

TTTT

SPQ

xJKL

++ +

éù

ëû

éù

ëû

éù

ëû

P υυ1

ZsSPQ

cccc

• can add spectra of known target analyte S

A,NxJ

• can add spectra or basis of clutter Q

NxL

We think it is useful to use Clutter!

Example Classification Data

4000 3500 3000 2500 2000 1500 1000 500

0.5

1.5

2.5

Using these regions only

• Mid-IR spectra of food grade oils

• Classify oils, detect adulterated olive oil

PCA Scores Plot of Oils

Olive oil

Corn oil

Safflower oil

Corn margarine

Selected regions,

mean centering

only

GLS α = 1

GLS α = 0.3

GLS α = 0.1

GLS α = 0.03

GLS α = 0.01

GLS α = 0.003

Calibration with MSC

0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2 0.25 0.3

0.1

0.05

0.05

0.1

0.15

0.2

0.25

Scores on PC 1 (84.99%)

Scores on PC 2 (12.71%)

Samples/Scores Plot of Olive Oil Calibration

Scores on PC 2 (12.71%)

CMarg

Corn

Olive

Saffl

Cal and Test with MSC

0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2 0.25 0.3

0.1

0.05

0.05

0.1

0.15

0.2

0.25

Scores on PC 1 (84.99%)

Scores on PC 2 (12.71%)

Samples/Scores Plot of Olive Oil Calibration & Oiltest,

With MSC and GLS

0.06 0.04 0.02 0 0.02 0.04 0.06 0.08 0.1

0.04

0.02

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Scores on PC 1 (61.84%)

Scores on PC 2 (37.55%)

Samples/Scores Plot of Olive Oil Calibration & Oiltest,

Zoom on Olive Oil

0.06 0.055 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015

0.035

0.03

0.025

0.02

0.015

0.01

Scores on PC 1 (61.84%)

Scores on PC 2 (37.55%)

Samples/Scores Plot of Olive Oil Calibration & Oiltest,

Calibration and test Olive Oil

Adulterated Olive Oil

Zoom on Corn and Safflower Oil

0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08

15

10

5

x 10

3

Scores on PC 1 (61.84%)

Scores on PC 2 (37.55%)

Samples/Scores Plot of Olive Oil Calibration & Oiltest,

Calibration and test Corn Oil

Calibration and test Safflower Oil

With MSC and EPO

0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2 0.25

0.05

0.05

0.1

0.15

0.2

Scores on PC 1 (88.03%)

Scores on PC 2 (11.52%)

Samples/Scores Plot of Olive Oil Calibration & Oiltest,

Indian Pines Data

• Classic image data set used in many

publications

• Crop area near West Lafayette, Indiana

• Ground truth identified 16 know crop areas

• Data from AVIRIS: Airborne Visible/Infrared

Imaging Spectrometer

• 220 channels, 400-2500nm

Indian Pines Image

Soybean Fields

Soybeans no till

Soybeans min

Soybeans clean

PLS-DA, Mean-Center Only

Class Probability Image

PLS-DA, EPO 1-PC

Class Probability Image

Example Calibration Data

• IDRC-2002 Shootout data

• NIR Transflectance of pharmaceutical tablets

• Goal is to predict assay value

600 800 1000 1200 1400 1600 1800

2.5

3.5

4.5

5.5

6.5

Wavelength (nm)

Signal Intensity

Calibration and Test with MSC & MC

150 160 170 180 190 200 210 220 230 240

150

160

170

180

190

200

210

220

230

240

Y Measured 3 assay

Y Predicted 3 assay

Samples/Scores Plot of calibrate_1,c & test_1,

R^2 = 0.964

2 Latent Variables

RMSEC = 3.3253

RMSEP = 3.3487

Calibration Bias = 0

Prediction Bias = 0.4224

With MSC, GLS & MC

150 160 170 180 190 200 210 220 230 240

150

160

170

180

190

200

210

220

230

240

Y Measured 3 assay

Y Predicted 3 assay

Samples/Scores Plot of calibrate_1,c & test_1,

R^2 = 0.984

2 Latent Variables

RMSEC = 2.5171

RMSEP = 2.159

Calibration Bias = 1.1369e13

Prediction Bias = 0.067298

With MSC, EPO & MC

150 160 170 180 190 200 210 220 230 240

150

160

170

180

190

200

210

220

230

240

Y Measured 3 assay

Y Predicted 3 assay

Samples/Scores Plot of calibrate_1,c & test_1,

R^2 = 0.979

2 Latent Variables

RMSEC = 3.0015

RMSEP = 2.3951

Calibration Bias = 8.5265e14

Prediction Bias = 0.18893

Orthogonalization Filters

Filter

Soft/

Hard

Adj.

Params

Clutter source

Improves Prediction?

OSC

Hard

LVs

Part of

X orthogonal to y

No,

but reduces models

complexity

-PLS

Hard

LVs

Part

of X-model space orthogonal

X’y

No, but sometimes improves

interpretation

MOSC

Hard

# PCs

Part of

X orthogonal to y

Maybe

CPSA

Hard

# PCs

priori, includes pathlength adj.

Yes

EPO

Hard

PCs

Classes,

y-gradient or a priori

Yes

DOP

Hard

# PCs

Synthetic

reference samples

Yes

GLS

Soft

Shrinkage

Classes,

y-gradient or a priori

Yes

SBC

Soft

# PCs (20?)

Repeat samples

or blanks

Yes

EMM

Hard

None

A priori

from known interferents,

clutter subspace

Yes, CLS model

ELS

Hard

# PCs

Clutter subspace

Yes

-CLS

Hard

None/# PCs

Baseline shapes,

residuals

Yes,

CLS model

WLS

Soft

Regularization

Noise measurements

Yes

Conclusions

• Main differences between methods are

– How the clutter is defined

– Whether the de-weighting is hard or soft

• Filtering methods are more similar than

published statements might have you believe

• Methods achieve similar results, model

performance generally improved (except O-PLS, OSC)

• Interpretation of filtered results can be

challenging – except OPLS (ideally)