5
coded by attorneys as responsive or not responsive. Among
the 688,294 documents, 41,739 are responsive and the rest
are not responsive. For each of the responsive documents, a
rationale was annotated by a review attorney as the
justification for coding the document as responsive. In
practical terms, most rationales are continuous words,
phrases, sentences or sections from the reviewed and labeled
documents. A few rationales contain words that are
comments from the attorney and do not occur in the
documents. Some rationales may consist of more than one
text snippet, which occur in different parts of the document.
Annotated rationales have a mean length of 52 words,
with a standard deviation of 112.5 words. 97.5% of these
rationales have fewer than 250 words. To reduce the effect of
outliers, such as very long or very short rationales, we limit
our rationales to those with 10 or more words but fewer than
250 words and those that can be precisely identified in the
data set – resulting in 23,791 responsive documents with
annotated rationales in our population. These 23,791
documents established our responsive population, covering
57% of all the responsive documents in the above 688,294
population. Proportionally, we randomly selected 365,742
documents from the not responsive documents within the
688,294 population to define the not responsive population
in our experiments.
B. Experiment Design
The purpose of these experiments was to study the
feasibility of automatically identifying rationales for
responsive documents with and without annotated rationales.
We conducted two sets of experiments. In both sets of
experiments, we built two types of predictive models, a
document model and a rationale model. A document model
was trained using documents with responsive and not
responsive labels, whereas a rationale model was trained
using responsive and not responsive labeled text snippets. A
responsive labeled snippet was an annotated rationale, while
a not responsive labeled text snippet was a randomly selected
text snippet from a not responsive document. Not responsive
snippets could also be selected from responsive documents,
but we did not fully explore that within the confines of this
study. Rather, we adopted a random process to sample not
responsive snippets from not responsive documents. To
ensure not responsive snippets have similar parameters as
responsive snippets, we enforced two constraints in sampling
a not responsive snippet from a not responsive document: (1)
the length should be a random number between 10 and 250,
which is the same snippet length range as the responsive
snippet population; (2) the starting position of the not
responsive snippet should be a random position between zero
and the attorney reviewed document’s length, minus the
snippet’s length from the first constraint. One not responsive
text snippet was selected from each not responsive
document.
The first set of experiments evaluated the performance of
both the document and rationale models in classifying
annotated responsive rationales from not responsive snippets
randomly selected from the not responsive documents. In
these experiments, both document and rationale models were
evaluated using a test set comprised of annotated rationales
and randomly selected not responsive snippets. Precision and
recall were used as performance metrics. The first set of
experiments were performed to test the performance of the
models on the text snippets alone and provide multiple ways
to interrogate the results of the modeling methods.
The second set of experiments simulate a real legal
application scenario and apply both document and rationale
models to responsive labeled documents to identify
rationales that “explain” the models’ responsive decision. In
these experiments, a responsive document is divided into a
set of overlapping snippets. Then, the models are applied to
these snippets to identify rationales. We encountered the
following question: how should this study decide if a snippet
should be treated as a rationale or not? Our answer to that
question in basic predictive coding principles: one simple
way is to identify the snippet with the largest score in a
document and consider that as the rationale for the
document. Recall (the percentage of identified rationales)
was used as the performance metric. An annotated rationale
is correctly identified if it is included in the text snippet with
the largest score.
The machine learning algorithm used to generate the
models was Logistic Regression. Our prior studies
demonstrated that predictive models generated with Logistic
Regression perform very well on legal matter documents [2,
10]. Other parameters used for modeling were bag of words
with 1-gram and normalized frequency [2]. The results
reported in the next section are averaged over a fivefold
cross validation.
C. Results of the Experiments
Figure 1 details the precision and recall curves for the
document and rationale models in discriminating annotated
responsive rationales from not responsive text snippets. The
curves are the average of fivefold cross validation results.
Each of the five document models in the fivefold cross
validation was trained using an 80% random subset of the
23,791 responsive documents and 365,742 not responsive
documents, while each of the five rationale models were
trained using an 80% random subset of the 23,791 annotated
rationales and 365,742 not responsive text snippets. In each
fold, both the document and rationale models were tested
using a 20% random subset of the 23,791 annotated
rationales and 365,742 not responsive text snippets, i.e. on
average, 4,758 annotated rationales and 73,148 not
responsive text snippets. The responsive document rate is
6.5%.
The second set of experiments evaluated the document
and rationale models’ ability to identify rationales of
responsive documents. In these experiments, both the
document and rationale models were the same models that