C+J 2020, March 20 – 21, 2020, Boston, MA, USA Brian Keith Norambuena, Michael Horning, and Tanushree Mitra
let’s refer to Figure 1. Considering logarithmic decay,
d = log(
2
)
,
we divide the score by half whenever we move farther away from
the headline. Thus, the WHO candidate (
p =
0) will be scored with
1 and the WHEN candidate with 0.25 (p = 2).
Type score. Scoring based on candidate type, such as proper or
common noun, date or time, etc., depends on the 5W1H question
being answered. For WHO, it refers to whether the candidate is
a named entity (i.e., a proper noun). For example, if the extracted
candidate for WHO is a named entity, we score it as 1, otherwise it
is scored as 0. For WHEN, type refers to whether the candidate is
a proper date or a vague expression. For WHERE, it refers to the
type of location (e.g., geopolitical entities, geographical locations,
man-made structures, or organizations which can be used to refer
to places in some cases). For WHY and HOW, we score candidates
based on whether it is expressed through an NP-VP-NP pattern or
conjunction or a combination of both.
Frequency Score. For all questions, except WHY and HOW, we
rank-score candidates by their frequency of occurrence in the article.
The highest frequency candidate is scored as 1. If the candidate is a
named entity, we count all its coreferences, otherwise, we simply
count the raw occurrences. For example, consider “Hawaii” and
“United States” as WHERE candidates for the article in Figure 1.
If we only consider the parts shown in Figure 1, then the article
mentions the rst candidate four times and the second candidate
only once. We normalize the counts by the highest frequency and
assign a score of 1 to “Hawaii” and 1/4 to “United States.”
Precision and Length Score. For WHERE and WHEN, we consider
the
Precision
of the candidate. For example, a date with an exact
time is ranked higher than a vague phrase like “election time” and
“London” is ranked higher than “UK” because it is a more precise
location. For WHY and HOW, we consider the Length of the can-
didate. We prefer longer explanations for the cause and method. To
implement this, we count the number of words in the candidate
and divide by the maximum count in all candidates. Moreover, we
add a redundancy penalty if the candidate repeats the answer to
WHAT or if we get the same answer for WHY and HOW.
Other Scoring Criteria. For WHEN, we also score candidates by
distance to publication date, preferring dates closer to the publica-
tion date. For WHERE, we score candidates by clustering. We assign
a higher score if a candidate is close to the other candidates. For
example, if most locations are in Germany, then we would assign
less score to a random location in Japan. For HOW, we score candi-
dates by modier frequency, which counts the number of adverbs
and adjectives used by the candidate.
4.1.2 Location Scoring of Main Event Descriptors. We assign the
location scores for each main event descriptor using the following
criteria: if an article follows an inverted pyramid structure, it should
provide answers to the 5W1H questions in the OP (see Figure 1).
Thus, if we nd the answers there, we assign a high IPS. While
the headline and lead are usually one sentence long each [
8
], the
2nd paragraph can have at most three sentences. We found this
maximum length by analyzing breaking news articles in the data
set. Hence, for the purposes of our estimation, we consider the OP
to be the rst 5 sentences of an article. We give a full score if all
5W1H descriptors are contained in the OP. Otherwise, we apply an
exponential penalty by location of each descriptor. More formally,
considering the headline index to be 0, for each descriptor D,
LocScore(D) =
(
2
4−max
(
4, Location(D)
)
if answer found
0 if answer not found.
Finally, we obtain a weighted average of all the location scores.
Since HOW and WHY are not necessarily present, and even humans
may have problems extracting them, we assign them a lower weight
than the other descriptors.
4.2 Summarization
The second component of the IPS models how well an article is sum-
marized by the OP. By denition, an article following an inverted
pyramid structure must be summarizable by removing everything
except the OP—the headline, lead, and 2nd paragraph. Note how in
Figure 1 the OP contains all relevant information about the news
story. Hence, our generated summary should be similar to the OP.
Thus, we implement our summary similarity module by comparing
the summary of the full article with the OP. First, we summarize the
full article using an extractive summarization algorithm—TextRank.
TextRank ranks an article by the most important sentences and then
uses those to build the summary. Next, we compare the full article
summary and the OP by comparing the language representations of
the two. In particular, we do this using
Spacy
and their pre-trained
en_core_web_lg
model. This model uses GloVe vectors and it was
trained with a multi-task CNN on blogs, news, and comments [
9
].
We average all the word vectors contained in a text to get its nal
representation. Finally, we compute the summarization score using
the Cosine similarity distance between the vector representations
of the OP and the summary.
5 RESULTS AND DISCUSSION
Here we show our main ndings and discussions. We begin by
presenting the evaluation of our main event descriptors extractor.
Next, we report the results on the November 2017 AP News articles,
showing the IPS distributions for breaking and non-breaking news.
5.1 5W1H Extraction
Table 1 shows the evaluation results of our 5W1H method. We
nd that our extractor is capable of obtaining the right answers for
the basic 4W with 78% accuracy on average. Out of the four basic
descriptors, our method systematically extracted better results for
WHERE in this data set. This could be attributed in turn to the
date-line being explicitly included in AP News articles.
However, for the full main event descriptors we only achieve
67% average accuracy. This reduction in accuracy makes sense
considering the inherent diculty of extracting the causes and
methods from news articles. Even though the accuracy for WHY
and HOW is still low compared to the other questions, our method
is on par with the state-of-the-art.
As a baseline for comparison,
Giveme5W1H
gets 0.73 accuracy
for all descriptors and 0.82 for the basic 4W on a BBC news data
set [
6
]. However, it is hard to draw a direct comparison because
of dierences in the background of the annotators (journalism
students vs IT students) and of data sets (AP News vs BBC).