Chapter 8
8.1
Effectiveness of IR system: consider
- A document collection
- A test suite of information needs, expressible as queries
- A set of relevance judgments, standardly a binary assessment of either relevant or nonrelevant for each query-document pair ( refer to gold standard or ground truth judgment of relevance)
8.2
The test collections that are most often used for this
purpose:
- The Cranfield collection
- Text Retrieval Conference (TREC)
- NII Text Collections for IR Systems (NTCIR)
- Cross Language Evaluation Forum (CLEF)
- Reuters-21578 and Reuters-RCV1
- 20 Newsgroups
8.3
The straightforward notion of relevant and nonrelevant
documents and the formal evaluation methodolody that has been developed for
evaluating unranked retrieval results.
Precision and Recall
F measure: the weighted harmonic mean of precision and
recall
8.4
Develop measures to evaluate ranked retrieval results
The MAP value for a test collection is the arithmetic mean
of average precision values for individual information needs.4
Another concept used in evaluation is an ROC curve (Receiver
Operating Characterstics)
For a set of queries Q, let R(j,d) be the relevance score
assessors gave to document d for query j. So
8.5
Reliable and informative test collections.
In social sciences, a common measure for agreement between
judges is the kappa statistic. It is designed for categorical judgments and
corrects a simple agreement rate for the rate of chance agreement.
One way to approach measuring this is by using distinct
facts or entities as evaluation units.
8.6
User utility and how it is approximated by the use of
document relevance.
- System issues
- User utility
- Refining a deployed system
8.7
Short summary of the document: two basic kinds of summaries are static and dynamic
Short summary of the document: two basic kinds of summaries are static and dynamic
What's the value of TREC: is there a gap to jump or a chasm
to bridge?
To address not further generalisation across information-seeking
contexts but context-driven particularisation. This note develops this argument
from an analysis of TREC work, applying notions taken from discussions of
evaluation for language and information processing in general.
Cumulated gain-based evaluation of IR techniques ACM
Transactions on Information Systems
Several novel measures that compute the cumulative gain the
user obtains by examining the retrieval result up to a given ranked position.
The first one accumulates the relevance scores of retrieved
documents along the ranked result list. The second one is similar but applies a
discount factor to the relevance scores in order to devaluate late-retrieved
documents. The third one computes the relative-to-the-ideal performance of IR
techniques, based on the cumulative gain they are able to yield. These novel
measures are defined and discussed and their use is demonstrated in a case
study using TREC data





No comments:
Post a Comment