A Study of Smoothing Methods for Language Models Applied ...
[Pages:9]A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval
Chengxiang Zhai
School of Computer Science Carnegie Mellon University
Pittsburgh, PA 15213
John Lafferty
School of Computer Science Carnegie Mellon University
Pittsburgh, PA 15213
ABSTRACT
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing, which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections.
1. INTRODUCTION
The study of information retrieval models has a long history. Over the decades, many different types of retrieval models have been proposed and tested [21]. There has been a great diversity of approaches and methodology developed, rather than a single unified retrieval model that has proven to be most effective; however, the field has progressed in two different ways. On the one hand, theoretical studies of an underlying model have been developed; this direction is, for example, represented by the various kinds of logic models and probabilistic models (e.g., [14, 3, 15, 22]). On the other hand, there have been many empirical studies of models, including many variants of the vector space model (e.g., [17, 18, 19]). In some cases, there have been theoretically motivated models that also perform well empirically; for example, the BM25 retrieval function, motivated by the 2-Poisson probabilistic retrieval model, has proven to be quite effective in practice [16].
Recently, a new approach based on language modeling has been successfully applied to the problem of ad hoc re-
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'01, September 9-12, 2001, New Orleans, Louisiana, USA Copyright 2001 ACM 1-58113-331-6/01/0009 ...$5.00.
trieval [13, 1, 10, 5]. The basic idea behind the new approach is extremely simple--estimate a language model for each document, and rank documents by the likelihood of the query according to the language model. Yet this new framework is very promising, because of its foundations in statistical theory, the great deal of complementary work on language modeling in speech recognition and natural language processing, and the fact that very simple language modeling retrieval methods have performed quite well empirically.
The term smoothing refers to the adjustment of the maximum likelihood estimator of a language model so that it will be more accurate. At the very least, it is required to not assign a zero probability to unseen words. When estimating a language model based on a limited amount of text, such as a single document, smoothing of the maximum likelihood model is extremely important. Indeed, many language modeling techniques are centered around the issue of smoothing. In the language modeling approach to retrieval, the accuracy of smoothing is directly related to the retrieval performance. Yet most existing research work has assumed one method or another for smoothing, and the smoothing effect tends to be mixed with that of other heuristic techniques. There has been no direct evaluation of different smoothing methods, and it is unclear how the retrieval performance is affected by the choice of a smoothing method and its parameters.
In this paper, we study the problem of language model smoothing in the context of ad hoc retrieval, focusing on the smoothing of document language models. The research questions that motivate this work are (1) how sensitive is retrieval performance to the smoothing of a document language model? (2) how should a smoothing method be selected, and how should its parameters be chosen? We compare several of the most popular smoothing methods that have been developed in speech and language processing, and study the behavior of each method.
Our study leads to several interesting and unanticipated conclusions. We find that the retrieval performance is highly sensitive to the setting of smoothing parameters. In some sense, smoothing is as important to this new family of retrieval models as term weighting is to the traditional models. Interestingly, the effect of smoothing is very sensitive to the type of queries, which suggests that smoothing plays two different roles in the query-likelihood ranking method: One role is to improve the accuracy of the estimated document language model, while the other is to accommodate generation of common and non-informative words in a query. Different smoothing methods have behaved somehow differ-
34334
ently. Some methods tend to perform better for concise title queries while others tend to perform better for long verbose queries. And some methods are more stable than others, in the sense that their performance is less sensitive to the choice of parameters.
2. THE LANGUAGE MODELING APPROACH
The basic idea of the language modeling approach to information retrieval can be described as follows. We assume that a query q is "generated" by a probabilistic model based on an document d. Given a query q = q1q2...qn and a document d = d1d2...dm, we are interested in estimating the conditional probability p(d | q), i.e., the probability that d generates the observed q. After applying the Bayes' formula and dropping a document-independent constant (since we are only interested in ranking documents), we have
p(d | q) p(q | d)p(d)
As discussed in [1], the righthand side of the above equation has an interesting interpretation, where, p(d) is our prior belief that d is relevant to any query and p(q | d) is the query likelihood given the document, which captures how well the document "fits" the particular query q.
In the simplest case, p(d) is assumed to be uniform, and so does not affect document ranking. This assumption has been taken in most existing work [1, 13, 12, 5, 20]. In other cases, p(d) can be used to capture non-textual information, e.g., the length of a document or links in a web page, as well as other format/style features of a document. In our study, we assume a uniform p(d) in order to focus on the effect of smoothing. See [10] for an empirical study that exploits simple alternative priors.
With a uniform prior, the retrieval model reduces to the calculation of p(q | d), where language modeling comes in. The language model used in most previous work is the unigram model.1 This is the multinomial model which assigns the probability
p(q | d) = p(qi | d)
i
Clearly, the retrieval problem is now essentially reduced to a unigram language model estimation problem. In this paper we focus on unigram models only; see [10, 20] for some explorations of bigram and trigram models.
On the surface, the use of language models appears fundamentally different from vector space models with TF-IDF weighting schemes, because the unigram language model only explicitly encodes term frequency--there appears to be no use of inverse document frequency weighting in the model. However, there is an interesting connection between the language model approach and the heuristics used in the traditional models. This connection has much to do with smoothing, and an appreciation of it helps to gain insight into the language modeling approach.
Most smoothing methods make use of two distributions, a model ps(w | d) used for "seen" words that occur in the document, and a model pu(w | d) for "unseen" words that do not. The probability of a query q can be written in terms
1The work of Ponte and Croft [13] adopts something similar to, but slightly different from the standard unigram model.
of these models as follows, where c(w; d) denotes the count of word w in d.
log p(q | d) = log p(qi | d)
i
=
log ps(qi | d) +
log pu(qi | d)
i:c(qi ;d)>0
i:c(qi ;d)=0
=
log ps(qi | d) +
i:c(qi ;d)>0
pu(qi | d)
log pu(qi | d)
i
The probability of an unseen word is typically taken as being proportional to the general frequency of the word,
e.g., as computed using the document collection. So, let us assume that pu(qi | d) = d p(qi | C), where d is a documentdependent constant and p(qi | C) is the collection language model. Now we have
log p(q
| d)
=
log
i:c(qi ;d)>0
ps(qi | d) d p(qi | C)
+
n
log d
+
i
log p(qi | C)
where n is the length of the query. Note that the last term
on the righthand side is independent of the document d, and
thus can be ignored in ranking.
Now we can see that the retrieval function can actually be
decomposed into two parts. The first part involves a weight
for each term common between the query and document
(i.e., matched terms) and the second part only involves a
document-dependent constant that is related to how much
probability mass will be allocated to unseen words, accord-
ing to the particular smoothing method used. The weight
of a matched term qi can be identified as the logarithm of
, ps(qi | d)
d p(qi | C)
which
is
directly
proportional
to
the
document
term frequency, but inversely proportional to the collection
frequency.
Thus, the use of p(qi | C) as a reference smoothing distri-
bution has turned out to play a role very similar to the well-
known IDF. The other component in the formula is just the
product of a document-dependent constant and the query
length. We can think of it as playing the role of document
length normalization, which is another important technique
to improve performance in traditional models. Indeed, d
should be closely related to the document length, since one
would expect that a longer document needs less smoothing
and thus a smaller d; thus a long document incurs a greater
penalty than a short one because of this term.
The connection just derived shows that the use of the
collection language model as a reference model for smooth-
ing document language models implies a retrieval formula
that implements TF-IDF weighting heuristics and document
length normalization. This suggests that smoothing plays a
key role in the language modeling approaches to retrieval.
A more restrictive derivation of the connection was given
in [5].
3. SMOOTHING METHODS
As described above, our goal is to estimate p(w | d), a unigram language model based on a given document d. The simplest method is the maximum likelihood estimate, simply given by relative counts
? pml(w | d) =
c(w; d) w c(w; d)
34345
Method Jelinek-Mercer
Dirichlet
Absolute discount
ps(w | d)
(1 - ) pml(w | d) + p(w | C)
c(?w; d) + ? p(w | C)
w c(w; d) + ?
? max(c(w; d) - , 0) w c(w; d)
+
| d |u |d|
p(w
| C)
d
? w c(w; d) + ? |d|u |d|
Parameter ?
Table 1: Summary of the three primary smoothing methods compared in this paper.
However, the maximum likelihood estimator will generally under-estimate the probability of any word unseen in the document, and so the main purpose of smoothing is to assign a non-zero probability to the unseen words and improve the accuracy of word probability estimation in general.
There are many smoothing methods that have been proposed, mostly in the context of speech recognition tasks [2]. In general, all smoothing methods are trying to discount the probabilities of the words seen in the text, and to then assign the extra probability mass to the unseen words according to some "fallback" model. For information retrieval, it makes much sense, and is very common, to exploit the collection language model as the fallback model. Following [2], we assume the general form of a smoothed model to be the following:
p(w | d) =
ps(w | d) if word w is seen d p(w | C) otherwise
where ps(w | d) is the smoothed probability of a word seen in the document, p(w | C) is the collection language model, and
d is a coefficient controlling the probability mass assigned
to unseen words, so that all probabilities sum to one. In
general, d may depend on d. Indeed, if ps(w | d) is given,
we must have
?
? 1 -
d = 1 -
w:c(w;d)>0 ps(w | d) w:c(w;d)>0 p(w | C)
Thus, individual smoothing methods essentially differ in their choice of ps(w | d).
A smoothing method may be as simple as adding an extra count to every word (called additive, or Laplace smoothing), or more sophisticated as in Katz smoothing, where words of different count are treated differently. However, because a retrieval task typically requires efficient computations over a large collection of documents, our study is constrained by the efficiency of the smoothing method. We selected three representative methods that are popular and relatively efficient to implement. We excluded some well-known methods, such as Katz smoothing [7] and Good-Turing estimation [4], because of the efficiency constraint2. Although the methods we evaluated are simple, the issues that they bring to light are relevant to more advanced methods. The three methods are described below.
The Jelinek-Mercer method. This method involves a linear interpolation of the maximum likelihood model with the collection model, using a coefficient to control the influence of each model.
p(w | d) = (1 - ) pml(w | d) + p(w | C)
(1)
2They involve the count of words with the same frequency in a document, which is expensive to compute.
Thus, this is a simple mixture model (but we preserve the name of the more general Jelinek-Mercer method which involves deleted-interpolation estimation of linearly interpolated n-gram models).
Bayesian smoothing using Dirichlet priors. A language model is a multinomial distribution, for which the conjugate prior for Bayesian analysis is the Dirichlet distribution with parameters
(?p(w1 | C), ?p(w2 | C), . . . , ?p(wn | C))
Thus, the model is given by
? p?(w | d)
=
c(w; d) + ? p(w | C) w c(w; d) + ?
(2)
The Laplace method is a special case of this technique.
Absolute discounting. The idea of the absolute discount-
ing method is to lower the probability of seen words by sub-
tracting a constant from their counts [11]. It is similar to
the Jelinek-Mercer method, but differs in that it discounts
the seen word probability by subtracting a constant instead
of multiplying it by (1-). The model is given by
? p(w | d)
=
max(c(w; d) - , 0) + p(w | C) w c(w; d)
(3)
where [0, 1] is a discount constant and = |d|u/|d|, so that all probabilities sum to one. Here |d|u is the number
? of unique terms in document d, and |d| is the total count of
words in the document, so that |d| = w c(w; d). The three methods are summarized in Table 1 in terms of
ps(w | d) and d in the general form. It is easy to see that a
larger parameter value means more smoothing in all cases.
Retrieval using any of the three methods can be imple-
mented very efficiently, when the smoothing parameter is
given in advance. The 's can be pre-computed for all doc-
uments at index time. The weight of a matched term w can
be computed easily based on the collection language model p(w | C), the query term frequency c(w; q), the document
term frequency c(w; d), and the smoothing parameters. Indeed, the scoring complexity for a query q is O(k |q |), where |q | is the query length, and k is the average number of doc-
uments in which a query term occurs. It is as efficient as
scoring using a TF-IDF model.
4. EXPERIMENTAL SETUP
Our goal is to study the behavior of individual smoothing methods as well as to compare different methods. As is well-known, the performance of a retrieval algorithm may vary significantly according to the testing collection used. It is generally desirable to have larger collections and more queries. We use the following five databases from TREC, including three of the largest testing collections for ad hoc retrieval, i.e., the official TREC7 ad hoc, TREC8 ad hoc, and
34356
precision
0.4 0.35
0.3 0.25
0.2 0.15
0.1 0
0.3
0.25
0.2
0.15
0.1
0.05 0
Precision of Jelinek-Mercer (Small Collections)
0.2
0.4
0.6
0.8
lambda
Precision of Jelinek-Mercer (Large Collections)
0.2
0.4
0.6
0.8
lambda
fbis7T fbis8T
ft7T ft8T la7T la8T fbis7L fbis8L ft7L ft8L la7L la8L 1
trec7T trec8T web8T trec7L trec8L web8L 1
recall
recall
0.8 0.7 0.6 0.5 0.4
0 0.8 0.7 0.6 0.5 0.4 0.3
0
Recall of Jelinek-Mercer (Small Collections)
0.2
0.4
0.6
0.8
lambda
Recall of Jelinek-Mercer (Large Collections)
0.2
0.4
0.6
0.8
lambda
fbis7T fbis8T
ft7T ft8T la7T la8T fbis7L fbis8L ft7L ft8L la7L la8L 1
trec7T trec8T web8T trec7L trec8L web8L 1
precision
Figure 1: Performance of Jelinek-Mercer smoothing.
Document collection
FBIS FT LA TREC7&8 WEB
Queries
351-400 (Trec7) 401-450 (Trec8)
Title Long Title Long
fbis7T fbis7L fbis8T fbis8L
ft7T ft7L ft8T ft8L
la7T la7L la8T la8L
trec7T trec7L trec8T trec8L
N/A
web8T web8L
Table 2: Labels used for test collections.
TREC8 web track testing collections: (1) Financial Times on disk 4, (2) FBIS on disk 5, (3) Los Angeles Times on disk 5, (4) Disk 4 and disk 5 minus CR, used for the TREC7 and TREC8 ad hoc tasks, and (5) the TREC8 Web data.
The queries we use are topics 351?400 (used for the TREC7 ad hoc task), and topics 401?450 (used for the TREC8 ad hoc and web tasks). In order to study the possible interaction of smoothing and query length/type, we use two different versions of each set of queries: (1) title only, (2) long version (title + description + narrative). The title queries are mostly two or three key words, whereas the long queries have whole sentences and are much more verbose.
In all our experiments, the only tokenization applied is stemming with a Porter stemmer. We deliberately indexed all the words in the language, since we do not want to be biased by any artificial choice of stop words and we believe that the effects of stop word removal should be better achieved by exploiting language modeling techniques.
In Table 2 we give the labels used for all possible retrieval testing collections, based on the databases and queries described above.
For each smoothing method and on each testing collec-
tion, we experiment with a wide range of parameter values. In each run, the smoothing parameter is set to the same value across all queries and documents. (While it is certainly possible to set the parameters differently for individual queries and documents through some kind of training procedure, this is out of the scope of the present paper.) For the purpose of studying the behavior of an individual smoothing method, we select a set of representative parameter values and examine the sensitivity of precision and recall to the variation in these values. For the purpose of comparing smoothing methods, we first optimize the performance of each method using the non-interpolated average precision as the optimization criterion, and then compare the best runs from each method. The optimal parameter is determined by searching over the entire parameter space.3
5. BEHAVIOR OF INDIVIDUAL METHODS
In this section, we study the behavior of each smoothing method. We first derive the expected influence of the smoothing parameter on the term weighting and document length normalization implied by the corresponding retrieval function. Then, we examine the sensitivity of retrieval performance by plotting the non-interpolated precision and recall at 1,000 documents against the different values of the smoothing parameter.
Jelinek-Mercer smoothing. When using the Jelinek-Mercer smoothing method with a fixed , we see that the parameter d in our ranking function (see Section 2) is the same for
3The search is performed in an iterative way, such that each iteration is more focused than the previous one. We stop searching when the improvement in average precision is less than 1%.
34367
lambda lambda
Optimal Lambda and Range in Jelinek-Mercer for trec8T
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
query
Optimal Lambda and Range in Jelinek-Mercer for trec8L
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
query
Figure 2: Optimal range for trec8t (left) and trec8l (right) in Jelinek-Mercer smoothing. The line shows the optimal value of and the bars are the optimal ranges.
all documents, so the length normalization term is a constant. This means that the score can be interpreted as a sum of weights over each matched term. The term weight is log(1 + (1 - )pml(qi|d)/(p(qi | C))). Thus, a small means more emphasis on relative term weighting. Indeed, if approaches one, then all term weights tend to zero, and the scoring formula approaches coordination level matching, which is simply the count of matched terms.
The plots in Figure 1 show the average precision and recall for different settings of , for both large and small collections. It is evident that both precision and recall are much more sensitive to for long queries than for title queries. The web collection, however, is an exception, where performance is very sensitive to smoothing even for title queries. For title queries, the retrieval performance tends to be optimized when is small (around 0.1), whereas for long queries, the optimal point is generally higher, and usually around 0.7. The difference in the optimal value suggests that long queries need more smoothing, and less emphasis is placed on the relative weighting of terms. The right end of the curve (when is very close to one) should be close to the performance of coordination level matching.
Such sensitivity and trend can be seen more clearly from the per-topic plot of the optimal range of shown in Figure 2. The optimal range is defined as the maximum range of values that deviate from the optimal average precision by no more than 0.01.
Dirichlet priors. When using the Dirichlet prior for smoothing, we see that the d in the retrieval formula is documentdependent. It is smaller for long documents, so can be interpreted as a length normalization component that penalizes long documents. The weight for a matched term is now log(1 + c(qi; d)/(?p(qi | C))). Note that in the Jelinek Mercer method, the term weight has a document length normalization implicit in ps(qi | d), but here the term weight is affected by only the raw counts of a term, not the length of the document. After rewriting the weight as log(1 + |d|pml(qi | d)/(?p(qi | C))) we see that |d|/? is playing the same role as (1 - )/, but differs in that it is documentdependent. The relative weighting of terms is emphasized when we use a smaller ?. As ? gets large, d tends to 1, and all term weights tend to zero. Thus, the scoring formula tends to, again, that of coordination level matching.
The plots in Figure 3 show the average precision and recall for different settings of the prior sample size ?. It is again clear that both precision and recall are much more sensitive
to ? for long queries than for title queries, especially when ? is small. However, the optimal value of ? does not seem to be much different for title queries and long queries. While it still tends to be slightly larger for long queries, the difference is not as large as in Jelinek-Mercer. The optimal prior ? seems to vary from collection to collection, though in most cases, it is around 2,000. The tail of the curves is generally flat, as it tends to the performance of coordination level matching.
Absolute discounting. The term weighting behavior of the absolute discounting method is a little more complicated. Obviously, here d is also document sensitive. It is larger for a document with a flatter distribution of words, i.e., when the count of unique terms is relatively large. Thus, it penalizes documents with a word distribution highly concentrated on a small number of words. The weight of a matched term is log(1+(c(qi; d)-)/(|d|up(qi | C))). The influence of on relative term weighting depends on |d|u and p(? | C), in the following way. If |d|u p(w | C) > 1, a larger will make term weights flatter, but otherwise, it will actually make the term weight more skewed according to the count of the term in the document. Thus, a larger will amplify the weight difference for rare words, but flatten the difference for common words, where the "rarity" threshold is p(w | C) < 1/|d|u.
The plots in Figure 4 show the average precision and recall for different settings of the discount constant . Once again it is clear that both precision and recall are much more sensitive to for long queries than for title queries. Similar to Bayesian smoothing, but different from Jelinek-Mercer smoothing, the optimal value of does not seem to be much different for title queries and long queries. Indeed, the optimal value of tends to be around 0.7. This is true not only for both title queries and long queries, but also across all testing collections.
The behavior of each smoothing method indicates that, in general, the performance of longer queries is much more sensitive to the choice of the smoothing parameters than that of title queries. This suggests that smoothing plays a more important role for long verbose queries than for title queries that are extremely concise. One interesting observation is that the web collection has behaved quite differently than other databases for Jelinek-Mercer and Dirichlet smoothing, but not for absolute discounting. In particular, the title queries performed much better than the long queries on the web collection for Dirichlet prior. Further analysis and evaluation are needed to understand this observation.
34378
precision
0.4 0.35
0.3
Precision of Dirichlet Prior (Small Collections)
Recall of Dirichlet Prior (Small Collections) 0.8 0.7
recall
0.25
0.2
0.15
0.1 0
0.3 0.28 0.26 0.24 0.22
0.2 0.18 0.16 0.14 0.12
0.1 0
fbis7T fbis8T
ft7T ft8T la7T la8T fbis7L fbis8L ft7L ft8L la7L la8L
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
prior
Precision of Dirichlet Prior (Large Collections)
2000
4000
prior
6000
8000
trec7T trec8T web8T trec7L trec8L web8L
10000
recall
0.6 0.5 0.4
0
fbis7T fbis8T
ft7T ft8T la7T la8T fbis7L fbis8L ft7L ft8L la7L la8L
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
prior
Recall of Dirichlet Prior (Large Collections)
0.8
0.7
0.6
0.5
0.4 0.3
0
2000
4000
prior
6000
8000
trec7T trec8T web8T trec7L trec8L web8L
10000
Figure 3: Performance of Dirichlet smoothing.
precision
6. COMPARISON OF METHODS
To compare the three smoothing methods, we select a best run (in terms of non-interpolated average precision ) for each method on each testing collection and compare the noninterpolated average precision, precision at 10 documents, and precision at 20 documents of the selected runs. The results are shown in Table 3 for both titles and long queries.
For title queries, there seems to be a clear order among the three methods in terms of all three precision measures: Dirichlet prior is better than absolute discounting, which is better than Jelinek-Mercer. Indeed, Dirichlet prior has the best average precision in all cases but one. In particular, it performed extremely well on the Web collection, significantly better than the other two. The good performance is relatively insensitive to the choice of ?. Indeed, many nonoptimal Dirichlet runs are also significantly better than the optimal runs for Jelinek-Mercer and absolute discounting.
For long queries, there is also a partial order. On average, Jelinek-Mercer is better than Dirichlet and absolute discounting by all three precision measures, though its average precision is almost identical to that of Dirichlet. Both Jelinek-Mercer and Dirichlet clearly have a better average precision than absolute discounting.
When comparing each method's performance on different types of queries, we see that the three methods all perform better on long queries than on title queries (except that Dirichlet prior performs worse on long queries than title queries on the web collection), but the increase of performance is most significant for Jelinek-Mercer. Indeed, Jelinek-Mercer is the worst for title queries, but the best for long queries. It appears that Jelinek-Mercer is much more effective when queries are more verbose.
Since the Trec7&8 database differs from the combined set of FT, FBIS, LA by only the Federal Register database, we can also compare the performance of a method on the three smaller databases with that on the large one. We find that the non-interpolated average precision on the large database is generally much worse than that on the smaller ones, and is often similar to the worst one among all the three small databases. However, the precision at 10 (or 20) documents on large collections is all significantly better than that on small collections. This is not surprising, since a given precision at a cutoff point of 10 documents would correspond to a much lower level of recall for a large collection than for a small collection. This is exactly the same reason as why one can expect a higher precision at 10 documents when a query has many more relevant documents. For both title queries and long queries, the relative performance of each method tends to remain the same when we merge the databases. Interestingly, that the optimal setting for the smoothing parameters seems to stay within a similar range when databases are merged.
The strong correlation between the effect of smoothing and the type of queries is somehow unexpected. If the purpose of smoothing is only to improve the accuracy in estimating a unigram language model based on a document, then, the effect of smoothing should be more affected by the characteristics of documents and the collection, and should be relatively insensitive to the type of queries. But the results above suggest that this is not the case. One possible explanation is that smoothing actually plays two different roles in the query likelihood retrieval method. One role is to improve the accuracy of the estimated documents language model, and can be referred to as the estimation role. The
34389
precision
0.4
0.35
0.3
0.25
0.2
0.15
0.1 0
0.3 0.28 0.26 0.24 0.22
0.2 0.18 0.16 0.14 0.12
0.1 0
Precision of Absolute Discounting (Small Collections)
Recall of Absolute Discounting (Small Collections) 0.8
0.7
recall
0.2
0.4
0.6
0.8
delta
Precision of Absolute Discounting (Large Collections)
fbis7T fbis8T
ft7T ft8T la7T la8T fbis7L fbis8L ft7L ft8L la7L la8L
1
0.6 0.5 0.4
0 0.8
0.2
0.4
0.6
0.8
delta
Recall of Absolute Discounting (Large Collections)
0.7
0.6
recall
0.5
0.2
0.4
0.6
0.8
delta
trec7T trec8T web8T trec7L trec8L web8L
1
0.4
0.3 0
0.2
0.4
0.6
0.8
delta
Figure 4: Performance of absolute discounting.
fbis7T fbis8T
ft7T ft8T la7T la8T fbis7L fbis8L ft7L ft8L la7L la8L 1
trec7T trec8T web8T trec7L trec8L web8L 1
precision
other is to "explain" the common and non-informative words in a query, and can be referred to as the role of query modeling. Indeed, this second role is explicitly implemented with a two-state HMM in [10]. This role is also well-supported by the connection of smoothing and IDF weighting derived in Section 2. Intuitively, more smoothing would decrease the "discrimination power" of common words in the query, because all documents will rely more on the collection language model to generate the common words.
The effect of smoothing that we observed is generally a mixed effect of both roles of smoothing. But, for title queries, the effect is more dominated by the estimation role, since in our experiments the title queries have few or no non-informative common words, whereas for long queries, the effect is more influenced by the role of query modeling for they often have many non-informative common words. Thus, the fact that the Dirichlet prior method performs the best on title queries suggests that it is good for the estimation role, and the fact that Jelinek-Mercer performs the worst for title queries, but the best for long queries suggests that Jelinek-Mercer is good for the role of query modeling. Intuitively this also makes sense, as Dirichlet prior adapts to the length of documents naturally, which is desirable for the estimation role, while in Jelinek-Mercer, we set a fixed smoothing parameter across all documents, which is necessary for query modeling.
7. INTERPOLATION VS. BACKOFF
The three methods that we have described and tested so far belong to the category of interpolation-based methods, in which we discount the counts of the seen words and the
extra counts are shared by both the seen words and unseen
words. One problem of this approach is that a high count
word may actually end up with more than its actual count
in the document, if it is frequent in the fallback model. An
alternative smoothing strategy is "backoff." Here the main
idea is to trust the maximum likelihood estimate for high
count words, and to discount and redistribute mass only
for the less common terms. As a result, it differs from the
interpolation strategy in that the extra counts are primarily
used for unseen words. The Katz smoothing method is a
well-known backoff method [7]. The backoff strategy is very
popular in speech recognition tasks.
Following [2], we implemented a backoff version of all the
three interpolation-based methods, which is derived as fol-
lows. Recall that in all three methods, ps(w) is written as
the sum of two parts: (1) a discounted maximum likelihood
estimate, which we denote by pdml(w); (2) a collection language model term, i.e., d p(w | C). If we use only the first
term for ps(w) and renormalize the probabilities, we will
have a smoothing method that follows the backoff strategy.
It is not hard to show that if an interpolation-based smooth-
ing method is characterized by ps(w) = pdml(w)+d p(w | C)
? and pu(w) = d p(w | C), then the backoff version is given by
ps(w) = pdml(w) and pu(w) = 1-
. d p(w | C)
i:c(wi ;d)>0 p(wi | C)
The
form of the ranking formula and the smoothing parameters
remain the same. It is easy to see that the d in the back-
off version differs from that in the interpolation version by
a document-dependent term which further penalizes long
documents. The weight of a matched term due to backoff
smoothing has a much wider range of values ((-, +))
than that for interpolation ((0, +)). Thus, analytically,
3490
Collection
fbis7T ft7T la7T fbis8T ft8T la8T trec7T trec8T web8T Avg.
Collection
fbis7L ft7L la7L fbis8L ft8L la8L trec7L trec8L web8L Avg.
Jelinek-Mercer avgpr, pr@10d, pr@20d () 0.172, 0.284, 0.220 (0.05)
0.199, 0.263, 0.195 (0.5) 0.179, 0.238, 0.205 (0.4) 0.306, 0.344, 0.282 (0.01) 0.310, 0.359, 0.283 (0.3) 0.231, 0.264, 0.211 (0.2) 0.167, 0.366, 0.315 (0.3) 0.239, 0.438, 0.378 (0.2) 0.243, 0.348, 0.293 (0.01)
0.227, 0.323, 0.265
Jelinek-Mercer avgpr, pr@10d, pr@20d () 0.224, 0.339, 0.279 (0.7)
0.279, 0.331, 0.244 (0.7) 0.264, 0.350, 0.286 (0.7) 0.341, 0.349, 0.283 (0.5) 0.375, 0.427, 0.320(0.8) 0.290,0.296, 0.238 (0.7) 0.222, 0.476, 0.401 (0.8) 0.265, 0.504, 0.434 (0.8) 0.259, 0.422, 0.348 (0.5)
0.280, 0.388, 0.315
Dirichlet Prior avgpr, pr@10d, pr@20d (?) 0.197, 0.282, 0.238 (2000) 0.236,0.283, 0.213 (4000) 0.220, 0.294, 0.233 (2000) 0.334, 0.367, 0.292 (500) 0.324, 0.367, 0.297 (800) 0.258, 0.271, 0.216 (500) 0.186, 0.412, 0.342 (2000) 0.256, 0.448, 0.398 (800) 0.294, 0.448, 0.374 (3000)
0.256, 0.352, 0.289
Dirichlet Prior avgpr, pr@10d, pr@20d (?) 0.232, 0.313, 0.249 (5000) 0.281, 0.329, 0.248 (2000) 0.265, 0.354, 0.285 (2000) 0.347, 0.349, 0.290 (2000) 0.347, 0.380, 0.297 (2000)
0.277, 0.282, 0.231 (500) 0.224, 0.456, 0.383 (3000)
0.260, 0.484, 0.4 (2000) 0.275, 0.410, 0.343 (10000)
0.279, 0.373, 0.303
Absolute Discounting avgpr, pr@10d, pr@20d ()
0.177, 0.284, 0.233 (0.8) 0.215, 0.271, 0.196 (0.8) 0.194, 0.268, 0.216 (0.8) 0.319 , 0.363, 0.288(0.5) 0.326, 0.367, 0.296 (0.7) 0.238, 0.282, 0.224 (0.8) 0.172, 0.382, 0.333 (0.7) 0.245, 0.466, 0.406 (0.6) 0.242, 0.370, 0.323 (0.7)
0.236, 0.339, 0.279
Absolute Discounting avgpr, pr@10d, pr@20d ()
0.185,0.321, 0.259 (0.6) 0.249, 0.317, 0.236 (0.8) 0.251, 0.340, 0.279 (0.7) 0.343, 0.356, 0.274 (0.7) 0.351, 0.398, 0.309 (0.8) 0.267, 0.287, 0.222 (0.6) 0.204, 0.460, 0.396 (0.7) 0.248, 0.518, 0.428 (0.8) 0.253, 0.414, 0.333 (0.6)
0.261, 0.379, 0.304
Table 3: Comparison of smoothing methods on title queries (top) and long queries (bottom). The three
numbers in each cell are average precision, precision at 10 documents, and precision at 20 documents. The parameter chosen for each data set and method is shown in parentheses.
Precision of Interpolation vs. Backoff (Jelinek-Mercer) 0.4
0.35
0.3
0.25
0.2
0.15
0.1
fbis8T
0.05
fbis8T-bk fbis8L
fbis8L-bk
0
0
0.2
0.4
0.6
0.8
1
lambda
Precision of Interpolation vs. Backoff (Dirichlet prior) 0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0 0
fbis8T fbis8T-bk
fbis8L fbis8L-bk
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
prior
Precision of Interpolation vs. Backoff (Absolute Discounting) 0.4
0.35
0.3
0.25
0.2
0.15
0.1
fbis8T
0.05
fbis8T-bk fbis8L
fbis8L-bk
0
0
0.2
0.4
0.6
0.8
1
delta
Figure 5: Interpolation versus backoff for Jelinek-Mercer (top), Dirichlet smoothing (middle), and absolute discounting (bottom).
the backoff version tends to do term weighting and docu-
ment length normalization more aggressively than the cor-
responding interpolated version.
The backoff strategy and the interpolation strategy are
compared for all three methods using the FBIS database
and topics 401-450 (i.e., fbis8T and fbis8L). The results are
shown in Figure 5. We find that the backoff performance is
more sensitive to the smoothing parameter than that of in-
terpolation, especially in Jelinek-Mercer and Dirichlet prior.
The difference is clearly less significant in the absolute dis-
counting method, and this may be due to its lower upper
bound
(
|d|u |d|
)
for
the
original
d,
which
restricts
the
aggres-
siveness in penalizing long documents. In general, the back-
off strategy gives worse performance than the interpolation
strategy, and only comes close to it when d approaches zero, which is expected, since analytically, we know that when d
approaches zero, the difference between the two strategies
will diminish.
8. CONCLUSIONS AND FUTURE WORK
We have studied the problem of language model smoothing in the context of information retrieval. By rewriting the query-likelihood retrieval model using a smoothed document language model, we derived a general retrieval formula where the smoothing of the document language model can be interpreted in terms of several heuristics used in traditional models, including TF-IDF weighting and document length normalization. We then examined three popular interpolation-based smoothing methods (Jelinek-Mercer method, Dirichlet priors, and absolute discounting), as well as their backoff versions, and evaluated them using several large and small TREC retrieval testing collections. We find that the retrieval performance is generally sensitive to the smoothing parameters, suggesting that an understanding and appropriate setting of smoothing parameters is very important in the language modeling approach. An interesting observation is that the effect of smoothing is strongly correlated with the type of queries. The performance is generally
precision precision precision
35401
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- the court system
- interpersonal reactivity index iri fetzer institute
- cugta form fippa university of western ontario
- your rights to equality at work training development
- the pros and cons of being young today
- student voices on sexual violence overview of selected
- psychiatric disorders associated with criminal behavior
- you are on a great journey
- a study of smoothing methods for language models applied
- against conditionalization