c******n 发帖数: 4965 | 1 【 以下文字转载自 SiliconValleyClub 俱乐部 】
发信人: creation (努力自由泳50m/45sec !), 信区: SiliconValleyClub
标 题: a question on search engine/Inverted Index (转载)
发信站: BBS 未名空间站 (Wed Apr 6 01:07:30 2011, 美东)
【 以下文字转载自 CS 讨论区 】
发信人: creation (努力自由泳50m/45sec !), 信区: CS
标 题: a question on search engine/Inverted Index
发信站: BBS 未名空间站 (Wed Apr 6 01:07:12 2011, 美东)
if I search
cat dog
inside the engine, 2 doclists are returned, each sorted in docId, one for
"dog", one for "cat", then the lists are merged. then PageRank is carried
out over the intersection of the 2 sets. is this correct? my question is:
since both doclists can be very long, and their intersection list could
also be very long, each time user does a query, maybe only 100 of the
intersection docIds are finally used, so we have to sort unnecessarily
each time??
how is this solved exactly in google?
thanks | w*********m 发帖数: 4740 | 2 now most search engines use OR search, not AND search. fast search
algorithms are used to return top N docs, and then rerank
【在 c******n 的大作中提到】 : 【 以下文字转载自 SiliconValleyClub 俱乐部 】 : 发信人: creation (努力自由泳50m/45sec !), 信区: SiliconValleyClub : 标 题: a question on search engine/Inverted Index (转载) : 发信站: BBS 未名空间站 (Wed Apr 6 01:07:30 2011, 美东) : 【 以下文字转载自 CS 讨论区 】 : 发信人: creation (努力自由泳50m/45sec !), 信区: CS : 标 题: a question on search engine/Inverted Index : 发信站: BBS 未名空间站 (Wed Apr 6 01:07:12 2011, 美东) : if I search : cat dog
| c******n 发帖数: 4965 | 3 thanks, but
if I search 2 words, one common , one very not common,
for example
dog arthritis
here "arthritis" is not common, so a good hit for both words may be in the
very end of the doclist for "dog", so if you use the pruned top N for dog
for joining, you may lose the best hit.????
【在 w*********m 的大作中提到】 : now most search engines use OR search, not AND search. fast search : algorithms are used to return top N docs, and then rerank
| B******y 发帖数: 34 | 4
for
carried
not only 2 doc lists. "cat dog" is also indexed. check out n-gram
theory.
【在 c******n 的大作中提到】 : thanks, but : if I search 2 words, one common , one very not common, : for example : dog arthritis : here "arthritis" is not common, so a good hit for both words may be in the : very end of the doclist for "dog", so if you use the pruned top N for dog : for joining, you may lose the best hit.????
| c******n 发帖数: 4965 | 5 thanks, I see that there is the bi-word index, but I read that it's
basically impossible to do 3-word phrase index, so you would still run
into
that problem of a hi-match doc for the phrase being pruned off the doclist
for a single word
【在 B******y 的大作中提到】 : : for : carried : not only 2 doc lists. "cat dog" is also indexed. check out n-gram : theory.
| w*********m 发帖数: 4740 | 6 Top N means top N with highest ranking scores, not first N
Fast algo only prune docs with scores lower than min score of top N
Many papers in sigir and cikm talk about this
【在 c******n 的大作中提到】 : thanks, but : if I search 2 words, one common , one very not common, : for example : dog arthritis : here "arthritis" is not common, so a good hit for both words may be in the : very end of the doclist for "dog", so if you use the pruned top N for dog : for joining, you may lose the best hit.????
| w*********m 发帖数: 4740 | 7 Btw index's posting list is pre sorted and/or query words are sorted by
impact (tfidf), words with higher impact are
processed first. Details depend on which algorithm is used.
Large search engines use parallel processes with splitted indices
【在 c******n 的大作中提到】 : thanks, but : if I search 2 words, one common , one very not common, : for example : dog arthritis : here "arthritis" is not common, so a good hit for both words may be in the : very end of the doclist for "dog", so if you use the pruned top N for dog : for joining, you may lose the best hit.????
| c******n 发帖数: 4965 | 8 Thanks, could u please give a few pointers on this particular topic of
parallel indices and doclist pruning/merging?
I read a few materials: Introduction to Information retrieval, Larry Page
and Sergey's "anatomy of a search engine", and a paper :"Pruning Policies
for Two-Tiered Inverted Index
with Correctness Guarantee", which talks about pruning, but not about
parallel indices .
thanks, I'm not majored in IR, so these are all dummy questions
【在 w*********m 的大作中提到】 : Btw index's posting list is pre sorted and/or query words are sorted by : impact (tfidf), words with higher impact are : processed first. Details depend on which algorithm is used. : Large search engines use parallel processes with splitted indices
|
|