由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
SanFrancisco版 - a question on search engine/Inverted Index (转载)
相关主题
闯红灯 (转载)急....请推荐SOUTH SAN JOSE附近APARTMENT
温家宝 = 假宝玉 ?? (转载)Sunnyvale Caltrain附近哪里全天停车便宜?
请教大家那些年薪四五十万美元的工作都是什么 (转载)黄灯被拍照,会有罚单么
是谷歌山寨了李彦宏,不是李彦宏山寨了谷歌 (转载)报一个speed trap
[合集] 刚在Stanford newletter上看到一篇文章,Yichao Wang remembereXING 是什么意思?
又是ticket求助借人气 问 新买的odyssey车内的灯为什么都不亮?
红灯停车压线多少钱?Central expressway has the best trafic in bay area?
freeway出口,卖花的劳模How do airplanes stay up?
相关话题的讨论汇总
话题: index话题: search话题: inverted话题: engine话题: dog
进入SanFrancisco版参与讨论
1 (共1页)
c******n
发帖数: 4965
1
【 以下文字转载自 SiliconValleyClub 俱乐部 】
发信人: creation (努力自由泳50m/45sec !), 信区: SiliconValleyClub
标 题: a question on search engine/Inverted Index (转载)
发信站: BBS 未名空间站 (Wed Apr 6 01:07:30 2011, 美东)
【 以下文字转载自 CS 讨论区 】
发信人: creation (努力自由泳50m/45sec !), 信区: CS
标 题: a question on search engine/Inverted Index
发信站: BBS 未名空间站 (Wed Apr 6 01:07:12 2011, 美东)
if I search
cat dog
inside the engine, 2 doclists are returned, each sorted in docId, one for
"dog", one for "cat", then the lists are merged. then PageRank is carried
out over the intersection of the 2 sets. is this correct? my question is:
since both doclists can be very long, and their intersection list could
also be very long, each time user does a query, maybe only 100 of the
intersection docIds are finally used, so we have to sort unnecessarily
each time??
how is this solved exactly in google?
thanks
w*********m
发帖数: 4740
2
now most search engines use OR search, not AND search. fast search
algorithms are used to return top N docs, and then rerank

【在 c******n 的大作中提到】
: 【 以下文字转载自 SiliconValleyClub 俱乐部 】
: 发信人: creation (努力自由泳50m/45sec !), 信区: SiliconValleyClub
: 标 题: a question on search engine/Inverted Index (转载)
: 发信站: BBS 未名空间站 (Wed Apr 6 01:07:30 2011, 美东)
: 【 以下文字转载自 CS 讨论区 】
: 发信人: creation (努力自由泳50m/45sec !), 信区: CS
: 标 题: a question on search engine/Inverted Index
: 发信站: BBS 未名空间站 (Wed Apr 6 01:07:12 2011, 美东)
: if I search
: cat dog

c******n
发帖数: 4965
3
thanks, but
if I search 2 words, one common , one very not common,
for example
dog arthritis
here "arthritis" is not common, so a good hit for both words may be in the
very end of the doclist for "dog", so if you use the pruned top N for dog
for joining, you may lose the best hit.????

【在 w*********m 的大作中提到】
: now most search engines use OR search, not AND search. fast search
: algorithms are used to return top N docs, and then rerank

B******y
发帖数: 34
4

for
carried
not only 2 doc lists. "cat dog" is also indexed. check out n-gram
theory.

【在 c******n 的大作中提到】
: thanks, but
: if I search 2 words, one common , one very not common,
: for example
: dog arthritis
: here "arthritis" is not common, so a good hit for both words may be in the
: very end of the doclist for "dog", so if you use the pruned top N for dog
: for joining, you may lose the best hit.????

c******n
发帖数: 4965
5
thanks, I see that there is the bi-word index, but I read that it's
basically impossible to do 3-word phrase index, so you would still run
into
that problem of a hi-match doc for the phrase being pruned off the doclist
for a single word

【在 B******y 的大作中提到】
:
: for
: carried
: not only 2 doc lists. "cat dog" is also indexed. check out n-gram
: theory.

w*********m
发帖数: 4740
6
Top N means top N with highest ranking scores, not first N
Fast algo only prune docs with scores lower than min score of top N
Many papers in sigir and cikm talk about this

【在 c******n 的大作中提到】
: thanks, but
: if I search 2 words, one common , one very not common,
: for example
: dog arthritis
: here "arthritis" is not common, so a good hit for both words may be in the
: very end of the doclist for "dog", so if you use the pruned top N for dog
: for joining, you may lose the best hit.????

w*********m
发帖数: 4740
7
Btw index's posting list is pre sorted and/or query words are sorted by
impact (tfidf), words with higher impact are
processed first. Details depend on which algorithm is used.
Large search engines use parallel processes with splitted indices

【在 c******n 的大作中提到】
: thanks, but
: if I search 2 words, one common , one very not common,
: for example
: dog arthritis
: here "arthritis" is not common, so a good hit for both words may be in the
: very end of the doclist for "dog", so if you use the pruned top N for dog
: for joining, you may lose the best hit.????

c******n
发帖数: 4965
8
Thanks, could u please give a few pointers on this particular topic of
parallel indices and doclist pruning/merging?
I read a few materials: Introduction to Information retrieval, Larry Page
and Sergey's "anatomy of a search engine", and a paper :"Pruning Policies
for Two-Tiered Inverted Index
with Correctness Guarantee", which talks about pruning, but not about
parallel indices .
thanks, I'm not majored in IR, so these are all dummy questions

【在 w*********m 的大作中提到】
: Btw index's posting list is pre sorted and/or query words are sorted by
: impact (tfidf), words with higher impact are
: processed first. Details depend on which algorithm is used.
: Large search engines use parallel processes with splitted indices

1 (共1页)
进入SanFrancisco版参与讨论
相关主题
How do airplanes stay up?[合集] 刚在Stanford newletter上看到一篇文章,Yichao Wang remembere
I have an idea, don't know it works or not又是ticket求助
收到一张Ticket: $399 ( cell phone + U-turn at no U-turn intersection)红灯停车压线多少钱?
这两房选哪个?freeway出口,卖花的劳模
闯红灯 (转载)急....请推荐SOUTH SAN JOSE附近APARTMENT
温家宝 = 假宝玉 ?? (转载)Sunnyvale Caltrain附近哪里全天停车便宜?
请教大家那些年薪四五十万美元的工作都是什么 (转载)黄灯被拍照,会有罚单么
是谷歌山寨了李彦宏,不是李彦宏山寨了谷歌 (转载)报一个speed trap
相关话题的讨论汇总
话题: index话题: search话题: inverted话题: engine话题: dog