a question on search engine/Inverted Index (转载) - SanFrancisco版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

SanFrancisco版 - a question on search engine/Inverted Index (转载)

相关主题
● 闯红灯 (转载)	● 急....请推荐SOUTH SAN JOSE附近APARTMENT
● 温家宝 = 假宝玉？？ (转载)	● Sunnyvale Caltrain附近哪里全天停车便宜？
● 请教大家那些年薪四五十万美元的工作都是什么 (转载)	● 黄灯被拍照,会有罚单么
● 是谷歌山寨了李彦宏，不是李彦宏山寨了谷歌 (转载)	● 报一个speed trap
● [合集] 刚在Stanford newletter上看到一篇文章，Yichao Wang remembere	● XING 是什么意思？
● 又是ticket求助	● 借人气问新买的odyssey车内的灯为什么都不亮？
● 红灯停车压线多少钱？	● Central expressway has the best trafic in bay area?
● freeway出口,卖花的劳模	● How do airplanes stay up?

相关话题的讨论汇总
话题: index话题: search话题: inverted话题: engine话题: dog

进入SanFrancisco版参与讨论

1

(共1页)

c******n 发帖数: 4965	1 【以下文字转载自 SiliconValleyClub 俱乐部】发信人: creation (努力自由泳50m/45sec !), 信区: SiliconValleyClub 标题: a question on search engine/Inverted Index (转载) 发信站: BBS 未名空间站 (Wed Apr 6 01:07:30 2011, 美东) 【以下文字转载自 CS 讨论区】发信人: creation (努力自由泳50m/45sec !), 信区: CS 标题: a question on search engine/Inverted Index 发信站: BBS 未名空间站 (Wed Apr 6 01:07:12 2011, 美东) if I search cat dog inside the engine, 2 doclists are returned, each sorted in docId, one for "dog", one for "cat", then the lists are merged. then PageRank is carried out over the intersection of the 2 sets. is this correct? my question is: since both doclists can be very long, and their intersection list could also be very long, each time user does a query, maybe only 100 of the intersection docIds are finally used, so we have to sort unnecessarily each time?? how is this solved exactly in google? thanks
w*********m 发帖数: 4740	2 now most search engines use OR search, not AND search. fast search algorithms are used to return top N docs, and then rerank 【在 c******n 的大作中提到】 : 【以下文字转载自 SiliconValleyClub 俱乐部】 : 发信人: creation (努力自由泳50m/45sec !), 信区: SiliconValleyClub : 标题: a question on search engine/Inverted Index (转载) : 发信站: BBS 未名空间站 (Wed Apr 6 01:07:30 2011, 美东) : 【以下文字转载自 CS 讨论区】 : 发信人: creation (努力自由泳50m/45sec !), 信区: CS : 标题: a question on search engine/Inverted Index : 发信站: BBS 未名空间站 (Wed Apr 6 01:07:12 2011, 美东) : if I search : cat dog
c******n 发帖数: 4965	3 thanks, but if I search 2 words, one common , one very not common, for example dog arthritis here "arthritis" is not common, so a good hit for both words may be in the very end of the doclist for "dog", so if you use the pruned top N for dog for joining, you may lose the best hit.???? 【在 w*********m 的大作中提到】 : now most search engines use OR search, not AND search. fast search : algorithms are used to return top N docs, and then rerank
B******y 发帖数: 34	4 for carried not only 2 doc lists. "cat dog" is also indexed. check out n-gram theory. 【在 c******n 的大作中提到】 : thanks, but : if I search 2 words, one common , one very not common, : for example : dog arthritis : here "arthritis" is not common, so a good hit for both words may be in the : very end of the doclist for "dog", so if you use the pruned top N for dog : for joining, you may lose the best hit.????
c******n 发帖数: 4965	5 thanks, I see that there is the bi-word index, but I read that it's basically impossible to do 3-word phrase index, so you would still run into that problem of a hi-match doc for the phrase being pruned off the doclist for a single word 【在 B******y 的大作中提到】 : : for : carried : not only 2 doc lists. "cat dog" is also indexed. check out n-gram : theory.
w*********m 发帖数: 4740	6 Top N means top N with highest ranking scores, not first N Fast algo only prune docs with scores lower than min score of top N Many papers in sigir and cikm talk about this 【在 c******n 的大作中提到】 : thanks, but : if I search 2 words, one common , one very not common, : for example : dog arthritis : here "arthritis" is not common, so a good hit for both words may be in the : very end of the doclist for "dog", so if you use the pruned top N for dog : for joining, you may lose the best hit.????
w*********m 发帖数: 4740	7 Btw index's posting list is pre sorted and/or query words are sorted by impact (tfidf), words with higher impact are processed first. Details depend on which algorithm is used. Large search engines use parallel processes with splitted indices 【在 c******n 的大作中提到】 : thanks, but : if I search 2 words, one common , one very not common, : for example : dog arthritis : here "arthritis" is not common, so a good hit for both words may be in the : very end of the doclist for "dog", so if you use the pruned top N for dog : for joining, you may lose the best hit.????
c******n 发帖数: 4965	8 Thanks, could u please give a few pointers on this particular topic of parallel indices and doclist pruning/merging? I read a few materials: Introduction to Information retrieval, Larry Page and Sergey's "anatomy of a search engine", and a paper :"Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee", which talks about pruning, but not about parallel indices . thanks, I'm not majored in IR, so these are all dummy questions 【在 w*********m 的大作中提到】 : Btw index's posting list is pre sorted and/or query words are sorted by : impact (tfidf), words with higher impact are : processed first. Details depend on which algorithm is used. : Large search engines use parallel processes with splitted indices

1

(共1页)

进入SanFrancisco版参与讨论

相关主题
● How do airplanes stay up?	● [合集] 刚在Stanford newletter上看到一篇文章，Yichao Wang remembere
● I have an idea, don't know it works or not	● 又是ticket求助
● 收到一张Ticket: $399 ( cell phone + U-turn at no U-turn intersection)	● 红灯停车压线多少钱？
● 这两房选哪个？	● freeway出口,卖花的劳模
● 闯红灯 (转载)	● 急....请推荐SOUTH SAN JOSE附近APARTMENT
● 温家宝 = 假宝玉？？ (转载)	● Sunnyvale Caltrain附近哪里全天停车便宜？
● 请教大家那些年薪四五十万美元的工作都是什么 (转载)	● 黄灯被拍照,会有罚单么
● 是谷歌山寨了李彦宏，不是李彦宏山寨了谷歌 (转载)	● 报一个speed trap

相关话题的讨论汇总
话题: index话题: search话题: inverted话题: engine话题: dog

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)