请教一个跟search中用到的auto suggestion问题 - Programming版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Programming版 - 请教一个跟search中用到的auto suggestion问题

相关主题
● autocomplete实现方法请教	● 如何智能化合并数据库中属于相关objects的各种属性到一个object下？
● load一个巨大的k-v table到一个view里，有搜索功能怎么设计？	● 如何将相似字符串更加准确地找出来？
● 算法求教	● mongobd中的text search速度问题
● 请问有什么好的开源中英文搜索引擎？	● solr shared index file solution (转载)
● 与其无意义的争论，不如干点实事	● 你们能scale out的都是有福的
● 一个网站里的search功能，是在search这个网站的database,还是象IDE里面的search workspace?	● 请教下本地搜索
● 请大牛来谈谈对Solr的看法	● 怎么做个文件的 index, 比如archive 这样的
● 架构设计问题，请各位大神指点	● StackOverflow的架构

相关话题的讨论汇总
话题: solr话题: search话题: suggestion话题: auto话题: sql

进入Programming版参与讨论

(共1页)

M***0
发帖数: 1180

附件中的auto suggestion应该是根据popularity score来决定提示哪5条信息的吧？
实现上，是用SQL database（分表？）来存储每个item和search count好，还是用自己
的代码实现好？如果是用自己的代码，是用heapsort吗？
数据量100M条，这个功能要用在ajax auto suggestion上，速度不能太慢。

i*****o
发帖数: 1714

能先透露一下打了f之后怎么把所有以f开头的掉出来吗？同理，再打个a，怎么把fa开
头的再掉出来？

【在 M***0 的大作中提到】

: 附件中的auto suggestion应该是根据popularity score来决定提示哪5条信息的吧？
: 实现上，是用SQL database（分表？）来存储每个item和search count好，还是用自己
: 的代码实现好？如果是用自己的代码，是用heapsort吗？
: 数据量100M条，这个功能要用在ajax auto suggestion上，速度不能太慢。

M***0
发帖数: 1180

不清楚你是哪个环节不懂，简而言之：
client side makes AJAX call
server side does a String wildcard search and returns a list of result
问题是下面的提示item是有选择性的而不是“所有”

【在 i*****o 的大作中提到】

: 能先透露一下打了f之后怎么把所有以f开头的掉出来吗？同理，再打个a，怎么把fa开
: 头的再掉出来？

g*****g
发帖数: 34805

Build a prefix tree. For weight, I'll probably start with
geometry proximity and mixed in the number of time it's been
selected.

【在 i*****o 的大作中提到】

: 能先透露一下打了f之后怎么把所有以f开头的掉出来吗？同理，再打个a，怎么把fa开
: 头的再掉出来？

i*****o
发帖数: 1714

lz说用sql，我想半天sql怎么也做不出这么厉害的功能。那自己做所有的都在ram里吗
？怎么做persitent？

【在 g*****g 的大作中提到】

: Build a prefix tree. For weight, I'll probably start with
: geometry proximity and mixed in the number of time it's been
: selected.

M***0
发帖数: 1180

如果是存在sql table里，那最简单的，如果你的数据量小的话
e.g. Oracle: select xxx from ttt where xxx like ‘fa%’ where rownum <5
返回前5条fa开头的记录（在xxx上建个Index）
如果要按字母顺序，在rownum<5前加order by xxx
如果是要按popularity来排，加order by popularity
不用sql table的话，就是自己用代码建b-tree或binary-tree，把output写到硬盘，实
际上这个查询会是一直用的，所以是长期在内存里的，我们的内存是100GB
这两天我又想了想，今天和全组的人讨论了一下，决定用建几棵树分别记录单字母，双
子母，三字母,etc， up to五字母表或六字母表
单字母表
prefix（key） top_5_search
a apple, aol, american airline, amazon, applebees
b bestbuy, boa, bbc, bing, burger king
c ...
...
(当搜索栏输入第一个字母a的时候，下面提示框提示apple, aol, american airline,
amazon, applebees，这个查询就非常快了，因为前5名的已经提前计算好并存在这里了)
双字母表
..
ap apple, apple store, applebees, apple tv, applebees coupon
...
这个top 5不是什么time critical的东西，不需要real time update，所以虽然我们那
100M条记录会一直更新被搜索的次数，但是完全可以每24小时，e.g. at midnight再
来更新以上那些表，而实际上那些表被更新的可能性不大，因为top 5一般一直不怎么变
上面那几个表不大(比如zzzzz开头的总搜索量太小所以zzzzz根本不用出现在五字母表，
这个可以大大减少记录条数)，用sql table完全可以了。
我们用自己的代码建树，是因为我们的代码是现成的，job schedule to rebuild
the tree at certain time interval也是现成的，所以就不劳烦sql database了
最后就是如果输入是6,7个以上的字母，就没这个字母表了，直接到原始的那个完整tree里
去查询top 5 popularity，一旦匹配了6,7个字母，数据量就不大了
这个是初步计划，实现起来可能会有预想不到的问题，不知道有没有有经验的人能帮忙看看

【在 i*****o 的大作中提到】

: lz说用sql，我想半天sql怎么也做不出这么厉害的功能。那自己做所有的都在ram里吗
: ？怎么做persitent？

i*****o
发帖数: 1714

如果有现成的100m的数据库的话，你能不能先做一个你说的select，我很想知道这个要
用多长时间。谢谢啦：）

【在 M***0 的大作中提到】

: 如果是存在sql table里，那最简单的，如果你的数据量小的话
: e.g. Oracle: select xxx from ttt where xxx like ‘fa%’ where rownum <5
: 返回前5条fa开头的记录（在xxx上建个Index）
: 如果要按字母顺序，在rownum<5前加order by xxx
: 如果是要按popularity来排，加order by popularity
: 不用sql table的话，就是自己用代码建b-tree或binary-tree，把output写到硬盘，实
: 际上这个查询会是一直用的，所以是长期在内存里的，我们的内存是100GB
: 这两天我又想了想，今天和全组的人讨论了一下，决定用建几棵树分别记录单字母，双
: 子母，三字母,etc， up to五字母表或六字母表
: 单字母表

a9
发帖数: 21638

建上索引的话应该瞬间就完成了。

，实

【在 i*****o 的大作中提到】

: 如果有现成的100m的数据库的话，你能不能先做一个你说的select，我很想知道这个要
: 用多长时间。谢谢啦：）

z***e
发帖数: 5393

用现成的search server，比如solr.

【在 M***0 的大作中提到】

c****e
发帖数: 1453

If you put the whole thing in memory, it's best. Or you can put your trie db
on SSD. The disk IO response will be much much better. The uniram, bigram,
N-gram subtrees are essentially a hash-map based cache. You can tweak that
later.
For your complete trie When the depth is larger than a threshold, at each node you can remember top 5 from its children. Since you rebuild the whole thing everyday, computiation is not a concern. You can adjust the threshold to manage storage whithin your budget.
To get better scalability, you can partition the tree into multiple machines
, the add a simple aggregator before returning results to frontend. Of
course cache can be deployed at each level to reduce better latency.
There are many things you can do. But I think the basic approach should just
work.

【在 M***0 的大作中提到】

相关主题
● 一个网站里的search功能，是在search这个网站的database,还是象IDE里面的search workspace?	● 如何智能化合并数据库中属于相关objects的各种属性到一个object下？
● 请大牛来谈谈对Solr的看法	● 如何将相似字符串更加准确地找出来？
● 架构设计问题，请各位大神指点	● mongobd中的text search速度问题
进入Programming版参与讨论

l*******s
发帖数: 1258

不行，solr太慢了。那玩意是全文索引的，不是用来搞query提示的。

【在 z***e 的大作中提到】

: 用现成的search server，比如solr.

p*****b
发帖数: 291

GOOGL的用什么不知道,但你的绝对用SOLR(JQuery AutoCompletion PLugin) 肯定就可以了.
这就是SOLR的强项,而用DATABASE就完全不行了.去SOLR的网站去研究下如何定义
FIeld type, field,handler,和如何DUMP 数据从
Database 到SOLR index engine.非常有意思.
Solr这个基于LUCENE的开源INDEX/SEARCH Engine现在非常火,
但大多数WEB应用开发人员还不知道.

M***0
发帖数: 1180

谢谢楼上各位
solr/lucene我们研究过了，速度太慢。我们自己的源代码，功能和lucene几乎完全相
似，但是针对我们自己的业务定制和优化过的，Performance比lucene好不少
lucene相当generic,适用于各种领域，比较适合小型项目，也因为它的优点，使得它针
对性不够强

M***0
发帖数: 1180

谢谢 your suggestion is invaluable! 有一处我没太懂，please see below

trie
db
bigram
,
that
each
node you can remember top 5 from its children. Since you rebuild the
whole
thing everyday, computiation is not a concern. You can adjust the
threshold
to manage storage whithin your budget.
你是指full tree只需存贮排名前5的children以节省时间/空间吗？
machines
should
just
good idea！不过我们的数据量还没那么大。我们现在先测试single machine，如果
performance不好，就改成你说的。

【在 c****e 的大作中提到】

: If you put the whole thing in memory, it's best. Or you can put your trie db
: on SSD. The disk IO response will be much much better. The uniram, bigram,
: N-gram subtrees are essentially a hash-map based cache. You can tweak that
: later.
: For your complete trie When the depth is larger than a threshold, at each node you can remember top 5 from its children. Since you rebuild the whole thing everyday, computiation is not a concern. You can adjust the threshold to manage storage whithin your budget.
: To get better scalability, you can partition the tree into multiple machines
: , the add a simple aggregator before returning results to frontend. Of
: course cache can be deployed at each level to reduce better latency.
: There are many things you can do. But I think the basic approach should just
: work.

p*****b
发帖数: 291

那你们的要求太高了.我们的数据量只有你的四分只一(50Million),
Solr即使随便放在一台五年前的老旧DELL上,速度都是可以接受的.
你的好象还只是Begin with类型的search.如果,我们配一个适合这种类型的field,
并只Search这个Field,速度应该还会大幅提高.
Sol用于中型作用的数据,应该还是可以胜任的.

【在 M***0 的大作中提到】

: 谢谢楼上各位
: solr/lucene我们研究过了，速度太慢。我们自己的源代码，功能和lucene几乎完全相
: 似，但是针对我们自己的业务定制和优化过的，Performance比lucene好不少
: lucene相当generic,适用于各种领域，比较适合小型项目，也因为它的优点，使得它针
: 对性不够强

z***e
发帖数: 5393

可以的，已经有这个feature了。而且谁给你说solr只能用于全文检索？明明是你自己
define不同的field和type，然后分别建立index.

【在 l*******s 的大作中提到】

: 不行，solr太慢了。那玩意是全文索引的，不是用来搞query提示的。

l*******s
发帖数: 1258

我说的全文索引，不是狭义的指某篇文章的全文，这里全文说的是对给定的某段text的
索引，因此也包括你说的define不同的field和type。当年我还把POS tagging跟
chunking加进去做索引呢。
至于新feature，已经两年不用了，这个不了解。

【在 z***e 的大作中提到】

: 可以的，已经有这个feature了。而且谁给你说solr只能用于全文检索？明明是你自己
: define不同的field和type，然后分别建立index.

c****e
发帖数: 1453

It's simiar to your hash-map cache idea. You can cache the top 5 ranked
words at any level in the trie, ranther than only from root.

【在 M***0 的大作中提到】

: 谢谢 your suggestion is invaluable! 有一处我没太懂，please see below
:
: trie
: db
: bigram
: ,
: that
: each
: node you can remember top 5 from its children. Since you rebuild the
: whole

(共1页)

进入Programming版参与讨论

相关主题
● StackOverflow的架构	● 与其无意义的争论，不如干点实事
● Index PDF和doc 是elasticsearch还是solr	● 一个网站里的search功能，是在search这个网站的database,还是象IDE里面的search workspace?
● 有没有直接对pdf或者doc简历进行分析的开源软件？	● 请大牛来谈谈对Solr的看法
● 搜索 lucene 之类是不是不流行了？	● 架构设计问题，请各位大神指点
● autocomplete实现方法请教	● 如何智能化合并数据库中属于相关objects的各种属性到一个object下？
● load一个巨大的k-v table到一个view里，有搜索功能怎么设计？	● 如何将相似字符串更加准确地找出来？
● 算法求教	● mongobd中的text search速度问题
● 请问有什么好的开源中英文搜索引擎？	● solr shared index file solution (转载)

相关话题的讨论汇总
话题: solr话题: search话题: suggestion话题: auto话题: sql

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天