问个L家设计题分布式 inverted index设计 - JobHunting版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

JobHunting版 - 问个L家设计题分布式 inverted index设计

相关主题
● 大家不是说要多准备设计么，来一道google设计面试题目	● 请教T F system design
● 老年马工赶快去 fb	● 问一道T家的面试题: 分布式随机数生成器
● 问两个大数据字符串算法问题和一个普通回文算法题	● ASP.NET+SQL+C#是所谓的后端吗？
● 招数据科学家 (转载)	● System design这东西
● 问个L家设计题	● 分享一下面试题目
● 怎么设计分布式LRU cache？	● Pinterest陶涛：三个教训和三个发展选择 (转载)
● 面试是fail掉一轮就全fail掉么？	● twitter ID 怎么变成 uuid？
● 设计题	● 一道有意思的设计面试题--天气预报Service

相关话题的讨论汇总
话题: inverted话题: index话题: cassandra话题: doc话题: key

进入JobHunting版参与讨论

(共1页)

s*******m
发帖数: 228

出了一个inverted index的题，就是有一大堆doc，对doc里出现的word建inverted
index，doc很多所以是distribute在很多machine上的，问怎么实现这个inverted
index

g*****g
发帖数: 34805

Cassandra is a perfect DB for illustration. You have each word mapping to a
list of doc ids in each row. The doc id can be UUID or URL as long as it's
unique. For each index row, the row key (word) is also hashed and the row is
replicated so you can have N copy in the cluster and the keys will evenly
distribute. You may also use
timestamp etc. to arrange your index row so you can optionally use a time
range query which is very common in such design.

【在 s*******m 的大作中提到】

: 出了一个inverted index的题，就是有一大堆doc，对doc里出现的word建inverted
: index，doc很多所以是distribute在很多machine上的，问怎么实现这个inverted
: index

s*******m
发帖数: 228

谢谢。
想请教个初级的问题，想cassandra这样的key-value数据库，
内部有index吗？比如，我检索一个key，会不会很快的完成？

a
is

【在 g*****g 的大作中提到】

: Cassandra is a perfect DB for illustration. You have each word mapping to a
: list of doc ids in each row. The doc id can be UUID or URL as long as it's
: unique. For each index row, the row key (word) is also hashed and the row is
: replicated so you can have N copy in the cluster and the keys will evenly
: distribute. You may also use
: timestamp etc. to arrange your index row so you can optionally use a time
: range query which is very common in such design.

p*****2
发帖数: 21240

检索key很快
然后基本没有index
不过inverted index是不是一般 in memory的？我可能会用redis搞搞

【在 s*******m 的大作中提到】

: 谢谢。
: 想请教个初级的问题，想cassandra这样的key-value数据库，
: 内部有index吗？比如，我检索一个key，会不会很快的完成？
:
: a
: is

s*******m
发帖数: 228

cassandra 生成的key, app 层可以知道吗？
如果数据库是分布式，需要用这个key做consistent hashing，找到这个数据在哪个节
点。
我理解的对吗？
如果检索很快，那是不是说NoSQL数据库就不需要memchache 这样的cache层了

【在 p*****2 的大作中提到】

: 检索key很快
: 然后基本没有index
: 不过inverted index是不是一般 in memory的？我可能会用redis搞搞

s*******m
发帖数: 228

还有个问题
Key-value 数据库。有对象的概念吗？
比如，一个人 key = 1, value=......
一个动物 key 也是 1， value=.......

【在 p*****2 的大作中提到】

: 检索key很快
: 然后基本没有index
: 不过inverted index是不是一般 in memory的？我可能会用redis搞搞

h*******0
发帖数: 270

CREATE TABLE invertedIndex (
word text,
positions list,
PRIMARY KEY word;
}
分布式数据库不需要你自己去找在那个node上，不然用起来也太麻烦了把。。。

【在 s*******m 的大作中提到】

: cassandra 生成的key, app 层可以知道吗？
: 如果数据库是分布式，需要用这个key做consistent hashing，找到这个数据在哪个节
: 点。
: 我理解的对吗？
: 如果检索很快，那是不是说NoSQL数据库就不需要memchache 这样的cache层了

b**********5
发帖数: 7881

MLGB de, 再抱怨一下，像这种概念不清的人（no offense），好多都能被FLG录取，
我他妈的这种人，反而倒是到处被reject。。。

【在 s*******m 的大作中提到】

: 还有个问题
: Key-value 数据库。有对象的概念吗？
: 比如，一个人 key = 1, value=......
: 一个动物 key 也是 1， value=.......

g****v
发帖数: 971

这个题用mapreduce不行么

c******z
发帖数: 38

马

相关主题
● 怎么设计分布式LRU cache？	● 请教T F system design
● 面试是fail掉一轮就全fail掉么？	● 问一道T家的面试题: 分布式随机数生成器
● 设计题	● ASP.NET+SQL+C#是所谓的后端吗？
进入JobHunting版参与讨论

g*****g
发帖数: 34805

App doesn't need to know. It knows the keyword which is a unique word, it
doesn't need to know the hash value. Cassandra can cache rows in memory, for
access, you don't need memcache. But Memcache can be convenient for
different things, like caching a rich object in memory which you don't do in
NoSQL.

【在 s*******m 的大作中提到】

h*******0
发帖数: 270

好好刷题。同时你系统设计比别人表现的好，到时候录取的时候level会高一点

【在 b**********5 的大作中提到】

: MLGB de, 再抱怨一下，像这种概念不清的人（no offense），好多都能被FLG录取，
: 我他妈的这种人，反而倒是到处被reject。。。

h*******0
发帖数: 270

好虫大神 rich object是什么？能举个例子吗？

for
in

【在 g*****g 的大作中提到】

: App doesn't need to know. It knows the keyword which is a unique word, it
: doesn't need to know the hash value. Cassandra can cache rows in memory, for
: access, you don't need memcache. But Memcache can be convenient for
: different things, like caching a rich object in memory which you don't do in
: NoSQL.

b**********5
发帖数: 7881

我刷啊，刷得黑天白夜的，然后面试时，问到一个怎么产生一个random bejewel的
题，你叫我怎么办？给它基本解出来，我觉得，但没写全，你叫我怎么办？
然后去面个二流公司，题目都解出来啊，然后领走时，面试官说， we will get
back to u very soon。。。然后二个礼拜过去了，发信去问，人家屁都不回

【在 h*******0 的大作中提到】

: 好好刷题。同时你系统设计比别人表现的好，到时候录取的时候level会高一点

h*******0
发帖数: 270

面试有时候运气占挺大成分的加油吧实在不行就去个非flg过度下

【在 b**********5 的大作中提到】

: 我刷啊，刷得黑天白夜的，然后面试时，问到一个怎么产生一个random bejewel的
: 题，你叫我怎么办？给它基本解出来，我觉得，但没写全，你叫我怎么办？
: 然后去面个二流公司，题目都解出来啊，然后领走时，面试官说， we will get
: back to u very soon。。。然后二个礼拜过去了，发信去问，人家屁都不回

b**********5
发帖数: 7881

那还不如在家里自己自由职业卖逼

【在 h*******0 的大作中提到】

: 面试有时候运气占挺大成分的加油吧实在不行就去个非flg过度下

h*******0
发帖数: 270

感觉mapreduce用在这不好

【在 g****v 的大作中提到】

: 这个题用mapreduce不行么

b**********5
发帖数: 7881

原题是问怎么实现这个inverted index，大家都在讨论怎么存这个inverted index。
。。
wordID，
不正好是basic map reduce么？

【在 h*******0 的大作中提到】

: 感觉mapreduce用在这不好

h*******0
发帖数: 270

看错题了。。这哥们提了好几个问题。不过mapreduce的overhead蛮大的，如果是每
次新加入一个doc，都run一遍hadoop还挺蛋疼的。

【在 b**********5 的大作中提到】

: 原题是问怎么实现这个inverted index，大家都在讨论怎么存这个inverted index。
: 。。
: wordID，
: 不正好是basic map reduce么？

g*****g
发帖数: 34805

Think of it as a Json object, a doc. Anything that's a value and too big to
fit into C* row cache.

【在 h*******0 的大作中提到】

: 好虫大神 rich object是什么？能举个例子吗？
:
: for
: in

相关主题
● System design这东西	● twitter ID 怎么变成 uuid？
● 分享一下面试题目	● 一道有意思的设计面试题--天气预报Service
● Pinterest陶涛：三个教训和三个发展选择 (转载)	● 请帮忙看一下简历
进入JobHunting版参与讨论

g*****g
发帖数: 34805

How is this a mapreduce? It's just an index. Everybody knows what an
inverted index is, the question is how to implemented it in a distributed
system so that it can scale.

【在 b**********5 的大作中提到】

: 原题是问怎么实现这个inverted index，大家都在讨论怎么存这个inverted index。
: 。。
: wordID，
: 不正好是basic map reduce么？

g****v
发帖数: 971

map:
(word, docID)
reduce
(word, docID1, docID2....)
这难道不是个经典的mapreduce application么，请大神指教。

【在 g*****g 的大作中提到】

: How is this a mapreduce? It's just an index. Everybody knows what an
: inverted index is, the question is how to implemented it in a distributed
: system so that it can scale.

g*****g
发帖数: 34805

If you are taking counts, it can be MapReduce, otherwise what are you
reducing in an inverted index?

【在 g****v 的大作中提到】

: map:
: (word, docID)
: reduce
: (word, docID1, docID2....)
: 这难道不是个经典的mapreduce application么，请大神指教。

d******a
发帖数: 238

http://grids.ucs.indiana.edu/ptliupages/publications/Scalable%2

【在 g*****g 的大作中提到】

: How is this a mapreduce? It's just an index. Everybody knows what an
: inverted index is, the question is how to implemented it in a distributed
: system so that it can scale.

b**********5
发帖数: 7881

我只是说，本来的问题是，你只有一些hdfs file，你要建立这个inverted index。
你store 这个inverted index in the hbase或者cassandra都可以

【在 g*****g 的大作中提到】

: If you are taking counts, it can be MapReduce, otherwise what are you
: reducing in an inverted index?

x****u
发帖数: 81

这仅仅是第一步，然后呢？
怎么存？怎么partition？怎么scale？怎么更新？怎么保证可用性？
还可能扩展问，如果有按条件搜索的需求怎么处理？怎么做实时更新？
设计题只能顺着面试官思路走，看他想问啥，不过要是你特别牛能从头到尾滴水不漏面
面俱到更好了。

【在 g****v 的大作中提到】

: map:
: (word, docID)
: reduce
: (word, docID1, docID2....)
: 这难道不是个经典的mapreduce application么，请大神指教。

b**********5
发帖数: 7881

怎么存，就是存在cassandra或者hbase里啊。 hbase、cassandra都是帮你partition
好了， scale好了。你可以谈谈hbase， cassandra的architecture。 real time
更新就是lookup， overwrite， insert到你这个nosql table里。。。

【在 x****u 的大作中提到】

:
: 这仅仅是第一步，然后呢？
: 怎么存？怎么partition？怎么scale？怎么更新？怎么保证可用性？
: 还可能扩展问，如果有按条件搜索的需求怎么处理？怎么做实时更新？
: 设计题只能顺着面试官思路走，看他想问啥，不过要是你特别牛能从头到尾滴水不漏面
: 面俱到更好了。

(共1页)

进入JobHunting版参与讨论

相关主题
● 一道有意思的设计面试题--天气预报Service	● 问个L家设计题
● 请帮忙看一下简历	● 怎么设计分布式LRU cache？
● 又被问到分布式cache的设计问题	● 面试是fail掉一轮就全fail掉么？
● 关于MySQL和NoSQL的一道面试题	● 设计题
● 大家不是说要多准备设计么，来一道google设计面试题目	● 请教T F system design
● 老年马工赶快去 fb	● 问一道T家的面试题: 分布式随机数生成器
● 问两个大数据字符串算法问题和一个普通回文算法题	● ASP.NET+SQL+C#是所谓的后端吗？
● 招数据科学家 (转载)	● System design这东西

相关话题的讨论汇总
话题: inverted话题: index话题: cassandra话题: doc话题: key

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天