分享一道google 面试题。大数据相关。 - JobHunting版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

JobHunting版 - 分享一道google 面试题。大数据相关。

相关主题
● 请教一道two sigma的面试题	● leetcode似乎c++11支持不完全?
● 一道面试题求解	● 贴个刚才的电话面试题
● how to return two values in a C function?	● 百思不得其解的一道题目
● 我的几个面试算法解答。	● 继续攒人品，发Apple面试题(iCloud)
● 这个题有什么好办法。（找出 5^1234566789893943的从底位开始	● 求教一道关于string的Google面试题～～
● 求关于数据库设计的资料	● xor cipher面试题求解
● 请教个 interview question	● 问道面试题
● 问个题目，好像以前有人问过~~~	● 印度哥哥羞辱我（一道面试题）

相关话题的讨论汇总
话题: each话题: data话题: machines话题: machine话题: lines

进入JobHunting版参与讨论

(共1页)

l******g
发帖数: 188

get the total number of unique lines across a data set of 1000 gzipped text
files.
for instance: If every file has two lines, "this is line1" and "this a
line2", then the total count of lines is 2000, and total number of unique
lines is 2.
1 1000 machines where each machine has one gzipped text file with an
approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
2 1000 machines are named data1, data2,..data1000.
3 Data format ASCII text.
4 we have 11 machines named res1, res2…res11
5 Each of 11 machines has 12 1TB disk drives mounted (/0/data, /1/data,../11
/data). /1/data to /11/data mount points are empty on all machines.
6 Each machine has 128GB of RAM
7 Each machine can 'talk' to each other via ssh without login credentials.
8 Assume each machine runs same linux OS.

z**m
发帖数: 3080

map reduce 的第一道例题。

B*****g
发帖数: 34098

link please, thanks

【在 z**m 的大作中提到】

: map reduce 的第一道例题。

x***i
发帖数: 585

any comments?

s*g
发帖数: 94

没看懂这题啥意思，就是一堆硬件，ssh链接，除此之外没有任何框架？
如果每个文件只有2000行记录“line1”，怎么会有50GB那么大？

text

【在 l******g 的大作中提到】

: get the total number of unique lines across a data set of 1000 gzipped text
: files.
: for instance: If every file has two lines, "this is line1" and "this a
: line2", then the total count of lines is 2000, and total number of unique
: lines is 2.
: 1 1000 machines where each machine has one gzipped text file with an
: approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
: 2 1000 machines are named data1, data2,..data1000.
: 3 Data format ASCII text.
: 4 we have 11 machines named res1, res2…res11

m**i
发帖数: 394

need to do dedup on file level first, then do dedup for lines.
calculate the checksum for each line for each file,
then do an unique sort with mergesort.

text

【在 l******g 的大作中提到】

j****u
发帖数: 12

mark

text

【在 l******g 的大作中提到】

l********k
发帖数: 613

要做dedup是不是还要setup dedup的file system吧？

【在 m**i 的大作中提到】

: need to do dedup on file level first, then do dedup for lines.
: calculate the checksum for each line for each file,
: then do an unique sort with mergesort.
:
: text

m****v
发帖数: 780

烤到hadoop上用hive写一句count (unique *))

text

【在 l******g 的大作中提到】

S*A
发帖数: 7142

嗯，这道题有意思，是个好题目。
注意到总共数据量是 50G ＊ 1000 ＝ 50T。
这 1000 台机器没有提到有可以写的空间，应该就是当作分布的只读
数据。
然后可以用于计算的机器 11 台。每台 12T，一共 11 x 12 = 121 T.
> 50T. 所以应该存在可能在1000 台机器只读一次就够了。
而且注意到没有一步机器可以存储 50T 的全部数据，也就是说，要
在不同机器直接找 unique。
这个考题非常考虑实际情况，所以看样子是要自己做些方案来统计。
例如自己搭个程序框架。用 Hadoop hive 偷懒的那种用轮子的估计是
过不了的。人家是要考造轮子的能力。

相关主题
● 求关于数据库设计的资料	● leetcode似乎c++11支持不完全?
● 请教个 interview question	● 贴个刚才的电话面试题
● 问个题目，好像以前有人问过~~~	● 百思不得其解的一道题目
进入JobHunting版参与讨论

l******g
发帖数: 188

get the total number of unique lines across a data set of 1000 gzipped text
files.
for instance: If every file has two lines, "this is line1" and "this a
line2", then the total count of lines is 2000, and total number of unique
lines is 2.
1. 1000 machines where each machine has one gzipped text file with an
approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
2. 1000 machines are named data1, data2,..data1000.
3. Data format ASCII text.
4. we have 11 machines named res1, res2…res11
5. Each of 11 machines has 12 1TB disk drives mounted (/0/data, /1/data,../
11/data). /1/data to /11/data mount points are empty on all machines.
6. Each machine has 128GB of RAM
7. Each machine can 'talk' to each other via ssh without login credentials.
8. Assume each machine runs same linux OS.

z**m
发帖数: 3080

map reduce 的第一道例题。

B*****g
发帖数: 34098

link please, thanks

【在 z**m 的大作中提到】

: map reduce 的第一道例题。

x***i
发帖数: 585

any comments?

s*g
发帖数: 94

: get the total number of unique lines across a data set of 1000 gzipped text
: files.
: for instance: If every file has two lines, "this is line1" and "this a
: line2", then the total count of lines is 2000, and total number of unique
: lines is 2.
: 1. 1000 machines where each machine has one gzipped text file with an
: approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
: 2. 1000 machines are named data1, data2,..data1000.
: 3. Data format ASCII text.
: 4. we have 11 machines named res1, res2…res11

j****u
发帖数: 12

mark

text

【在 l******g 的大作中提到】

: get the total number of unique lines across a data set of 1000 gzipped text
: files.
: for instance: If every file has two lines, "this is line1" and "this a
: line2", then the total count of lines is 2000, and total number of unique
: lines is 2.
: 1. 1000 machines where each machine has one gzipped text file with an
: approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
: 2. 1000 machines are named data1, data2,..data1000.
: 3. Data format ASCII text.
: 4. we have 11 machines named res1, res2…res11

l********k
发帖数: 613

要做dedup是不是还要setup dedup的file system吧？

【在 m**i 的大作中提到】

: need to do dedup on file level first, then do dedup for lines.
: calculate the checksum for each line for each file,
: then do an unique sort with mergesort.
:
: text

m****v
发帖数: 780

烤到hadoop上用hive写一句count (unique *))

text

【在 l******g 的大作中提到】

: get the total number of unique lines across a data set of 1000 gzipped text
: files.
: for instance: If every file has two lines, "this is line1" and "this a
: line2", then the total count of lines is 2000, and total number of unique
: lines is 2.
: 1. 1000 machines where each machine has one gzipped text file with an
: approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
: 2. 1000 machines are named data1, data2,..data1000.
: 3. Data format ASCII text.
: 4. we have 11 machines named res1, res2…res11

S*A
发帖数: 7142

m*****k
发帖数: 731

顶一下，谁有好方案？

y**********a
发帖数: 824

连轮子都不知道怎么用的菜鸟是不是直接跪了

【在 S*A 的大作中提到】

: 嗯，这道题有意思，是个好题目。
: 注意到总共数据量是 50G ＊ 1000 ＝ 50T。
: 这 1000 台机器没有提到有可以写的空间，应该就是当作分布的只读
: 数据。
: 然后可以用于计算的机器 11 台。每台 12T，一共 11 x 12 = 121 T.
: > 50T. 所以应该存在可能在1000 台机器只读一次就够了。
: 而且注意到没有一步机器可以存储 50T 的全部数据，也就是说，要
: 在不同机器直接找 unique。
: 这个考题非常考虑实际情况，所以看样子是要自己做些方案来统计。
: 例如自己搭个程序框架。用 Hadoop hive 偷懒的那种用轮子的估计是

(共1页)

进入JobHunting版参与讨论

相关主题
● 印度哥哥羞辱我（一道面试题）	● 这个题有什么好办法。（找出 5^1234566789893943的从底位开始
● 问道Twitter面试题	● 求关于数据库设计的资料
● 问一道面试题	● 请教个 interview question
● 问道面试题	● 问个题目，好像以前有人问过~~~
● 请教一道two sigma的面试题	● leetcode似乎c++11支持不完全?
● 一道面试题求解	● 贴个刚才的电话面试题
● how to return two values in a C function?	● 百思不得其解的一道题目
● 我的几个面试算法解答。	● 继续攒人品，发Apple面试题(iCloud)

相关话题的讨论汇总
话题: each话题: data话题: machines话题: machine话题: lines

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天