由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
JobHunting版 - 分享一道google 面试题。大数据相关。
相关主题
请教一道two sigma的面试题leetcode似乎c++11支持不完全?
一道面试题求解贴个刚才的电话面试题
how to return two values in a C function?百思不得其解的一道题目
我的几个面试算法解答。继续攒人品,发Apple面试题(iCloud)
这个题有什么好办法。(找出 5^1234566789893943的从底位开始求教一道关于string的Google面试题~~
求关于数据库设计的资料xor cipher面试题求解
请教个 interview question问道面试题
问个题目,好像以前有人问过~~~印度哥哥羞辱我(一道面试题)
相关话题的讨论汇总
话题: each话题: data话题: machines话题: machine话题: lines
进入JobHunting版参与讨论
1 (共1页)
l******g
发帖数: 188
1
get the total number of unique lines across a data set of 1000 gzipped text
files.
for instance: If every file has two lines, "this is line1" and "this a
line2", then the total count of lines is 2000, and total number of unique
lines is 2.
1 1000 machines where each machine has one gzipped text file with an
approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
2 1000 machines are named data1, data2,..data1000.
3 Data format ASCII text.
4 we have 11 machines named res1, res2…res11
5 Each of 11 machines has 12 1TB disk drives mounted (/0/data, /1/data,../11
/data). /1/data to /11/data mount points are empty on all machines.
6 Each machine has 128GB of RAM
7 Each machine can 'talk' to each other via ssh without login credentials.
8 Assume each machine runs same linux OS.
z**m
发帖数: 3080
2
map reduce 的第一道例题。
B*****g
发帖数: 34098
3
link please, thanks

【在 z**m 的大作中提到】
: map reduce 的第一道例题。
x***i
发帖数: 585
4
any comments?
s*g
发帖数: 94
5
没看懂这题啥意思,就是一堆硬件,ssh链接,除此之外没有任何框架?
如果每个文件只有2000行记录“line1”,怎么会有50GB那么大?

text

【在 l******g 的大作中提到】
: get the total number of unique lines across a data set of 1000 gzipped text
: files.
: for instance: If every file has two lines, "this is line1" and "this a
: line2", then the total count of lines is 2000, and total number of unique
: lines is 2.
: 1 1000 machines where each machine has one gzipped text file with an
: approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
: 2 1000 machines are named data1, data2,..data1000.
: 3 Data format ASCII text.
: 4 we have 11 machines named res1, res2…res11

m**i
发帖数: 394
6
need to do dedup on file level first, then do dedup for lines.
calculate the checksum for each line for each file,
then do an unique sort with mergesort.

text

【在 l******g 的大作中提到】
: get the total number of unique lines across a data set of 1000 gzipped text
: files.
: for instance: If every file has two lines, "this is line1" and "this a
: line2", then the total count of lines is 2000, and total number of unique
: lines is 2.
: 1 1000 machines where each machine has one gzipped text file with an
: approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
: 2 1000 machines are named data1, data2,..data1000.
: 3 Data format ASCII text.
: 4 we have 11 machines named res1, res2…res11

j****u
发帖数: 12
7
mark

text

【在 l******g 的大作中提到】
: get the total number of unique lines across a data set of 1000 gzipped text
: files.
: for instance: If every file has two lines, "this is line1" and "this a
: line2", then the total count of lines is 2000, and total number of unique
: lines is 2.
: 1 1000 machines where each machine has one gzipped text file with an
: approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
: 2 1000 machines are named data1, data2,..data1000.
: 3 Data format ASCII text.
: 4 we have 11 machines named res1, res2…res11

l********k
发帖数: 613
8
要做dedup是不是还要setup dedup的file system吧?

【在 m**i 的大作中提到】
: need to do dedup on file level first, then do dedup for lines.
: calculate the checksum for each line for each file,
: then do an unique sort with mergesort.
:
: text

m****v
发帖数: 780
9
烤到hadoop上用hive写一句count (unique *))

text

【在 l******g 的大作中提到】
: get the total number of unique lines across a data set of 1000 gzipped text
: files.
: for instance: If every file has two lines, "this is line1" and "this a
: line2", then the total count of lines is 2000, and total number of unique
: lines is 2.
: 1 1000 machines where each machine has one gzipped text file with an
: approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
: 2 1000 machines are named data1, data2,..data1000.
: 3 Data format ASCII text.
: 4 we have 11 machines named res1, res2…res11

S*A
发帖数: 7142
10
嗯,这道题有意思,是个好题目。
注意到总共数据量是 50G * 1000 = 50T。
这 1000 台机器没有提到有可以写的空间,应该就是当作分布的只读
数据。
然后可以用于计算的机器 11 台。每台 12T, 一共 11 x 12 = 121 T.
> 50T. 所以应该存在可能在1000 台机器只读一次就够了。
而且注意到没有一步机器可以存储 50T 的全部数据,也就是说,要
在不同机器直接找 unique。
这个考题非常考虑实际情况,所以看样子是要自己做些方案来统计。
例如自己搭个程序框架。用 Hadoop hive 偷懒的那种用轮子的估计是
过不了的。人家是要考造轮子的能力。
相关主题
求关于数据库设计的资料leetcode似乎c++11支持不完全?
请教个 interview question贴个刚才的电话面试题
问个题目,好像以前有人问过~~~百思不得其解的一道题目
进入JobHunting版参与讨论
l******g
发帖数: 188
11
get the total number of unique lines across a data set of 1000 gzipped text
files.
for instance: If every file has two lines, "this is line1" and "this a
line2", then the total count of lines is 2000, and total number of unique
lines is 2.
1. 1000 machines where each machine has one gzipped text file with an
approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
2. 1000 machines are named data1, data2,..data1000.
3. Data format ASCII text.
4. we have 11 machines named res1, res2…res11
5. Each of 11 machines has 12 1TB disk drives mounted (/0/data, /1/data,../
11/data). /1/data to /11/data mount points are empty on all machines.
6. Each machine has 128GB of RAM
7. Each machine can 'talk' to each other via ssh without login credentials.
8. Assume each machine runs same linux OS.
z**m
发帖数: 3080
12
map reduce 的第一道例题。
B*****g
发帖数: 34098
13
link please, thanks

【在 z**m 的大作中提到】
: map reduce 的第一道例题。
x***i
发帖数: 585
14
any comments?
s*g
发帖数: 94
15
没看懂这题啥意思,就是一堆硬件,ssh链接,除此之外没有任何框架?
如果每个文件只有2000行记录“line1”,怎么会有50GB那么大?

text

【在 l******g 的大作中提到】
: get the total number of unique lines across a data set of 1000 gzipped text
: files.
: for instance: If every file has two lines, "this is line1" and "this a
: line2", then the total count of lines is 2000, and total number of unique
: lines is 2.
: 1. 1000 machines where each machine has one gzipped text file with an
: approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
: 2. 1000 machines are named data1, data2,..data1000.
: 3. Data format ASCII text.
: 4. we have 11 machines named res1, res2…res11

j****u
发帖数: 12
16
mark

text

【在 l******g 的大作中提到】
: get the total number of unique lines across a data set of 1000 gzipped text
: files.
: for instance: If every file has two lines, "this is line1" and "this a
: line2", then the total count of lines is 2000, and total number of unique
: lines is 2.
: 1. 1000 machines where each machine has one gzipped text file with an
: approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
: 2. 1000 machines are named data1, data2,..data1000.
: 3. Data format ASCII text.
: 4. we have 11 machines named res1, res2…res11

l********k
发帖数: 613
17
要做dedup是不是还要setup dedup的file system吧?

【在 m**i 的大作中提到】
: need to do dedup on file level first, then do dedup for lines.
: calculate the checksum for each line for each file,
: then do an unique sort with mergesort.
:
: text

m****v
发帖数: 780
18
烤到hadoop上用hive写一句count (unique *))

text

【在 l******g 的大作中提到】
: get the total number of unique lines across a data set of 1000 gzipped text
: files.
: for instance: If every file has two lines, "this is line1" and "this a
: line2", then the total count of lines is 2000, and total number of unique
: lines is 2.
: 1. 1000 machines where each machine has one gzipped text file with an
: approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
: 2. 1000 machines are named data1, data2,..data1000.
: 3. Data format ASCII text.
: 4. we have 11 machines named res1, res2…res11

S*A
发帖数: 7142
19
嗯,这道题有意思,是个好题目。
注意到总共数据量是 50G * 1000 = 50T。
这 1000 台机器没有提到有可以写的空间,应该就是当作分布的只读
数据。
然后可以用于计算的机器 11 台。每台 12T, 一共 11 x 12 = 121 T.
> 50T. 所以应该存在可能在1000 台机器只读一次就够了。
而且注意到没有一步机器可以存储 50T 的全部数据,也就是说,要
在不同机器直接找 unique。
这个考题非常考虑实际情况,所以看样子是要自己做些方案来统计。
例如自己搭个程序框架。用 Hadoop hive 偷懒的那种用轮子的估计是
过不了的。人家是要考造轮子的能力。
m*****k
发帖数: 731
20
顶一下,谁有好方案?
y**********a
发帖数: 824
21

连轮子都不知道怎么用的菜鸟是不是直接跪了

【在 S*A 的大作中提到】
: 嗯,这道题有意思,是个好题目。
: 注意到总共数据量是 50G * 1000 = 50T。
: 这 1000 台机器没有提到有可以写的空间,应该就是当作分布的只读
: 数据。
: 然后可以用于计算的机器 11 台。每台 12T, 一共 11 x 12 = 121 T.
: > 50T. 所以应该存在可能在1000 台机器只读一次就够了。
: 而且注意到没有一步机器可以存储 50T 的全部数据,也就是说,要
: 在不同机器直接找 unique。
: 这个考题非常考虑实际情况,所以看样子是要自己做些方案来统计。
: 例如自己搭个程序框架。用 Hadoop hive 偷懒的那种用轮子的估计是

1 (共1页)
进入JobHunting版参与讨论
相关主题
印度哥哥羞辱我(一道面试题)这个题有什么好办法。(找出 5^1234566789893943的从底位开始
问道Twitter面试题求关于数据库设计的资料
问一道面试题请教个 interview question
问道 面试题问个题目,好像以前有人问过~~~
请教一道two sigma的面试题leetcode似乎c++11支持不完全?
一道面试题求解贴个刚才的电话面试题
how to return two values in a C function?百思不得其解的一道题目
我的几个面试算法解答。继续攒人品,发Apple面试题(iCloud)
相关话题的讨论汇总
话题: each话题: data话题: machines话题: machine话题: lines