l******g 发帖数: 188 | 1 get the total number of unique lines across a data set of 1000 gzipped text
files.
for instance: If every file has two lines, "this is line1" and "this a
line2", then the total count of lines is 2000, and total number of unique
lines is 2.
1 1000 machines where each machine has one gzipped text file with an
approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
2 1000 machines are named data1, data2,..data1000.
3 Data format ASCII text.
4 we have 11 machines named res1, res2…res11
5 Each of 11 machines has 12 1TB disk drives mounted (/0/data, /1/data,../11
/data). /1/data to /11/data mount points are empty on all machines.
6 Each machine has 128GB of RAM
7 Each machine can 'talk' to each other via ssh without login credentials.
8 Assume each machine runs same linux OS. |
z**m 发帖数: 3080 | |
B*****g 发帖数: 34098 | 3 link please, thanks
【在 z**m 的大作中提到】 : map reduce 的第一道例题。
|
x***i 发帖数: 585 | |
s*g 发帖数: 94 | 5 没看懂这题啥意思,就是一堆硬件,ssh链接,除此之外没有任何框架?
如果每个文件只有2000行记录“line1”,怎么会有50GB那么大?
text
【在 l******g 的大作中提到】 : get the total number of unique lines across a data set of 1000 gzipped text : files. : for instance: If every file has two lines, "this is line1" and "this a : line2", then the total count of lines is 2000, and total number of unique : lines is 2. : 1 1000 machines where each machine has one gzipped text file with an : approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz : 2 1000 machines are named data1, data2,..data1000. : 3 Data format ASCII text. : 4 we have 11 machines named res1, res2…res11
|
m**i 发帖数: 394 | 6 need to do dedup on file level first, then do dedup for lines.
calculate the checksum for each line for each file,
then do an unique sort with mergesort.
text
【在 l******g 的大作中提到】 : get the total number of unique lines across a data set of 1000 gzipped text : files. : for instance: If every file has two lines, "this is line1" and "this a : line2", then the total count of lines is 2000, and total number of unique : lines is 2. : 1 1000 machines where each machine has one gzipped text file with an : approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz : 2 1000 machines are named data1, data2,..data1000. : 3 Data format ASCII text. : 4 we have 11 machines named res1, res2…res11
|
j****u 发帖数: 12 | 7 mark
text
【在 l******g 的大作中提到】 : get the total number of unique lines across a data set of 1000 gzipped text : files. : for instance: If every file has two lines, "this is line1" and "this a : line2", then the total count of lines is 2000, and total number of unique : lines is 2. : 1 1000 machines where each machine has one gzipped text file with an : approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz : 2 1000 machines are named data1, data2,..data1000. : 3 Data format ASCII text. : 4 we have 11 machines named res1, res2…res11
|
l********k 发帖数: 613 | 8 要做dedup是不是还要setup dedup的file system吧?
【在 m**i 的大作中提到】 : need to do dedup on file level first, then do dedup for lines. : calculate the checksum for each line for each file, : then do an unique sort with mergesort. : : text
|
m****v 发帖数: 780 | 9 烤到hadoop上用hive写一句count (unique *))
text
【在 l******g 的大作中提到】 : get the total number of unique lines across a data set of 1000 gzipped text : files. : for instance: If every file has two lines, "this is line1" and "this a : line2", then the total count of lines is 2000, and total number of unique : lines is 2. : 1 1000 machines where each machine has one gzipped text file with an : approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz : 2 1000 machines are named data1, data2,..data1000. : 3 Data format ASCII text. : 4 we have 11 machines named res1, res2…res11
|
S*A 发帖数: 7142 | 10 嗯,这道题有意思,是个好题目。
注意到总共数据量是 50G * 1000 = 50T。
这 1000 台机器没有提到有可以写的空间,应该就是当作分布的只读
数据。
然后可以用于计算的机器 11 台。每台 12T, 一共 11 x 12 = 121 T.
> 50T. 所以应该存在可能在1000 台机器只读一次就够了。
而且注意到没有一步机器可以存储 50T 的全部数据,也就是说,要
在不同机器直接找 unique。
这个考题非常考虑实际情况,所以看样子是要自己做些方案来统计。
例如自己搭个程序框架。用 Hadoop hive 偷懒的那种用轮子的估计是
过不了的。人家是要考造轮子的能力。 |
|
|
l******g 发帖数: 188 | 11 get the total number of unique lines across a data set of 1000 gzipped text
files.
for instance: If every file has two lines, "this is line1" and "this a
line2", then the total count of lines is 2000, and total number of unique
lines is 2.
1. 1000 machines where each machine has one gzipped text file with an
approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz
2. 1000 machines are named data1, data2,..data1000.
3. Data format ASCII text.
4. we have 11 machines named res1, res2…res11
5. Each of 11 machines has 12 1TB disk drives mounted (/0/data, /1/data,../
11/data). /1/data to /11/data mount points are empty on all machines.
6. Each machine has 128GB of RAM
7. Each machine can 'talk' to each other via ssh without login credentials.
8. Assume each machine runs same linux OS. |
z**m 发帖数: 3080 | |
B*****g 发帖数: 34098 | 13 link please, thanks
【在 z**m 的大作中提到】 : map reduce 的第一道例题。
|
x***i 发帖数: 585 | |
s*g 发帖数: 94 | 15 没看懂这题啥意思,就是一堆硬件,ssh链接,除此之外没有任何框架?
如果每个文件只有2000行记录“line1”,怎么会有50GB那么大?
text
【在 l******g 的大作中提到】 : get the total number of unique lines across a data set of 1000 gzipped text : files. : for instance: If every file has two lines, "this is line1" and "this a : line2", then the total count of lines is 2000, and total number of unique : lines is 2. : 1. 1000 machines where each machine has one gzipped text file with an : approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz : 2. 1000 machines are named data1, data2,..data1000. : 3. Data format ASCII text. : 4. we have 11 machines named res1, res2…res11
|
j****u 发帖数: 12 | 16 mark
text
【在 l******g 的大作中提到】 : get the total number of unique lines across a data set of 1000 gzipped text : files. : for instance: If every file has two lines, "this is line1" and "this a : line2", then the total count of lines is 2000, and total number of unique : lines is 2. : 1. 1000 machines where each machine has one gzipped text file with an : approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz : 2. 1000 machines are named data1, data2,..data1000. : 3. Data format ASCII text. : 4. we have 11 machines named res1, res2…res11
|
l********k 发帖数: 613 | 17 要做dedup是不是还要setup dedup的file system吧?
【在 m**i 的大作中提到】 : need to do dedup on file level first, then do dedup for lines. : calculate the checksum for each line for each file, : then do an unique sort with mergesort. : : text
|
m****v 发帖数: 780 | 18 烤到hadoop上用hive写一句count (unique *))
text
【在 l******g 的大作中提到】 : get the total number of unique lines across a data set of 1000 gzipped text : files. : for instance: If every file has two lines, "this is line1" and "this a : line2", then the total count of lines is 2000, and total number of unique : lines is 2. : 1. 1000 machines where each machine has one gzipped text file with an : approximate size of 50GB. The file on each machine is /0/data/foo.txt.gz : 2. 1000 machines are named data1, data2,..data1000. : 3. Data format ASCII text. : 4. we have 11 machines named res1, res2…res11
|
S*A 发帖数: 7142 | 19 嗯,这道题有意思,是个好题目。
注意到总共数据量是 50G * 1000 = 50T。
这 1000 台机器没有提到有可以写的空间,应该就是当作分布的只读
数据。
然后可以用于计算的机器 11 台。每台 12T, 一共 11 x 12 = 121 T.
> 50T. 所以应该存在可能在1000 台机器只读一次就够了。
而且注意到没有一步机器可以存储 50T 的全部数据,也就是说,要
在不同机器直接找 unique。
这个考题非常考虑实际情况,所以看样子是要自己做些方案来统计。
例如自己搭个程序框架。用 Hadoop hive 偷懒的那种用轮子的估计是
过不了的。人家是要考造轮子的能力。 |
m*****k 发帖数: 731 | |
y**********a 发帖数: 824 | 21
连轮子都不知道怎么用的菜鸟是不是直接跪了
【在 S*A 的大作中提到】 : 嗯,这道题有意思,是个好题目。 : 注意到总共数据量是 50G * 1000 = 50T。 : 这 1000 台机器没有提到有可以写的空间,应该就是当作分布的只读 : 数据。 : 然后可以用于计算的机器 11 台。每台 12T, 一共 11 x 12 = 121 T. : > 50T. 所以应该存在可能在1000 台机器只读一次就够了。 : 而且注意到没有一步机器可以存储 50T 的全部数据,也就是说,要 : 在不同机器直接找 unique。 : 这个考题非常考虑实际情况,所以看样子是要自己做些方案来统计。 : 例如自己搭个程序框架。用 Hadoop hive 偷懒的那种用轮子的估计是
|