H*M 发帖数: 1268 | 1 讨论讨论
You have a file with millions of lines of data. Only two lines are identical
; the rest are all unique. Each line is so long that it may not even fit in
memory. What is the most efficient solution for finding the identical lines?
这个连每行都不能放进内存里。有什么好方法吗? |
g*******y 发帖数: 1930 | 2 for each line, compute a key, hash the key, and also store the index of line(for later comparision)
manually compare lines with same key
identical
in
lines?
【在 H*M 的大作中提到】 : 讨论讨论 : You have a file with millions of lines of data. Only two lines are identical : ; the rest are all unique. Each line is so long that it may not even fit in : memory. What is the most efficient solution for finding the identical lines? : 这个连每行都不能放进内存里。有什么好方法吗?
|
h**k 发帖数: 3368 | 3 计算hash function value,然后比较?
identical
in
lines?
【在 H*M 的大作中提到】 : 讨论讨论 : You have a file with millions of lines of data. Only two lines are identical : ; the rest are all unique. Each line is so long that it may not even fit in : memory. What is the most efficient solution for finding the identical lines? : 这个连每行都不能放进内存里。有什么好方法吗?
|
m******9 发帖数: 968 | |
g*******y 发帖数: 1930 | 5 really?
我希望被亚马孙问到这类题目,呵呵
【在 m******9 的大作中提到】 : 这不是amazon的题目吗?
|
m******9 发帖数: 968 | 6 好主意
double hashing
line(for later comparision)
【在 g*******y 的大作中提到】 : for each line, compute a key, hash the key, and also store the index of line(for later comparision) : manually compare lines with same key : : identical : in : lines?
|