s*i 发帖数: 388 | 1 某大公司,我签了nda所以不透露了,有一道题不是非常清楚
given 1 million files of various sizes, how to find duplicated files?
I've given several attempts, but the interviewer kept pushing me for
improvement.
1, hash the file content into a hashtable. i provided some hash func.
he said ok, but improve it.
I provide 2 more ways to improve it. and he kept pushing me, until time out.
(I don't want to poison your thoughts so I won't post my answers here. :) )
So, what's the best solution?? | e********c 发帖数: 66 | 2 It should be better to check the file size first. If sizes are the same,
then compare hash value. To improve odds, you can have two different hash
values. | r****o 发帖数: 1950 | 3 是不是可以先检查文件大小,
对文件的size用hash,若两文件map到hash表的同一处再对两文件从头顺序比较。
这种方法行吗?
out.
)
【在 s*i 的大作中提到】 : 某大公司,我签了nda所以不透露了,有一道题不是非常清楚 : given 1 million files of various sizes, how to find duplicated files? : I've given several attempts, but the interviewer kept pushing me for : improvement. : 1, hash the file content into a hashtable. i provided some hash func. : he said ok, but improve it. : I provide 2 more ways to improve it. and he kept pushing me, until time out. : (I don't want to poison your thoughts so I won't post my answers here. :) ) : So, what's the best solution??
| w******1 发帖数: 520 | 4 总是听人说 BITMAP, 这个题目估计是 BITMAP , | g*****g 发帖数: 34805 | 5 Start with sorting file size.
Then for files with same size, compare byte by byte.
If other attributes can be carried, which is typical
for DB storage, you can save MD5 and use it for comparison.
out.
)
【在 s*i 的大作中提到】 : 某大公司,我签了nda所以不透露了,有一道题不是非常清楚 : given 1 million files of various sizes, how to find duplicated files? : I've given several attempts, but the interviewer kept pushing me for : improvement. : 1, hash the file content into a hashtable. i provided some hash func. : he said ok, but improve it. : I provide 2 more ways to improve it. and he kept pushing me, until time out. : (I don't want to poison your thoughts so I won't post my answers here. :) ) : So, what's the best solution??
| f****4 发帖数: 1359 | 6 如果是文件的话,可以用md5啊
这东东就是校验几g大的文件也很快
如果md5一样,再想办法校验文件内容 | s*i 发帖数: 388 | 7 right, that's my 2nd improvement i gave during the interview. but i don't think we need to explicitly "sort" it, 'coz that's O(nlogn).
assume the file size ranges from 1KB to 1GB, we can use a table of 1M size,
and put the files into this file_size_array, then after this round, check
the md5 for those files fallen into the same slot.
but the interviewer still push me after this stage.....
any more ideas? thanks....
【在 g*****g 的大作中提到】 : Start with sorting file size. : Then for files with same size, compare byte by byte. : If other attributes can be carried, which is typical : for DB storage, you can save MD5 and use it for comparison. : : out. : )
| s*i 发帖数: 388 | 8 that was my first thought, but bitmap cannot backtrack to the file name (
maybe 100 files will be put into the same bitmap slot), so it's not enough.
so i go for hash table.
【在 w******1 的大作中提到】 : 总是听人说 BITMAP, 这个题目估计是 BITMAP ,
| v****s 发帖数: 1112 | 9 那哥们估计就是在考验你的behavior吧。。。 压力测试 |
|