由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
JobHunting版 - 大公司算法题
相关主题
MS intern 电面被拒,附上面试过程问几个最近很头痛的A家的题
贡献一个VMWARE的online test题目google 面试题
RP Amazon Third phone一个google面试题
问一道关于字符串的面试题[discussion] how to approve that the given 9*9 is a sudoku
又tmd的面砸了一个,还是贴贴面经求bitmap相关资料的推荐
讨论一道面试题顶风上来问道题:一个很大char[], 如何in-place 删除重复元素
面试题讨论:如何在一批文件中找到相同的文件How to solve this Interview question... thanks
问两个大数据字符串算法问题和一个普通回文算法题google scramble string O(n) 解法
相关话题的讨论汇总
话题: hash话题: files话题: file话题: size话题: improve
进入JobHunting版参与讨论
1 (共1页)
s*i
发帖数: 388
1
某大公司,我签了nda所以不透露了,有一道题不是非常清楚
given 1 million files of various sizes, how to find duplicated files?
I've given several attempts, but the interviewer kept pushing me for
improvement.
1, hash the file content into a hashtable. i provided some hash func.
he said ok, but improve it.
I provide 2 more ways to improve it. and he kept pushing me, until time out.
(I don't want to poison your thoughts so I won't post my answers here. :) )
So, what's the best solution??
e********c
发帖数: 66
2
It should be better to check the file size first. If sizes are the same,
then compare hash value. To improve odds, you can have two different hash
values.
r****o
发帖数: 1950
3
是不是可以先检查文件大小,
对文件的size用hash,若两文件map到hash表的同一处再对两文件从头顺序比较。
这种方法行吗?

out.
)

【在 s*i 的大作中提到】
: 某大公司,我签了nda所以不透露了,有一道题不是非常清楚
: given 1 million files of various sizes, how to find duplicated files?
: I've given several attempts, but the interviewer kept pushing me for
: improvement.
: 1, hash the file content into a hashtable. i provided some hash func.
: he said ok, but improve it.
: I provide 2 more ways to improve it. and he kept pushing me, until time out.
: (I don't want to poison your thoughts so I won't post my answers here. :) )
: So, what's the best solution??

w******1
发帖数: 520
4
总是听人说 BITMAP, 这个题目估计是 BITMAP ,
g*****g
发帖数: 34805
5
Start with sorting file size.
Then for files with same size, compare byte by byte.
If other attributes can be carried, which is typical
for DB storage, you can save MD5 and use it for comparison.

out.
)

【在 s*i 的大作中提到】
: 某大公司,我签了nda所以不透露了,有一道题不是非常清楚
: given 1 million files of various sizes, how to find duplicated files?
: I've given several attempts, but the interviewer kept pushing me for
: improvement.
: 1, hash the file content into a hashtable. i provided some hash func.
: he said ok, but improve it.
: I provide 2 more ways to improve it. and he kept pushing me, until time out.
: (I don't want to poison your thoughts so I won't post my answers here. :) )
: So, what's the best solution??

f****4
发帖数: 1359
6
如果是文件的话,可以用md5啊
这东东就是校验几g大的文件也很快
如果md5一样,再想办法校验文件内容
s*i
发帖数: 388
7
right, that's my 2nd improvement i gave during the interview. but i don't think we need to explicitly "sort" it, 'coz that's O(nlogn).
assume the file size ranges from 1KB to 1GB, we can use a table of 1M size,
and put the files into this file_size_array, then after this round, check
the md5 for those files fallen into the same slot.
but the interviewer still push me after this stage.....
any more ideas? thanks....

【在 g*****g 的大作中提到】
: Start with sorting file size.
: Then for files with same size, compare byte by byte.
: If other attributes can be carried, which is typical
: for DB storage, you can save MD5 and use it for comparison.
:
: out.
: )

s*i
发帖数: 388
8
that was my first thought, but bitmap cannot backtrack to the file name (
maybe 100 files will be put into the same bitmap slot), so it's not enough.
so i go for hash table.

【在 w******1 的大作中提到】
: 总是听人说 BITMAP, 这个题目估计是 BITMAP ,
v****s
发帖数: 1112
9
那哥们估计就是在考验你的behavior吧。。。 压力测试
1 (共1页)
进入JobHunting版参与讨论
相关主题
google scramble string O(n) 解法又tmd的面砸了一个,还是贴贴面经
如何检查是否连续讨论一道面试题
请教个题面试题讨论:如何在一批文件中找到相同的文件
去掉单向链表中的重复元素 with O(n) time and O(1) (转载)问两个大数据字符串算法问题和一个普通回文算法题
MS intern 电面被拒,附上面试过程问几个最近很头痛的A家的题
贡献一个VMWARE的online test题目google 面试题
RP Amazon Third phone一个google面试题
问一道关于字符串的面试题[discussion] how to approve that the given 9*9 is a sudoku
相关话题的讨论汇总
话题: hash话题: files话题: file话题: size话题: improve