求教一道面试题 - JobHunting版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

JobHunting版 - 求教一道面试题

相关主题
● 一道巨常见的题	● web count 设计
● 请问一道面试题	● #面试题#有100个database，每个存1 million data，如何求出median number of 这些数。
● 再问一道题	● amazon问题求教
● 问一个多次遇到的面试题	● 弱弱的问问 2sum, 3sum 的问题
● 问个amazon面试题	● 请教个题
● unordered_set是怎么实现的？	● 再问个题
● 一个多线程的简单问题	● 这个题怎么做啊？
● 问道关于快速找bucket的面试题	● HASHTABLE collision 后REHASH 怎么SEARCH

相关话题的讨论汇总
话题: bucket话题: hashtable话题: number话题: counts话题: 32

进入JobHunting版参与讨论

(共1页)

c**y
发帖数: 2282

一个很大的数组，长度可以是100000或者更大，里面的数字可以从0到2的32次方，找出
出现频率最高
的数字
我给的答案是hashtable, 小印女很不满，觉得太占内存，红黑树，好像也不满意，哪
位大牛给指点
下，她到底期待一个什么样的数据结构啊

y*********e
发帖数: 518

首先要问，输入是存在主内存中，还是存在外部文件？
若是输入是存在主内存中的array，是否已经排序好了？多半是没排序的。
那么这个问题可以简化成，给定一个未排序的32bit数组，找出出现频率最高的。用
hashtable 是最简单的解法; 不过面官对 hashtable 不满意，你可以分析出
hashtable 的 memory overhead 有多大，这样可以表现出你对 hashtable 内在实
现的了解程度。（hashtable是很占内存的）。
另外一个很简单的方案：对数组排序，用掉O(NlogN)的时间。那么相同的元素都放
在一起了。接下来扫描数组一遍，同时记录相同元素出现的次数，能有用O(1)的空
间找到出现频率最大的相同元素。

c***2
发帖数: 838

This is to find an algorithm, not to find a data structure.
One way to try:
for(i=0;i<32;i++)
read one bit (the i-th bit) of all numbers
Count how many ones, save it to counts[i]
Then we get array counts[32]
Do this a few rounds:
Find the smallest non-zero number from counts[32]: min
for(i=0;i<32;i++)
counts[i] -= min;
until
there are only two different numbers left in the array
one number must be 0, the other non-zero number
constuct the number from array counts:
i-t

s*****n
发帖数: 5488

看不懂原理。不过举个例子。假设1M数只有一个1。剩下都是零。
那么照这个算法，最frequent的数是1？

【在 c***2 的大作中提到】

: This is to find an algorithm, not to find a data structure.
: One way to try:
: for(i=0;i<32;i++)
: read one bit (the i-th bit) of all numbers
: Count how many ones, save it to counts[i]
: Then we get array counts[32]
: Do this a few rounds:
: Find the smallest non-zero number from counts[32]: min
: for(i=0;i<32;i++)
: counts[i] -= min;

c***2
发帖数: 838

This won't work.(good only for a few special cases).
The direction may be right...

【在 c***2 的大作中提到】

c**y
发帖数: 2282

嗯，很好的想法。不过，读一个bit和读进来一个字长耗费的时间都是一样的吧

f****g
发帖数: 313

Could we do multi-level bucket sort:
Round 1:
Construct the buckets as int *a[32].
For each number, we place the number into the buckets by the most-right non-
zero bit. For example: 01000000,00000000,00110000,11000000, we link it into
the bucket a[30]. 00000001,00000000,00110000,11000000 => link to bucket a[
24].
Then for each bucket, we could use the same strategy to sort or quick sort(
depends on the number of integers to be sorted).

t****a
发帖数: 1212

agree!

non-
into

【在 f****g 的大作中提到】

: Could we do multi-level bucket sort:
: Round 1:
: Construct the buckets as int *a[32].
: For each number, we place the number into the buckets by the most-right non-
: zero bit. For example: 01000000,00000000,00110000,11000000, we link it into
: the bucket a[30]. 00000001,00000000,00110000,11000000 => link to bucket a[
: 24].
: Then for each bucket, we could use the same strategy to sort or quick sort(
: depends on the number of integers to be sorted).
:

s*********g
发帖数: 153

How does example work? I did not get it.

non-
into

【在 f****g 的大作中提到】

p********7
发帖数: 549

先用1bit bitmap，0表示没出现，1表示出现过1次或者以上
然后再根据上面的结果创建一个hashtable表格，里面只包括了出现了1次或者1次以上的数字，这样这个hashtable的大小会小很多。
最后就是一个counting的过程

相关主题
● unordered_set是怎么实现的？	● web count 设计
● 一个多线程的简单问题	● #面试题#有100个database，每个存1 million data，如何求出median number of 这些数。
● 问道关于快速找bucket的面试题	● amazon问题求教
进入JobHunting版参与讨论

f****g
发帖数: 313

That's a very good solution too: use bitmap to determine the existence of
each
element, and then build a hashtable to count.

上的数字，这样这
个hashtable的大小会小很多。

【在 p********7 的大作中提到】

: 先用1bit bitmap，0表示没出现，1表示出现过1次或者以上
: 然后再根据上面的结果创建一个hashtable表格，里面只包括了出现了1次或者1次以上的数字，这样这个hashtable的大小会小很多。
: 最后就是一个counting的过程

f****g
发帖数: 313

Take 4-bit integer as an example:
Bucket 1: 1000 - 1111
Bucket 2: 0100 - 0111
Bucket 3: 0010 - 0011
Bucket 4: 0001 - 0001
Got it?

【在 s*********g 的大作中提到】

: How does example work? I did not get it.
:
: non-
: into

d*********i
发帖数: 628

学习了。

k***g
发帖数: 58

不知道这个bitmap有什么用，原来数组中的数，出现次数肯定大于等于1，跟直接扫描
数组有区别？

上的数字，这
样这个hashtable的大小会小很多。

【在 p********7 的大作中提到】

g*****e
发帖数: 282

这个bitmap开销很大。2^32 bits。建ht的过程又要读一遍。这个ht能直接count，还是
类似bucket sort要来好几轮？
我觉得还是类似bucket sort的思路好，每次2^8个buckets，最坏的情况全部数字读4遍
。从space和time的cost均衡一些。
但是我觉得tree不是更好，类似Huffman coding，每个node带一个counter？

【在 f****g 的大作中提到】

: That's a very good solution too: use bitmap to determine the existence of
: each
: element, and then build a hashtable to count.
:
: 上的数字，这样这
: 个hashtable的大小会小很多。

p********7
发帖数: 549

这么一来其实还是分了N个extra 内存

non-
into

【在 f****g 的大作中提到】

(共1页)

进入JobHunting版参与讨论

相关主题
● HASHTABLE collision 后REHASH 怎么SEARCH	● 问个amazon面试题
● 算法题	● unordered_set是怎么实现的？
● A facebook interview question	● 一个多线程的简单问题
● hashtable的遍历	● 问道关于快速找bucket的面试题
● 一道巨常见的题	● web count 设计
● 请问一道面试题	● #面试题#有100个database，每个存1 million data，如何求出median number of 这些数。
● 再问一道题	● amazon问题求教
● 问一个多次遇到的面试题	● 弱弱的问问 2sum, 3sum 的问题

相关话题的讨论汇总
话题: bucket话题: hashtable话题: number话题: counts话题: 32

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天