a**********k 发帖数: 1953 | 1 Each of the N machines has an array of N numbers, how to calculate
the median of the N^2 numbers efficiently? |
c*****e 发帖数: 74 | 2 I was asked by an Indian interviewer at Google Mountain View 2.5 years ago.
I remembered we discussed communication cost between machines, available
memory on each machine and available disk on each machine, but I couldn't
give a convincing answer then.
【在 a**********k 的大作中提到】 : Each of the N machines has an array of N numbers, how to calculate : the median of the N^2 numbers efficiently?
|
a**********k 发帖数: 1953 | 3 More general form would be:
median of M*N numbers across M machines, each having N numbers.
When M=2, there's a well-known solution.
How to generalize it to any integer M?
.
【在 c*****e 的大作中提到】 : I was asked by an Indian interviewer at Google Mountain View 2.5 years ago. : I remembered we discussed communication cost between machines, available : memory on each machine and available disk on each machine, but I couldn't : give a convincing answer then.
|
b********h 发帖数: 119 | 4 If you can use MapReduce, you could partition the numbers into several
buckets (Mapper) and count how many numbers are in each bucket. Then, you
know which bucket to look for the median. If the bucket is small enough, we
could put it into memory and find the median. Otherwise, do another
mapreduce job. |
a**********k 发帖数: 1953 | 5 If the numbers have been distributed beforehand, then each bucket will
be distributed among N machines, then it is reduced to the original problem.
Also, MapReduce infrastructure itself has a cost.
we
【在 b********h 的大作中提到】 : If you can use MapReduce, you could partition the numbers into several : buckets (Mapper) and count how many numbers are in each bucket. Then, you : know which bucket to look for the median. If the bucket is small enough, we : could put it into memory and find the median. Otherwise, do another : mapreduce job.
|
j********x 发帖数: 2330 | 6 这个问题很好,有讨论空间
不像那些傻逼的无聊题目。。。 |
b********h 发帖数: 119 | 7
problem.
No. After the map phase, every bucket will be on one machine. This is due to
the shuffling between map and reduce.
Yes, the largest cost is the shuffling between map and reduce.
【在 a**********k 的大作中提到】 : If the numbers have been distributed beforehand, then each bucket will : be distributed among N machines, then it is reduced to the original problem. : Also, MapReduce infrastructure itself has a cost. : : we
|
c*****e 发帖数: 74 | 8 I agree. This seems to be a good way to decrease the communication cost and
disk I/O (but didn't know this two years ago).
we
【在 b********h 的大作中提到】 : If you can use MapReduce, you could partition the numbers into several : buckets (Mapper) and count how many numbers are in each bucket. Then, you : know which bucket to look for the median. If the bucket is small enough, we : could put it into memory and find the median. Otherwise, do another : mapreduce job.
|
a**********k 发帖数: 1953 | 9 So the map task will run on each machine, and data will be shuffled
so that each bucket will be on one machine, then the reduce task on
each machine will do the counting and report it to one master machine
to calculate which machine has the right bucket for the median, and
asks it to finish up the rest of the work.
to
【在 b********h 的大作中提到】 : : problem. : No. After the map phase, every bucket will be on one machine. This is due to : the shuffling between map and reduce. : Yes, the largest cost is the shuffling between map and reduce.
|
j*****u 发帖数: 1133 | 10 actually map reduce doesn't reduce the communication cost nor the disk I/O.
Each M/R phase requires m*r connections/intermediate files.
and
【在 c*****e 的大作中提到】 : I agree. This seems to be a good way to decrease the communication cost and : disk I/O (but didn't know this two years ago). : : we
|
|
|
c*****e 发帖数: 74 | 11 Actually, this is a special Map/Reduce scenario. All the machine can pass
its distribution (over each bucket) statistic (which is small compared to
the original data) to one machine and then it can be determined that which
bucket the median is in.
.
【在 j*****u 的大作中提到】 : actually map reduce doesn't reduce the communication cost nor the disk I/O. : Each M/R phase requires m*r connections/intermediate files. : : and
|
a**********k 发帖数: 1953 | 12 Actually we can do some optimization over the classic Map/Reduce:
Let all buckets distributed on N machines as is, each machine counts
the numbers in all buckets, and pass the result to a master machine,
which calculate and figure out which bucket has the median, then,
we need merge the data in that particular bucket to one machine
to finish the remaining job. In this case, we don't need to shuffle the data
in all the buckets as in classic MR, which has the most overhead as
some one pointed out earlier. We only need to merge data in one bucket.
【在 c*****e 的大作中提到】 : Actually, this is a special Map/Reduce scenario. All the machine can pass : its distribution (over each bucket) statistic (which is small compared to : the original data) to one machine and then it can be determined that which : bucket the median is in. : : .
|
j*j 发帖数: 5564 | 13 good
data
【在 a**********k 的大作中提到】 : Actually we can do some optimization over the classic Map/Reduce: : Let all buckets distributed on N machines as is, each machine counts : the numbers in all buckets, and pass the result to a master machine, : which calculate and figure out which bucket has the median, then, : we need merge the data in that particular bucket to one machine : to finish the remaining job. In this case, we don't need to shuffle the data : in all the buckets as in classic MR, which has the most overhead as : some one pointed out earlier. We only need to merge data in one bucket.
|
c*****e 发帖数: 74 | 14 bladesmith suggested to repeat the first step if the bucket is still large.
data
【在 a**********k 的大作中提到】 : Actually we can do some optimization over the classic Map/Reduce: : Let all buckets distributed on N machines as is, each machine counts : the numbers in all buckets, and pass the result to a master machine, : which calculate and figure out which bucket has the median, then, : we need merge the data in that particular bucket to one machine : to finish the remaining job. In this case, we don't need to shuffle the data : in all the buckets as in classic MR, which has the most overhead as : some one pointed out earlier. We only need to merge data in one bucket.
|
a**********k 发帖数: 1953 | 15 Yes, if the master finds the bucket is still too large
to fit in memory, it can ask each machine to divide
that particular bucket into more sub-buckets, and
count each sub-bucket on each machine. This way
we only have the cross-machine communication
overhead to collect the counters (not shuffling the
data records around), plus the final merge, a huge
saving in communication bandwidth as well as
time/space cost.
【在 c*****e 的大作中提到】 : bladesmith suggested to repeat the first step if the bucket is still large. : : data
|
b********h 发帖数: 119 | 16 True. This is how people do K-means using MapReduce. Instead of passing all
the objects that belongs to one cluster to the Reducers. Just pass some
statistics.
【在 c*****e 的大作中提到】 : Actually, this is a special Map/Reduce scenario. All the machine can pass : its distribution (over each bucket) statistic (which is small compared to : the original data) to one machine and then it can be determined that which : bucket the median is in. : : .
|