b****z 发帖数: 176 | 1 题目是这样的。
Inputs:
1. Text file containing IP-range (lost/highest IPs inclusive) and country-
code, ~10MB
2. Text file containing browser-IP and URL requested by that browser, ~1TB
Outputs:
1. Top-10 countries that generated the most requests.
2. Top-10 most requested URLs grouped by country.
exact counts are not important. Assume a single machine (16-core / 32-GB
RAM/ 4TB drive) dedicated to this job
第一问应该是用heap做。但是第二问还没有想到很好的方法。 | s**x 发帖数: 7506 | 2 How to map ip to a country efficiently? 以前有个同事喜欢问差不多同样的问题,
好像就是用两个level, first level direct small hash, second level binary
search.
Group by country is easy as we only have 200+ countries .
Group by country + url will be a lot, since not require exact number, you
probably can just main a small heap, say 10000 country + url , then find the
last top 10.
抛砖引玉。 | s******7 发帖数: 1758 | 3 我来扔个砖头吧,没做过big data, 为面试专门看了一下
1苯办法,文件肯定至少要读一遍的,读的时候每个ip判断国家,按200个国家分成200
个文件, 同时用一个long[20]数组记录着个国家的ip数,数组排序就得到前10
2.每个国家的文件,根据该国的ip数和request数,一个或者分做多个(32G/ip数*
request数)map(ip, request number),再读一遍,用heap就可以找到 top 10了 | m*****k 发帖数: 731 | 4 1 用的是browser IP, 直接读文件, update 一个HashMap
, req_count>, update 一个小文件,文件名为URL_HASH,小文件只有一条纪录:<
countryCode_of_URL,
URL,req_count>
最后再heaplify一下HashMap, 不需要用long[20]数组和排序。
2 用的是URL,用上一步生成的所有小文件建立200+个堆。精确解。 | c********r 发帖数: 107 | | b*******d 发帖数: 750 | | m*****n 发帖数: 2152 | 7 这道题很简单啊。
建一个二维histogram, x axis用IP分bin (可以每个country一个bin),y axis用
URL的hash value分bin。
然后顺序读入所有数据, fill 相对应的bin。内存绝对够用。
然后把histogram projection到x axis,建个10位的heap,用heap sort,搞定(1).
然后把histogram projection到y axis,建个10位的heap,用heap sort,搞定(2).
【在 b****z 的大作中提到】 : 题目是这样的。 : Inputs: : 1. Text file containing IP-range (lost/highest IPs inclusive) and country- : code, ~10MB : 2. Text file containing browser-IP and URL requested by that browser, ~1TB : Outputs: : 1. Top-10 countries that generated the most requests. : 2. Top-10 most requested URLs grouped by country. : exact counts are not important. Assume a single machine (16-core / 32-GB : RAM/ 4TB drive) dedicated to this job
|
|