mapreduce问题 - Programming版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

相关主题
● 问个Hadoop Word Count的简单问题	● 一个Hadoop Cluster升级的问题
● 是否值得把业务逻辑做到Hbase coprocessor里面?	● 这样读多个文件对吗？
● aws EMR能设置一个mapper吗？	● [转载] Re: [转载] 这样读多个文件对吗？
● 能不能在hadoop中用open mpi?	● 超牛的debug
● 菜鸟请教个hadoop streaming job 的问题 (转载)	● Re: [转载] how would you do this?
● 关于mapreduce一问	● java + javascript open local file?
● 关于big data	● perl: how to get the filename from the full path name
● spark is slower than java Mapreduce --scala big bulls pls advise	● How to fire up MFI in C#?

相关话题的讨论汇总
话题: mapper话题: reducer话题: 赋值话题: 文件话题: output

进入Programming版参与讨论

(共1页)

f*********e
发帖数: 8453

(用python mrjob包)
能不能在mapper里赋值给一个公共变量在所有的reducer里用？比如mapper里统计各个
文件里的词频(文件名和数目未知)，以二维表格式(每个词一行，每个文件一列，文件
没有的词填零)输出每个词在各个文件出现的次数。
我的代码本地测试可以完成上面要求，但是扔上hadoop cluster就出异常，说公共变量
没有赋值。
基本框架:
class MRcounter:
filenames=set()
def mapper(self, _, line):
self.filenames.add(current_filename)
yield word, {filename: count}
def reducer(self, word, count):
self.filenames=list(self.filenames)
output=[0]*len(self.filenames)
for c in count:
for k, c in count.items():
output[self.filenames.index(k)]=c
yield word, '\t'.join(output)
用测试输入样本在本地运行正确。上到hadoop cluster用同样的输入就报错。具体是那
个filenames在reducer里看到的是空，没被赋值。

(共1页)

进入Programming版参与讨论

相关主题
● 问一个C++函数Parameter的问题	● 菜鸟请教个hadoop streaming job 的问题 (转载)
● 问一个打开文件的问题	● 关于mapreduce一问
● awk求救	● 关于big data
● C语言程序静态库和动态库的创建及其应用	● spark is slower than java Mapreduce --scala big bulls pls advise
● 问个Hadoop Word Count的简单问题	● 一个Hadoop Cluster升级的问题
● 是否值得把业务逻辑做到Hbase coprocessor里面?	● 这样读多个文件对吗？
● aws EMR能设置一个mapper吗？	● [转载] Re: [转载] 这样读多个文件对吗？
● 能不能在hadoop中用open mpi?	● 超牛的debug

相关话题的讨论汇总
话题: mapper话题: reducer话题: 赋值话题: 文件话题: output

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天