由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Programming版 - spark is slower than java Mapreduce --scala big bulls pls advise
相关主题
Spark请教。转行做data warehouse的问个学习大数据的问题
aws EMR能设置一个mapper吗?Re: 请教板上老司机 关于组和以后的发展方向
能不能在hadoop中用open mpi?学scala和spark需要什么pre req?
coltzhao的公司还在用mongo吗?试了下spark,不过如此啊
Flink Sparks Next Wave of Distributed Data Processing已经全上内存了,还要40多秒啊
MapReduce 的思想是怎么发明的?Spark会干掉Storm吗?
是否值得把业务逻辑做到Hbase coprocessor里面?以后真的是cassandra spark的天下了?
Hadoop运行时是不是用命令行执行的?Hadoop和Java有什么联系?Spark PK Akka 完胜呀
相关话题的讨论汇总
话题: spark话题: advise话题: bulls话题: sortmapper
进入Programming版参与讨论
1 (共1页)
v*****r
发帖数: 2325
1
spark beginner trying out the buzz tech
input 200GB uncompressed data file stored in hdfs
37 worker nodes, each has 24 cores
using java map reduce, 6-8 minutes
using spark, 37 minutes, 2 18 minute-stage
"lightning fast cluster computing, 100x faster" ???!!!!
Big bulls please advise!
#sortMapper sort values for each key, then do some iteration for the grouped
values
text = sc.textFile(input,1776) #24*37*2
text.map(mapper).filter(lambda x: x!=None).groupByKey().map(sortMapper).
filter(lambda x: x[1]!=[]).saveAsTextFile(output)
sc.textFile and saveAsTextFile is very slow
configuration as follows:
conf = SparkConf().set("spark.executor.memory","24g").set("spark.driver.
memory","16g").set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
N********n
发帖数: 8363
2
It's "lightening fast" only when it's in-memory, otherwise there's
no magic here.
w********m
发帖数: 1137
3
Shuffle了没有

★ 发自iPhone App: ChineseWeb 8.7

【在 v*****r 的大作中提到】
: spark beginner trying out the buzz tech
: input 200GB uncompressed data file stored in hdfs
: 37 worker nodes, each has 24 cores
: using java map reduce, 6-8 minutes
: using spark, 37 minutes, 2 18 minute-stage
: "lightning fast cluster computing, 100x faster" ???!!!!
: Big bulls please advise!
: #sortMapper sort values for each key, then do some iteration for the grouped
: values
: text = sc.textFile(input,1776) #24*37*2

b********l
发帖数: 84
4
没有 worker mem 24g x 37 是数据量的4倍

【在 w********m 的大作中提到】
: Shuffle了没有
:
: ★ 发自iPhone App: ChineseWeb 8.7

b********l
发帖数: 84
5
做磁盘读写也不能比java mr 慢吧

【在 N********n 的大作中提到】
: It's "lightening fast" only when it's in-memory, otherwise there's
: no magic here.

1 (共1页)
进入Programming版参与讨论
相关主题
Spark PK Akka 完胜呀Flink Sparks Next Wave of Distributed Data Processing
谈谈为什么上scalaMapReduce 的思想是怎么发明的?
mapreduce, hadoop还能火几年?是否值得把业务逻辑做到Hbase coprocessor里面?
Intro to Hadoop and MapReduce @ ucadicy 有人学过么? 199块 (转载)Hadoop运行时是不是用命令行执行的?Hadoop和Java有什么联系?
Spark请教。转行做data warehouse的问个学习大数据的问题
aws EMR能设置一个mapper吗?Re: 请教板上老司机 关于组和以后的发展方向
能不能在hadoop中用open mpi?学scala和spark需要什么pre req?
coltzhao的公司还在用mongo吗?试了下spark,不过如此啊
相关话题的讨论汇总
话题: spark话题: advise话题: bulls话题: sortmapper