v*****r 发帖数: 2325 | 1 spark beginner trying out the buzz tech
input 200GB uncompressed data file stored in hdfs
37 worker nodes, each has 24 cores
using java map reduce, 6-8 minutes
using spark, 37 minutes, 2 18 minute-stage
"lightning fast cluster computing, 100x faster" ???!!!!
Big bulls please advise!
#sortMapper sort values for each key, then do some iteration for the grouped
values
text = sc.textFile(input,1776) #24*37*2
text.map(mapper).filter(lambda x: x!=None).groupByKey().map(sortMapper).
filter(lambda x: x[1]!=[]).saveAsTextFile(output)
sc.textFile and saveAsTextFile is very slow
configuration as follows:
conf = SparkConf().set("spark.executor.memory","24g").set("spark.driver.
memory","16g").set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer") | N********n 发帖数: 8363 | 2 It's "lightening fast" only when it's in-memory, otherwise there's
no magic here. | w********m 发帖数: 1137 | 3 Shuffle了没有
★ 发自iPhone App: ChineseWeb 8.7
【在 v*****r 的大作中提到】 : spark beginner trying out the buzz tech : input 200GB uncompressed data file stored in hdfs : 37 worker nodes, each has 24 cores : using java map reduce, 6-8 minutes : using spark, 37 minutes, 2 18 minute-stage : "lightning fast cluster computing, 100x faster" ???!!!! : Big bulls please advise! : #sortMapper sort values for each key, then do some iteration for the grouped : values : text = sc.textFile(input,1776) #24*37*2
| b********l 发帖数: 84 | 4 没有 worker mem 24g x 37 是数据量的4倍
【在 w********m 的大作中提到】 : Shuffle了没有 : : ★ 发自iPhone App: ChineseWeb 8.7
| b********l 发帖数: 84 | 5 做磁盘读写也不能比java mr 慢吧
【在 N********n 的大作中提到】 : It's "lightening fast" only when it's in-memory, otherwise there's : no magic here.
|
|