scala - How do you parallelize RDD / DataFrame creation in Spark? -
say have spark job looks following:
def loadtable1() { val table1 = sqlcontext.jsonfile(s"s3://textfiledirectory/") table1.cache().registertemptable("table1") } def loadtable2() { val table2 = sqlcontext.jsonfile(s"s3://testfiledirectory2/") table2.cache().registertemptable("table2") } def loadalltables() { loadtable1() loadtable2() } loadalltables() how parallelize spark job both tables created @ same time?
you don't need parallelize it. rdd/df creation operations don't anything. these data structures lazy, actual calculation happen when start using them. , when spark calculation happen, automatically parallelized (partition-by-partition). spark distribute work across executors. not gain introducing further parallelism.
Comments
Post a Comment