scala - How do you parallelize RDD / DataFrame creation in Spark? -

June 15, 2015

say have spark job looks following:

def loadtable1() {   val table1 = sqlcontext.jsonfile(s"s3://textfiledirectory/")   table1.cache().registertemptable("table1") }    def loadtable2() {   val table2 = sqlcontext.jsonfile(s"s3://testfiledirectory2/")   table2.cache().registertemptable("table2") }    def loadalltables() {   loadtable1()   loadtable2() }  loadalltables()

how parallelize spark job both tables created @ same time?

you don't need parallelize it. rdd/df creation operations don't anything. these data structures lazy, actual calculation happen when start using them. , when spark calculation happen, automatically parallelized (partition-by-partition). spark distribute work across executors. not gain introducing further parallelism.

Search This Blog

JVParth

scala - How do you parallelize RDD / DataFrame creation in Spark? -

Comments

Post a Comment

Popular posts from this blog

android - Pass an Serializable object in AIDL -

How to provide Authorization & Authentication using Asp.net, C#? -

How to use Authorization & Authentication in Asp.net, C#? -