scala - How do you parallelize RDD / DataFrame creation in Spark? -


say have spark job looks following:

def loadtable1() {   val table1 = sqlcontext.jsonfile(s"s3://textfiledirectory/")   table1.cache().registertemptable("table1") }    def loadtable2() {   val table2 = sqlcontext.jsonfile(s"s3://testfiledirectory2/")   table2.cache().registertemptable("table2") }    def loadalltables() {   loadtable1()   loadtable2() }  loadalltables() 

how parallelize spark job both tables created @ same time?

you don't need parallelize it. rdd/df creation operations don't anything. these data structures lazy, actual calculation happen when start using them. , when spark calculation happen, automatically parallelized (partition-by-partition). spark distribute work across executors. not gain introducing further parallelism.


Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -

How to provide Authorization & Authentication using Asp.net, C#? -