apache pig - Pig Optimization on Group by -

September 15, 2011

lets assume have large data set per below schema layout

id,name,city --------------- 100,ajay,chennai 101,john,bangalore 102,zach,chennai 103,deep,bangalore .... ...

i have 2 style of pig code giving me same output.

style 1 :

records = load 'user/inputfiles/records.txt' using pigstorage(',') (id:int,name:chararray,city:chararray); records_grp = group records city; records_each = foreach records_grp generate group city,count(records.id) emp_cnt; dump records_each;

style 2 :

records = load 'user/inputfiles/records.txt' using pigstorage(',') (id:int,name:chararray,city:chararray); records_each = foreach (group records city) generate group city,count(records.id) emp_cnt; dump records_each ;

in second style used nested foreach. style 2 code run faster style 1 code or not.

i reduce total time taken complete pig job..

so style 2 code achieve ? or there no impact in total time taken?

if confirms me can run similar code in cluster large dataset

the solutions have same costs.

however if records_grp not used elsewhere, version 2 allows not declare variable , script shorter.

Search This Blog

JVParth

apache pig - Pig Optimization on Group by -

Comments

Post a Comment

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -

How to provide Authorization & Authentication using Asp.net, C#? -