apache pig - Pig Optimization on Group by -
lets assume have large data set per below schema layout
id,name,city --------------- 100,ajay,chennai 101,john,bangalore 102,zach,chennai 103,deep,bangalore .... ... i have 2 style of pig code giving me same output.
style 1 :
records = load 'user/inputfiles/records.txt' using pigstorage(',') (id:int,name:chararray,city:chararray); records_grp = group records city; records_each = foreach records_grp generate group city,count(records.id) emp_cnt; dump records_each; style 2 :
records = load 'user/inputfiles/records.txt' using pigstorage(',') (id:int,name:chararray,city:chararray); records_each = foreach (group records city) generate group city,count(records.id) emp_cnt; dump records_each ; in second style used nested foreach. style 2 code run faster style 1 code or not.
i reduce total time taken complete pig job..
so style 2 code achieve ? or there no impact in total time taken?
if confirms me can run similar code in cluster large dataset
the solutions have same costs.
however if records_grp not used elsewhere, version 2 allows not declare variable , script shorter.
Comments
Post a Comment