hadoop - Does Pig have a shuffle function? -


the question have is there in build function in pig shuffle tuple/bag ?

raw_record = load '$inputpath' -- using com.test.parser.testparser; record_project = foreach raw_record generate       field1,       field2,       field3,       field4;  sl_record = filter record_project (field1=='1' or field1=='2'); split sl_record rec1 if field1=='1',rec2 if field1=='2'; rec2sample = sample rec2 $samplingrate; finalrec1 = foreach rec1 generate       -1,       1,       field1,       field2,       field3,       field4;  finalrec2 = foreach rec2 generate       1,       1,       field1,       field2,       field3,       field4;  unionrec = union finalrec1, finalrec2;  store unionrec '$outputpath' using pigstorage(','); 


in above example problem union, see of finalrec1 followed finalrec2. need shuffled or mixed.

the approach took solve :

raw_record = load '$inputpath' -- using com.test.parser.testparser; record_project = foreach raw_record generate       field1,       field2,       field3,       field4;  sl_record = filter record_project (field1=='1' or field1=='2'); split sl_record rec1 if field1=='1',rec2 if field1=='2'; rec2sample = sample rec2 $samplingrate; finalrec1 = foreach rec1 generate       -1,       1,       field1,       field2,       field3,       field4,       (chararray)random() id;  finalrec2 = foreach rec2 generate       1,       1,       field1,       field2,       field3,       field4,       (chararray)random() id;  unionrec = union finalrec1, finalrec2; mixedrec = order unionrec id asc store mixedrec '$outputpath' using pigstorage(','); 

this way able mix them i'm unable write pig unit test. there way can shuffle unionrec directly , write pig unit test?

test :

@test public void mypigunittest {     string []inputs=new string[] {         "inputpath=/src/test/resource/testfile.txt",         "samplingrate=1",         "outputpath=dummy"     };     pigtest pigtest = pigunitutil.createpigtest("pathtomypigfile",inputs);     string [] expectedunion;     string [] expectedmixedrec;     pigtest.assertoutput("unionrec",expectedunion);     pigtest.assertoutput("mixedrec",expectedmixedrec); } 

here problem unionrec , mixedrec have random number there order mixed messed up.

i managed think of work around myself :

raw_record = load '$inputpath' -- using com.test.parser.testparser; record_project = foreach raw_record generate       field1,       field2,       field3,       field4;  sl_record = filter record_project (field1=='1' or field1=='2'); split sl_record rec1 if field1=='1',rec2 if field1=='2'; rec2sample = sample rec2 $samplingrate; finalrec1 = foreach rec1 generate       -1 label1,       1 label2,       field1  label3,       field2 label4,       field3 label5,       field4  label6;  finalrec2 = foreach rec2 generate       -1 label1,       1 label2,       field1  label3,       field2 label4,       field3 label5,       field4  label6;  unionrec = union finalrec1, finalrec2; unionrecwithid = foreach unionrec generate label1, label2, label3, label4, label5, label6,(chararray)random() id; mixedrec = order unionrecwithid id asc; store mixedrec '$outputpath' using pigstorage(','); 

now verify unionrec if has data expected.


Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -