hadoop - Does Pig have a shuffle function? -
the question have is there in build function in pig shuffle tuple/bag ?
raw_record = load '$inputpath' -- using com.test.parser.testparser; record_project = foreach raw_record generate field1, field2, field3, field4; sl_record = filter record_project (field1=='1' or field1=='2'); split sl_record rec1 if field1=='1',rec2 if field1=='2'; rec2sample = sample rec2 $samplingrate; finalrec1 = foreach rec1 generate -1, 1, field1, field2, field3, field4; finalrec2 = foreach rec2 generate 1, 1, field1, field2, field3, field4; unionrec = union finalrec1, finalrec2; store unionrec '$outputpath' using pigstorage(',');
in above example problem union, see of finalrec1 followed finalrec2. need shuffled or mixed.
the approach took solve :
raw_record = load '$inputpath' -- using com.test.parser.testparser; record_project = foreach raw_record generate field1, field2, field3, field4; sl_record = filter record_project (field1=='1' or field1=='2'); split sl_record rec1 if field1=='1',rec2 if field1=='2'; rec2sample = sample rec2 $samplingrate; finalrec1 = foreach rec1 generate -1, 1, field1, field2, field3, field4, (chararray)random() id; finalrec2 = foreach rec2 generate 1, 1, field1, field2, field3, field4, (chararray)random() id; unionrec = union finalrec1, finalrec2; mixedrec = order unionrec id asc store mixedrec '$outputpath' using pigstorage(',');
this way able mix them i'm unable write pig unit test. there way can shuffle unionrec directly , write pig unit test?
test :
@test public void mypigunittest { string []inputs=new string[] { "inputpath=/src/test/resource/testfile.txt", "samplingrate=1", "outputpath=dummy" }; pigtest pigtest = pigunitutil.createpigtest("pathtomypigfile",inputs); string [] expectedunion; string [] expectedmixedrec; pigtest.assertoutput("unionrec",expectedunion); pigtest.assertoutput("mixedrec",expectedmixedrec); }
here problem unionrec , mixedrec have random number there order mixed messed up.
i managed think of work around myself :
raw_record = load '$inputpath' -- using com.test.parser.testparser; record_project = foreach raw_record generate field1, field2, field3, field4; sl_record = filter record_project (field1=='1' or field1=='2'); split sl_record rec1 if field1=='1',rec2 if field1=='2'; rec2sample = sample rec2 $samplingrate; finalrec1 = foreach rec1 generate -1 label1, 1 label2, field1 label3, field2 label4, field3 label5, field4 label6; finalrec2 = foreach rec2 generate -1 label1, 1 label2, field1 label3, field2 label4, field3 label5, field4 label6; unionrec = union finalrec1, finalrec2; unionrecwithid = foreach unionrec generate label1, label2, label3, label4, label5, label6,(chararray)random() id; mixedrec = order unionrecwithid id asc; store mixedrec '$outputpath' using pigstorage(',');
now verify unionrec if has data expected.
Comments
Post a Comment