hadoop - How to read all the data of Common Crawl from AWS with Java? -
i'm totally new hadoop , mapreduce programming, , i'm trying write first mapreduce program data of common crawl.
i read data of april 2015 aws. example, if want download data of april 2015 in command line, do:
s3cmd s3://aws-publicdatasets/common-crawl/crawl-data/cc-main-2015-18/segments/1429246633512.41/wat/*.warc.wat.gz
this command line work, don't want download data of april 2015, want read "warc.wat.gz" files (in order analyze data).
i try create job, this:
public class firstjob extends configured implements tool { private static final logger log = logger.getlogger(firstjob.class); /** * main entry point uses {@link toolrunner} class run hadoop * job. */ public static void main(string[] args) throws exception { int res = toolrunner.run(new configuration(), new firstjob(), args); system.out.println("done !!"); system.exit(res); } /** * builds , runs hadoop job. * * @return 0 if hadoop job completes , 1 otherwise. */ public int run(string[] arg0) throws exception { configuration conf = getconf(); // job job = new job(conf); job.setjarbyclass(firstjob.class); job.setnumreducetasks(1); //string inputpath = "data/*.warc.wat.gz"; string inputpath = "s3n://aws-publicdatasets/common-crawl/crawl-data/cc-main-2015-18/segments/1429246633512.41/wat/*.warc.wat.gz"; log.info("input path: " + inputpath); fileinputformat.addinputpath(job, new path(inputpath)); string outputpath = "/tmp/cc-firstjob/"; filesystem fs = filesystem.newinstance(conf); if (fs.exists(new path(outputpath))) { fs.delete(new path(outputpath), true); } fileoutputformat.setoutputpath(job, new path(outputpath)); job.setinputformatclass(warcfileinputformat.class); job.setoutputformatclass(textoutputformat.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(longwritable.class); job.setmapperclass(firstjoburltypemap.servermapper.class); job.setreducerclass(longsumreducer.class); if (job.waitforcompletion(true)) { return 0; } else { return 1; } }
but i've got error:
exception in thread "main" java.lang.illegalargumentexception: aws access key id , secret access key must specified username or password (respectively) of s3n url, or setting fs.s3n.awsaccesskeyid or fs.s3n.awssecretaccesskey properties (respectively).
how can resolve problem ? in advance,
you can try this github project.
Comments
Post a Comment