hadoop - How to read all the data of Common Crawl from AWS with Java? -

January 15, 2014

i'm totally new hadoop , mapreduce programming, , i'm trying write first mapreduce program data of common crawl.

i read data of april 2015 aws. example, if want download data of april 2015 in command line, do:

s3cmd s3://aws-publicdatasets/common-crawl/crawl-data/cc-main-2015-18/segments/1429246633512.41/wat/*.warc.wat.gz

this command line work, don't want download data of april 2015, want read "warc.wat.gz" files (in order analyze data).

i try create job, this:

public class firstjob extends configured implements tool {     private static final logger log = logger.getlogger(firstjob.class);      /**      * main entry point uses {@link toolrunner} class run hadoop      * job.      */     public static void main(string[] args) throws exception {         int res = toolrunner.run(new configuration(), new firstjob(), args);         system.out.println("done !!");         system.exit(res);     }      /**      * builds , runs hadoop job.      *       * @return 0 if hadoop job completes , 1 otherwise.      */     public int run(string[] arg0) throws exception {         configuration conf = getconf();         //         job job = new job(conf);         job.setjarbyclass(firstjob.class);         job.setnumreducetasks(1);          //string inputpath = "data/*.warc.wat.gz";         string inputpath = "s3n://aws-publicdatasets/common-crawl/crawl-data/cc-main-2015-18/segments/1429246633512.41/wat/*.warc.wat.gz";         log.info("input path: " + inputpath);         fileinputformat.addinputpath(job, new path(inputpath));          string outputpath = "/tmp/cc-firstjob/";         filesystem fs = filesystem.newinstance(conf);         if (fs.exists(new path(outputpath))) {             fs.delete(new path(outputpath), true);         }         fileoutputformat.setoutputpath(job, new path(outputpath));          job.setinputformatclass(warcfileinputformat.class);         job.setoutputformatclass(textoutputformat.class);          job.setoutputkeyclass(text.class);         job.setoutputvalueclass(longwritable.class);          job.setmapperclass(firstjoburltypemap.servermapper.class);         job.setreducerclass(longsumreducer.class);          if (job.waitforcompletion(true)) {             return 0;         } else {             return 1;         }     }

but i've got error:

exception in thread "main" java.lang.illegalargumentexception: aws access key id , secret access key must specified username or password (respectively) of s3n url, or setting fs.s3n.awsaccesskeyid or fs.s3n.awssecretaccesskey properties (respectively).

how can resolve problem ? in advance,

you can try this github project.

Search This Blog

JVParth

hadoop - How to read all the data of Common Crawl from AWS with Java? -

Comments

Post a Comment

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -

I can see elements on storyboard from one screen on the other one - Objective C -