从简单的Java程序调用mapreduce作业

哦，请不要使用runJar，Java API非常好。了解如何从常规代码开始工作：// create a configurationConfiguration conf = new Configuration();// create a new job based on the configurationJob job = new Job(conf);// here you have to put your mapper classjob.setMapperClass(Mapper.class);// here you have to put your reducer classjob.setReducerClass(Reducer.class);// here you have to set the jar which is containing your // map/reduce class, so you can use the mapper classjob.setJarByClass(Mapper.class);// key/value of your reducer outputjob.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);// this is setting the format of your input, can be TextInputFormatjob.setInputFormatClass(SequenceFileInputFormat.class);// same with outputjob.setOutputFormatClass(TextOutputFormat.class);// here you can set the path of your inputSequenceFileInputFormat.addInputPath(job, new Path("files/toMap/"));// this deletes possible output paths to prevent job failuresFileSystem fs = FileSystem.get(conf);Path out = new Path("files/out/processed/");fs.delete(out, true);// finally set the empty out pathTextOutputFormat.setOutputPath(job, out);// this waits until the job completes and prints debug out to STDOUT or whatever// has been configured in your log4j properties.job.waitForCompletion(true);如果您使用的是外部群集，则必须通过以下方式将以下信息放入配置中：// this should be like defined in your mapred-site.xmlconf.set("mapred.job.tracker", "jobtracker.com:50001"); // like defined in hdfs-site.xmlconf.set("fs.default.name", "hdfs://namenode.com:9000");当hadoop-core.jar位于您的应用程序容器类路径中时，这应该没问题。但是我认为您应该在网页上放置某种进度指示器，因为完成一项Hadoop工作可能需要几分钟到几小时;）对于YARN（> Hadoop 2）对于YARN，需要设置以下配置。// this should be like defined in your yarn-site.xmlconf.set("yarn.resourcemanager.address", "yarn-manager.com:50001"); // framework is now "yarn", should be defined like this in mapred-site.xmconf.set("mapreduce.framework.name", "yarn");// like defined in hdfs-site.xmlconf.set("fs.default.name", "hdfs://namenode.com:9000");

从简单的Java程序调用mapreduce作业

3回答