(1)启动hadoop
hadoop@master:~$ start-all.sh
(2)成功编译mahout源码
hadoop@master:~$ cd $MAHOUT_HOME
hadoop@master:~/mahout-0.5-src/mahout-distribution-0.5$ mvn install -Dmaven.test.skip=true
(1)生成input的数据
hadoop@master:~/mahout-0.5-src/mahout-distribution-0.5$ mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p /home/hadoop/mahout-0.5-src/mahout-distribution-0.5/my-test-data/20news-bydate/20news-bydate-train -o /home/hadoop/mahout-0.5-src/mahout-distribution-0.5/my-test-result/bayes-train-input -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8
运行结果:
Running on hadoop, using HADOOP_HOME=/home/hadoop/cloud/hadoop-1.0.4
HADOOP_CONF_DIR=/home/hadoop/cloud/confDir/hadoop/conf
Warning: $HADOOP_HOME is deprecated.
13/08/09 14:07:09 WARN driver.MahoutDriver: No org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups.props found on classpath, will use command-line arguments only
13/08/09 14:07:14 INFO driver.MahoutDriver: Program took 5202 ms
(2)生成test的数据
hadoop@master:~/mahout-0.5-src/mahout-distribution-0.5$ mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p /home/hadoop/mahout-0.5-src/mahout-distribution-0.5/my-test-data/20news-bydate/20news-bydate-test -o /home/hadoop/mahout-0.5-src/mahout-distribution-0.5/my-test-result/bayes-test-input -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8
运行结果:
Running on hadoop, using HADOOP_HOME=/home/hadoop/cloud/hadoop-1.0.4
HADOOP_CONF_DIR=/home/hadoop/cloud/confDir/hadoop/conf
Warning: $HADOOP_HOME is deprecated.
13/08/09 14:13:35 WARN driver.MahoutDriver: No org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups.props found on classpath, will use command-line arguments only
13/08/09 14:13:38 INFO driver.MahoutDriver: Program took 3428 ms
(3)将训练文本集上传到HDFS上
hadoop@master:~/mahout-0.5-src/mahout-distribution-0.5$ hadoop dfs -put /home/hadoop/mahout-0.5-src/mahout-distribution-0.5/my-test-result/bayes-train-input/ bayes-train-input
(4)模型训练:依据训练文本集来训练贝叶斯分类器模型
解释一下命令:-i:表示训练集的输入路径,HDFS路径; -o:分类模型输出路径; -type:分类器类型,这里使用bayes,可选cbayes;
-ng:(n-gram)建模的大小,默认为1; -source:数据源的位置,HDFS或HBase
hadoop@master:~/mahout-0.5-src/mahout-distribution-0.5$ mahout trainclassifier -i bayes-train-input -o bayes-newsmodel -type bayes -ng 1 -source hdfs
(5)将测试文本集上传到HDFS上
hadoop@master:~/mahout-0.5-src/mahout-distribution-0.5$ hadoop dfs -put /home/hadoop/mahout-0.5-src/mahout-distribution-0.5/my-test-result/bayes-test-input/ bayes-test-input
(6)模型测试:依据训练的贝叶斯分类器模型来进行分类测试
hadoop@master:~/mahout-0.5-src/mahout-distribution-0.5$ mahout testclassifier -m bayes-newsmodel -d bayes-test-input -type bayes -ng 1 -source hdfs -method mapreduce
运行结果(与apache官网里面的一致):
13/08/09 14:56:34 INFO bayes.BayesClassifierDriver: =======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t u <--Classified as
381 0 0 0 0 9 1 0 0 0 1 0 2 0 0 1 0 0 3 0 0 | 398 a = rec.motorcycles
1 284 0 0 0 0 1 0 6 3 11 0 3 66 0 1 6 0 4 9 0 | 395 b = comp.windows.x
2 0 339 2 0 3 5 1 0 0 0 0 1 1 12 1 7 0 2 0 0 | 376 c = talk.politics.mideast
4 0 1 327 0 2 2 0 0 2 1 1 5 0 1 4 12 0 2 0 0 | 364 d = talk.politics.guns
7 0 4 32 27 7 7 2 0 12 0 0 0 6 100 9 7 31 0 0 0 | 251 e = talk.religion.misc
10 0 0 0 0 359 2 2 0 1 3 0 6 1 0 1 0 0 11 0 0 | 396 f = rec.autos
0 0 0 0 0 1 383 9 1 0 0 0 0 0 0 0 0 0 3 0 0 | 397 g = rec.sport.baseball
1 0 0 0 0 0 9 382 0 0 0 0 1 1 1 0 2 0 2 0 0 | 399 h = rec.sport.hockey
2 0 0 0 0 4 3 0 330 4 4 0 12 5 0 0 2 0 12 7 0 | 385 i = comp.sys.mac.hardware
0 3 0 0 0 0 1 0 0 368 0 0 4 10 1 3 2 0 2 0 0 | 394 j = sci.space
0 0 0 0 0 3 1 0 27 2 291 0 25 11 0 0 1 0 13 18 0 | 392 k = comp.sys.ibm.pc.hardware
8 0 1 109 0 6 11 4 1 18 0 98 3 1 11 10 27 1 1 0 0 | 310 l = talk.politics.misc
6 0 1 0 0 4 2 0 5 2 12 0 321 8 0 4 14 0 8 6 0 | 393 m = sci.electronics
0 11 0 0 0 3 6 0 10 7 11 0 13 298 0 2 13 0 7 8 0 | 389 n = comp.graphics
2 0 0 0 0 0 4 1 0 3 1 0 1 3 372 6 0 2 1 2 0 | 398 o = soc.religion.christian
4 0 0 1 0 2 3 3 0 4 2 0 12 7 6 342 1 0 9 0 0 | 396 p = sci.med
0 1 0 1 0 1 4 0 3 0 1 0 4 8 0 2 369 0 1 1 0 | 396 q = sci.crypt
10 0 4 10 1 5 6 2 2 6 2 0 1 2 86 15 14 152 0 1 0 | 319 r = alt.atheism
4 0 0 0 0 9 1 1 8 1 12 0 6 3 0 2 0 0 341 2 0 | 390 s = misc.forsale
8 5 0 0 0 1 6 0 8 5 50 0 2 39 1 0 9 0 3 257 0 | 394 t = comp.os.ms-windows.misc
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 u = unknown
Default Category: unknown: 20
13/08/09 14:56:34 INFO driver.MahoutDriver: Program took 118128 ms
这一篇是转的,上一篇是自己写的,但是自己写的中间过程不完整,所以又转了一篇。