前言

前面我们已经搭建了 本地模式

接着上节我们开始搭建一个伪分布式模式

前面已经配置好 JAVA_HOMEHADOOP_HOME 环境变量。

首先创建 临时目录

1
mkdir ${HADOOP_HOME}/tmp

core-site.xml

1
vim ${HADOOP_HOME}/etc/hadoop/core-site.xml

configuration节点下添加如下配置

1
2
3
4
5
6
7
8
9
10
11
12
<configuration>
<!-- 指定HADOOP所使用的文件系统 schema(URI),HDFS NameNode 的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<!-- 指定hadoop运行时产生临时文件的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/bigdata/hadoop-2.7.7/tmp</value>
</property>
</configuration>

hdfs-site.xml

1
vim ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml

configuration节点下添加如下配置

1
2
3
4
5
6
7
8
9
10
11
<configuration>
<!-- 指定HDFS副本的数量 -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.secondary.http.address</name>
<value>localhost:50090</value>
</property>
</configuration>

mapred-site.xml

mapred-site.xml.template 复制为 mapred-site.xml

1
2
cp ${HADOOP_HOME}/etc/hadoop/mapred-site.xml.template ${HADOOP_HOME}/etc/hadoop/mapred-site.xml
vim ${HADOOP_HOME}/etc/hadoop/mapred-site.xml

configuration节点下添加如下配置

1
2
3
4
5
6
7
<configuration>
<!-- 指定mr运行在yarn上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

yarn-site.xml

1
vim ${HADOOP_HOME}/etc/hadoop/yarn-site.xml

configuration节点下添加如下配置

1
2
3
4
5
6
7
8
9
10
11
12
<configuration>
<!-- 指定 YARN 的 ResourceManager 地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<!-- reducer获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

格式化namenode

1
hdfs namenode -format

格式化成功会在 ${HADOOP_HOME}/tmp/dfs/name/current/ 目录生成几个文件,表示成功,如下:

1
2
3
4
fsimage_0000000000000000000
fsimage_0000000000000000000.md5
seen_txid
VERSION

启动服务

进入${HADOOP_HOME}目录

1
cd ${HADOOP_HOME}

启动 datanode

1
sbin/hadoop-daemon.sh start datanode

启动 SecondaryNameNode

1
sbin/hadoop-daemon.sh start secondarynamenode

启动 Resourcemanager

1
sbin/yarn-daemon.sh start resourcemanager

启动 NodeManager

1
sbin/yarn-daemon.sh start nodemanager

启动完成后,使用jps命令查看java进程

1
2
3
4
5
6
48295 Jps
47776 NameNode
47860 DataNode
48084 NodeManager
47945 SecondaryNameNode
48219 ResourceManager

启动dfs服务和yarn服务的另外方式

开启 dfs,包括 namenode,datanode,secondarynamenode 服务

1
sbin/start-dfs.sh

开启yarn,包括resourcemanagernodemanager

sbin/start-yarn.sh

开启所有的服务(过时)

1
2
bash
sbin/start-all.sh

配置完成后,我们访问 yarn 的 web 管理页面 http://ip:8088/

运行 MapReduce Job

使用 hdfs shellwords 文件上传到 hdfs

首先创建 input 目录

1
hdfs dfs -mkdir -p /input

上传 words 文件到 input 目录

1
hdfs dfs -put words /input

运行 WordCount MapReduce Job

1
yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount  /input /output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
12/01/21 22:27:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1611223975969_0001
12/01/21 22:27:35 INFO impl.YarnClientImpl: Submitted application application_1611223975969_0001
12/01/21 22:27:36 INFO mapreduce.Job: The url to track the job: http://master1:8088/proxy/application_1611223975969_0001/
12/01/21 22:27:36 INFO mapreduce.Job: Running job: job_1611223975969_0001
12/01/21 22:28:07 INFO mapreduce.Job: Job job_1611223975969_0001 running in uber mode : false
12/01/21 22:28:07 INFO mapreduce.Job: map 0% reduce 0%
12/01/21 22:28:27 INFO mapreduce.Job: map 100% reduce 0%
12/01/21 22:28:34 INFO mapreduce.Job: map 100% reduce 100%
12/01/21 22:28:35 INFO mapreduce.Job: Job job_1611223975969_0001 completed successfully
12/01/21 22:28:35 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=123
FILE: Number of bytes written=245681
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=179
HDFS: Number of bytes written=77
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=18107
Total time spent by all reduces in occupied slots (ms)=4454
Total time spent by all map tasks (ms)=18107
Total time spent by all reduce tasks (ms)=4454
Total vcore-milliseconds taken by all map tasks=18107
Total vcore-milliseconds taken by all reduce tasks=4454
Total megabyte-milliseconds taken by all map tasks=18541568
Total megabyte-milliseconds taken by all reduce tasks=4560896
Map-Reduce Framework
Map input records=4
Map output records=14
Map output bytes=137
Map output materialized bytes=123
Input split bytes=98
Combine input records=14
Combine output records=10
Reduce input groups=10
Reduce shuffle bytes=123
Reduce input records=10
Reduce output records=10
Spilled Records=20
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=205
CPU time spent (ms)=8210
Physical memory (bytes) snapshot=324100096
Virtual memory (bytes) snapshot=3861331968
Total committed heap usage (bytes)=170004480
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=81
File Output Format Counters
Bytes Written=77

查看输出结果目录

1
hdfs dfs -ls /output
1
2
3
Found 2 items
-rw-r--r-- 1 bennie supergroup 0 2021-01-12 22:28 /output/_SUCCESS
-rw-r--r-- 1 bennie supergroup 77 2021-01-12 22:28 /output/part-r-00000
  • output目录中有两个文件,_SUCCESS文件是空文件,有这个文件说明Job执行成功。
  • part-r-00000文件是结果文件,其中-r-说明这个文件是Reduce阶段产生的结果。

查看输出文件结果

1
hdfs dfs -cat  /output/part-r-00000
1
2
3
4
5
6
7
8
hadoop  2
hbase 1
hive 2
java 2
php 1
python 1
scala 2
spark 3

停止Hadoop服务

1
2
3
4
5
6
7
8
9
10
11
# 停止namenode
sbin/hadoop-daemon.sh stop namenode

# 停止datanode
sbin/hadoop-daemon.sh stop datanode

# 停止resourcemanager
sbin/yarn-daemon.sh stop resourcemanager

# 停止nodemanager
sbin/yarn-daemon.sh stop nodemanager

停止Hadoop服务的其他方式

1
2
3
4
5
6
7
8
# 停止dfs服务
sbin/stop-dfs.sh

# 停止yarn服务
sbin/stop-yarn.sh

# 停止所有服务
sbin/stop-all.sh

注意

开启历史服务

1
sbin/mr-jobhistory-daemon.sh start historyserver

可以通过http://ip:19888打开,可看到JobHistory页面

开启日志

hadoop默认不启动日志,我们可以在yarn-site.xml文件中配置启用日志。

  • log环境配置
1
2
3
4
5
6
7
8
9
10
<property>
<!--是否启用日志-->
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<!--设置日志保存时间,单位秒-->
<name>yarn.log-aggregation.retain-seconds</name>
<value>106800</value>
</property>
  • 重启Yarn进程
1
2
3
4
5
# 停止yarn服务
sbin/stop-yarn.sh

# 启动yarn服务
sbin/start-yarn.sh
  • 重启HistoryServer进程
1
2
3
4
5
# 停止HistoryServer服务
sbin/mr-jobhistory-daemon.sh stop historyserver

# 开启HistoryServer服务
sbin/mr-jobhistory-daemon.sh start historyserver
  • 测试日志,运行一个MapReducedemo,产生日志
1
yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount /input /output2

参考文献

hadoop 集群环境搭建