1 环境准备
本次搭建Hadoop准备了四台服务器:
系统: Centos7
IP地址:
192.168.10.131 master
192.168.10.160 slave1
192.168.10.161 slave2
192.168.10.162 slave3
Java版本: Java1.8
Hadoop版本: Hadoop2.6.5
关闭Centos7的防火墙和SELinux,如何关闭参考Centos7关闭防火墙与SELinux
2 必要软件安装和环境配置
2.1 安装Java环境(master和三台slave上均安装java环境)
网上如何安装java教程比较多,这里就不再累述。
[root@master local]# java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
2.2 修改主机名
分别修改四台服务器的hosts文件,填入以下内容:
192.168.10.131 master
192.168.10.160 slave1
192.168.10.161 slave2
192.168.10.162 slave3
分别修改四台服务器的主机名,例如我们修改master主机名
vi /etc/sysconfig/network
# Created by anaconda
HOSTNAME=master 这里不同的服务器名称不同
重启服务器使其生效。
2.3 ssh配置
master上的namenode需要通过ssh无密码的访问三台slave机器,但是三台slave是否需要通过ssh无密码访问master我现在还没有了解Hadoop的原理,所以给不出确定的答案。这里我将master的公钥分别分发到三台slave上,三台slave的公钥都分别上传到master上。 这里以master为例,进行ssh配置。
首先生成密钥对:
ssh-keygen -t rsa 然后一路默认即可
然后将执行下面命令,将公钥追加到authorized_keys文件中,实现本机的无密码登陆
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
然后尝试登陆本机,验证是否成功
ssh master
将公钥分发到三台slave上
ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave1
ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave2
ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave3
我在每一台slave分别生成了公私钥,然后分别将公钥追加到了master机器的authorized_keys文件中(这一步是否有必要,等我更深入学习后修改)
到这里,就实现master无密码登陆slave机器
3 Hadoop安装配置
3.1 创建文件夹
在master中,name、data和tmp文件夹分别用于保存HDFS的NameNode文件、数据和临时文件:
mkdir -p /data/hdfs
cd /data/hdfs
mkdir name
mkdir data
mkdir tmp
然后将该文件夹拷贝到三台slave上(分别在三台slave上创建data文件夹 mkdir /data
)
scp -r hdfs/ root@slave1:/data
scp -r hdfs/ root@slave2:/data
scp -r hdfs/ root@slave3:/data
3.2 解压Hadoop
tar -xzvf hadoop-2.6.5.tar.gz
mv hadoop-2.6.5 /usr/local
3.3 配置Hadoop
注意: 为了让操作更直观,我下面的部分配置文件使用了绝对路径,实际操作过程中可以进入相应目录操作。
Hadoop的配置文件位于安装目录下的 /etc/hadoop/
目录下。
配置core-site.xml
vi /usr/local/hadoop-2.6.5/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/data/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
</configuration>
配置hdfs-site.xml
vi /usr/local/hadoop-2.6.5/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name> 这里配置数据块备份数量,默认一个数据块备份三份
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/data/hdfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>master:9001</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
配置yarn-site.xml
vi /usr/local/hadoop-2.6.5/etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:18040</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:18030</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:18088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:18025</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:18141</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
配置mapred-site.xml
默认是没有这个文件,但是有模板,我们使用模板生成该文件。
cp /usr/local/hadoop-2.6.5/etc/hadoop/mapred-site.xml.template /usr/local/hadoop-2.6.5/etc/hadoop/mapred-site.xml
vi /usr/local/hadoop-2.6.5/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
配置salve
vi /usr/local/hadoop-2.6.5/etc/hadoop/slave1
删除已有内容,添加如下内容:
slave1
slave2
slave3
配置hadoop-env.sh
vi /usr/local/hadoop-2.6.5/etc/hadoop/hadoop-env.sh
找到JAVA_HOME,修改成我们自己的java路径
例如 export JAVA_HOME=/usr/java/jdk1.8.0_181
3.4 拷贝Hadoop安装目录三台slave上
scp -r /usr/local/hadoop-2.6.5/ root@slave1:/usr/local
scp -r /usr/local/hadoop-2.6.5/ root@slave2:/usr/local
scp -r /usr/local/hadoop-2.6.5/ root@slave3:/usr/local
3.5 在master上配置Hadoop Home
vi /etc/profile
export HADOOP_HOME=/usr/local/hadoop-2.6.5
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
这样就可以在master的任何地方运行Hadoop的命令,Hadoop的可执行脚本放在/usr/local/hadoop-2.6.5/bin或sbin
目录下。
4 启动集群
4.1 格式化HDFS文件系统
hadoop namenode -format 只在启动服务前执行,以后不需要执行。
4.2 启动集群
之前我是参考他人的教程,分别使用不同的脚本启动不同的服务,但是遇到一些slave上服务启动失败的现象,所以我直接使用sstart-all.sh
启动所有服务,但是该命令提示已经被弃用。
start-all.sh
使用jps查看运行情况
master上
[root@master hadoop]# jps
1313 Bootstrap
8057 NodeManager
8889 SecondaryNameNode
9049 ResourceManager
8714 NameNode
1755 core
10829 Jps
slave上
[root@slave2 hadoop]# jps
10436 Jps
1720 core
8664 DataNode
8761 NodeManager
1198 Bootstrap
4.3 重启Hadoop集群
stop-all.sh
start-all.sh
4.4 HDFS管理界面
访问http://192.168.10.131:50070
4.5 YARN管理界面
访问http://192.168.10.131:18088
5 运行样例
运行一个MapReduce作业:
[root@master hadoop]# hadoop jar /usr/local/hadoop-2.6.5/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.5.jar pi 5 10
Number of Maps = 5
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Starting Job
18/11/18 07:17:38 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.10.131:18040
18/11/18 07:17:43 INFO input.FileInputFormat: Total input paths to process : 5
18/11/18 07:17:44 INFO mapreduce.JobSubmitter: number of splits:5
18/11/18 07:17:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1542542761242_0002
18/11/18 07:17:47 INFO impl.YarnClientImpl: Submitted application application_1542542761242_0002
18/11/18 07:17:47 INFO mapreduce.Job: The url to track the job: http://master:18088/proxy/application_1542542761242_0002/
18/11/18 07:17:47 INFO mapreduce.Job: Running job: job_1542542761242_0002
18/11/18 07:18:13 INFO mapreduce.Job: Job job_1542542761242_0002 running in uber mode : false
18/11/18 07:18:13 INFO mapreduce.Job: map 0% reduce 0%
18/11/18 07:20:41 INFO mapreduce.Job: map 60% reduce 0%
18/11/18 07:20:42 INFO mapreduce.Job: map 100% reduce 0%
18/11/18 07:21:13 INFO mapreduce.Job: map 100% reduce 100%
18/11/18 07:21:23 INFO mapreduce.Job: Job job_1542542761242_0002 completed successfully
18/11/18 07:21:24 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=116
FILE: Number of bytes written=647247
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1305
HDFS: Number of bytes written=215
HDFS: Number of read operations=23
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=5
Launched reduce tasks=1
Data-local map tasks=5
Total time spent by all maps in occupied slots (ms)=695859
Total time spent by all reduces in occupied slots (ms)=23590
Total time spent by all map tasks (ms)=695859
Total time spent by all reduce tasks (ms)=23590
Total vcore-milliseconds taken by all map tasks=695859
Total vcore-milliseconds taken by all reduce tasks=23590
Total megabyte-milliseconds taken by all map tasks=712559616
Total megabyte-milliseconds taken by all reduce tasks=24156160
Map-Reduce Framework
Map input records=5
Map output records=10
Map output bytes=90
Map output materialized bytes=140
Input split bytes=715
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=140
Reduce input records=10
Reduce output records=0
Spilled Records=20
Shuffled Maps =5
Failed Shuffles=0
Merged Map outputs=5
GC time elapsed (ms)=6835
CPU time spent (ms)=47290
Physical memory (bytes) snapshot=362029056
Virtual memory (bytes) snapshot=12450140160
Total committed heap usage (bytes)=625672192
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=590
File Output Format Counters
Bytes Written=97
Job Finished in 225.822 seconds
Estimated value of Pi is 3.28000000000000000000
[root@master hadoop]#