Hadoop,  Operation

基于 Apache Hadoop 大数据平台生产部署(原汁原味)

基于 Apache Hadoop 大数据平台生产部署(原汁原味)

本文参考了 AWS EMRAliyun EMR 服务的相关组件部署配置规则,适用于中小规模私有云生产集群手动部署或运维人员技术培训等场景, 另外, a. 如需大规模部署请参考: 基于 absible 的主机 emr 生产集群部署, b. 如需基于 k8s 部署请参考: 基于 kubernetes 的部署 emr 生产集群部署

注:以下部署的各组件的版本如有必要请自行修改, 必须确保相关组件版本号的对应关系, 小的版本号改变也可能影响相关组件之间的调用兼容性.

1. 部署说明

1.1 资源目录

1.2 部署拓扑

为了以最简单最少的服务器完成所有组件的完全分布式部署, 以下 4 台主机(物理机/虚拟机), 在实际机房中部署时请自行按需调整(拆分/隔离), 比如通常 kafka 集群不会部署到 emr 的机器上, 与 emr 的 zookeeper 集群也可以物理隔离(即两个 zookeeper 集群).

OS Hostname IP Packages Processes (jps)
CentOS7.9 8C 32G 500G SSD emr-header-1 10.0.0.161 open-jdk1.8 / zookeeper-3.4.10 / hadoop-2.7.2 / hbase-1.2.5 / phoenix-4.14.1 / spark-2.3.1-bin-hadoop2.7 / kafka_2.11-1.0.1 QuorumPeerMain / NameNode / JournalNode / DFSZKFailoverController / HistoryServer / WebAppProxyServer / ApplicationHistoryServer / JobHistoryServer / Bootstrap(hdfs httpfs) / ResourceManager / HMaster / ThriftServer / Kafka
CentOS7.9 8C 32G 500G SSD emr-header-2 10.0.0.162 open-jdk1.8 / zookeeper-3.4.10 / hadoop-2.7.2 / hbase-1.2.5 / phoenix-4.14.1 / spark-2.3.1-bin-hadoop2.7 / kafka_2.11-1.0.1 QuorumPeerMain / NameNode / JournalNode / DFSZKFailoverController / HistoryServer / WebAppProxyServer / ApplicationHistoryServer / JobHistoryServer / Bootstrap(hdfs httpfs) / ResourceManager / HMaster / ThriftServer / Kafka
CentOS7.9 4C 32G 4T SSD emr-worker-1 10.0.0.163 open-jdk1.8 / zookeeper-3.4.10 / hadoop-2.7.2 / hbase-1.2.5 / phoenix-4.14.1 / spark-2.3.1-bin-hadoop2.7 / kafka_2.11-1.0.1 QuorumPeerMain / DataNode / JournalNode / NodeManager / HRegionServer / Kafka / ProdServerStart / TSDMain
CentOS7.9 4C 32G 4T SSD emr-worker-2 10.0.0.164 open-jdk1.8 / hadoop-2.7.2 / hbase-1.2.5 / phoenix-4.14.1 / spark-2.3.1-bin-hadoop2.7 DataNode / NodeManager / HRegionServer / TSDMain

1.2 各组件统一日志目录

/mnt/disk1/log/zookeeper/
/mnt/disk1/log/hadoop-hdfs/
/mnt/disk1/log/hadoop-yarn/
/mnt/disk1/log/hadoop-hbase/
/mnt/disk1/log/hadoop-spark/
/mnt/disk1/log/spark/
/mnt/disk1/log/kafka/
/mnt/disk1/log/opentsdb/

1.3 各组件统一数据目录

/mnt/disk1/zookeeper/
/mnt/disk1/hadoop/
/mnt/disk1/hdfs/
/mnt/disk1/yarn/
/mnt/disk1/kafka/
/mnt/disk1/opentsdb/

1.4 其他重要说明

  • 1.4.1 在部署之前,必须将所有主机按以上 1.2 集群服务拓扑 的规划,提前修改好主机名,且必须做好 emr-header-nemr-worker-n 的 SSH 免密(这里为了简化安全操作,全部使用 root 用户);

  • 1.4.2 以下所有组件手动部署操作过程均在 emr-header-1 节点进行,待所有组件部署完成后(启动之前),再通过如:
    scp -r /etc/emr/ emr-header-2:/etc/ 分发到所有节点,注意 zookeeper 的 /mnt/disk1/zookeeper/myid
    与 kafka 的 /etc/emr/kafka-current/server.properties#broker.id 等状态服务每台机的配置要求不同;

  • 1.4.3 有关系统 kernel 优化,请在集群所有组件启动之前配置,参考 kernel 调优

sudo cat <<-EOF >>/etc/rc.local
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
   echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
if test -f /sys/kernel/mm/transparent_hugepage/defrag; then
   echo never > /sys/kernel/mm/transparent_hugepage/defrag
fi
EOF

2. 部署 Zookeeper 集群

注: 每个节点的配置 /mnt/disk1/zookeeper/myid 的值不同, 分别设为: 1,2,3

  • 2.1 安装准备
BASE_DIR=/opt/apps/emr
# Download
mkdir -p $BASE_DIR; cd $BASE_DIR
wget -O apache-zookeeper.tar.gz https://archive.apache.org/dist/zookeeper/zookeeper-3.4.10/zookeeper-3.4.10.tar.gz
tar -xf apache-zookeeper.tar.gz
# Add soft link
ln -snf $(cd `ls | grep zookeeper-`;pwd) /usr/lib/zookeeper-current
  • 2.2 环境配置
# Generating profile-zookeeper.sh
curl -Lk -o /etc/profile.d/profile-zookeeper.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/profile.d/profile-zookeeper.sh

chmod +x /etc/profile.d/profile-zookeeper.sh
. /etc/profile.d/profile-zookeeper.sh
  • 2.3 运行配置
mkdir -p /mnt/disk1/zookeeper/data/
mkdir -p /etc/emr/zookeeper-conf

# Generating zoo.cfg
curl -Lk -o $ZOOCFGDIR/zoo.cfg https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/zookeeper-conf/zoo.cfg
  • 2.4 服务配置
# Generating zookeeper.service
curl -Lk -o $ZOOCFGDIR/zoo.cfg https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/zookeeper.service

3. 部署 Hadoop 集群

  • 3.1 安装准备
# TODO
  • 3.2 环境配置

# Generating profile-hdfs.sh
curl -Lk -o /etc/profile.d/profile-hdfs.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/profile.d/profile-hdfs.sh

# Generating profile-yarn.sh
curl -Lk -o /etc/profile.d/profile-yarn.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/profile.d/profile-yarn.sh
  • 3.3 运行配置
mkdir -p $HADOOP_LOG_DIR
mkdir -p $YARN_LOG_DIR
mkdir -p $HADOOP_CONF_DIR
cp -r $HADOOP_HOME/etc/hadoop/ $HADOOP_CONF_DIR

# Generating hadoop-env.sh
curl -Lk -o $HADOOP_CONF_DIR/hadoop-env.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/hadoop-conf/hadoop-env.sh

# Generating core-site.xml
curl -Lk -o $HADOOP_CONF_DIR/core-site.xml https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/hadoop-conf/core-site.xml

# Generating hdfs-site.xml
curl -Lk -o $HADOOP_CONF_DIR/hdfs-site.xml https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/hadoop-conf/hdfs-site.xml

# Generating mapred-env.sh
curl -Lk -o $HADOOP_CONF_DIR/mapred-env.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/hadoop-conf/mapred-env.sh

# Generating mapred-site.xml
curl -Lk -o $HADOOP_CONF_DIR/mapred-site.xml https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/hadoop-conf/mapred-site.xml

# Generating yarn-env.sh
curl -Lk -o $HADOOP_CONF_DIR/yarn-env.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/hadoop-conf/yarn-env.sh

# Generating yarn-site.xml
curl -Lk -o $HADOOP_CONF_DIR/yarn-site.xml https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/hadoop-conf/yarn-site.xml

# Generating slaves
cat <<-'EOF'> /etc/emr/hadoop-conf/slaves
emr-worker-1
emr-worker-2
EOF
  • 3.4 服务配置
cd /etc/systemd/system
curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/hadoop-namenode.service

curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/hadoop-journalnode.service

curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/hadoop-zkfc.service

curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/hadoop-resourcemanager.service

curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/hadoop-httpfs.service

curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/hadoop-historyserver.service

curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/hadoop-nodemanager.service

curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/hadoop-datanode.service

4. 部署 HBase 集群

  • 4.1 安装准备
BASE_DIR=/opt/apps/emr
# Download
mkdir -p $BASE_DIR; cd $BASE_DIR
wget -O apache-hbase.tgz https://archive.apache.org/dist/hbase/1.2.5/hbase-1.2.5-bin.tar.gz
tar -xf apache-hbase.tgz
# Add soft link
ln -snf $(cd `ls | grep hbase-`;pwd) /usr/lib/hbase-current
  • 4.2 环境配置
# Generating profile-hbase.sh
curl -Lk -o /etc/profile.d/profile-hbase.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/profile.d/profile-hbase.sh

chmod +x /etc/profile.d/profile-hbase.sh
. /etc/profile.d/profile-hbase.sh
  • 4.3 运行配置
mkdir -p $HBASE_LOG_DIR
mkdir -p $HBASE_CONF_DIR
cp -r $HBASE_HOME/conf/ $HBASE_CONF_DIR

# Generating hbase-env.sh
curl -Lk -o $HBASE_CONF_DIR https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/hbase-conf/hbase-env.sh

# Generating hbase-site.xml
curl -Lk -o $HBASE_CONF_DIR https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/hbase-conf/hbase-site.xml

# Generating regionservers
cat <<-'EOF'> /etc/emr/hbase-conf/regionservers
emr-worker-1
emr-worker-2
EOF
  • 4.4 服务配置
cd /etc/systemd/system

curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/hbase-hmaster.service

curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/hbase-regionserver.service

curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/hbase-thrift.service

5. 集成 Phoenix 到集群

  • 5.1 安装准备
BASE_DIR=/opt/apps/emr
# Download
mkdir -p $BASE_DIR; cd $BASE_DIR
wget -O apache-phoenix.tgz http://archive.apache.org/dist/phoenix/apache-phoenix-4.14.1-HBase-1.2/bin/apache-phoenix-4.14.1-HBase-1.2-bin.tar.gz
tar -xf apache-phoenix.tgz
# Add soft link
ln -snf $(cd `ls | grep phoenix-`;pwd) /usr/lib/phoenix-current
  • 5.2 环境配置
# Generating profile-phoenix.sh
curl -Lk -o /etc/profile.d/profile-phoenix.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/profile.d/profile-phoenix.sh

chmod +x /etc/profile.d/profile-phoenix.sh
. /etc/profile.d/profile-phoenix.sh
  • 5.3 关联 HBase
cp -r $PHOENIX_HOME/phoenix-*-*-*-server.jar $HBASE_HOME/lib/

5. 部署 Spark 集群

  • 5.1 安装准备
BASE_DIR=/opt/apps/emr
# Download
mkdir -p $BASE_DIR; cd $BASE_DIR
wget -O apache-spark.tgz https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
tar -xf apache-spark.tgz
# Add soft link
ln -snf $(cd `ls | grep spark-`;pwd) /usr/lib/spark-current
  • 5.2 环境配置
# Generating profile-spark.sh
curl -Lk -o /etc/profile.d/profile-spark.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/spark-conf/profile-spark.sh

chmod +x /etc/profile.d/profile-spark.sh
. /etc/profile.d/profile-spark.sh
  • 5.3 运行配置
mkdir -p $SPARK_LOG_DIR
mkdir -p $SPARK_CONF_DIR
cp -r $SPARK_HOME/conf/ $SPARK_CONF_DIR

# Generating spark-defaults.conf
curl -Lk -o $SPARK_CONF_DIR https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/spark-conf/spark-defaults.conf

# Generating spark-env.sh
curl -Lk -o $SPARK_CONF_DIR https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/spark-conf/spark-env.sh

# Generating hive-site.xml
curl -Lk -o $SPARK_CONF_DIR https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/spark-conf/hive-site.xml
  • 5.4 服务配置
cd /etc/systemd/system

curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/spark-historyserver.service

6. 部署 Flink 集群

  • 6.1 安装准备
BASE_DIR=/opt/apps/emr
# Download
mkdir -p $BASE_DIR; cd $BASE_DIR
wget -O apache-flink.tgz https://archive.apache.org/dist/flink/flink-1.14.4/flink-1.14.4-bin-scala_2.11.tgz
tar -xf apache-flink.tgz
# Add soft link
ln -snf $(cd `ls | grep flink-`;pwd) /usr/lib/flink-current
  • 6.2 环境配置
# Generating profile-flink.sh
curl -Lk -o /etc/profile.d/profile-flink.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/profile.d/profile-flink.sh

chmod +x /etc/profile.d/profile-flink.sh
. /etc/profile.d/profile-flink.sh
  • 6.3 运行配置
mkdir -p $FLINK_LOG_DIR
mkdir -p $FLINK_CONF_DIR
cp -r $FLINK_HOME/conf/ $FLINK_CONF_DIR

# Generating flink-conf.yaml
# TODO
  • 6.4 服务配置
cd /etc/systemd/system

curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/flink-standalone.service

7. 部署 Kafka 集群

注: 每个节点的配置 /etc/emr/kafka-conf/server.properties#broker.id 的值不同, 分别设为: 1,2,3

  • 7.1 安装准备
BASE_DIR=/opt/apps/emr
# Download
mkdir -p $BASE_DIR; cd $BASE_DIR
wget -O apache-kafka.tgz https://archive.apache.org/dist/kafka/1.0.1/kafka_2.11-1.0.1.tgz
tar -xf apache-kafka.tgz
# Add soft link
ln -snf $(cd `ls | grep kafka_`;pwd) /usr/lib/kafka-current
  • 7.2 环境配置
# Generating profile-kafka.sh
sudo curl -Lk -o /etc/profile.d/profile-kafka.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/profile.d/profile-kafka.sh

sudo chmod +x /etc/profile.d/profile-kafka.sh
. /etc/profile.d/profile-kafka.sh
  • 7.3 运行配置
mkdir -p $KAFKA_CONF_DIR
mkdir -p $KAFKA_DATA_DIR
cp -r /usr/lib/kafka-current/config/* $KAFKA_CONF_DIR

# Generating server.properties
sudo curl -Lk -o $KAFKA_CONF_DIR https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr/kafka-conf/server.properties
  • 7.4 服务配置
cd /etc/systemd/system

sudo curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/kafka.service
  • 7.5 Kafka Manager 裸机部署
sudo curl -Lk -O https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/systemd/system/kafka-manager.service

sudo curl -Lk -o /etc/init.d/kafka-manager.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/init.d/kafka-manager.sh
sudo chmod +x /etc/init.d/kafka-manager.sh

sudo mkdir -p $KAFKA_MANAGER_HOME
sudo ln -snf /opt/apps/kafka-manager-1.3.3.16 $KAFKA_MANAGER_HOME
  • 7.6 (推荐) Kafka Manager docker 部署

注:kafka-manager-2.x 版连接 kafka-2.x,kafka-1.x 用不了

# 连接本地 zookeeper
docker run --rm --name=kafka-manager1 --network=host -e ZK_HOSTS=127.0.0.1:2181 registry.cn-shenzhen.aliyuncs.com/wl4g/kafka-manager:2.0.0.2

8. 部署 OpenTSDB 集群

  • 8.1 安装准备
BASE_DIR=/opt/apps/emr
# Download
mkdir -p $BASE_DIR; cd $BASE_DIR
wget -O apache-opentsdb.tgz https://github.com/OpenTSDB/opentsdb/releases/download/v2.4.1/opentsdb-2.4.1-1-20210902183110-root.noarch.rpm
# or
#wget -O apache-opentsdb.tgz https://github.com/OpenTSDB/opentsdb/releases/download/v2.4.1/opentsdb-2.4.1_all.deb

tar -xf apache-opentsdb.tgz
# Add soft link
ln -snf $(cd `ls | grep opentsdb-`;pwd) /usr/lib/opentsdb-current
  • 8.2 环境配置
# Generating profile-opentsdb.sh
curl -Lk -o /etc/profile.d/profile-opentsdb.sh https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/profile.d/profile-opentsdb.sh

chmod +x /etc/profile.d/profile-opentsdb.sh
. /etc/profile.d/profile-opentsdb.sh
  • 8.3 运行配置

TODO

  • 8.4 服务配置

TODO

9. 启动集群

  • 首次安装须先格式化
# 首次安装,需在 nn1 节点上执行格式化命令,注: 前提条件是必须先启动 journalnode 服务, 因为格式化需连接到 journal edit 8485 端口写日志)
hdfs namenode -format

# 如果是 HA 模式则需格式化 zkfc,如果需要手动重置,则需清除 zk 中的数据,通常是: echo 'rmr /hadoop-ha'|zkCli.sh
hdfs zkfc -formatZK
  • 如果是 HA 模式,则如上格式化操作仅需在 NN1 节点执行即可,注: 在启动 NN2 前,必须先执行 hdfs namenode -bootstrapStandby 来指定
    是 standby 节点,否则会认为当前要启动的是 active 节点,导致报错 NameNode not formatted, 待集群就绪后可通过 hdfs haadmin -getServiceState nn1 查看 active 状态.

  • 启动 EMR 集群所有组件

10. 验证集群

  • 验证条件 1: 集群所有组件的进程启动后, 检查是否与如上 集群部署拓扑 的进程列表一致

  • 验证条件 2: 利用 kafka-console-producer.sh 生产数据到 topic1, 能被 spark-streaming 的 SparkSubmit 进程消费并存储进 HBase 的tb_table1 表.

11. 部署本地开发环境

部署本地开发环境只需将上述 curl 下配置文件路径中的 xx/emr/xx 改为 xx/emr-local/xx 即可

  • 如下载适用于本地的 hdfs-site.xml
curl -Lk -o $HADOOP_CONF_DIR/hdfs-site.xml https://gitee.com/wl4g/blogs/raw/master/docs/articles/hadoop/emr-production-deployment/resources/etc/emr-local/hadoop-conf/hdfs-site.xml
  • 启动并验证
# 首次安装需格式化 hdfs
hdfs namenode -format

# 以最小的规格启动,只保留必要的服务
zkServer.sh start
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
hbase-daemon.sh start master
hbase-daemon.sh start regionserver
kafka-server-start.sh -daemon $KAFKA_CONF_DIR/server.properties

# 验证服务
hdfs dfs -ls /
sqlline.py
0: jdbc:phoenix:localhost:2181:/hbase> !table

留言

您的电子邮箱地址不会被公开。