hadoop部分(单节点):
1.vim /etc/hosts
192.168.8.201 hadoop
2.创建hadoop 用户
useradd hadoop
hadoop 密码:123
3.安装JDK
[root@h201 ~]# vim /etc/profile
export JAVA_HOME=/usr/local/jdk1.8.0
export JAVA_BIN=$JAVA_HOME/bin
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME JAVA_BIN PATH CLASSPATH
4.安装ssh 证书(免密)
[hadoop@h201 ~]$ ssh-keygen -t rsa
[hadoop@h201 ~]$ ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub h201
5.
[hadoop@h201 hadoop]$ cp hadoop-2.6.0.tar.gz /home/hadoop
[hadoop@h201 ~]$ vi .bash_profile
HADOOP_HOME=/home/hadoop/hadoop-2.6.0
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
PATH=$HADOOP_HOME/bin:$PATH
export HADOOP_HOME HADOOP_CONF_DIR PATH
[hadoop@h201 ~]$ source .bash_profile
7.
编辑hdfs-site.xml
mkdir -p /home/hadoop/data/dfs/name
mkdir -p /home/hadoop/data/dfs/data
mkdir -p /home/hadoop/data/dfs/namesecondary
更改配置文件:
[hadoop@h201 hadoop]$ vi hdfs-site.xml
<name>dfs.replication</name>
<value>1</value>
8.编辑mapred-site.xml
[hadoop@h201 hadoop]$ cp mapred-site.xml.template mapred-site.xml
<value>hadoop:10020</value>
<description>MapReduce JobHistoryServer IPC host:port</description>
<value>hadoop:19888</value>
<description>MapReduce JobHistoryServer Web UI host:port</description>
属性”mapreduce.framework.name“表示执行mapreduce任务所使用的运行框架,默认为local,需要将其改为”yarn”
9.
编辑yarn-site.xml
[hadoop@h201 hadoop]$ vi yarn-site.xml
10.
[hadoop@h201 hadoop]$ vi hadoop-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0
11.
[hadoop@h201 hadoop]$ vi slaves
localhost
12.验证:
格式化:
[hadoop@h201 hadoop-2.6.0]$ bin/hdfs namenode -format
[hadoop@h201 hadoop-2.6.0]$ sbin/start-all.sh
[hadoop@h201 hadoop-2.6.0]$ jps
7054 SecondaryNameNode
7844 Jps
7318 NameNode
7598 ResourceManager
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -ls /
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -mkdir /aaa
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -mkdir /home/hadoop
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -mkdir /home/hive
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -mkdir /home/hbase
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -mkdir /home/spark
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -mkdir /home/spark
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -mkdir -p /home/flink/checkpoints
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -mkdir /tmp/hadoop
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -mkdir /tmp/hive
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -mkdir /tmp/hbase
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -mkdir /tmp/spark
[hadoop@hadoop hadoop-2.6.0]$ bin/hadoop fs -chmod 777 -R /tmp
hive部分(更改配置文件):
hive-site.xml 这个百度下!
hbase部分(更改配置文件):
hbase-site.xml
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop:9000/home/hbase</value>
<description>此参数指定了HRegion服务器的位置,即数据存放位置</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>此参数指定了Hlog和Hfile的副本个数,此参数的设置不能大于HDFS的节点数。伪分布式下DataNode只有一台,因此此参数应设置为1 </description> </property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<name>zookeeper.session.timeout</name>
<value>1200000</value>
<name>hbase.regionserver.handler.count</name>
<value>50</value>
<name>hbase.client.write.buffer</name>
<value>8388608</value>
<name>mapreduce.task.timeout</name>
<value>1200000</value>
<name>hbase.client.scanner.timeout.period</name>
<value>600000</value>
<name>hbase.rpc.timeout</name>
<value>600000</value>
spark部分(更改配置文件):
spark-defaults.conf
spark.master spark://hadoop:7077
spark.default.parallelism 6
spark.driver.memory 2g
spark.executor.memory 2g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.shuffle.partitions 1
spark.kryoserializer.buffer.max=1g
spark.kryoserializer.buffer=1g
spark.executor.extraClassPath /home/hadoop/hive-2.3.3/lib/mysql-connector-java-8.0.13.jar:/home/hadoop/hive-2.3.3/lib/hive-hbase-handler-2.3.3.jar:/home/hadoop/hbase-2.1.1/lib/hbase-client-2.1.1.jar:/home/hadoop/hbase-2.1.1/lib/hbase-server-2.1.1.jar:/home/hadoop/hbase-2.1.1/lib/hbase-common-2.1.1.jar:/home/hadoop/hbase-2.0.2/lib/hbase-protocol-shaded-2.0.2.jar:/home/hadoop/hbase-2.0.2/lib/hbase-protocol-2.0.2.jar:/home/hadoop/hbase-2.0.2/lib/htrace-core-3.2.0-incubating.jar:/home/hadoop/hbase-2.0.2/lib/htrace-core4-4.2.0-incubating.jar:/home/hadoop/hbase-2.0.2/lib/metrics-core-3.2.1.jar:/home/hadoop/hbase-2.0.2/lib/hbase-hadoop2-compat-2.0.2.jar:/home/hadoop/hbase-2.0.2/lib/hbase-hadoop-compat-2.0.2.jar:/home/hadoop/hbase-2.0.2/lib/guava-11.0.2.jar:/home/hadoop/hbase-2.0.2/lib/protobuf-java-2.5.0.jar
spark.driver.extraClassPath /home/hadoop/hive-2.3.3/lib/mysql-connector-java-8.0.13.jar:/home/hadoop/hive-2.3.3/lib/hive-hbase-handler-2.3.3.jar:/home/hadoop/hbase-2.1.1/lib/hbase-client-2.1.1.jar:/home/hadoop/hbase-2.1.1/lib/hbase-server-2.1.1.jar:/home/hadoop/hbase-2.1.1/lib/hbase-common-2.1.1.jar:/home/hadoop/hbase-2.0.2/lib/hbase-protocol-shaded-2.0.2.jar:/home/hadoop/hbase-2.0.2/lib/hbase-protocol-2.0.2.jar:/home/hadoop/hbase-2.0.2/lib/htrace-core-3.2.0-incubating.jar:/home/hadoop/hbase-2.0.2/lib/htrace-core4-4.2.0-incubating.jar:/home/hadoop/hbase-2.0.2/lib/metrics-core-3.2.1.jar:/home/hadoop/hbase-2.0.2/lib/hbase-hadoop2-compat-2.0.2.jar:/home/hadoop/hbase-2.0.2/lib/hbase-hadoop-compat-2.0.2.jar:/home/hadoop/hbase-2.0.2/lib/guava-11.0.2.jar:/home/hadoop/hbase-2.0.2/lib/protobuf-java-2.5.0.jar
spark-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HBASE_HOME=/home/hadoop/hbase-1.3.1
export HBASE_CONF=$HBASE_HOME/conf
export HIVE_HOME=/home/hadoop/hive-2.3.3
export SPARK_HOME=/home/hadoop/spark-2.1.1
export SPARK_MASTER_IP=hadoop
export SPARK_WORKER_MEMORY=1G
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1
export SPARK_EXECUTOR_INSTANCES=1
flink部分(更改配置文件):
下载地址:https://mirrors.bfsu.edu.cn/apache/flink/flink-1.12.5/flink-1.12.5-bin-scala_2.11.tgz
可参考网盘上的内容
flink-streaming-platform-web(更改配置文件):
这个只有一个配置文件:
application.properties
(就是mysql jdbc)
这些都是配置文件,可以到我们网盘上下载!
http://oneindex.iegum.com
重点讲解下flink部分
原理:
sum原理:X=X+Y
其中X存在于内存,Y存在于kafka!类似于这种模式,也就是说只要是聚合函数统一用这种模式!换而已言之永远是在处理两个值
我以flink-sql 讲解:
flink可以把kafka中的流数据映射成一张表,比如为a表.
那阿里那个大屏来说,方便大家理解,
比如a表中有一个字段total 为交易现金!
阿里数据属于海量数据,如果统计一天,用sql角度来讲:
就是select sum(total) from a;
数据量非常庞大,肯定资源不够!
如果用数据流的方式,那么就变成这样了!
我们采集A表到kafka,然后flink映射kafka_a表,在关联结果表 total_a
语句变成:
select sum(total) from (select total from kafka_a union all select total from total_a) tmp
flink第一次会从结果表total_a读取一次,然后union kafka_a 的total 计算并放在内存中,
后面kafka再有流数据过来,就变成从直接用内存中的值+kafka_a得total
相当于只从结果表读一次,后面不走硬盘,所以速度非常快!可以说flink真是小企业的福音,小配置一样可以高速处理数据!
图我就放在我们网站上了!后面直接用数据处理模型给大家讲解!