Prévia do material em texto
INSTITUTO DE INFORMÁTICA Universidade Federal de Goiás Ecossistema Hadoop Prof. Savio Salvarino Teles de Oliveira Ecossistema Hadoop HDFS The Hadoop Distributed File System ● Construído em hardware “commodity” ● Altamente tolerante a falha ● Projetado para processamento em lote - Acesso a dados tem alto “throughput” ao invés de baixa latência (ocultar latência) ● Suporta datasets muito grandes HDFS Name Node Data Nodes (Commodity Hardware) • Armazena a árvore de diretório do sistema de arquivos. • Armazena todos os metadados com as informações sobre os arquivos: • File name, permissões, diretório • Quais nós contém cada bloco. • Os metadados são armazenados em memória e em disco. • Não se esqueça de fazer o backup dos metadados • Se você perder o NameNode você perdeu TODO HDFS. Name Node Name Node Data Nodes (Commodity Hardware) • Armazena e administra blocos do HDFS no disco local. • Informa a saúde e status de repositório de dados individuais ao NameNode. Vamos quebrar o texto em blocos de sentenças... ● Diferentes tamanhos de arquivos são processados da mesma forma; ● O armazenamento é simplificado; ● Unidade para replicação e tolerância a falhas; ● Tamanho de bloco padrão é de 128 MB; HDFS 8 Escrita no HDFS 500mb Name Node #hadoop fs –put arquivo.txt /path/in/hdfs Data Node - A Data Node - D Data Node - B Data Node - E Data Node - C Data Node - F 500mb 128mb 128mb 128mb 128mb Data Node - A Data Node - D Data Node - B Data Node - E Data Node - C Data Node - F Name Node #hadoop fs –put arquivo.txt /path/in/hdfs 500mb Name Node #hadoop fs –put arquivo.txt /path/in/hdfs Blk_03 Blk_01 Blk_04 Blk_02 Data Node - A Data Node - D Data Node - B Data Node - E Data Node - C Data Node - F 500mb Name Node #hadoop fs –put arquivo.txt /path/in/hdfs Blk_03 Blk_04 Blk_02 Data Node - A blk_01 Data Node - D Data Node - B Data Node - E Data Node - C Data Node - F 500mb Name Node #hadoop fs –put arquivo.txt /path/in/hdfs Blk_03 Blk_04 Blk_02 Data Node - A blk_01 Data Node - D Data Node - B blk_01 Data Node - E Data Node - C Data Node - F 500mb Name Node #hadoop fs –put arquivo.txt /path/in/hdfs Blk_03 Blk_04 Blk_02 Data Node - A blk_01 Data Node - D Data Node - B blk_01 Data Node - E Data Node - C Data Node - F blk_01Blk_01: A, B, F Registrado no metadados do NameNode. 500mb Name Node #hadoop fs –put arquivo.txt /path/in/hdfs Blk_03 Blk_04 Blk_02 Data Node - A blk_01 Data Node - D Data Node - B blk_01 Data Node - E Data Node - C Data Node - F blk_01Blk_01: A, B, F 500mb Name Node #hadoop fs –put arquivo.txt /path/in/hdfs Blk_03 Blk_04 Blk_02 Data Node - A blk_01 Data Node - D Data Node - B blk_01 Data Node - E Data Node - C Data Node - F blk_01Blk_01: A, E, F blk_01 Alterado metadados do NameNode. 500mb Name Node #hadoop fs –put arquivo.txt /path/in/hdfs Blk_03 Blk_04 Blk_02 Data Node - A blk_01 Data Node - D Data Node - B blk_01 Data Node - E Data Node - C Data Node - F blk_01Blk_01: A, E, F blk_01 Alterado metadados do NameNode. 150 bytes O que é acontece se um dos blocos for corrompido??? Replicação no HDFS 1.Replicar blocos baseados em um fator de replicação; 2.Armazenar replicas em diferentes locais; Replicação no HDFS A localização das réplicas são armazenadas no Namenode! Replicação no HDFS Armazena réplicas em diferentes nós MAXIMIZAR REDUNDÂNCIA Replicação no HDFS A terceira localização é interessante que fique no mesmo rack do segundo mas em um nó diferente… Replicação no HDFS 23 Client #hadoop jar wc.jar arquivo.txt Node Manager - A Node Manager - D Node Manager - B Node Manager - E Node Manager - C Node Manager - F Resource Manager Client #hadoop jar wc.jar arquivo.txt Node Manager - A Node Manager - D Node Manager - B Node Manager - E Node Manager - C Node Manager - F Application Master Resource Manager Execute – wc.jar ApplicationId Client #hadoop jar wc.jar arquivo.txt Node Manager - A Node Manager - D Node Manager - B Node Manager - E Node Manager - C Node Manager - F Application Master Resource Manager Execute – wc.jar Resource Request Client #hadoop jar wc.jar arquivo.txt Node Manager - A Node Manager - D Node Manager - B Node Manager - E Node Manager - C Node Manager - F Application Master Resource Manager Execute – wc.jar Resource Request Client #hadoop jar wc.jar arquivo.txt Node Manager - A Node Manager - D Node Manager - B Node Manager - E Node Manager - C Node Manager - F Application Master wc.jar Execute Resource Manager Execute – wc.jar Resource Request Client #hadoop jar wc.jar arquivo.txt Node Manager - A Node Manager - D Node Manager - B Node Manager - E Node Manager - C Node Manager - F Application Master wc.jar wc.jar Execute Execute Resource Manager Execute – wc.jar Resource Request Client #hadoop jar wc.jar arquivo.txt Node Manager - A Node Manager - D Node Manager - B Node Manager - E Node Manager - C Node Manager - F Execute – wc.jar Application Master wc.jar wc.jar Execute Execute Resource Manager Client #hadoop jar wc.jar arquivo.txt Node Manager - A Node Manager - D Node Manager - B Node Manager - E Node Manager - C Node Manager - F Execute – wc.jar Application Master Resource Status wc.jar wc.jar Execute Execute Resource Manager Client #hadoop jar wc.jar arquivo.txt Node Manager - A Node Manager - D Node Manager - B Node Manager - E Node Manager - C Node Manager - F Application Master Resource Request wc.jar wc.jar Execute Execute Resource Manager Done! Client #hadoop jar wc.jar arquivo.txt Node Manager - A blk_01 Node Manager - D Node Manager - B Node Manager - E Node Manager - C Node Manager - F Application Master Resource Request wc.jar wc.jar Execute Execute Resource Manager Done! Client #hadoop jar groupcount.jar arquivo.txt groupcount.jar groupcount.jarApplication Master YARN (Mapreduce / MRv2) Resource Manager Node Manager HDFS Name Node Data Node Master Slave Hadoop 2.x Core Componentes Hadoop Deamons Client #hadoop jar wc.jar arquivo.txt Node Manager – A Data Node - A Node Manager - D Node Manager - B Node Manager - E Node Manager - C Node Manager - F Application Master Done! wc.jar Map Task wc.jar Map Task Resource Manager blk_01 blk_02 wc.jar Map Task blk_03 Client #hadoop jar wc.jar arquivo.txt Node Manager – A Data Node - A Node Manager - D Node Manager - B Node Manager - E Node Manager - C Node Manager - F Application Master Done! wc.jar Reduce Task wc.jar Reduce Task Resource Manager blk_01 blk_02 wc.jar Reduce Task blk_03 MapReduce: Programação Distribuída Funcionamento do MapReduce MapReduce: exemplo MapReduce: exemplo Funcionamento do MapReduce Entrada de Dados O processo começa com dados brutos. Função Map Aplica uma função "map" a cada registro de dados, gerando pares chave-valor intermediários. Shuffle e Combine Os pares chave-valor são embaralhados e ordenados por chave Função Reduce A função "reduce" combina valores com a mesma chave, produzindo o resultado final da operação. Referências Referências ● KIMBALL, Ralph; CASERTA, Joe. The Data Warehouse? ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. John Wiley & Sons, 2011. ● DOAN, AnHai; HALEVY, Alon; IVES, Zachary. Principles of data integration. Elsevier, 2012. ● KRISHNAN, Krish. Data warehousing in the age of big data. Newnes, 2013. ● Big Data ETL: ○ https://www.slideshare.net/SparkSummit/spark-summit-keynote-by- suren-nathan ○ https://software.intel.com/sites/default/files/article/402274/etl-big- data-with-hadoop.pdf 43 https://www.slideshare.net/SparkSummit/spark-summit-keynote-by-suren-nathan https://www.slideshare.net/SparkSummit/spark-summit-keynote-by-suren-nathan https://www.slideshare.net/SparkSummit/spark-summit-keynote-by-suren-nathanhttps://www.slideshare.net/SparkSummit/spark-summit-keynote-by-suren-nathan https://www.slideshare.net/SparkSummit/spark-summit-keynote-by-suren-nathan https://www.slideshare.net/SparkSummit/spark-summit-keynote-by-suren-nathan https://www.slideshare.net/SparkSummit/spark-summit-keynote-by-suren-nathan https://www.slideshare.net/SparkSummit/spark-summit-keynote-by-suren-nathan https://www.slideshare.net/SparkSummit/spark-summit-keynote-by-suren-nathan https://www.slideshare.net/SparkSummit/spark-summit-keynote-by-suren-nathan https://www.slideshare.net/SparkSummit/spark-summit-keynote-by-suren-nathan https://www.slideshare.net/SparkSummit/spark-summit-keynote-by-suren-nathan https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf Referências ● FRIEDMAN, Ellen; TZOUMAS, Kostas. Introduction to Apache Flink: Stream Processing for Real Time and Beyond. 2016. ● NARKHEDE, Neha; SHAPIRA, Gwen; PALINO, Todd. Kafka: The Definitive Guide. 2016. ● JAIN, Ankit; NALYA, Anand. Learning storm. Packt Publishing, 2014. ● KARAU, Holden et al. Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.", 2015. ● DEROOS, Dirk et al. Hadoop for Dummies. John Wiley & Sons, Incorporated, 2014. 44 ● [Cae16] Wu, Caesar, Buyya, Rajkumar, Ramamohanarao, Kotagiri: Big Data Analytics: Machine Learning plus Cloud Computing, Big Data: Principles and Paradigms, Morgan Kaufmann, 1–27, Eds: Buyya, Rajkumar, Calheiros, Rodrigo N., Dastjerdi, Amir Vahid, January 2016 ● [Mar15] Marz, Nathan, Warren, James: , Big Data: Principles and best practices of scalable realtime data systems, 1st edition, Manning Publications, 328, May 2015 45 Referências Savio Salvarino Teles de Oliveira savioteles@ufg.br Slide 1: Ecossistema Hadoop Slide 2: Ecossistema Hadoop Slide 3 Slide 4 Slide 5: Name Node Slide 6 Slide 7 Slide 8 Slide 9 Slide 10 Slide 11 Slide 12 Slide 13 Slide 14 Slide 15 Slide 16 Slide 17 Slide 18 Slide 19 Slide 20 Slide 21 Slide 22 Slide 23 Slide 24 Slide 25 Slide 26 Slide 27 Slide 28 Slide 29 Slide 30 Slide 31 Slide 32 Slide 33 Slide 34: Hadoop Deamons Slide 35 Slide 36 Slide 37 Slide 38 Slide 39 Slide 40 Slide 41 Slide 42: Referências Slide 43: Referências Slide 44: Referências Slide 45 Slide 46: Savio Salvarino Teles de Oliveira savioteles@ufg.br