Day 1:

Big Data

  • What is Big Data?
  • Why all industries are talking about Big Data?
  • What are the issues in Big Data?
  • Storage & Processing
  • What are the challenges for storing big data?
  • What are the challenges for processing big data? What are the technologies support big data?
  • Hadoop – Bigdata
  • Traditional Databases Vs NOSQL
  • Most popular BigData ecosystems (Spark, cassandhra, flink )

Day 2:

Hadoop

  • Installation of apache Hadoop
  • What is Hadoop?
  • History of Hadoop
  • Why Hadoop?
  • Hadoop Use cases
  • Advantages and disadvantages of Hadoop
  • Importance of Different Ecosystems of Hadoop
  • Importance of Integration with other BigData solutions Big Data Real time Use Cases
  • Apache Hadoop installation in Local mode (hands on installation on ur laptop)
  • Psuedo mode (hands on installation on ur laptop)
  • Cluster mode ( 5 node cluster setup in aws Account)

Day 3:

HDFS Commands

  • Importance of each command
  • How to execute the command
  • Hdfs admin related commands explanation
    Configurations
  • Can we change the existing configurations of hdfs or not?
  • CLI(Command Line Interface) using hdfs commands

Day 4:

HDFS Architecture

  • Name Node
  • Importance of Name Node
  • What are the roles of Name Node
  • What are the drawbacks in Name Node
  • Secondary Name Node
  • Importance of Secondary Name Node
  • What are the roles of Secondary Name Node
  • What are the drawbacks in Secondary Name Node
  • Data Node
  • Importance of Data Node
  • What are the roles of Data Node
  • What are the drawbacks in Data Node

Day 5:

Data Storage in HDFS

  • Traditional OS Block information.
  • How blocks are storing in DataNodes
  • How replication works in Data Nodes
  • HDFS Block size
  • Importance of HDFS Block size
  • Why Block size is so large?
  • How it is related to MapReduce split size
  • Importance of HDFS Replication factor in production environment
  • Can we change the replication for a particular file or folder
  • Can we change the replication for all files or folders
    Accessing HDFS

Day 6:

  • How to write the files in HDFS
  • How to read the files in HDFS
  • Rack Awareness, Topology Script
  • How block replicated?

Day 7:

How to overcome the Drawbacks in HDFS

  • Name Node failures
  • Secondary Name Node failures
  • Data Node failures
    Where does it fit and Where doesn’t fit? Exploring the Apache HDFS Web UI
    How to configure the Hadoop Cluster
  • How to add the new nodes ( Commissioning )
  • How to remove the existing nodes ( De-Commissioning )
  • How to verify the Dead Nodes
  • How to start the Dead Nodes

Day 8:

Map Reduce architecture

  • JobTracker
  • Importance of JobTracker
  • What are the roles of JobTracker
  • What are the drawbacks in JobTracker
  • TaskTracker
  • Importance of TaskTracker
  • What are the roles of TaskTracker
  • What are the drawbacks in TaskTracker
  • Map Reduce Job execution flow

Day 9:

Data Types in Hadoop

  • What are the Data types in Map Reduce
  • Text Input Format
  • Key Value Text Input Format
  • Sequence File Input Format
  • Nline Input Format
  • Importance of Input Format in Map Reduce
  • How to use Input Format in Map Reduce
  • How to write custom Input Format’s and its Record Readers Output Format’s in Map Reduce
  • Text Output Format
  • Sequence File Output Format
  • Importance of Output Format in Map Reduce
  • How to use Output Format in Map Reduce
  • How to write custom Output Format’s and its Record Writers Mapper
  • What is mapper in Map Reduce Job

Day 10:

  • Why we need mapper?
  • What are the Advantages and Disadvantages of mapper
  • Writing mapper programs Reducer
  • What is reducer in Map Reduce Job
  • Why we need reducer ?
  • What are the Advantages and Disadvantages of reducer
  • Writing reducer programs Combiner
  • What is combiner in Map Reduce Job
  • Why we need combiner?
  • What are the Advantages and Disadvantages of Combiner

Day 11:

Writing Combiner programs Partitioner

  • What is Partitioner in Map Reduce Job
  • Why we need Partitioner?
  • What are the Advantages and Disadvantages of Partitioner
  • Writing Partitioner programs Distributed Cache

Day 12:

  • What is Distributed Cache in Map Reduce Job
  • Importance of Distributed Cache in Map Reduce job
  • What are the Advantages and Disadvantages of Distributed Cache
  • Writing Distributed Cache programs Counters
  • What is Counter in Map Reduce Job
  • Why we need Counters in production environment?
  • How to Write Counters in Map Reduce programs
  • Importance of Writable and Writable Comparable Api’s
  • How to write custom Map Reduce Keys using Writable
  • How to write custom Map Reduce Values using Writable Comparable

Day 13:

Joins

  • Map Side Join
  • What is the importance of Map Side Join
  • Where we are using it
  • Reduce Side Join
  • What is the importance of Reduce Side Join
  • Where we are using it
  • What is the difference between Map Side join and Reduce Side Join?

Day 14

Compression techniques

  • Importance of Compression techniques in production environment
  • Compression Types
  • NONE, RECORD and BLOCK
  • Compression Codecs
  • Default, Gzip, Bzip, Snappy and LZO
  • Enabling and Disabling these techniques for all the Jobs
  • Enabling and Disabling these techniques for a particular Job

Day 15:

Map Reduce Schedulers

    • FIFO Scheduler
    • Capacity Scheduler
    • Fair Scheduler
    • Importance of Schedulers in production environment
    • How to use Schedulers in production environment
    • Map Reduce Programming Model
    • How to write the Map Reduce jobs in Java
    • Running the Map Reduce jobs in local mode
    • Running the Map Reduce jobs in pseudo mode
    • Running the Map Reduce jobs in cluster mode

Day 16:

Debugging Map Reduce Jobs

      • How to debug Map Reduce Jobs in Local Mode.
      • How to debug Map Reduce Jobs in Remote Mode.

 

Day 17:

Hadoop 2.6:

  • Hadoop 2.6 version features
  • Introduction to Namenode fedoration
  • Introduction to Namenode High Availabilty
  • Difference between Hadoop 1.x.x and Hadoop 2.x.x versions
  • HDFS changes in Hadoop 2.x
  • Mapreduce Changes in 2.x

Day 18:

YARN (Next Generation Map Reduce)

      • What is YARN?
      • What is the importance of YARN?
      • Where we can use the concept of YARN in Real Time
      • What is difference between YARN and Map Reduce
      • Data Locality
      • What is Data Locality?
      • Will Hadoop follows Data Locality?
      • Speculative Execution
      • What is Speculative Execution?
      • Will Hadoop follows Speculative Execution?

Day19:

Map Reduce Commands

      • Importance of each command
      • How to execute the command
      • Mapreduce admin related commands explanation
        Configurations
      • Can we change the existing configurations of mapreduce or not?
      • Importance of configurations
      • Power of Hadoop 2.x

Day 20:

Other Topics

      • Writing Unit Tests for Map Reduce Jobs
      • Configuring hadoop development environment using Eclipse
      • Use of Secondary Sorting and how to solve using MapReduce
      • How to Identify Performance Bottlenecks in MR jobs and tuning MR jobs. Map Reduce Streaming and Pipes with examples
      • Exploring the Apache MapReduce Web UI

Day 21:

Apache HIVE

      • Hive Introduction Hive architecture
      • Driver
      • Compiler
      • Semantic Analyzer
      • Hive Integration with Hadoop
      • Hive Query Language(Hive QL) VS SQL
      • Hive Installation and Configuration Hive
      • Map-Reduce and Local-Mode Hive DLL and DML Operations

Day 22:

Hive Services

      • CLI
      • Hiveserver
      • Hwi Metastore
      • embedded metastore configuration
      • external metastore configuration UDF’s
      • How to write the UDF’s in Hive
      • How to use the UDF’s in Hive
      • Importance of UDF’s in Hive

Day 23:

UDAF’s

      • How to use the UDAF’s in Hive
      • Importance of UDAF’s in Hive UDTF’s
      • How to use the UDTF’s in Hive
      • Importance of UDTF’s in Hive How to write a complex Hive queries What is Hive Data Model?

Day 24:

Partitions

      • Importance of Hive Partitions in production environment
      • Limitations of Hive Partitions
      • How to write Partitions
        Buckets
      • Importance of Hive Buckets in production environment
      • How to write Buckets

Day 25:

SerDe

      • Importance of Hive SerDe’s in production environment
      • How to write SerDe programs
      • How to integrate the Hive and Hbase
      • Json Serde and Regex serde

Day 26:

Apache SQOOP

      • Introduction to Sqoop
      • MySQL client and Server Installation
      • Sqoop Installation
      • How to connect to Relational Database using Sqoop
      • Sqoop Commands and Examples on Import and Export commands

Day 27:

Apache FLUME

  • Introduction to flume Flume installation
  • Integrate flume with Hadoop
    • Gather log data from twitter and pull into Flume
    • Flume agent usage and Flume examples execution

Day 28:

OOZIE:

      • Apache OOZIE Introduction to oozie Oozie installation
      • Executing oozie workflow jobs
      • Monitering Oozie workflow jobs

Day 29

Apache PIG

      • Introduction to Apache Pig
      • Map Reduce Vs Apache Pig
      • SQL Vs Apache Pig Different data types in Pig
      • Modes Of Execution in Pig
      • Local Mode
      • Map Reduce Mode
      • Execution Mechanism
      • Grunt Shell
      • Script
      • Embedded

Day 30:

UDF’s in Pig

      • How to write the UDF’s in Pig
      • How to use the UDF’s in Pig
      • Importance of UDF’s in Pig
        Filter’s
      • How to write the Filter’s in Pig
      • How to use the Filter’s in Pig
      • Importance of Filter’s in Pig
        Load Functions

Day 31:

      • How to write the Load Functions in Pig
      • How to use the Load Functions in Pig
      • Importance of Load Functions in Pig
        Store Functions
      • How to use the Store Functions in Pig
      • Importance of Store Functions in Pig
      • Transformations in Pig
      • How to write the complex pig scripts
      • How to integrate the Pig and Hbase

Day 32:

Apache Zookeeper

      • Introduction to zookeeper
      • Pseudo mode installations
      • Zookeeper cluster installations
      • Basic commands execution
      • Zookeeper Architecture
      • How it’s functioning with Hadoop

Day 33:

Apache Hbase

      • Hbase introduction
      • Hbase usecases Hbase basics
      • Column families
      • Scans
      • Hbase installation
      • Local mode
      • Psuedo mode
      • Cluster mode
      • Hbase Architecture
      • Storage

Day 34:

      • WriteAhead Log
      • Log Structured MergeTrees
      • Mapreduce integration
      • Mapreduce over Hbase
      • Hbase Usage
      • Key design
      • Bloom Filters
      • Versioning
      • Coprocessors
      • Filters
      • Hbase Clients
      • REST
      • Thrift
      • Hive
      • Web Based UI
        Hbase Admin
      • Schema definition
      • Basic CRUD operations

Hadoop Distributions:

Day 35:

  • Create an instance in AWS
  • Setup a cluster in Aws
  • Realtime practice Hadoop.
      • Introduction to Amazon EMR and Amazon Ec2
      • How to use Amazon EMR and Amazon Ec2
      • Why to use Amazon EMR and Importance of this

Day 36:

      • Introduction to Cloudera
      • Cloudera Installation
      • Cloudera Certification details How to use cloudera hadoop
      • What are the main differences between Cloudera and Apache Hadoop
      • Cloudera mock test for Certification

Day 37:

      • Hortonworks Distribution Introduction to Hortonworks
      • Hortonworks Installation Hortonworks Certification details
      • How to use Hortonworks hadoop
      • What are the main differences between Hortonworks and Apache Hadoop

Day 38:

Hadoop Administration tasks

      • Ganglia, Nagios (Monitoring tools)
      • How those monitoring tools work in real time?
      • Hadoop Administrator main functionalities.
      • Trouble shouting

Day 39: Kafka with Strom:

  • Kafka installation.
  • How kafka sends million messages.
  • Power of Kafka.
  • Importance in Strom.

Day 40: Future of Hadoop:

  • What next after Hadoop?
  • Apache Spark introduction
  • Installation of Spark.

 

Pre-Requisites for this Course

      • Java Basics like OOPS Concepts, Interfaces, Classes and Abstract
        Classes etc
      • SQL Basic Knowledge
      • Linux Basic Commands

Every Saturday and Sunday holiday, but assists through online to implement POCs and rectify the problems.