Day 1:
Big Data
- What is Big Data?
- Why all industries are talking about Big Data?
- What are the issues in Big Data?
- Storage & Processing
- What are the challenges for storing big data?
- What are the challenges for processing big data? What are the technologies support big data?
- Hadoop – Bigdata
- Traditional Databases Vs NOSQL
- Most popular BigData ecosystems (Spark, cassandhra, flink )
Day 2:
Hadoop
- Installation of apache Hadoop
- What is Hadoop?
- History of Hadoop
- Why Hadoop?
- Hadoop Use cases
- Advantages and disadvantages of Hadoop
- Importance of Different Ecosystems of Hadoop
- Importance of Integration with other BigData solutions Big Data Real time Use Cases
- Apache Hadoop installation in Local mode (hands on installation on ur laptop)
- Psuedo mode (hands on installation on ur laptop)
- Cluster mode ( 5 node cluster setup in aws Account)
Day 3:
HDFS Commands
- Importance of each command
- How to execute the command
- Hdfs admin related commands explanation
Configurations - Can we change the existing configurations of hdfs or not?
- CLI(Command Line Interface) using hdfs commands
Day 4:
HDFS Architecture
- Name Node
- Importance of Name Node
- What are the roles of Name Node
- What are the drawbacks in Name Node
- Secondary Name Node
- Importance of Secondary Name Node
- What are the roles of Secondary Name Node
- What are the drawbacks in Secondary Name Node
- Data Node
- Importance of Data Node
- What are the roles of Data Node
- What are the drawbacks in Data Node
Day 5:
Data Storage in HDFS
- Traditional OS Block information.
- How blocks are storing in DataNodes
- How replication works in Data Nodes
- HDFS Block size
- Importance of HDFS Block size
- Why Block size is so large?
- How it is related to MapReduce split size
- Importance of HDFS Replication factor in production environment
- Can we change the replication for a particular file or folder
- Can we change the replication for all files or folders
Accessing HDFS
Day 6:
- How to write the files in HDFS
- How to read the files in HDFS
- Rack Awareness, Topology Script
- How block replicated?
Day 7:
How to overcome the Drawbacks in HDFS
- Name Node failures
- Secondary Name Node failures
- Data Node failures
Where does it fit and Where doesn’t fit? Exploring the Apache HDFS Web UI
How to configure the Hadoop Cluster - How to add the new nodes ( Commissioning )
- How to remove the existing nodes ( De-Commissioning )
- How to verify the Dead Nodes
- How to start the Dead Nodes
Day 8:
Map Reduce architecture
- JobTracker
- Importance of JobTracker
- What are the roles of JobTracker
- What are the drawbacks in JobTracker
- TaskTracker
- Importance of TaskTracker
- What are the roles of TaskTracker
- What are the drawbacks in TaskTracker
- Map Reduce Job execution flow
Day 9:
Data Types in Hadoop
- What are the Data types in Map Reduce
- Text Input Format
- Key Value Text Input Format
- Sequence File Input Format
- Nline Input Format
- Importance of Input Format in Map Reduce
- How to use Input Format in Map Reduce
- How to write custom Input Format’s and its Record Readers Output Format’s in Map Reduce
- Text Output Format
- Sequence File Output Format
- Importance of Output Format in Map Reduce
- How to use Output Format in Map Reduce
- How to write custom Output Format’s and its Record Writers Mapper
- What is mapper in Map Reduce Job
Day 10:
- Why we need mapper?
- What are the Advantages and Disadvantages of mapper
- Writing mapper programs Reducer
- What is reducer in Map Reduce Job
- Why we need reducer ?
- What are the Advantages and Disadvantages of reducer
- Writing reducer programs Combiner
- What is combiner in Map Reduce Job
- Why we need combiner?
- What are the Advantages and Disadvantages of Combiner
Day 11:
Writing Combiner programs Partitioner
- What is Partitioner in Map Reduce Job
- Why we need Partitioner?
- What are the Advantages and Disadvantages of Partitioner
- Writing Partitioner programs Distributed Cache
Day 12:
- What is Distributed Cache in Map Reduce Job
- Importance of Distributed Cache in Map Reduce job
- What are the Advantages and Disadvantages of Distributed Cache
- Writing Distributed Cache programs Counters
- What is Counter in Map Reduce Job
- Why we need Counters in production environment?
- How to Write Counters in Map Reduce programs
- Importance of Writable and Writable Comparable Api’s
- How to write custom Map Reduce Keys using Writable
- How to write custom Map Reduce Values using Writable Comparable
Day 13:
Joins
- Map Side Join
- What is the importance of Map Side Join
- Where we are using it
- Reduce Side Join
- What is the importance of Reduce Side Join
- Where we are using it
- What is the difference between Map Side join and Reduce Side Join?
Day 14
Compression techniques
- Importance of Compression techniques in production environment
- Compression Types
- NONE, RECORD and BLOCK
- Compression Codecs
- Default, Gzip, Bzip, Snappy and LZO
- Enabling and Disabling these techniques for all the Jobs
- Enabling and Disabling these techniques for a particular Job
Day 15:
Map Reduce Schedulers
- FIFO Scheduler
- Capacity Scheduler
- Fair Scheduler
- Importance of Schedulers in production environment
- How to use Schedulers in production environment
- Map Reduce Programming Model
- How to write the Map Reduce jobs in Java
- Running the Map Reduce jobs in local mode
- Running the Map Reduce jobs in pseudo mode
- Running the Map Reduce jobs in cluster mode
Day 16:
Debugging Map Reduce Jobs
- How to debug Map Reduce Jobs in Local Mode.
- How to debug Map Reduce Jobs in Remote Mode.
Day 17:
Hadoop 2.6:
- Hadoop 2.6 version features
- Introduction to Namenode fedoration
- Introduction to Namenode High Availabilty
- Difference between Hadoop 1.x.x and Hadoop 2.x.x versions
- HDFS changes in Hadoop 2.x
- Mapreduce Changes in 2.x
Day 18:
YARN (Next Generation Map Reduce)
- What is YARN?
- What is the importance of YARN?
- Where we can use the concept of YARN in Real Time
- What is difference between YARN and Map Reduce
- Data Locality
- What is Data Locality?
- Will Hadoop follows Data Locality?
- Speculative Execution
- What is Speculative Execution?
- Will Hadoop follows Speculative Execution?
Day19:
Map Reduce Commands
- Importance of each command
- How to execute the command
- Mapreduce admin related commands explanation
Configurations - Can we change the existing configurations of mapreduce or not?
- Importance of configurations
- Power of Hadoop 2.x
Day 20:
Other Topics
- Writing Unit Tests for Map Reduce Jobs
- Configuring hadoop development environment using Eclipse
- Use of Secondary Sorting and how to solve using MapReduce
- How to Identify Performance Bottlenecks in MR jobs and tuning MR jobs. Map Reduce Streaming and Pipes with examples
- Exploring the Apache MapReduce Web UI
Day 21:
Apache HIVE
- Hive Introduction Hive architecture
- Driver
- Compiler
- Semantic Analyzer
- Hive Integration with Hadoop
- Hive Query Language(Hive QL) VS SQL
- Hive Installation and Configuration Hive
- Map-Reduce and Local-Mode Hive DLL and DML Operations
Day 22:
Hive Services
- CLI
- Hiveserver
- Hwi Metastore
- embedded metastore configuration
- external metastore configuration UDF’s
- How to write the UDF’s in Hive
- How to use the UDF’s in Hive
- Importance of UDF’s in Hive
Day 23:
UDAF’s
- How to use the UDAF’s in Hive
- Importance of UDAF’s in Hive UDTF’s
- How to use the UDTF’s in Hive
- Importance of UDTF’s in Hive How to write a complex Hive queries What is Hive Data Model?
Day 24:
Partitions
- Importance of Hive Partitions in production environment
- Limitations of Hive Partitions
- How to write Partitions
Buckets - Importance of Hive Buckets in production environment
- How to write Buckets
Day 25:
SerDe
- Importance of Hive SerDe’s in production environment
- How to write SerDe programs
- How to integrate the Hive and Hbase
- Json Serde and Regex serde
Day 26:
Apache SQOOP
- Introduction to Sqoop
- MySQL client and Server Installation
- Sqoop Installation
- How to connect to Relational Database using Sqoop
- Sqoop Commands and Examples on Import and Export commands
Day 27:
Apache FLUME
- Introduction to flume Flume installation
- Integrate flume with Hadoop
- Gather log data from twitter and pull into Flume
- Flume agent usage and Flume examples execution
Day 28:
OOZIE:
- Apache OOZIE Introduction to oozie Oozie installation
- Executing oozie workflow jobs
- Monitering Oozie workflow jobs
Day 29
Apache PIG
- Introduction to Apache Pig
- Map Reduce Vs Apache Pig
- SQL Vs Apache Pig Different data types in Pig
- Modes Of Execution in Pig
- Local Mode
- Map Reduce Mode
- Execution Mechanism
- Grunt Shell
- Script
- Embedded
Day 30:
UDF’s in Pig
- How to write the UDF’s in Pig
- How to use the UDF’s in Pig
- Importance of UDF’s in Pig
Filter’s - How to write the Filter’s in Pig
- How to use the Filter’s in Pig
- Importance of Filter’s in Pig
Load Functions
Day 31:
- How to write the Load Functions in Pig
- How to use the Load Functions in Pig
- Importance of Load Functions in Pig
Store Functions - How to use the Store Functions in Pig
- Importance of Store Functions in Pig
- Transformations in Pig
- How to write the complex pig scripts
- How to integrate the Pig and Hbase
Day 32:
Apache Zookeeper
- Introduction to zookeeper
- Pseudo mode installations
- Zookeeper cluster installations
- Basic commands execution
- Zookeeper Architecture
- How it’s functioning with Hadoop
Day 33:
Apache Hbase
- Hbase introduction
- Hbase usecases Hbase basics
- Column families
- Scans
- Hbase installation
- Local mode
- Psuedo mode
- Cluster mode
- Hbase Architecture
- Storage
Day 34:
- WriteAhead Log
- Log Structured MergeTrees
- Mapreduce integration
- Mapreduce over Hbase
- Hbase Usage
- Key design
- Bloom Filters
- Versioning
- Coprocessors
- Filters
- Hbase Clients
- REST
- Thrift
- Hive
- Web Based UI
Hbase Admin - Schema definition
- Basic CRUD operations
Hadoop Distributions:
Day 35:
- Create an instance in AWS
- Setup a cluster in Aws
- Realtime practice Hadoop.
- Introduction to Amazon EMR and Amazon Ec2
- How to use Amazon EMR and Amazon Ec2
- Why to use Amazon EMR and Importance of this
Day 36:
- Introduction to Cloudera
- Cloudera Installation
- Cloudera Certification details How to use cloudera hadoop
- What are the main differences between Cloudera and Apache Hadoop
- Cloudera mock test for Certification
Day 37:
- Hortonworks Distribution Introduction to Hortonworks
- Hortonworks Installation Hortonworks Certification details
- How to use Hortonworks hadoop
- What are the main differences between Hortonworks and Apache Hadoop
Day 38:
Hadoop Administration tasks
- Ganglia, Nagios (Monitoring tools)
- How those monitoring tools work in real time?
- Hadoop Administrator main functionalities.
- Trouble shouting
Day 39: Kafka with Strom:
- Kafka installation.
- How kafka sends million messages.
- Power of Kafka.
- Importance in Strom.
Day 40: Future of Hadoop:
- What next after Hadoop?
- Apache Spark introduction
- Installation of Spark.
Pre-Requisites for this Course
- Java Basics like OOPS Concepts, Interfaces, Classes and Abstract
Classes etc - SQL Basic Knowledge
- Linux Basic Commands
Every Saturday and Sunday holiday, but assists through online to implement POCs and rectify the problems.