Toll Free 1800-123-321-5

Developer Training for Apache Spark and Hadoop

With enterprises exploring the potential of the cyberspace and connectivity for organizational success, it is evident that Big Data is here to stay for good. However, despite various market research that forecast a significant growth of big data and data analytics, data scientists report that there is an 80% shortage of Big data-skilled professionals working in the industry. IIHT’s Developer Training for Apache Spark and Hadoop is designed to addresses this challenge and delivers the key concepts and expertise required to develop high-performing applications with Apache Spark 2. The course structure is followed by assignments and projects that validates their expertise in writing and configuring Apache Spark Applications, distributed processing on an Apache Hadoop cluster, and distributed data persistence.

Use Spark SQL to query structured data and Spark Streaming to perform real-time processing on streaming data from different sources

Write applications that use Spark to perform ETL processing and iterative algorithms

Work with large datasets stored in a distributed file system, and execute Spark applications

Print Friendly, PDF & Email


This course is designed for developers and engineers who have experience programming experience. While a good foundation on Hadoop or Spark is recommended, it is not a necessity. The pre-requisites for the course are:

  • The ability to program with Python or Scala
  • Familiarity with the Linux command line
  • Basic knowledge of SQL

Course Content

  • Apache Hadoop Overview
  • Data Ingestion and Storage
  • Data Processing
  • Data Analysis and Exploration
  • Other Ecosystem Tools
  • Introduction to the Hands-On Exercises
  • Apache Hadoop Cluster Components
  • HDFS Architecture
  • Using HDFS
  • YARN Architecture
  • Working With YARN
  • What is Apache Spark?
  • Starting the Spark Shell
  • Using the Spark Shell
  • Getting Started with Datasets and DataFrames
  • DataFrame Operations
  • Creating DataFrames from Data Sources
  • Saving DataFrames to Data Sources
  • DataFrame Schemas
  • Eager and Lazy Execution
  • Querying DataFrames Using Column Expressions
  • Grouping and Aggregation Queries
  • Joining DataFrames
  • RDD Overview
  • RDD Data Sources
  • Creating and Saving RDDs
  • RDD Operations
  • Writing and Passing Transformation Functions
  • Transformation Execution
  • Converting Between RDDs and DataFrames
  • Key-Value Pair RDDs
  • Map-Reduce
  • Other Pair RDD Operations
  • Querying Tables in Spark Using SQL
  • Querying Files and Views
  • The Catalog API
  • Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
  • Datasets and DataFrames
  • Creating Datasets
  • Loading and Saving Datasets
  • Dataset Operations
  • Writing a Spark Application
  • Building and Running an Application
  • Application Deployment Mode
  • The Spark Application Web UI
  • Configuring Application Properties
  • Review: Apache Spark on a Cluster
  • RDD Partitions
  • Example: Partitioning in Queries
  • Stages and Tasks
  • Job Execution Planning
  • Example: Catalyst Execution Plan
  • Example: RDD Execution Plan
  • DataFrame and Dataset Persistence
  • Persistence Storage Levels
  • Viewing Persisted RDDs
  • Common Apache Spark Use Cases
  • Iterative Algorithms in Apache Spark
  • Machine Learning
  • Example: k-means
  • Apache Spark Streaming Overview
  • Example: Streaming Request Count
  • DStreams
  • Developing Streaming Applications
  • Multi-Batch Operations
  • Time Slicing
  • State Operations
  • Sliding Window Operations
  • Preview: Structured Streaming
  • Streaming Data Source Overview
  • Apache Flume and Apache Kafka Data Sources

Example: Using a Kafka Direct Data Source