08 November 2020

#Apache_Spark

#Apache_Spark

Key Concepts


Topic Sub-Topics Basic Intermediate Advanced Expert
Introduction What is Spark, Features, Spark vs Hadoop, Use cases
Architecture Spark Components, Driver, Executor, Cluster Manager, DAG, Jobs, Stages, Tasks
RDDs Resilient Distributed Datasets, Transformations, Actions, Caching, Persistence
DataFrames & Datasets Creation, Schema, Operations, Optimizations, Catalyst Engine
Spark SQL SQL Queries, Data Sources, Temporary Views, Performance Tuning
Spark Streaming DStreams, Structured Streaming, Window Operations, Checkpointing
Spark MLlib Machine Learning APIs, Pipelines, Models, Feature Engineering
Spark GraphX Graphs, Pregel API, Graph Algorithms
Spark Core APIs RDD API, Transformations, Actions, Accumulators, Broadcast Variables
Performance Tuning Partitioning, Caching, Shuffling, Join optimizations, Resource tuning
Cluster Management Standalone, YARN, Mesos, Kubernetes, Resource Allocation
Debugging & Monitoring Spark UI, Logs, Event Timeline, Metrics, Executors monitoring
Fault Tolerance Lineage, Task Retry, Checkpointing, Speculative Execution
Advanced Features Custom Partitioner, User-defined functions, Structured Streaming triggers
Integration Hive, HDFS, Kafka, Cassandra, Parquet, ORC, JDBC

Interview question

1. Introduction & Basics

  1. What is Apache Spark and what are its main features?
  2. How does Spark differ from Hadoop MapReduce?
  3. What are the core components of Spark?
  4. What is SparkContext?
  5. What is the role of a Driver in Spark?
  6. What is an Executor in Spark?
  7. What is a DAG in Spark?
  8. What are jobs, stages, and tasks in Spark?
  9. What are the key use cases of Spark?
  10. What is the difference between batch processing and stream processing in Spark?

2. Architecture

  1. Explain the Spark architecture.
  2. What are the main cluster managers supported by Spark?
  3. How does Spark schedule tasks?
  4. How does Spark handle fault tolerance?
  5. What is the role of the DAG scheduler?
  6. How are RDDs distributed across the cluster?
  7. How does Spark handle data locality?
  8. What is the role of the Task Scheduler?
  9. How does Spark communicate between Driver and Executors?
  10. What is the Spark UI and how is it used?

3. RDDs

  1. What is an RDD in Spark?
  2. How do you create RDDs?
  3. What are the main transformations in RDDs?
  4. What are the main actions in RDDs?
  5. How does Spark achieve fault tolerance in RDDs?
  6. What is the difference between narrow and wide transformations?
  7. How do caching and persistence work in RDDs?
  8. What is a lineage graph?
  9. What is the difference between map() and flatMap()?
  10. How do you perform joins on RDDs?

4. DataFrames & Datasets

  1. What is a DataFrame in Spark?
  2. How do DataFrames differ from RDDs?
  3. How do you create a DataFrame?
  4. What are Datasets in Spark?
  5. How do Datasets differ from DataFrames?
  6. How does Spark infer schema automatically?
  7. How do you perform filtering and aggregations on DataFrames?
  8. What is the Catalyst optimizer?
  9. How do you register a temporary view for SQL queries?
  10. How do you handle missing data in DataFrames?

5. Spark SQL

  1. How do you execute SQL queries in Spark?
  2. How do you connect Spark to Hive?
  3. What are the different data sources supported by Spark SQL?
  4. How do you create external and managed tables?
  5. What is partitioning in Spark SQL?
  6. How do you optimize joins in Spark SQL?
  7. How do you cache tables in Spark SQL?
  8. How does Spark SQL handle schema evolution?
  9. How do you use UDFs in Spark SQL?
  10. How do you monitor query performance in Spark SQL?

6. Spark Streaming

  1. What is Spark Streaming?
  2. What are DStreams?
  3. How does Structured Streaming differ from DStreams?
  4. What are window operations in Spark Streaming?
  5. What is checkpointing and why is it used?
  6. How do you handle late data in streaming?
  7. What are triggers in Structured Streaming?
  8. How do you integrate Kafka with Spark Streaming?
  9. How do you monitor streaming jobs?
  10. How do you ensure exactly-once semantics in streaming?

7. Spark MLlib

  1. What is MLlib?
  2. What are the main features of MLlib?
  3. How do you create a machine learning pipeline in Spark?
  4. How do you handle feature engineering in Spark MLlib?
  5. What are transformers and estimators?
  6. How do you perform model evaluation?
  7. How do you handle classification tasks in Spark MLlib?
  8. How do you handle regression tasks?
  9. How do you save and load models?
  10. How do you tune hyperparameters in Spark MLlib?

8. Spark GraphX

  1. What is GraphX?
  2. How do you represent graphs in Spark?
  3. What are vertices and edges?
  4. What is the Pregel API?
  5. How do you compute PageRank in GraphX?
  6. How do you find connected components?
  7. How do you implement graph algorithms using GraphX?
  8. How do you persist graph data?
  9. How do you visualize graphs from Spark?
  10. What are practical use cases of GraphX?

9. Performance Tuning

  1. How do you optimize partitioning in Spark?
  2. How do you reduce shuffle operations?
  3. How do you cache and persist data for performance?
  4. How do you tune memory and executor configurations?
  5. What are broadcast variables and how are they used?
  6. How do accumulators work in Spark?
  7. How do you optimize joins in Spark?
  8. How do you handle skewed data?
  9. How do you monitor and profile Spark jobs?
  10. How do you use Tungsten optimization?

10. Cluster Management

  1. What are the different deployment modes in Spark?
  2. How do you run Spark in Standalone mode?
  3. How do you run Spark on YARN?
  4. How do you run Spark on Mesos?
  5. How do you run Spark on Kubernetes?
  6. How do you configure Spark executors and cores?
  7. How do you handle dynamic allocation?
  8. How do you manage resources in a multi-tenant cluster?
  9. How do you submit a Spark job?
  10. How do you handle failures in Spark clusters?

Related Topics