| Introduction |
What is Spark, Features, Spark vs Hadoop, Use cases |
✅ |
✅ |
|
|
| Architecture |
Spark Components, Driver, Executor, Cluster Manager, DAG, Jobs, Stages, Tasks |
✅ |
✅ |
✅ |
✅ |
| RDDs |
Resilient Distributed Datasets, Transformations, Actions, Caching, Persistence |
✅ |
✅ |
✅ |
✅ |
| DataFrames & Datasets |
Creation, Schema, Operations, Optimizations, Catalyst Engine |
✅ |
✅ |
✅ |
✅ |
| Spark SQL |
SQL Queries, Data Sources, Temporary Views, Performance Tuning |
✅ |
✅ |
✅ |
✅ |
| Spark Streaming |
DStreams, Structured Streaming, Window Operations, Checkpointing |
✅ |
✅ |
✅ |
✅ |
| Spark MLlib |
Machine Learning APIs, Pipelines, Models, Feature Engineering |
✅ |
✅ |
✅ |
✅ |
| Spark GraphX |
Graphs, Pregel API, Graph Algorithms |
|
✅ |
✅ |
✅ |
| Spark Core APIs |
RDD API, Transformations, Actions, Accumulators, Broadcast Variables |
✅ |
✅ |
✅ |
✅ |
| Performance Tuning |
Partitioning, Caching, Shuffling, Join optimizations, Resource tuning |
|
✅ |
✅ |
✅ |
| Cluster Management |
Standalone, YARN, Mesos, Kubernetes, Resource Allocation |
✅ |
✅ |
✅ |
✅ |
| Debugging & Monitoring |
Spark UI, Logs, Event Timeline, Metrics, Executors monitoring |
✅ |
✅ |
✅ |
✅ |
| Fault Tolerance |
Lineage, Task Retry, Checkpointing, Speculative Execution |
✅ |
✅ |
✅ |
✅ |
| Advanced Features |
Custom Partitioner, User-defined functions, Structured Streaming triggers |
|
✅ |
✅ |
✅ |
| Integration |
Hive, HDFS, Kafka, Cassandra, Parquet, ORC, JDBC |
✅ |
✅ |
✅ |
✅ |