Introduction |
What is Spark, Features, Spark vs Hadoop, Use cases |
✅ |
✅ |
|
|
Architecture |
Spark Components, Driver, Executor, Cluster Manager, DAG, Jobs, Stages, Tasks |
✅ |
✅ |
✅ |
✅ |
RDDs |
Resilient Distributed Datasets, Transformations, Actions, Caching, Persistence |
✅ |
✅ |
✅ |
✅ |
DataFrames & Datasets |
Creation, Schema, Operations, Optimizations, Catalyst Engine |
✅ |
✅ |
✅ |
✅ |
Spark SQL |
SQL Queries, Data Sources, Temporary Views, Performance Tuning |
✅ |
✅ |
✅ |
✅ |
Spark Streaming |
DStreams, Structured Streaming, Window Operations, Checkpointing |
✅ |
✅ |
✅ |
✅ |
Spark MLlib |
Machine Learning APIs, Pipelines, Models, Feature Engineering |
✅ |
✅ |
✅ |
✅ |
Spark GraphX |
Graphs, Pregel API, Graph Algorithms |
|
✅ |
✅ |
✅ |
Spark Core APIs |
RDD API, Transformations, Actions, Accumulators, Broadcast Variables |
✅ |
✅ |
✅ |
✅ |
Performance Tuning |
Partitioning, Caching, Shuffling, Join optimizations, Resource tuning |
|
✅ |
✅ |
✅ |
Cluster Management |
Standalone, YARN, Mesos, Kubernetes, Resource Allocation |
✅ |
✅ |
✅ |
✅ |
Debugging & Monitoring |
Spark UI, Logs, Event Timeline, Metrics, Executors monitoring |
✅ |
✅ |
✅ |
✅ |
Fault Tolerance |
Lineage, Task Retry, Checkpointing, Speculative Execution |
✅ |
✅ |
✅ |
✅ |
Advanced Features |
Custom Partitioner, User-defined functions, Structured Streaming triggers |
|
✅ |
✅ |
✅ |
Integration |
Hive, HDFS, Kafka, Cassandra, Parquet, ORC, JDBC |
✅ |
✅ |
✅ |
✅ |