| S.No |
Topic |
Sub-Topics |
| 1 | PySpark | What is PySpark, Spark ecosystem, PySpark vs Pandas, Use cases, Installation & setup |
| 2 | Spark Architecture | Driver, Executors, Cluster manager, Jobs/Stages/Tasks, Execution flow |
| 3 | SparkSession & Context | SparkSession, SparkContext, Configurations, Application lifecycle, Best practices |
| 4 | RDD Fundamentals | RDD creation, Transformations, Actions, Persistence, RDD vs DataFrame |
| 5 | RDD Advanced Operations | Narrow vs wide ops, shuffle, Accumulators, Broadcast variables, Performance tuning |
| 6 | DataFrame Introduction | DataFrame API, Creating DataFrames, Schema inference, show/select, DataFrame vs RDD |
| 7 | DataFrame Transformations | select, filter, withColumn, drop, cast & rename |
| 8 | Data Sources & Formats | CSV, JSON, Parquet, ORC, Avro |
| 9 | Schema Management | StructType, StructField, Explicit schema, Schema evolution, Corrupt records |
| 10 | Built-in Functions | String functions, Date functions, Math functions, Conditional logic, Null handling |
| 11 | Joins in PySpark | Inner join, Left/Right join, Full join, Broadcast join, Join optimization |
| 12 | Aggregations | groupBy, agg, count/sum/avg, rollup, cube |
| 13 | Window Functions | Window spec, row_number, rank/dense_rank, lead/lag, Running totals |
| 14 | Sorting & Partitioning | orderBy, sortWithinPartitions, repartition, coalesce, Data skew basics |
| 15 | Spark SQL | Temp views, Global views, SQL queries, CTEs, SQL vs DataFrame API |
| 16 | User Defined Functions | Python UDF, Pandas UDF, Serialization cost, When to avoid UDF, Alternatives |
| 17 | Performance Optimization | Caching, Persist levels, Broadcast joins, File sizing, Best practices |
| 18 | Partition & File Optimization | Partition pruning, Bucketing, Small file problem, Compression, Skew handling |
| 19 | PySpark with Hive | Hive metastore, Managed tables, External tables, Partitioned tables, Hive SQL |
| 20 | Structured Streaming Basics | Streaming concepts, Micro-batching, Sources, Sinks, Checkpointing |
| 21 | Streaming Operations | Triggers, Output modes, Watermarking, Late data, Fault tolerance |
| 22 | Streaming Aggregations | Windowed aggregation, Stateful ops, Stream joins, Exactly-once semantics, Recovery |
| 23 | MLlib Overview | Transformers, Estimators, Pipelines, Evaluators, Model lifecycle |
| 24 | Feature Engineering | StringIndexer, OneHotEncoder, VectorAssembler, Scaling, Feature selection |
| 25 | ML Algorithms | Regression, Classification, Clustering, Recommendation, Metrics |
| 26 | Hyperparameter Tuning | CrossValidator, Train-validation split, ParamGrid, Model selection, Optimization |
| 27 | PySpark with Delta Lake | Delta tables, ACID transactions, Time travel, MERGE, Optimize & Vacuum |
| 28 | Debugging & Monitoring | Spark UI, Logs, Common errors, Debug strategies, Job analysis |
| 29 | Job Scheduling & Deployment | spark-submit, Config tuning, Scheduling, Parameterization, Automation |
| 30 | Real-world Use Cases | ETL pipelines, Streaming analytics, ML pipelines, Optimization patterns, Interview prep |