| S.No |
Topic |
Sub-Topics |
| 1 | Transformations & Actions | Definition, Lazy Evaluation, DAG concept, Execution flow, Why separation matters |
| 2 | Narrow vs Wide Transformations | Definition, Examples, Shuffle impact, Performance difference, Use cases |
| 3 | map() | Syntax, One-to-one mapping, Use cases, Performance, Examples |
| 4 | flatMap() | One-to-many mapping, Differences from map, Use cases, Examples, Performance |
| 5 | filter() | Predicate logic, Data reduction, Optimization tips, Examples, Use cases |
| 6 | select() / withColumn() | Column selection, Column creation, Expressions, Performance tips, Examples |
| 7 | union() & distinct() | Combining datasets, Removing duplicates, Shuffle behavior, Use cases, Examples |
| 8 | groupBy() | Grouping logic, Aggregation basics, Shuffle impact, Examples, Best practices |
| 9 | reduceByKey() | Key-based reduction, Map-side aggregation, Performance benefits, Examples, Comparison |
| 10 | groupByKey() | Working principle, Memory impact, Comparison with reduceByKey, Examples, When to avoid |
| 11 | sortBy() & orderBy() | Sorting logic, Asc/Desc order, Shuffle cost, Examples, Optimization tips |
| 12 | join() Basics | Inner join, Join condition, Execution flow, Examples, Common issues |
| 13 | Advanced Join Types | Left, Right, Full, Semi, Anti joins, Use cases, Examples |
| 14 | Broadcast Join | Concept, When to use, Memory impact, SQL hint, Examples |
| 15 | repartition() & coalesce() | Partition control, Shuffle behavior, Performance impact, Use cases, Examples |
| 16 | cache() & persist() | Storage levels, Memory vs disk, When to cache, Examples, Pitfalls |
| 17 | count() | Action trigger, Job creation, Performance considerations, Examples, Use cases |
| 18 | collect() | Driver memory risk, Small data usage, Examples, Best practices, Alternatives |
| 19 | show() & take() | Preview data, Execution behavior, Limit handling, Examples, Usage tips |
| 20 | save() & write() | Output formats, File systems, Partition output, Modes, Examples |
| 21 | foreach() & foreachPartition() | Side effects, External systems, Performance difference, Examples, Best practices |
| 22 | Window Functions | Over clause, Partition by, Order by, Use cases, Examples |
| 23 | Actions vs Transformations | Comparison, Execution timing, DAG role, Interview questions, Examples |
| 24 | Shuffle Internals | When shuffle occurs, Cost factors, Optimization, Examples, Debugging |
| 25 | Performance Optimization | Avoid wide ops, Partition sizing, Caching strategy, Examples, Tips |
| 26 | Error Handling | Bad records, Null handling, Try-catch logic, Data validation, Examples |
| 27 | Spark UI Analysis | Jobs tab, Stages tab, Task metrics, Shuffle read/write, Debugging |
| 28 | Real-world ETL Flow | Transform chain design, Action placement, Optimization, Examples, Best practices |
| 29 | Interview Scenarios | Common questions, Tricky cases, Performance questions, Sample answers, Tips |
| 30 | Hands-on Mini Project | End-to-end pipeline, Transformations usage, Actions usage, Optimization, Review |