| S.No |
Topic |
Sub-Topics |
| 1 | Apache Spark & DataFrames | Spark overview, RDD vs DataFrame, Spark architecture, Lazy evaluation, Use cases |
| 2 | Spark Setup & Environment | Local mode, Cluster mode, SparkSession, spark-submit, Configuration basics |
| 3 | SparkSession & Entry Points | SparkSession creation, SQLContext, HiveContext, Config options, Best practices |
| 4 | Creating DataFrames | From files, From RDD, From collections, Schema inference, Explicit schema |
| 5 | DataFrame Schema & Data Types | StructType, StructField, Primitive types, Complex types, Schema evolution |
| 6 | Reading Data Sources | CSV, JSON, Parquet, ORC, Avro basics |
| 7 | Writing DataFrames | Save modes, Partitioning, Bucketing, File formats, Compression |
| 8 | DataFrame Basic Operations | select, withColumn, drop, filter, where |
| 9 | Column Operations | Column expressions, alias, cast, when/otherwise, lit |
| 10 | Row Operations & Actions | show, collect, take, count, first |
| 11 | DataFrame Functions | Built-in functions, String functions, Date functions, Math functions, Null handling |
| 12 | Filtering & Conditional Logic | filter vs where, isin, like, rlike, case when |
| 13 | Sorting & Deduplication | orderBy, sort, distinct, dropDuplicates, Sorting optimization |
| 14 | Aggregation & Grouping | groupBy, agg, count, sum, avg |
| 15 | Joins in DataFrames | Inner join, Left/Right join, Full join, Semi/Anti join |
| 16 | Join Optimization | Broadcast join, Shuffle join, Join hints, Skew handling, AQE |
| 17 | Handling Missing & Bad Data | dropna, fillna, replace, Null checks, Data validation |
| 18 | Window Functions | Window spec, row_number, rank, lead/lag, Running totals |
| 19 | UDF & UDAF | UDF creation, Performance impact, Pandas UDF, Serialization, Best practices |
| 20 | DataFrame Caching & Persistence | cache, persist, Storage levels, Memory vs disk, When to cache |
| 21 | Spark SQL with DataFrames | Temp views, Global views, SQL queries, Mixing SQL & DF, Optimization |
| 22 | Partitioning & Repartitioning | repartition, coalesce, Partition pruning, File partitioning, Performance tuning |
| 23 | Performance Optimization Basics | Catalyst optimizer, Tungsten, Predicate pushdown, Column pruning, AQE |
| 24 | DataFrame Execution Plan | Logical plan, Physical plan, explain(), DAG, Stage breakdown |
| 25 | Handling Large Datasets | Skew issues, Sampling, Checkpointing, Memory tuning, Spill handling |
| 26 | Integration with Hive | Hive tables, External tables, Metastore, Partitioned tables, Hive SQL |
| 27 | Streaming DataFrames (Structured Streaming) | Streaming sources, Sinks, Watermarking, Windowed aggregations, Triggers |
| 28 | Error Handling & Debugging | Common errors, Serialization issues, Logging, Debug tools, Retry strategies |
| 29 | Best Practices & Design Patterns | Code structure, Reusability, Performance patterns, Anti-patterns, Testing |
| 30 | Real-world Use Cases & Projects | ETL pipelines, Data lake processing, Analytics workloads, Reporting, Optimization review |