| S.No |
Topic |
Sub-Topics |
| 1 | Apache Spark & RDD | What is Spark, What is RDD, RDD characteristics, Use cases, Spark ecosystem |
| 2 | Spark Architecture | Driver, Executors, Cluster manager, Jobs, Stages & Tasks |
| 3 | SparkContext & RDD Creation | SparkContext, parallelize(), textFile(), wholeTextFiles(), makeRDD() |
| 4 | RDD Types & Characteristics | Parallel collections, Hadoop RDDs, Typed RDDs, Immutable nature, Partitioning |
| 5 | RDD Transformations Basics | map, flatMap, filter, distinct, union |
| 6 | RDD Actions Basics | collect, count, first, take, reduce |
| 7 | RDD Key-Value Pairs | Pair RDDs, mapToPair, reduceByKey, groupByKey, combineByKey |
| 8 | Key-Value Aggregations | reduceByKey, foldByKey, aggregateByKey, countByKey, sortByKey |
| 9 | RDD Joins | join, leftOuterJoin, rightOuterJoin, fullOuterJoin, Cartesian |
| 10 | RDD Set Operations | union, intersection, subtract, distinct, zip |
| 11 | Partitioning in RDDs | HashPartitioner, RangePartitioner, Custom partitioner, Repartition, Coalesce |
| 12 | Shuffle Operations | Shuffle concept, Wide vs narrow transformations, Optimization, Skew issues, Tuning |
| 13 | RDD Persistence & Caching | cache, persist, Storage levels, Memory management, When to cache |
| 14 | Fault Tolerance & Lineage | RDD lineage, DAG, Re-computation, Checkpointing, Recovery |
| 15 | Shared Variables | Broadcast variables, Accumulators, Use cases, Performance benefits, Limitations |
| 16 | RDD with Files | Text files, Sequence files, Object files, Hadoop input formats, Output formats |
| 17 | RDD vs DataFrame vs Dataset | API differences, Performance, Type safety, Optimization, When to use RDD |
| 18 | Custom Transformations | mapPartitions, mapPartitionsWithIndex, foreachPartition, Efficiency, Use cases |
| 19 | RDD Sorting & Ordering | sortBy, sortByKey, takeOrdered, top, Ordering logic |
| 20 | Error Handling & Debugging | Common errors, Serialization issues, Logging, Debug techniques, Best practices |
| 21 | Performance Optimization | Partition sizing, Avoiding shuffles, Serialization tuning, Memory tuning, Best practices |
| 22 | Handling Large Datasets | Skew handling, Sampling, Checkpointing, Spill management, Resource tuning |
| 23 | RDD Execution Model | Job submission, Stage creation, Task execution, DAG scheduling, Execution flow |
| 24 | RDD with Hive & HDFS | HDFS integration, Hive tables, InputFormat, OutputFormat, Metadata handling |
| 25 | Advanced Key-Value Operations | combineByKey internals, map-side aggregation, Custom combiners, Performance tuning, Examples |
| 26 | Checkpointing & Reliability | Checkpoint types, Reliable storage, Performance trade-offs, Use cases, Configuration |
| 27 | Security & Access Control | Authentication, Authorization, Secure clusters, Kerberos basics, Best practices |
| 28 | Testing RDD Code | Local mode testing, Unit testing, Test data, Assertions, Debug runs |
| 29 | Best Practices & Anti-patterns | Common mistakes, Performance anti-patterns, Code structure, Reusability, Guidelines |
| 30 | Real-world Use Cases & Projects | Log processing, ETL pipelines, Graph processing, Analytics workloads, Review |