09 January 2026

#RDD

#RDD

Key Concepts


S.No Topic Sub-Topics
1Apache Spark & RDDWhat is Spark, What is RDD, RDD characteristics, Use cases, Spark ecosystem
2Spark ArchitectureDriver, Executors, Cluster manager, Jobs, Stages & Tasks
3SparkContext & RDD CreationSparkContext, parallelize(), textFile(), wholeTextFiles(), makeRDD()
4RDD Types & CharacteristicsParallel collections, Hadoop RDDs, Typed RDDs, Immutable nature, Partitioning
5RDD Transformations Basicsmap, flatMap, filter, distinct, union
6RDD Actions Basicscollect, count, first, take, reduce
7RDD Key-Value PairsPair RDDs, mapToPair, reduceByKey, groupByKey, combineByKey
8Key-Value AggregationsreduceByKey, foldByKey, aggregateByKey, countByKey, sortByKey
9RDD Joinsjoin, leftOuterJoin, rightOuterJoin, fullOuterJoin, Cartesian
10RDD Set Operationsunion, intersection, subtract, distinct, zip
11Partitioning in RDDsHashPartitioner, RangePartitioner, Custom partitioner, Repartition, Coalesce
12Shuffle OperationsShuffle concept, Wide vs narrow transformations, Optimization, Skew issues, Tuning
13RDD Persistence & Cachingcache, persist, Storage levels, Memory management, When to cache
14Fault Tolerance & LineageRDD lineage, DAG, Re-computation, Checkpointing, Recovery
15Shared VariablesBroadcast variables, Accumulators, Use cases, Performance benefits, Limitations
16RDD with FilesText files, Sequence files, Object files, Hadoop input formats, Output formats
17RDD vs DataFrame vs DatasetAPI differences, Performance, Type safety, Optimization, When to use RDD
18Custom TransformationsmapPartitions, mapPartitionsWithIndex, foreachPartition, Efficiency, Use cases
19RDD Sorting & OrderingsortBy, sortByKey, takeOrdered, top, Ordering logic
20Error Handling & DebuggingCommon errors, Serialization issues, Logging, Debug techniques, Best practices
21Performance OptimizationPartition sizing, Avoiding shuffles, Serialization tuning, Memory tuning, Best practices
22Handling Large DatasetsSkew handling, Sampling, Checkpointing, Spill management, Resource tuning
23RDD Execution ModelJob submission, Stage creation, Task execution, DAG scheduling, Execution flow
24RDD with Hive & HDFSHDFS integration, Hive tables, InputFormat, OutputFormat, Metadata handling
25Advanced Key-Value OperationscombineByKey internals, map-side aggregation, Custom combiners, Performance tuning, Examples
26Checkpointing & ReliabilityCheckpoint types, Reliable storage, Performance trade-offs, Use cases, Configuration
27Security & Access ControlAuthentication, Authorization, Secure clusters, Kerberos basics, Best practices
28Testing RDD CodeLocal mode testing, Unit testing, Test data, Assertions, Debug runs
29Best Practices & Anti-patternsCommon mistakes, Performance anti-patterns, Code structure, Reusability, Guidelines
30Real-world Use Cases & ProjectsLog processing, ETL pipelines, Graph processing, Analytics workloads, Review

Interview question

Basic Level

  1. What is an RDD in Apache Spark?
  2. What does RDD stand for?
  3. What are the main characteristics of RDDs?
  4. How is RDD different from a DataFrame?
  5. Is RDD immutable? Why?
  6. How do you create an RDD in Spark?
  7. What are the two ways to create an RDD?
  8. What is parallelize() in Spark?
  9. What is textFile() method?
  10. What is lazy evaluation in RDD?
  11. What are transformations in RDD?
  12. What are actions in RDD?
  13. Give examples of RDD transformations.
  14. Give examples of RDD actions.
  15. What is map() transformation?
  16. What is flatMap()?
  17. Difference between map() and flatMap()?
  18. What is filter() in RDD?
  19. What is collect() action?
  20. What is count() action?
  21. What is take() action?
  22. What is foreach() in RDD?
  23. What is SparkContext?
  24. What is an RDD partition?
  25. Why is RDD fault tolerant?

Intermediate Level

  1. What is lineage in RDD?
  2. How does RDD achieve fault tolerance?
  3. What is DAG in Spark?
  4. What is narrow transformation?
  5. What is wide transformation?
  6. Difference between narrow and wide transformations?
  7. What is shuffle in Spark?
  8. Which RDD operations cause shuffle?
  9. What is reduceByKey()?
  10. Difference between reduceByKey() and groupByKey()?
  11. What is aggregateByKey()?
  12. What is combineByKey()?
  13. What is cache() in RDD?
  14. What is persist()?
  15. Difference between cache() and persist()?
  16. What storage levels are supported in RDD?
  17. What is MEMORY_ONLY storage level?
  18. What happens if cached RDD does not fit in memory?
  19. What is checkpointing?
  20. Difference between cache and checkpoint?
  21. What is coalesce()?
  22. Difference between coalesce() and repartition()?
  23. What is union() operation?
  24. What is intersection()?
  25. What is subtract() in RDD?

Advanced Level

  1. What is a Pair RDD?
  2. How do you create a Pair RDD?
  3. What is sortByKey()?
  4. How does partitioning work in RDD?
  5. What is HashPartitioner?
  6. What is RangePartitioner?
  7. Why is partitioning important?
  8. What is mapPartitions()?
  9. Difference between map() and mapPartitions()?
  10. What is foreachPartition()?
  11. What is zip() in RDD?
  12. What are broadcast variables?
  13. What are accumulators?
  14. Can RDDs be shared between Spark applications?
  15. What is task in Spark?
  16. What is stage in Spark?
  17. How are stages created from RDD operations?
  18. What is speculative execution?
  19. What is memory spill in Spark?
  20. How does Spark handle executor memory?
  21. What is a long lineage problem?
  22. How does checkpoint help performance?
  23. What is serialization in RDD?
  24. Java serializer vs Kryo serializer?
  25. How to optimize RDD performance?

Expert Level

  1. When should you use RDD over DataFrame?
  2. Why are RDDs considered low-level APIs?
  3. How does RDD differ from Dataset?
  4. How does Spark recover lost RDD partitions?
  5. What happens if a node fails during RDD computation?
  6. How does Spark schedule RDD tasks?
  7. What is locality level in Spark?
  8. Explain process of RDD execution internally.
  9. How does Spark avoid recomputation?
  10. Explain RDD execution with an example job.
  11. How do you debug RDD performance issues?
  12. How does garbage collection affect RDD performance?
  13. What is out-of-memory error in RDD jobs?
  14. How to tune Spark for large RDD jobs?
  15. What are common RDD anti-patterns?
  16. Why groupByKey is discouraged?
  17. How do you manage skewed data in RDD?
  18. Explain custom partitioner usage.
  19. What is RDD persistence across stages?
  20. Can RDD operations be optimized by Catalyst?
  21. Why Spark recommends DataFrame over RDD?
  22. How does RDD handle schema-less data?
  23. What are the limitations of RDD?
  24. How does RDD support iterative algorithms?
  25. Explain RDD usage in real-time production scenarios.

Related Topics