Prime_Questions: #RDD

#RDD

Key Concepts

S.No	Topic	Sub-Topics
1	Apache Spark & RDD	What is Spark, What is RDD, RDD characteristics, Use cases, Spark ecosystem
2	Spark Architecture	Driver, Executors, Cluster manager, Jobs, Stages & Tasks
3	SparkContext & RDD Creation	SparkContext, parallelize(), textFile(), wholeTextFiles(), makeRDD()
4	RDD Types & Characteristics	Parallel collections, Hadoop RDDs, Typed RDDs, Immutable nature, Partitioning
5	RDD Transformations Basics	map, flatMap, filter, distinct, union
6	RDD Actions Basics	collect, count, first, take, reduce
7	RDD Key-Value Pairs	Pair RDDs, mapToPair, reduceByKey, groupByKey, combineByKey
8	Key-Value Aggregations	reduceByKey, foldByKey, aggregateByKey, countByKey, sortByKey
9	RDD Joins	join, leftOuterJoin, rightOuterJoin, fullOuterJoin, Cartesian
10	RDD Set Operations	union, intersection, subtract, distinct, zip
11	Partitioning in RDDs	HashPartitioner, RangePartitioner, Custom partitioner, Repartition, Coalesce
12	Shuffle Operations	Shuffle concept, Wide vs narrow transformations, Optimization, Skew issues, Tuning
13	RDD Persistence & Caching	cache, persist, Storage levels, Memory management, When to cache
14	Fault Tolerance & Lineage	RDD lineage, DAG, Re-computation, Checkpointing, Recovery
15	Shared Variables	Broadcast variables, Accumulators, Use cases, Performance benefits, Limitations
16	RDD with Files	Text files, Sequence files, Object files, Hadoop input formats, Output formats
17	RDD vs DataFrame vs Dataset	API differences, Performance, Type safety, Optimization, When to use RDD
18	Custom Transformations	mapPartitions, mapPartitionsWithIndex, foreachPartition, Efficiency, Use cases
19	RDD Sorting & Ordering	sortBy, sortByKey, takeOrdered, top, Ordering logic
20	Error Handling & Debugging	Common errors, Serialization issues, Logging, Debug techniques, Best practices
21	Performance Optimization	Partition sizing, Avoiding shuffles, Serialization tuning, Memory tuning, Best practices
22	Handling Large Datasets	Skew handling, Sampling, Checkpointing, Spill management, Resource tuning
23	RDD Execution Model	Job submission, Stage creation, Task execution, DAG scheduling, Execution flow
24	RDD with Hive & HDFS	HDFS integration, Hive tables, InputFormat, OutputFormat, Metadata handling
25	Advanced Key-Value Operations	combineByKey internals, map-side aggregation, Custom combiners, Performance tuning, Examples
26	Checkpointing & Reliability	Checkpoint types, Reliable storage, Performance trade-offs, Use cases, Configuration
27	Security & Access Control	Authentication, Authorization, Secure clusters, Kerberos basics, Best practices
28	Testing RDD Code	Local mode testing, Unit testing, Test data, Assertions, Debug runs
29	Best Practices & Anti-patterns	Common mistakes, Performance anti-patterns, Code structure, Reusability, Guidelines
30	Real-world Use Cases & Projects	Log processing, ETL pipelines, Graph processing, Analytics workloads, Review

Interview question

Basic Level

What is an RDD in Apache Spark?
What does RDD stand for?
What are the main characteristics of RDDs?
How is RDD different from a DataFrame?
Is RDD immutable? Why?
How do you create an RDD in Spark?
What are the two ways to create an RDD?
What is parallelize() in Spark?
What is textFile() method?
What is lazy evaluation in RDD?
What are transformations in RDD?
What are actions in RDD?
Give examples of RDD transformations.
Give examples of RDD actions.
What is map() transformation?
What is flatMap()?
Difference between map() and flatMap()?
What is filter() in RDD?
What is collect() action?
What is count() action?
What is take() action?
What is foreach() in RDD?
What is SparkContext?
What is an RDD partition?
Why is RDD fault tolerant?

Intermediate Level

What is lineage in RDD?
How does RDD achieve fault tolerance?
What is DAG in Spark?
What is narrow transformation?
What is wide transformation?
Difference between narrow and wide transformations?
What is shuffle in Spark?
Which RDD operations cause shuffle?
What is reduceByKey()?
Difference between reduceByKey() and groupByKey()?
What is aggregateByKey()?
What is combineByKey()?
What is cache() in RDD?
What is persist()?
Difference between cache() and persist()?
What storage levels are supported in RDD?
What is MEMORY_ONLY storage level?
What happens if cached RDD does not fit in memory?
What is checkpointing?
Difference between cache and checkpoint?
What is coalesce()?
Difference between coalesce() and repartition()?
What is union() operation?
What is intersection()?
What is subtract() in RDD?

Advanced Level

What is a Pair RDD?
How do you create a Pair RDD?
What is sortByKey()?
How does partitioning work in RDD?
What is HashPartitioner?
What is RangePartitioner?
Why is partitioning important?
What is mapPartitions()?
Difference between map() and mapPartitions()?
What is foreachPartition()?
What is zip() in RDD?
What are broadcast variables?
What are accumulators?
Can RDDs be shared between Spark applications?
What is task in Spark?
What is stage in Spark?
How are stages created from RDD operations?
What is speculative execution?
What is memory spill in Spark?
How does Spark handle executor memory?
What is a long lineage problem?
How does checkpoint help performance?
What is serialization in RDD?
Java serializer vs Kryo serializer?
How to optimize RDD performance?

Expert Level

When should you use RDD over DataFrame?
Why are RDDs considered low-level APIs?
How does RDD differ from Dataset?
How does Spark recover lost RDD partitions?
What happens if a node fails during RDD computation?
How does Spark schedule RDD tasks?
What is locality level in Spark?
Explain process of RDD execution internally.
How does Spark avoid recomputation?
Explain RDD execution with an example job.
How do you debug RDD performance issues?
How does garbage collection affect RDD performance?
What is out-of-memory error in RDD jobs?
How to tune Spark for large RDD jobs?
What are common RDD anti-patterns?
Why groupByKey is discouraged?
How do you manage skewed data in RDD?
Explain custom partitioner usage.
What is RDD persistence across stages?
Can RDD operations be optimized by Catalyst?
Why Spark recommends DataFrame over RDD?
How does RDD handle schema-less data?
What are the limitations of RDD?
How does RDD support iterative algorithms?
Explain RDD usage in real-time production scenarios.

Prime_Questions

Popular Posts

09 January 2026

#RDD

Key Concepts

Interview question

Basic Level

Intermediate Level

Advanced Level

Expert Level

Related Topics