Prime_Questions: #DataFrames

#DataFrames

Key Concepts

S.No	Topic	Sub-Topics
1	Apache Spark & DataFrames	Spark overview, RDD vs DataFrame, Spark architecture, Lazy evaluation, Use cases
2	Spark Setup & Environment	Local mode, Cluster mode, SparkSession, spark-submit, Configuration basics
3	SparkSession & Entry Points	SparkSession creation, SQLContext, HiveContext, Config options, Best practices
4	Creating DataFrames	From files, From RDD, From collections, Schema inference, Explicit schema
5	DataFrame Schema & Data Types	StructType, StructField, Primitive types, Complex types, Schema evolution
6	Reading Data Sources	CSV, JSON, Parquet, ORC, Avro basics
7	Writing DataFrames	Save modes, Partitioning, Bucketing, File formats, Compression
8	DataFrame Basic Operations	select, withColumn, drop, filter, where
9	Column Operations	Column expressions, alias, cast, when/otherwise, lit
10	Row Operations & Actions	show, collect, take, count, first
11	DataFrame Functions	Built-in functions, String functions, Date functions, Math functions, Null handling
12	Filtering & Conditional Logic	filter vs where, isin, like, rlike, case when
13	Sorting & Deduplication	orderBy, sort, distinct, dropDuplicates, Sorting optimization
14	Aggregation & Grouping	groupBy, agg, count, sum, avg
15	Joins in DataFrames	Inner join, Left/Right join, Full join, Semi/Anti join
16	Join Optimization	Broadcast join, Shuffle join, Join hints, Skew handling, AQE
17	Handling Missing & Bad Data	dropna, fillna, replace, Null checks, Data validation
18	Window Functions	Window spec, row_number, rank, lead/lag, Running totals
19	UDF & UDAF	UDF creation, Performance impact, Pandas UDF, Serialization, Best practices
20	DataFrame Caching & Persistence	cache, persist, Storage levels, Memory vs disk, When to cache
21	Spark SQL with DataFrames	Temp views, Global views, SQL queries, Mixing SQL & DF, Optimization
22	Partitioning & Repartitioning	repartition, coalesce, Partition pruning, File partitioning, Performance tuning
23	Performance Optimization Basics	Catalyst optimizer, Tungsten, Predicate pushdown, Column pruning, AQE
24	DataFrame Execution Plan	Logical plan, Physical plan, explain(), DAG, Stage breakdown
25	Handling Large Datasets	Skew issues, Sampling, Checkpointing, Memory tuning, Spill handling
26	Integration with Hive	Hive tables, External tables, Metastore, Partitioned tables, Hive SQL
27	Streaming DataFrames (Structured Streaming)	Streaming sources, Sinks, Watermarking, Windowed aggregations, Triggers
28	Error Handling & Debugging	Common errors, Serialization issues, Logging, Debug tools, Retry strategies
29	Best Practices & Design Patterns	Code structure, Reusability, Performance patterns, Anti-patterns, Testing
30	Real-world Use Cases & Projects	ETL pipelines, Data lake processing, Analytics workloads, Reporting, Optimization review

Interview question

Basic Level

What is a DataFrame in Apache Spark?
How is a DataFrame different from an RDD?
What are the advantages of DataFrames?
Is DataFrame immutable?
What is SparkSession?
How do you create a DataFrame in Spark?
What are the different data sources supported by DataFrames?
What is a schema in DataFrame?
How can you infer schema automatically?
How do you define a custom schema?
What is show() in DataFrame?
What is printSchema()?
What is select() in DataFrame?
What is withColumn()?
Difference between withColumn() and select()?
What is filter() / where() in DataFrame?
Difference between filter() and where()?
What is limit()?
What is collect()?
What is count()?
What is distinct()?
What is drop()?
What is dropDuplicates()?
What is alias()?
How do you rename a column in DataFrame?

Intermediate Level

What are DataFrame transformations?
What are DataFrame actions?
What is lazy evaluation in DataFrames?
What is explain() in DataFrame?
What is logical plan?
What is physical plan?
What is Catalyst Optimizer?
What is Tungsten execution engine?
What is column expression?
What are built-in Spark SQL functions?
Difference between UDF and built-in functions?
What is UDF?
Performance impact of UDFs?
What is groupBy()?
What is agg()?
Difference between groupBy and window functions?
What is orderBy()?
Difference between orderBy() and sort()?
What is join() in DataFrame?
What are the types of joins in Spark?
What is inner join?
What is left outer join?
What is right outer join?
What is full outer join?
What is cross join?

Advanced Level

What is broadcast join?
When does Spark automatically use broadcast join?
What is shuffle join?
How does Spark handle join optimization?
What is partitioning in DataFrame?
Difference between repartition() and coalesce()?
What is bucketing in Spark?
Difference between bucketing and partitioning?
What is window function?
Explain row_number(), rank(), dense_rank().
What is caching in DataFrame?
Difference between cache() and persist()?
What storage levels are available?
What is checkpointing in DataFrame?
Difference between cache and checkpoint?
What is skewed data?
How to handle data skew in joins?
What is salting technique?
What is adaptive query execution (AQE)?
What are Spark SQL hints?
What is explain(true)?
How to optimize wide transformations?
What is column pruning?
What is predicate pushdown?
What is vectorized reader?

Expert Level

Why DataFrames are faster than RDDs?
How does Catalyst optimize DataFrame queries?
How does Tungsten improve performance?
What happens internally when a DataFrame action is triggered?
How does Spark generate optimized bytecode?
What is whole-stage code generation?
What is off-heap memory?
How does Spark handle memory management for DataFrames?
How does AQE change execution plans at runtime?
How do you debug slow DataFrame jobs?
How do you analyze Spark UI for DataFrame jobs?
What are common DataFrame performance anti-patterns?
Why excessive withColumn() is discouraged?
How do you design efficient Spark SQL pipelines?
What are limitations of DataFrames?
How do DataFrames handle schema evolution?
How does DataFrame support semi-structured data?
What is explode() function?
What is from_json() and to_json()?
How to handle nested columns efficiently?
What is Delta Lake DataFrame integration?
How does DataFrame handle ACID properties?
What is cost-based optimization (CBO)?
How do you tune Spark SQL configurations?
Explain real-time production use cases of DataFrames.

Prime_Questions

Popular Posts

09 January 2026

#DataFrames

Key Concepts

Interview question

Basic Level

Intermediate Level

Advanced Level

Expert Level

Related Topics