Prime_Questions: #PySpark

#PySpark

Key Concepts

S.No	Topic	Sub-Topics
1	PySpark	What is PySpark, Spark ecosystem, PySpark vs Pandas, Use cases, Installation & setup
2	Spark Architecture	Driver, Executors, Cluster manager, Jobs/Stages/Tasks, Execution flow
3	SparkSession & Context	SparkSession, SparkContext, Configurations, Application lifecycle, Best practices
4	RDD Fundamentals	RDD creation, Transformations, Actions, Persistence, RDD vs DataFrame
5	RDD Advanced Operations	Narrow vs wide ops, shuffle, Accumulators, Broadcast variables, Performance tuning
6	DataFrame Introduction	DataFrame API, Creating DataFrames, Schema inference, show/select, DataFrame vs RDD
7	DataFrame Transformations	select, filter, withColumn, drop, cast & rename
8	Data Sources & Formats	CSV, JSON, Parquet, ORC, Avro
9	Schema Management	StructType, StructField, Explicit schema, Schema evolution, Corrupt records
10	Built-in Functions	String functions, Date functions, Math functions, Conditional logic, Null handling
11	Joins in PySpark	Inner join, Left/Right join, Full join, Broadcast join, Join optimization
12	Aggregations	groupBy, agg, count/sum/avg, rollup, cube
13	Window Functions	Window spec, row_number, rank/dense_rank, lead/lag, Running totals
14	Sorting & Partitioning	orderBy, sortWithinPartitions, repartition, coalesce, Data skew basics
15	Spark SQL	Temp views, Global views, SQL queries, CTEs, SQL vs DataFrame API
16	User Defined Functions	Python UDF, Pandas UDF, Serialization cost, When to avoid UDF, Alternatives
17	Performance Optimization	Caching, Persist levels, Broadcast joins, File sizing, Best practices
18	Partition & File Optimization	Partition pruning, Bucketing, Small file problem, Compression, Skew handling
19	PySpark with Hive	Hive metastore, Managed tables, External tables, Partitioned tables, Hive SQL
20	Structured Streaming Basics	Streaming concepts, Micro-batching, Sources, Sinks, Checkpointing
21	Streaming Operations	Triggers, Output modes, Watermarking, Late data, Fault tolerance
22	Streaming Aggregations	Windowed aggregation, Stateful ops, Stream joins, Exactly-once semantics, Recovery
23	MLlib Overview	Transformers, Estimators, Pipelines, Evaluators, Model lifecycle
24	Feature Engineering	StringIndexer, OneHotEncoder, VectorAssembler, Scaling, Feature selection
25	ML Algorithms	Regression, Classification, Clustering, Recommendation, Metrics
26	Hyperparameter Tuning	CrossValidator, Train-validation split, ParamGrid, Model selection, Optimization
27	PySpark with Delta Lake	Delta tables, ACID transactions, Time travel, MERGE, Optimize & Vacuum
28	Debugging & Monitoring	Spark UI, Logs, Common errors, Debug strategies, Job analysis
29	Job Scheduling & Deployment	spark-submit, Config tuning, Scheduling, Parameterization, Automation
30	Real-world Use Cases	ETL pipelines, Streaming analytics, ML pipelines, Optimization patterns, Interview prep

Interview question

Basic

What is PySpark?
What is Apache Spark?
Why is PySpark faster than MapReduce?
What are the components of Spark?
What is a SparkSession?
What is a DataFrame in PySpark?
Difference between RDD and DataFrame?
What is lazy evaluation?
What are transformations?
What are actions?
What is an RDD?
What languages are supported by Spark?
What is schema inference?
What is DBFS?
How do you read a CSV file in PySpark?
What is collect()?
What is show()?
What is repartition()?
What is coalesce()?
What is Spark SQL?
What is cache()?
What is persist()?
What is a cluster?
What is a driver?
What is an executor?

Intermediate

Difference between cache() and persist()?
Difference between repartition() and coalesce()?
What is broadcast join?
What is shuffle in Spark?
What is narrow vs wide transformation?
What is lineage?
What is fault tolerance?
What is a partition?
How does Spark handle memory management?
What is Catalyst Optimizer?
What is Tungsten engine?
What is Spark SQL execution plan?
What is explain()?
Difference between inner and outer join?
What is window function?
What is UDF?
Difference between UDF and built-in functions?
What is checkpointing?
What is serialization?
What is Kryo serialization?
What is accumulator?
What is broadcast variable?
What is PySpark SQL?
How do you handle null values?
How do you remove duplicates?

Advanced

How does Spark optimize joins?
What is Adaptive Query Execution (AQE)?
How do you handle data skew?
What is salting technique?
What is bucketing?
Difference between bucketing and partitioning?
What is column pruning?
What is predicate pushdown?
What is data skipping?
What is whole-stage code generation?
How does Spark handle out-of-memory errors?
Explain task, stage, and job
What is speculative execution?
What is watermarking?
What is Structured Streaming?
Difference between DStream and Structured Streaming?
What is stateful streaming?
What is exactly-once semantics?
What is file compaction?
What is Z-ordering?
What is Delta Lake?
Difference between Parquet and ORC?
How do you optimize Spark SQL queries?
What is Photon engine?
How do you tune Spark jobs?

Expert

Design a large-scale ETL pipeline using PySpark
How do you process billions of records efficiently?
How do you tune Spark for high concurrency?
How do you handle schema evolution?
How do you implement CDC in PySpark?
How do you debug slow Spark jobs?
How do you monitor PySpark applications?
How do you design fault-tolerant pipelines?
How do you secure sensitive data in PySpark?
How do you implement row-level security?
Explain memory tuning parameters
How do you reduce shuffle operations?
How do you optimize joins in large datasets?
Explain Spark internals with execution flow
How do you migrate legacy Spark jobs to PySpark?
How do you implement real-time streaming pipelines?
How do you manage dependencies in PySpark?
How do you handle late-arriving data?
Explain best practices for production PySpark
How do you build reusable PySpark frameworks?
How does PySpark interact with JVM?
Explain Py4J architecture
How do you handle large small files problem?
Explain cost optimization strategies
End-to-end PySpark project explanation

Prime_Questions

Popular Posts

18 January 2026

#PySpark

Key Concepts

Interview question

Basic

Intermediate

Advanced

Expert

Related Topics