18 January 2026

#PySpark

#PySpark

Key Concepts


S.No Topic Sub-Topics
1PySparkWhat is PySpark, Spark ecosystem, PySpark vs Pandas, Use cases, Installation & setup
2Spark ArchitectureDriver, Executors, Cluster manager, Jobs/Stages/Tasks, Execution flow
3SparkSession & ContextSparkSession, SparkContext, Configurations, Application lifecycle, Best practices
4RDD FundamentalsRDD creation, Transformations, Actions, Persistence, RDD vs DataFrame
5RDD Advanced OperationsNarrow vs wide ops, shuffle, Accumulators, Broadcast variables, Performance tuning
6DataFrame IntroductionDataFrame API, Creating DataFrames, Schema inference, show/select, DataFrame vs RDD
7DataFrame Transformationsselect, filter, withColumn, drop, cast & rename
8Data Sources & FormatsCSV, JSON, Parquet, ORC, Avro
9Schema ManagementStructType, StructField, Explicit schema, Schema evolution, Corrupt records
10Built-in FunctionsString functions, Date functions, Math functions, Conditional logic, Null handling
11Joins in PySparkInner join, Left/Right join, Full join, Broadcast join, Join optimization
12AggregationsgroupBy, agg, count/sum/avg, rollup, cube
13Window FunctionsWindow spec, row_number, rank/dense_rank, lead/lag, Running totals
14Sorting & PartitioningorderBy, sortWithinPartitions, repartition, coalesce, Data skew basics
15Spark SQLTemp views, Global views, SQL queries, CTEs, SQL vs DataFrame API
16User Defined FunctionsPython UDF, Pandas UDF, Serialization cost, When to avoid UDF, Alternatives
17Performance OptimizationCaching, Persist levels, Broadcast joins, File sizing, Best practices
18Partition & File OptimizationPartition pruning, Bucketing, Small file problem, Compression, Skew handling
19PySpark with HiveHive metastore, Managed tables, External tables, Partitioned tables, Hive SQL
20Structured Streaming BasicsStreaming concepts, Micro-batching, Sources, Sinks, Checkpointing
21Streaming OperationsTriggers, Output modes, Watermarking, Late data, Fault tolerance
22Streaming AggregationsWindowed aggregation, Stateful ops, Stream joins, Exactly-once semantics, Recovery
23MLlib OverviewTransformers, Estimators, Pipelines, Evaluators, Model lifecycle
24Feature EngineeringStringIndexer, OneHotEncoder, VectorAssembler, Scaling, Feature selection
25ML AlgorithmsRegression, Classification, Clustering, Recommendation, Metrics
26Hyperparameter TuningCrossValidator, Train-validation split, ParamGrid, Model selection, Optimization
27PySpark with Delta LakeDelta tables, ACID transactions, Time travel, MERGE, Optimize & Vacuum
28Debugging & MonitoringSpark UI, Logs, Common errors, Debug strategies, Job analysis
29Job Scheduling & Deploymentspark-submit, Config tuning, Scheduling, Parameterization, Automation
30Real-world Use CasesETL pipelines, Streaming analytics, ML pipelines, Optimization patterns, Interview prep

Interview question

Basic

  • What is PySpark?
  • What is Apache Spark?
  • Why is PySpark faster than MapReduce?
  • What are the components of Spark?
  • What is a SparkSession?
  • What is a DataFrame in PySpark?
  • Difference between RDD and DataFrame?
  • What is lazy evaluation?
  • What are transformations?
  • What are actions?
  • What is an RDD?
  • What languages are supported by Spark?
  • What is schema inference?
  • What is DBFS?
  • How do you read a CSV file in PySpark?
  • What is collect()?
  • What is show()?
  • What is repartition()?
  • What is coalesce()?
  • What is Spark SQL?
  • What is cache()?
  • What is persist()?
  • What is a cluster?
  • What is a driver?
  • What is an executor?

Intermediate

  • Difference between cache() and persist()?
  • Difference between repartition() and coalesce()?
  • What is broadcast join?
  • What is shuffle in Spark?
  • What is narrow vs wide transformation?
  • What is lineage?
  • What is fault tolerance?
  • What is a partition?
  • How does Spark handle memory management?
  • What is Catalyst Optimizer?
  • What is Tungsten engine?
  • What is Spark SQL execution plan?
  • What is explain()?
  • Difference between inner and outer join?
  • What is window function?
  • What is UDF?
  • Difference between UDF and built-in functions?
  • What is checkpointing?
  • What is serialization?
  • What is Kryo serialization?
  • What is accumulator?
  • What is broadcast variable?
  • What is PySpark SQL?
  • How do you handle null values?
  • How do you remove duplicates?

Advanced

  • How does Spark optimize joins?
  • What is Adaptive Query Execution (AQE)?
  • How do you handle data skew?
  • What is salting technique?
  • What is bucketing?
  • Difference between bucketing and partitioning?
  • What is column pruning?
  • What is predicate pushdown?
  • What is data skipping?
  • What is whole-stage code generation?
  • How does Spark handle out-of-memory errors?
  • Explain task, stage, and job
  • What is speculative execution?
  • What is watermarking?
  • What is Structured Streaming?
  • Difference between DStream and Structured Streaming?
  • What is stateful streaming?
  • What is exactly-once semantics?
  • What is file compaction?
  • What is Z-ordering?
  • What is Delta Lake?
  • Difference between Parquet and ORC?
  • How do you optimize Spark SQL queries?
  • What is Photon engine?
  • How do you tune Spark jobs?

Expert

  • Design a large-scale ETL pipeline using PySpark
  • How do you process billions of records efficiently?
  • How do you tune Spark for high concurrency?
  • How do you handle schema evolution?
  • How do you implement CDC in PySpark?
  • How do you debug slow Spark jobs?
  • How do you monitor PySpark applications?
  • How do you design fault-tolerant pipelines?
  • How do you secure sensitive data in PySpark?
  • How do you implement row-level security?
  • Explain memory tuning parameters
  • How do you reduce shuffle operations?
  • How do you optimize joins in large datasets?
  • Explain Spark internals with execution flow
  • How do you migrate legacy Spark jobs to PySpark?
  • How do you implement real-time streaming pipelines?
  • How do you manage dependencies in PySpark?
  • How do you handle late-arriving data?
  • Explain best practices for production PySpark
  • How do you build reusable PySpark frameworks?
  • How does PySpark interact with JVM?
  • Explain Py4J architecture
  • How do you handle large small files problem?
  • Explain cost optimization strategies
  • End-to-end PySpark project explanation

Related Topics