Prime_Questions: January 2026

26 January 2026

#Milvus

#Pinecone

19 January 2026

#Amazon EC2

Key Concepts

S.No	Topic	Sub-Topics

Interview question

Key Concepts

S.No	Topic	Sub-Topics

Interview question

Key Concepts

S.No	Topic	Sub-Topics

Interview question

Key Concepts

S.No	Topic	Sub-Topics

Interview question

Key Concepts

S.No	Topic	Sub-Topics

Interview question

Key Concepts

S.No	Topic	Sub-Topics

Interview question

Key Concepts

S.No	Topic	Sub-Topics

Interview question

18 January 2026

#Amazon Redshift

#Amazon S3

#Amazon Kinesis

#AWS Glue

#PySpark

Key Concepts

S.No	Topic	Sub-Topics
1	PySpark	What is PySpark, Spark ecosystem, PySpark vs Pandas, Use cases, Installation & setup
2	Spark Architecture	Driver, Executors, Cluster manager, Jobs/Stages/Tasks, Execution flow
3	SparkSession & Context	SparkSession, SparkContext, Configurations, Application lifecycle, Best practices
4	RDD Fundamentals	RDD creation, Transformations, Actions, Persistence, RDD vs DataFrame
5	RDD Advanced Operations	Narrow vs wide ops, shuffle, Accumulators, Broadcast variables, Performance tuning
6	DataFrame Introduction	DataFrame API, Creating DataFrames, Schema inference, show/select, DataFrame vs RDD
7	DataFrame Transformations	select, filter, withColumn, drop, cast & rename
8	Data Sources & Formats	CSV, JSON, Parquet, ORC, Avro
9	Schema Management	StructType, StructField, Explicit schema, Schema evolution, Corrupt records
10	Built-in Functions	String functions, Date functions, Math functions, Conditional logic, Null handling
11	Joins in PySpark	Inner join, Left/Right join, Full join, Broadcast join, Join optimization
12	Aggregations	groupBy, agg, count/sum/avg, rollup, cube
13	Window Functions	Window spec, row_number, rank/dense_rank, lead/lag, Running totals
14	Sorting & Partitioning	orderBy, sortWithinPartitions, repartition, coalesce, Data skew basics
15	Spark SQL	Temp views, Global views, SQL queries, CTEs, SQL vs DataFrame API
16	User Defined Functions	Python UDF, Pandas UDF, Serialization cost, When to avoid UDF, Alternatives
17	Performance Optimization	Caching, Persist levels, Broadcast joins, File sizing, Best practices
18	Partition & File Optimization	Partition pruning, Bucketing, Small file problem, Compression, Skew handling
19	PySpark with Hive	Hive metastore, Managed tables, External tables, Partitioned tables, Hive SQL
20	Structured Streaming Basics	Streaming concepts, Micro-batching, Sources, Sinks, Checkpointing
21	Streaming Operations	Triggers, Output modes, Watermarking, Late data, Fault tolerance
22	Streaming Aggregations	Windowed aggregation, Stateful ops, Stream joins, Exactly-once semantics, Recovery
23	MLlib Overview	Transformers, Estimators, Pipelines, Evaluators, Model lifecycle
24	Feature Engineering	StringIndexer, OneHotEncoder, VectorAssembler, Scaling, Feature selection
25	ML Algorithms	Regression, Classification, Clustering, Recommendation, Metrics
26	Hyperparameter Tuning	CrossValidator, Train-validation split, ParamGrid, Model selection, Optimization
27	PySpark with Delta Lake	Delta tables, ACID transactions, Time travel, MERGE, Optimize & Vacuum
28	Debugging & Monitoring	Spark UI, Logs, Common errors, Debug strategies, Job analysis
29	Job Scheduling & Deployment	spark-submit, Config tuning, Scheduling, Parameterization, Automation
30	Real-world Use Cases	ETL pipelines, Streaming analytics, ML pipelines, Optimization patterns, Interview prep

Interview question

Basic

What is PySpark?
What is Apache Spark?
Why is PySpark faster than MapReduce?
What are the components of Spark?
What is a SparkSession?
What is a DataFrame in PySpark?
Difference between RDD and DataFrame?
What is lazy evaluation?
What are transformations?
What are actions?
What is an RDD?
What languages are supported by Spark?
What is schema inference?
What is DBFS?
How do you read a CSV file in PySpark?
What is collect()?
What is show()?
What is repartition()?
What is coalesce()?
What is Spark SQL?
What is cache()?
What is persist()?
What is a cluster?
What is a driver?
What is an executor?

Intermediate

Difference between cache() and persist()?
Difference between repartition() and coalesce()?
What is broadcast join?
What is shuffle in Spark?
What is narrow vs wide transformation?
What is lineage?
What is fault tolerance?
What is a partition?
How does Spark handle memory management?
What is Catalyst Optimizer?
What is Tungsten engine?
What is Spark SQL execution plan?
What is explain()?
Difference between inner and outer join?
What is window function?
What is UDF?
Difference between UDF and built-in functions?
What is checkpointing?
What is serialization?
What is Kryo serialization?
What is accumulator?
What is broadcast variable?
What is PySpark SQL?
How do you handle null values?
How do you remove duplicates?

Advanced

How does Spark optimize joins?
What is Adaptive Query Execution (AQE)?
How do you handle data skew?
What is salting technique?
What is bucketing?
Difference between bucketing and partitioning?
What is column pruning?
What is predicate pushdown?
What is data skipping?
What is whole-stage code generation?
How does Spark handle out-of-memory errors?
Explain task, stage, and job
What is speculative execution?
What is watermarking?
What is Structured Streaming?
Difference between DStream and Structured Streaming?
What is stateful streaming?
What is exactly-once semantics?
What is file compaction?
What is Z-ordering?
What is Delta Lake?
Difference between Parquet and ORC?
How do you optimize Spark SQL queries?
What is Photon engine?
How do you tune Spark jobs?

Expert

Design a large-scale ETL pipeline using PySpark
How do you process billions of records efficiently?
How do you tune Spark for high concurrency?
How do you handle schema evolution?
How do you implement CDC in PySpark?
How do you debug slow Spark jobs?
How do you monitor PySpark applications?
How do you design fault-tolerant pipelines?
How do you secure sensitive data in PySpark?
How do you implement row-level security?
Explain memory tuning parameters
How do you reduce shuffle operations?
How do you optimize joins in large datasets?
Explain Spark internals with execution flow
How do you migrate legacy Spark jobs to PySpark?
How do you implement real-time streaming pipelines?
How do you manage dependencies in PySpark?
How do you handle late-arriving data?
Explain best practices for production PySpark
How do you build reusable PySpark frameworks?
How does PySpark interact with JVM?
Explain Py4J architecture
How do you handle large small files problem?
Explain cost optimization strategies
End-to-end PySpark project explanation

Key Concepts

S.No	Topic	Sub-Topics
1	Databricks	What is Databricks, Lakehouse concept, Databricks vs Hadoop, Use cases, Architecture overview
2	Databricks Workspace	Workspace UI, Notebooks, Clusters, Jobs, Repos
3	Databricks Architecture	Control plane, Data plane, Workspace components, Security layers, Execution flow
4	Clusters in Databricks	All-purpose clusters, Job clusters, Autoscaling, Cluster policies, Init scripts
5	Databricks Runtime	DBR versions, Photon engine, ML runtime, GPU runtime, Performance tuning
6	Notebooks	Languages supported, Notebook workflows, Magic commands, Versioning, Collaboration
7	Databricks Utilities (dbutils)	File system ops, Secrets, Widgets, Notebook workflows, FS mounts
8	Data Ingestion	Batch ingestion, Streaming ingestion, Auto Loader, File formats, Schema inference
9	Delta Lake Fundamentals	ACID transactions, Delta log, Schema enforcement, Time travel, File compaction
10	Delta Lake Advanced	OPTIMIZE, Z-ORDER, Vacuum, Delta constraints, Change Data Feed
11	Spark SQL in Databricks	SQL editor, ANSI SQL, Views, CTEs, Query optimization
12	DataFrames & Datasets	API overview, Transformations, Actions, Lazy evaluation, Performance tips
13	Databricks SQL Warehouses	Serverless SQL, Query execution, Dashboards, Alerts, Access control
14	Jobs & Workflows	Job types, Task dependencies, Scheduling, Retries, Monitoring
15	Databricks Repos	Git integration, Branching, CI/CD basics, Repo permissions, Best practices
16	Security & Access Control	Users & groups, IAM integration, Table ACLs, Cluster policies, Secrets
17	Unity Catalog	Metastore, Catalogs & schemas, Data lineage, Fine-grained access, Auditing
18	Streaming with Databricks	Structured Streaming, Triggers, Watermarking, Stateful ops, Fault tolerance
19	Auto Loader	CloudFiles, Incremental ingestion, Schema evolution, Notifications, Performance tuning
20	Databricks ML Overview	ML workspace, ML runtime, Experiment tracking, Feature store, Model registry
21	MLflow in Databricks	Tracking, Projects, Models, Model registry, Deployment
22	Feature Store	Feature tables, Offline features, Online features, Reusability, Governance
23	Model Training	Distributed training, Hyperparameter tuning, AutoML, GPUs, Evaluation metrics
24	Model Deployment	Batch inference, Real-time serving, Model endpoints, A/B testing, Monitoring
25	Performance Optimization	Partitioning, Caching, Broadcast joins, Skew handling, Photon usage
26	Monitoring & Logging	Spark UI, Ganglia, Job metrics, Logs, Alerts
27	Cost Optimization	Cluster sizing, Spot instances, Autoscaling, Job clusters, Usage reports
28	Databricks on Cloud	AWS architecture, Azure architecture, GCP basics, Networking, Storage integration
29	CI/CD & DevOps	Repos + pipelines, Databricks CLI, Asset bundles, Environment promotion, Automation
30	Real-world Use Cases	ETL pipelines, Streaming analytics, ML pipelines, Lakehouse design, Interview prep

Interview question

Basic

What is Databricks?
What problems does Databricks solve?
What is Apache Spark?
How is Databricks different from Apache Spark?
What are Databricks Workspaces?
What is a Databricks Cluster?
Types of clusters in Databricks?
What is a Notebook in Databricks?
Supported languages in Databricks?
What is DBFS?
Difference between DBFS and HDFS?
What is Delta Lake?
Advantages of Delta Lake?
What is a Databricks Job?
What is Auto-scaling?
What is Auto-termination?
What is a Databricks Runtime?
Difference between Standard and ML Runtime?
What is a DataFrame?
What is a Spark Session?
What is a cell execution?
What is a notebook revision history?
What is MAGIC command?
What is %sql in Databricks?
What is Unity Catalog?

Intermediate

What is Delta Table?
What is ACID compliance in Delta Lake?
What is schema enforcement?
What is schema evolution?
What is OPTIMIZE in Delta Lake?
What is Z-ORDER?
Difference between Managed and External tables?
What is Time Travel in Delta Lake?
How do you handle duplicate data in Databricks?
What is Databricks SQL?
Difference between Spark SQL and Databricks SQL?
What is a Job Cluster vs Interactive Cluster?
How does Databricks handle fault tolerance?
What is caching in Databricks?
What is broadcast join?
What is Shuffle?
What is lazy evaluation?
Difference between RDD and DataFrame?
What is checkpointing?
How does Databricks integrate with cloud storage?
What is Structured Streaming?
Difference between batch and streaming?
What is watermarking?
What is MLflow?
What is Feature Store?

Advanced

Explain Databricks Lakehouse architecture
How does Delta Lake handle concurrent writes?
What is vacuum in Delta Lake?
Explain Delta log (_delta_log)
How does Z-ORDER improve performance?
What is Photon engine?
How does Photon improve query performance?
Explain cluster sizing strategy
How do you optimize Spark jobs in Databricks?
Explain adaptive query execution
What is cost optimization in Databricks?
What is data skipping?
Explain file compaction
How does Databricks handle skewed data?
What is Unity Catalog security model?
Difference between table ACLs and Unity Catalog?
What is lineage in Databricks?
How do you manage secrets in Databricks?
What is Databricks REST API?
How do you deploy code using Databricks Repos?
What is CI/CD in Databricks?
What is MLflow tracking?
What is model registry?
Explain real-time pipeline in Databricks
How do you handle late-arriving data?

Expert

Design an end-to-end Lakehouse architecture
How does Databricks ensure data governance at scale?
Explain multi-hop architecture (Bronze, Silver, Gold)
How do you design CDC pipelines in Databricks?
Explain Delta Live Tables (DLT)
Difference between DLT and normal pipelines?
How do you handle schema drift in production?
Explain exactly-once processing
How do you tune Spark for petabyte-scale data?
How does Photon compare with Spark Tungsten?
Explain Databricks serverless SQL
How do you secure PII data in Databricks?
How do you implement row-level and column-level security?
Explain workload isolation
How do you migrate from Hive to Databricks?
How do you monitor Databricks jobs?
Explain cost vs performance trade-offs
How do you manage large joins efficiently?
Explain Lakehouse vs Data Warehouse
Future roadmap of Databricks platform?
How does Databricks support AI workloads?
Explain vector search in Databricks
How do you handle model versioning at scale?
Explain MLOps in Databricks
How do you design enterprise-grade Databricks solution?

14 January 2026

#Joins & Aggregations

Key Concepts

S.No	Topic	Sub-Topics
1	Joins	What is a join, Types of joins, Importance, Examples, Use cases
2	Inner Join	Definition, Syntax, Example with RDD, Example with DataFrame, Performance considerations
3	Left Outer Join	Definition, Syntax, Example RDD, Example DataFrame, Handling nulls
4	Right Outer Join	Definition, Syntax, Example RDD, Example DataFrame, Use cases
5	Full Outer Join	Definition, Syntax, Example RDD, Example DataFrame, Null handling
6	Cross Join / Cartesian	Definition, Syntax, Example, Performance considerations, Use cases
7	Self Join	Definition, Syntax, Example RDD, Example DataFrame, Use cases
8	Broadcast Join	Definition, When to use, Example, Performance benefits, Spark configuration
9	Skewed Joins	Definition, Problems caused, Solutions, Salting technique, Performance tips
10	Join on Multiple Columns	Syntax, Example DataFrame, Example SQL, Performance considerations, Best practices
11	Key Considerations in Joins	Partitioning, Shuffling, Data size, Broadcast, Caching
12	Aggregation Overview	What is aggregation, Types, Importance, Syntax, Use cases
13	GroupBy	Definition, Syntax, Example RDD, Example DataFrame, Performance considerations
14	GroupByKey vs ReduceByKey	Definition, Syntax, Performance difference, Example, When to use
15	AggregateByKey	Definition, Syntax, Example, Custom aggregation functions, Performance
16	CountByKey & CountByValue	Definition, Syntax, Example RDD, Example DataFrame, Use cases
17	Sum, Max, Min Aggregations	Syntax, Example DataFrame, Example SQL, Performance, Best practices
18	Average & Mean Aggregations	Syntax, Example RDD, Example DataFrame, Handling nulls, Performance
19	Multiple Aggregations	agg() function, Syntax, Example DataFrame, Example SQL, Performance tips
20	Window Functions for Aggregation	Definition, Syntax, PartitionBy, OrderBy, Example
21	Rollup & Cube	Definition, Syntax, Example DataFrame, Use cases, Performance tips
22	Pivot Aggregations	Definition, Syntax, Example DataFrame, Example SQL, Use cases
23	Approximate Aggregations	approxCountDistinct(), approxQuantile(), Use cases, Syntax, Performance benefits
24	Custom Aggregations	User-defined aggregate functions (UDAF), Syntax, Example, Use cases, Performance tips
25	Combining Joins & Aggregations	Join then aggregate, Aggregate then join, Example DataFrame, SQL example, Best practices
26	Handling Nulls in Joins & Aggregations	Null handling functions, coalesce(), fill(), drop(), Example, Best practices
27	Optimizing Joins	Broadcast join, Partitioning, Caching, Skew handling, Shuffle reduction
28	Optimizing Aggregations	Partitioning, ReduceByKey, AggregateByKey, Caching, Avoid groupByKey for large data
29	Advanced Aggregation Techniques	Window functions, Rollup, Cube, Pivot, Custom UDAFs
30	Real-world Examples	ETL pipelines, Log analytics, Sales aggregation, Customer behavior analysis, Recommendations

Interview question

Basic

What is a join in Spark?
What are the types of joins?
Explain inner join with an example.
Explain left outer join with an example.
Explain right outer join with an example.
Explain full outer join with an example.
What is a cross join or cartesian product?
What is a self join?
Difference between inner join and outer join.
Difference between left and right outer join.
Difference between full outer join and inner join.
How to perform join on multiple columns?
What is a broadcast join?
When should you use broadcast join?
How does Spark handle join shuffles?
What are skewed joins?
How to handle nulls in joins?
What is join on key-value RDDs?
Explain join using DataFrames.
Explain join using Spark SQL.
Difference between join on RDDs and DataFrames.
What is the role of partitioning in joins?
What is the impact of join on performance?
What is a cogroup operation?
When should you use cogroup over join?

Intermediate

Explain groupBy in Spark.
Explain reduceByKey in Spark.
Difference between groupBy and reduceByKey.
Explain aggregateByKey.
Explain combineByKey.
What is countByKey?
What is countByValue?
Explain sum, max, min aggregations.
Explain average and mean aggregation.
How to perform multiple aggregations?
Explain window functions for aggregation.
Explain rollup aggregation.
Explain cube aggregation.
Explain pivot aggregation.
Explain approxCountDistinct aggregation.
Explain approxQuantile aggregation.
What are user-defined aggregate functions (UDAFs)?
How to perform joins before aggregation?
How to perform aggregation after join?
Explain groupBy with multiple columns.
Explain aggregateByKey vs reduceByKey.
Explain foldByKey for aggregation.
Explain subtractByKey.
Explain join optimizations in aggregations.
How to cache/join before aggregation for better performance?

Advanced

Explain shuffle in join and aggregation.
Explain narrow vs wide dependencies in joins.
How does Spark optimize joins internally?
How does Spark optimize aggregations internally?
What is partitioning and its importance?
How does partitioning affect join performance?
How does partitioning affect aggregation performance?
Explain broadcast join with large datasets.
Explain handling skewed keys in joins.
Explain reduce-side join vs map-side join.
Explain join with multiple RDDs.
Explain aggregation with multiple RDDs.
Explain window-based aggregations.
Explain stateful aggregation in streaming joins.
Explain streaming joins vs batch joins.
Explain approximate aggregations for performance.
Explain advanced pivot operations.
Explain multi-level aggregations.
Explain hierarchical rollup and cube.
Explain combining joins and aggregations for ETL pipelines.
Explain memory and disk management in join operations.
Explain partition tuning for large aggregations.
Explain broadcast variable usage in aggregation.
Explain accumulators in aggregations.
Explain best practices for join + aggregation in Spark.

Expert

Explain shuffle optimization strategies in joins and aggregations.
Explain Spark Catalyst optimization for DataFrame joins.
Explain whole-stage code generation for joins & aggregations.
Explain Tungsten engine role in join and aggregation.
Explain join strategy selection: sort-merge vs broadcast hash join.
Explain adaptive query execution (AQE) in joins.
Explain skew join handling in AQE.
Explain incremental aggregations for streaming data.
Explain join/aggregation in structured streaming.
Explain checkpointing in streaming aggregations.
Explain watermarking for joins in streaming.
Explain stateful streaming aggregations.
Explain memory tuning for large join + aggregation operations.
Explain tuning shuffle partitions for large datasets.
Explain optimizing multi-stage aggregation pipelines.
Explain combining wide & narrow transformations with aggregations.
Explain caching strategies for repeated joins and aggregations.
Explain fault-tolerance mechanisms in joins & aggregations.
Explain Spark UI metrics related to joins and aggregations.
Explain advanced use cases: ETL, analytics dashboards, ML pipelines.
Explain differences in join behavior between RDD, DataFrame, and Dataset APIs.
Explain join performance tuning in distributed clusters.
Explain aggregation performance tuning in distributed clusters.
Explain UDAF optimization techniques.
Explain real-world examples combining joins and aggregations.

Key Concepts

S.No	Topic	Sub-Topics
1	Spark Streaming	What is Spark Streaming, Real-time data, Micro-batch processing, Advantages, Use cases
2	Spark Streaming Architecture	Driver, Receiver, DStream, Scheduler, Executors
3	DStream Basics	Definition, Creation, Operations, RDDs, Transformations
4	Creating DStreams	From sources: Kafka, Flume, TCP sockets, File streams, Custom receivers
5	Transformations on DStreams	map(), flatMap(), filter(), reduceByKey(), window()
6	Window Operations	window(), slideDuration, reduceByKeyAndWindow(), aggregateByKeyAndWindow(), Examples
7	Stateful Transformations	updateStateByKey(), mapWithState(), Example, Use cases, Performance
8	Actions on DStreams	print(), count(), saveAsTextFiles(), foreachRDD(), Examples
9	Data Sources Integration	Kafka, Flume, HDFS, Socket, Custom sources
10	Sinks / Output Operations	print(), saveAsTextFiles(), saveAsObjectFiles(), foreachRDD(), write to DB
11	Checkpointing	Definition, Directory setup, Purpose, Examples, Fault tolerance
12	Receiver Types	Reliable receiver, Unreliable receiver, Custom receiver, Receiver lifecycle, Examples
13	Transformations: map vs flatMap	map(), flatMap(), Use cases, Examples, Differences
14	Transformations: reduceByKey	reduceByKey(), reduceByKeyAndWindow(), Examples, Use cases, Performance
15	Transformations: join in streaming	join(), leftOuterJoin(), rightOuterJoin(), fullOuterJoin(), Example
16	Transformations: union & transform	union(), transform(), Example, Use cases, Combining multiple streams
17	Handling Late Data	Watermarks, Window operations, State management, dropLateData(), Examples
18	Kafka Integration	DirectStream vs ReceiverStream, Kafka parameters, Offset management, Example, Best practices
19	Flume Integration	Spark Streaming + Flume, Push vs Pull, Receiver setup, Example, Best practices
20	File Stream Source	HDFS integration, Local files, Monitoring new files, Examples, Performance considerations
21	Structured Streaming Introduction	Differences from DStream, High-level API, DataFrames & Datasets, Fault-tolerance, Example
22	Structured Streaming Sources	Kafka, File, Socket, Rate source, Custom sources
23	Structured Streaming Sinks	Console, File, Kafka, ForeachBatch, Memory
24	Event Time & Watermarks	Definition, Handling late data, withWatermark(), Examples, Use cases
25	Window Operations in Structured Streaming	window(), slideDuration, groupBy window(), Examples, Performance tips
26	Stateful Operations in Structured Streaming	mapGroupsWithState(), flatMapGroupsWithState(), Examples, Use cases, Performance
27	Performance Tuning	Batch interval, Partitioning, Backpressure, Checkpointing, Resource tuning
28	Fault Tolerance & Reliability	Checkpointing, Write-ahead logs, Replay, Receiver reliability, Structured Streaming guarantees
29	Monitoring & Debugging	Spark UI, Streaming metrics, Logs, Executor monitoring, Performance tuning
30	Real-world Examples	Log analytics, IoT data processing, Real-time dashboards, Clickstream analysis, Recommendations

Interview question

Basic

What is Spark Streaming?
Explain real-time data processing.
What is a micro-batch in Spark Streaming?
Difference between batch and streaming.
What is a DStream?
How is a DStream created?
What are the basic DStream transformations?
What are the basic DStream actions?
Explain map() transformation in streaming.
Explain flatMap() transformation in streaming.
Explain filter() transformation in streaming.
Explain reduceByKey() transformation in streaming.
Explain count() action in streaming.
Explain print() action in streaming.
How to read from a socket stream?
How to read from a file stream?
Difference between reliable and unreliable receivers.
What is the role of the driver in Spark Streaming?
What is the role of executors in streaming?
How is batch interval configured?
What is the default checkpointing mechanism?
How do you stop a streaming context?
Explain foreachRDD() action.
What is the Spark Streaming UI?
Explain the use cases of Spark Streaming.

Intermediate

Explain window operations in Spark Streaming.
What is slide interval?
Difference between window duration and slide duration.
Explain reduceByKeyAndWindow().
Explain aggregateByKeyAndWindow().
What are stateful transformations?
Explain updateStateByKey().
Explain mapWithState().
How do you integrate Spark Streaming with Kafka?
What is DirectKafkaStream?
What is Receiver-based Kafka stream?
How do you handle offsets in Kafka?
Explain Spark Streaming integration with Flume.
Explain push-based vs pull-based Flume integration.
How to read from HDFS in streaming?
How to read from S3 in streaming?
Explain streaming file source options.
Explain output operations: saveAsTextFiles().
Explain output operations: saveAsObjectFiles().
Explain output operations: foreachRDD() to database.
Explain fault tolerance in Spark Streaming.
What is write-ahead logs (WAL)?
Explain receiver reliability.
Explain backpressure mechanism in Spark Streaming.
What is the role of batch scheduling?

Advanced

Explain structured streaming.
Difference between DStream API and Structured Streaming API.
What are the sources in Structured Streaming?
What are the sinks in Structured Streaming?
Explain event-time processing.
Explain watermarks in streaming.
How to handle late data using watermarks?
Explain streaming aggregation.
Explain window aggregation in structured streaming.
Explain stateful aggregations.
Explain mapGroupsWithState().
Explain flatMapGroupsWithState().
Explain join operations in streaming.
Explain stream-stream join vs stream-static join.
Explain stream-stream outer joins.
Explain checkpointing in structured streaming.
Explain exactly-once semantics in streaming.
Explain output modes: append, complete, update.
Explain processing-time triggers.
Explain continuous processing mode.
Explain schema inference in streaming.
Explain custom sources in structured streaming.
Explain foreachBatch() in structured streaming.
Explain streaming aggregation with watermarking.
Explain performance tuning for structured streaming.

Expert

Explain state store in structured streaming.
Explain recovery from failures in streaming.
Explain backpressure in structured streaming.
Explain memory and executor tuning for streaming.
Explain shuffle optimization in streaming joins.
Explain handling skewed streaming data.
Explain checkpointing and lineage recovery.
Explain streaming aggregation optimizations.
Explain watermarks with multiple streams.
Explain latency vs throughput trade-offs.
Explain using Kafka offsets with checkpointing.
Explain exactly-once vs at-least-once delivery.
Explain stateful streaming performance tuning.
Explain streaming joins with large datasets.
Explain stream-stream join optimization.
Explain integrating streaming with machine learning.
Explain handling late-arriving events.
Explain multi-window aggregations.
Explain structured streaming with event time vs processing time.
Explain monitoring streaming jobs with Spark UI.
Explain streaming metrics and logs.
Explain resource allocation and dynamic scaling.
Explain memory spill and disk management in streaming.
Explain streaming ETL pipelines.
Explain real-world streaming applications and case studies.

Key Concepts

S.No	Topic	Sub-Topics
1	Transformations & Actions	Definition, Lazy Evaluation, DAG concept, Execution flow, Why separation matters
2	Narrow vs Wide Transformations	Definition, Examples, Shuffle impact, Performance difference, Use cases
3	map()	Syntax, One-to-one mapping, Use cases, Performance, Examples
4	flatMap()	One-to-many mapping, Differences from map, Use cases, Examples, Performance
5	filter()	Predicate logic, Data reduction, Optimization tips, Examples, Use cases
6	select() / withColumn()	Column selection, Column creation, Expressions, Performance tips, Examples
7	union() & distinct()	Combining datasets, Removing duplicates, Shuffle behavior, Use cases, Examples
8	groupBy()	Grouping logic, Aggregation basics, Shuffle impact, Examples, Best practices
9	reduceByKey()	Key-based reduction, Map-side aggregation, Performance benefits, Examples, Comparison
10	groupByKey()	Working principle, Memory impact, Comparison with reduceByKey, Examples, When to avoid
11	sortBy() & orderBy()	Sorting logic, Asc/Desc order, Shuffle cost, Examples, Optimization tips
12	join() Basics	Inner join, Join condition, Execution flow, Examples, Common issues
13	Advanced Join Types	Left, Right, Full, Semi, Anti joins, Use cases, Examples
14	Broadcast Join	Concept, When to use, Memory impact, SQL hint, Examples
15	repartition() & coalesce()	Partition control, Shuffle behavior, Performance impact, Use cases, Examples
16	cache() & persist()	Storage levels, Memory vs disk, When to cache, Examples, Pitfalls
17	count()	Action trigger, Job creation, Performance considerations, Examples, Use cases
18	collect()	Driver memory risk, Small data usage, Examples, Best practices, Alternatives
19	show() & take()	Preview data, Execution behavior, Limit handling, Examples, Usage tips
20	save() & write()	Output formats, File systems, Partition output, Modes, Examples
21	foreach() & foreachPartition()	Side effects, External systems, Performance difference, Examples, Best practices
22	Window Functions	Over clause, Partition by, Order by, Use cases, Examples
23	Actions vs Transformations	Comparison, Execution timing, DAG role, Interview questions, Examples
24	Shuffle Internals	When shuffle occurs, Cost factors, Optimization, Examples, Debugging
25	Performance Optimization	Avoid wide ops, Partition sizing, Caching strategy, Examples, Tips
26	Error Handling	Bad records, Null handling, Try-catch logic, Data validation, Examples
27	Spark UI Analysis	Jobs tab, Stages tab, Task metrics, Shuffle read/write, Debugging
28	Real-world ETL Flow	Transform chain design, Action placement, Optimization, Examples, Best practices
29	Interview Scenarios	Common questions, Tricky cases, Performance questions, Sample answers, Tips
30	Hands-on Mini Project	End-to-end pipeline, Transformations usage, Actions usage, Optimization, Review

Interview question

Basic

What is a transformation in Spark?
What is an action in Spark?
Difference between transformations and actions?
What is lazy evaluation?
What is an RDD?
How do you create an RDD?
What is parallelize() in Spark?
What is textFile() in SparkContext?
Explain map() transformation.
Explain filter() transformation.
What is flatMap()?
Explain distinct() transformation.
What does union() do?
Explain intersection() transformation.
Explain subtract() transformation.
What is cartesian() transformation?
What is collect() action?
Explain count() action.
Explain first() action.
Explain take(n) action.
Explain reduce() action.
Explain fold() action.
Explain aggregate() action.
Explain takeOrdered() action.
Explain top() action.

Intermediate

Explain groupByKey() transformation.
Explain reduceByKey() transformation.
Difference between groupByKey() and reduceByKey().
Explain aggregateByKey() transformation.
Explain combineByKey() transformation.
What are pair RDDs?
Explain mapValues() transformation.
Explain flatMapValues() transformation.
Explain keys() and values() transformations.
Explain lookup() action on pair RDDs.
Explain joins: innerJoin(), leftOuterJoin(), rightOuterJoin(), fullOuterJoin().
Explain cogroup() transformation.
Explain sortByKey() transformation.
Explain sortBy() transformation.
Difference between sortByKey() and sortBy().
Explain repartition() transformation.
Explain coalesce() transformation.
Difference between repartition() and coalesce().
Explain sample() transformation.
Explain sampleByKey() and sampleByKeyExact().
What is takeSample()?
Explain randomSplit() transformation.
Explain cache() and persist() transformations.
Explain unpersist() method.
Explain checkpointing and its use case.

Advanced

Explain narrow vs wide transformations.
Difference between narrow and wide dependencies.
What is shuffle in Spark?
How does shuffle affect performance?
Explain partitioning and its importance.
Explain HashPartitioner and RangePartitioner.
Explain repartitionAndSortWithinPartitions().
Explain the role of DAG scheduler in transformations.
Explain stages and tasks in Spark execution.
Explain lazy evaluation and lineage.
How are transformations optimized internally?
Explain the difference between map() and mapPartitions().
Explain foreach() and foreachPartition() actions.
How to handle skewed data in transformations?
Explain broadcast variables and their usage.
Explain accumulators and their usage.
Explain foldByKey() transformation.
Explain subtractByKey() transformation.
Explain join optimizations for pair RDDs.
Explain caching strategies for iterative algorithms.
Explain checkpointing vs caching.
Explain transformations on DataFrames compared to RDDs.
Explain mapPartitionsWithIndex() transformation.
Explain wide transformations and task parallelism.
Explain best practices for memory management during transformations.

Expert

Explain spark.sql.shuffle.partitions and its impact.
Explain narrow dependency scheduling optimizations.
How does Spark handle task failures during actions?
Explain lineage and recomputation during RDD failure.
Explain transformations in Structured Streaming.
Explain actions in Structured Streaming.
How to optimize joins on large datasets?
Explain partition tuning for large-scale RDDs.
Explain avoiding shuffles with map-side reductions.
Explain caching strategies for MLlib pipelines.
Explain the difference between cache() and persist(StorageLevel.MEMORY_AND_DISK).
Explain how Spark plans tasks for wide transformations.
Explain the difference between reduceByKey() and aggregateByKey() performance.
How to handle skewed keys in joins?
Explain Spark UI metrics related to transformations and actions.
Explain how stage boundaries are created in wide transformations.
Explain partition coalescing to reduce shuffle.
Explain RDD lineage graph and fault tolerance.
Explain advanced join strategies: broadcast join, shuffle hash join.
Explain the difference between DataFrame and RDD transformations.
Explain transformation optimization by Catalyst (for DataFrames).
Explain whole-stage code generation for DataFrame transformations.
Explain streaming aggregation and its fault tolerance.
Explain stateful transformations in streaming RDDs.
Explain advanced tuning techniques for actions on huge datasets.

Key Concepts

S.No	Topic	Sub-Topics
1	Apache Spark & DataFrames	Spark overview, RDD vs DataFrame, Spark architecture, Lazy evaluation, Use cases
2	Spark Setup & Environment	Local mode, Cluster mode, SparkSession, spark-submit, Configuration basics
3	SparkSession & Entry Points	SparkSession creation, SQLContext, HiveContext, Config options, Best practices
4	Creating DataFrames	From files, From RDD, From collections, Schema inference, Explicit schema
5	DataFrame Schema & Data Types	StructType, StructField, Primitive types, Complex types, Schema evolution
6	Reading Data Sources	CSV, JSON, Parquet, ORC, Avro basics
7	Writing DataFrames	Save modes, Partitioning, Bucketing, File formats, Compression
8	DataFrame Basic Operations	select, withColumn, drop, filter, where
9	Column Operations	Column expressions, alias, cast, when/otherwise, lit
10	Row Operations & Actions	show, collect, take, count, first
11	DataFrame Functions	Built-in functions, String functions, Date functions, Math functions, Null handling
12	Filtering & Conditional Logic	filter vs where, isin, like, rlike, case when
13	Sorting & Deduplication	orderBy, sort, distinct, dropDuplicates, Sorting optimization
14	Aggregation & Grouping	groupBy, agg, count, sum, avg
15	Joins in DataFrames	Inner join, Left/Right join, Full join, Semi/Anti join
16	Join Optimization	Broadcast join, Shuffle join, Join hints, Skew handling, AQE
17	Handling Missing & Bad Data	dropna, fillna, replace, Null checks, Data validation
18	Window Functions	Window spec, row_number, rank, lead/lag, Running totals
19	UDF & UDAF	UDF creation, Performance impact, Pandas UDF, Serialization, Best practices
20	DataFrame Caching & Persistence	cache, persist, Storage levels, Memory vs disk, When to cache
21	Spark SQL with DataFrames	Temp views, Global views, SQL queries, Mixing SQL & DF, Optimization
22	Partitioning & Repartitioning	repartition, coalesce, Partition pruning, File partitioning, Performance tuning
23	Performance Optimization Basics	Catalyst optimizer, Tungsten, Predicate pushdown, Column pruning, AQE
24	DataFrame Execution Plan	Logical plan, Physical plan, explain(), DAG, Stage breakdown
25	Handling Large Datasets	Skew issues, Sampling, Checkpointing, Memory tuning, Spill handling
26	Integration with Hive	Hive tables, External tables, Metastore, Partitioned tables, Hive SQL
27	Streaming DataFrames (Structured Streaming)	Streaming sources, Sinks, Watermarking, Windowed aggregations, Triggers
28	Error Handling & Debugging	Common errors, Serialization issues, Logging, Debug tools, Retry strategies
29	Best Practices & Design Patterns	Code structure, Reusability, Performance patterns, Anti-patterns, Testing
30	Real-world Use Cases & Projects	ETL pipelines, Data lake processing, Analytics workloads, Reporting, Optimization review

Interview question

Basic Level

What is a DataFrame in Apache Spark?
How is a DataFrame different from an RDD?
What are the advantages of DataFrames?
Is DataFrame immutable?
What is SparkSession?
How do you create a DataFrame in Spark?
What are the different data sources supported by DataFrames?
What is a schema in DataFrame?
How can you infer schema automatically?
How do you define a custom schema?
What is show() in DataFrame?
What is printSchema()?
What is select() in DataFrame?
What is withColumn()?
Difference between withColumn() and select()?
What is filter() / where() in DataFrame?
Difference between filter() and where()?
What is limit()?
What is collect()?
What is count()?
What is distinct()?
What is drop()?
What is dropDuplicates()?
What is alias()?
How do you rename a column in DataFrame?

Intermediate Level

What are DataFrame transformations?
What are DataFrame actions?
What is lazy evaluation in DataFrames?
What is explain() in DataFrame?
What is logical plan?
What is physical plan?
What is Catalyst Optimizer?
What is Tungsten execution engine?
What is column expression?
What are built-in Spark SQL functions?
Difference between UDF and built-in functions?
What is UDF?
Performance impact of UDFs?
What is groupBy()?
What is agg()?
Difference between groupBy and window functions?
What is orderBy()?
Difference between orderBy() and sort()?
What is join() in DataFrame?
What are the types of joins in Spark?
What is inner join?
What is left outer join?
What is right outer join?
What is full outer join?
What is cross join?

Advanced Level

What is broadcast join?
When does Spark automatically use broadcast join?
What is shuffle join?
How does Spark handle join optimization?
What is partitioning in DataFrame?
Difference between repartition() and coalesce()?
What is bucketing in Spark?
Difference between bucketing and partitioning?
What is window function?
Explain row_number(), rank(), dense_rank().
What is caching in DataFrame?
Difference between cache() and persist()?
What storage levels are available?
What is checkpointing in DataFrame?
Difference between cache and checkpoint?
What is skewed data?
How to handle data skew in joins?
What is salting technique?
What is adaptive query execution (AQE)?
What are Spark SQL hints?
What is explain(true)?
How to optimize wide transformations?
What is column pruning?
What is predicate pushdown?
What is vectorized reader?

Expert Level

Why DataFrames are faster than RDDs?
How does Catalyst optimize DataFrame queries?
How does Tungsten improve performance?
What happens internally when a DataFrame action is triggered?
How does Spark generate optimized bytecode?
What is whole-stage code generation?
What is off-heap memory?
How does Spark handle memory management for DataFrames?
How does AQE change execution plans at runtime?
How do you debug slow DataFrame jobs?
How do you analyze Spark UI for DataFrame jobs?
What are common DataFrame performance anti-patterns?
Why excessive withColumn() is discouraged?
How do you design efficient Spark SQL pipelines?
What are limitations of DataFrames?
How do DataFrames handle schema evolution?
How does DataFrame support semi-structured data?
What is explode() function?
What is from_json() and to_json()?
How to handle nested columns efficiently?
What is Delta Lake DataFrame integration?
How does DataFrame handle ACID properties?
What is cost-based optimization (CBO)?
How do you tune Spark SQL configurations?
Explain real-time production use cases of DataFrames.

12 January 2026

#JUnit

Key Concepts

S.No	Topic	Sub-Topics
1	Introduction to JUnit	What is JUnit?, Importance of Unit Testing, History of JUnit, Versions overview, Use cases
2	JUnit Architecture	Core classes, Test runners, Test lifecycle, Annotations overview, Test suites
3	JUnit 4 vs JUnit 5	Key differences, Annotations, Assertions, Extension model, Migration strategies
4	JUnit Annotations	@Test, @Before, @After, @BeforeClass, @AfterClass
5	JUnit 5 Annotations	@Test, @BeforeEach, @AfterEach, @BeforeAll, @AfterAll
6	Assertions in JUnit	assertEquals, assertTrue, assertFalse, assertNotNull, assertThrows
7	Parameterized Tests	Introduction, @ParameterizedTest, @ValueSource, @CsvSource, Custom parameter providers
8	JUnit Test Suites	Purpose, Creating test suites, Including multiple classes, @Suite annotation, Running suites
9	Exception Testing	assertThrows, Expected exceptions, Handling exceptions in tests, Try-catch in tests, Best practices
10	Timeouts in Tests	Using @Test(timeout), assertTimeout, assertTimeoutPreemptively, Long-running tests, Best practices
11	Assumptions in JUnit	assumeTrue, assumeFalse, Conditional test execution, Environment-specific tests, Integration with CI
12	Test Lifecycle Methods	Setup and teardown, @BeforeEach/@AfterEach, @BeforeAll/@AfterAll, Resource management, Best practices
13	Nested Tests	Introduction, @Nested annotation, Structuring tests, Inner classes, Scope and lifecycle
14	Tagging Tests	@Tag annotation, Grouping tests, Running specific tags, Excluding tags, Integration with CI/CD
15	JUnit Extensions	Introduction, @ExtendWith annotation, Custom extensions, Parameter resolvers, Test lifecycle hooks
16	Mocking with Mockito	Mockito basics, @Mock, @InjectMocks, when-thenReturn, Verifying interactions
17	JUnit with Spring Boot	@SpringBootTest, @WebMvcTest, @MockBean, Context loading, Integration tests
18	Behavior Driven Testing	Introduction to BDD, JUnit + Cucumber, Feature files, Step definitions, Integration examples
19	Testing Exceptions and Edge Cases	Edge case identification, Boundary testing, assertThrows, Negative testing, Best practices
20	JUnit Test Reports	Generating reports, Maven Surefire plugin, Gradle reports, HTML reports, CI integration
21	Mocking Static Methods	Mockito inline, PowerMockito, Limitations, Use cases, Best practices
22	Parameterized and CSV Tests	@CsvSource, @CsvFileSource, @MethodSource, Dynamic tests, Practical examples
23	Dynamic Tests	@TestFactory, DynamicTest.stream, Custom dynamic tests, Use cases, Best practices
24	Integration Testing with JUnit	Introduction, Database tests, REST API testing, Spring integration, Environment setup
25	Code Coverage	Jacoco integration, Measuring coverage, Analyzing reports, Coverage thresholds, Best practices
26	Continuous Integration	JUnit in CI/CD, Jenkins integration, GitHub Actions, Pipeline setup, Reporting
27	Best Practices in JUnit	Writing clean tests, DRY principle, Readable assertions, Test naming conventions, Test isolation
28	Debugging Unit Tests	Using IDE debugger, Common failures, Stack traces, Logging in tests, Fixing flaky tests
29	Advanced Assertions	assertAll, assertIterableEquals, assertLinesMatch, assertTimeout, Custom assertions
30	JUnit Projects & Labs	Hands-on projects, Full coverage examples, Spring Boot testing, CI/CD integration, Practice exercises

Interview question

1. JUnit Basics

What is JUnit and why is it used?
Explain the differences between JUnit 4 and JUnit 5.
What are the advantages of using JUnit in Java projects?
How do you write your first JUnit test case?
What is the naming convention for test methods?
Explain the role of the @Test annotation.
What is the default test runner in JUnit?
How do you disable a test in JUnit?
What are assumptions in JUnit?
Explain JUnit?s role in TDD (Test-Driven Development).
What is the difference between unit tests and integration tests?
How do you set up a JUnit environment in a Maven project?
Can JUnit be used for testing private methods?
What is the default order of test execution in JUnit?
How do you test void methods in JUnit?
How do you skip tests conditionally?
What is the difference between JUnit and TestNG?
Explain the concept of test lifecycle in JUnit.
What is the purpose of @DisplayName in JUnit 5?
How do you tag and filter tests in JUnit?

2. Annotations

Explain the usage of @BeforeEach and @AfterEach.
What is the purpose of @BeforeAll and @AfterAll?
How do you create a setup method in JUnit?
Difference between @BeforeClass (JUnit 4) and @BeforeAll (JUnit 5).
What happens if @BeforeAll is not static?
Can you use multiple annotations on the same method?
How do you use @Disabled in JUnit 5?
What is @RepeatedTest in JUnit 5?
Explain @Nested test classes.
How is @TestFactory used in dynamic tests?
Explain @Tag annotation with examples.
What is the difference between @Test and @ParameterizedTest?
What is @ExtendWith used for?
How do you use @TempDir in JUnit?
What does @Timeout do in JUnit 5?
Can you annotate constructors in JUnit with @BeforeEach?
Explain @EnabledOnOs and @EnabledOnJre annotations.
How do you handle conditional test execution using annotations?
What is @Order used for in JUnit tests?
Can annotations be customized in JUnit?

3. Assertions

What is an assertion in JUnit?
Difference between assertEquals and assertSame.
How do you test for exceptions using assertions?
Explain the usage of assertTrue and assertFalse.
What is assertNull and assertNotNull?
How do you compare arrays in JUnit?
Explain assertAll with example.
What is the difference between fail() and assertThrows()?
How do you write custom assertions?
What is the purpose of Hamcrest in JUnit assertions?
Explain the difference between Hamcrest and AssertJ.
How do you use assertLinesMatch in JUnit?
What is assertIterableEquals used for?
How do you write assertions for collections?
Explain assertTimeout and assertTimeoutPreemptively.
What is the difference between soft and hard assertions?
Can you write assertions for floating-point numbers?
How do you compare objects in JUnit tests?
Explain the usage of assertDoesNotThrow.
What happens when an assertion fails?

4. Parameterized Tests

What is a parameterized test in JUnit?
How do you write a parameterized test using @ValueSource?
Difference between @CsvSource and @CsvFileSource.
How do you test with enum values in JUnit?
How do you write parameterized tests with @MethodSource?
Explain how arguments are resolved in parameterized tests.
How do you test with multiple parameters?
What are custom argument providers?
How do you reuse test data across parameterized tests?
What is the advantage of parameterized tests?
How do you use ArgumentsAccessor?
What is @ArgumentsSource annotation?
How do you test edge cases with parameterized tests?
Explain the difference between parameterized tests in JUnit 4 vs JUnit 5.
Can parameterized tests be combined with @BeforeEach?
What happens when parameterized test data is invalid?
How do you test null inputs in parameterized tests?
What are some best practices for parameterized testing?
How do you test with complex objects?
Can you combine parameterized and dynamic tests?

5. Test Suites

What is a test suite in JUnit?
How do you run multiple test classes together?
Explain @SelectPackages annotation.
Explain @SelectClasses annotation.
How do you include/exclude tests in a suite?
Can test suites be nested?
Difference between JUnit 4 @Suite and JUnit 5 suite engine.
How do you run test suites from Maven?
How do you integrate test suites with Gradle?
What is the benefit of test suites?
Can you use filters with test suites?
How do you configure test discovery?
How do you execute only tagged tests in a suite?
What happens if a suite includes disabled tests?
How do you execute JUnit 4 suites inside JUnit 5?
What is @IncludeClassNamePatterns?
What is @ExcludeTags used for?
Explain discovery selectors in JUnit.
What are suite-level lifecycle methods?
How do you create custom suite runners?

6. Mockito & JUnit

What is Mockito?
How do you create a mock in JUnit?
What is the purpose of @Mock annotation?
How do you use @InjectMocks in tests?
Explain stubbing in Mockito.
What is verify() used for?
How do you reset a mock?
How do you use ArgumentCaptor?
What is the difference between spy() and mock()?
How do you mock exceptions in JUnit tests?
Can you mock static methods with Mockito?
How do you mock final classes?
What is the difference between real and mock objects?
How do you mock collections?
How do you handle void methods in Mockito?
What is the difference between @MockBean and @Mock?
How do you mock private methods?
Explain deep stubs in Mockito.
How do you verify the number of interactions?
What are common pitfalls in using mocks?

7. Spring & JUnit Integration

How do you write a test with @SpringBootTest?
What is @MockBean used for?
Explain the difference between @Mock and @MockBean.
How do you load application context in JUnit?
How do you test Spring MVC controllers?
What is @DataJpaTest used for?
How do you test REST APIs using MockMvc?
Explain @WebMvcTest annotation.
How do you test Spring Boot configuration classes?
How do you handle transactions in Spring tests?
How do you use @TestConfiguration?
What is @AutoConfigureMockMvc?
How do you test caching in Spring?
Explain the use of @Sql annotation in testing.
How do you test services with external dependencies?
What is TestEntityManager used for?
How do you test asynchronous methods in Spring?
Explain how to use TestRestTemplate.
How do you test application events in Spring?
What is the role of @ActiveProfiles in testing?

8. Exception Testing

How do you test exceptions in JUnit 4?
How do you test exceptions in JUnit 5?
Explain the usage of assertThrows.
What is ExpectedException in JUnit 4?
How do you test custom exceptions?
How do you verify exception messages?
How do you test multiple exceptions in a single method?
Can you test checked and unchecked exceptions differently?
How do you test exceptions in parameterized tests?
What is the difference between fail() and assertThrows()?
How do you test runtime exceptions?
How do you test for null pointer exceptions?
How do you combine assertions with exception testing?
How do you log exceptions during tests?
How do you suppress exceptions in JUnit?
How do you create reusable exception assertions?
What happens when no exception is thrown in assertThrows?
Can you test exceptions in dynamic tests?
How do you handle exception hierarchies?
Explain common pitfalls in exception testing.

9. Test Execution & Reports

How do you run JUnit tests from Eclipse/IntelliJ?
How do you run tests using Maven?
How do you run tests using Gradle?
How do you execute tests from the command line?
How do you run a single test class?
How do you run a single test method?
How do you generate XML test reports?
How do you generate HTML test reports?
What is the Surefire plugin in Maven?
How do you configure test execution in Gradle?
How do you run tests in parallel?
How do you configure timeout for all tests?
How do you rerun failed tests?
How do you ignore flaky tests?
How do you integrate JUnit tests with Jenkins?
How do you configure reporting plugins?
How do you publish test reports in CI/CD?
How do you measure code coverage with JUnit?
How do you generate Jacoco reports?
What are best practices in test reporting?

10. JUnit Extensions & Best Practices

What is the extension model in JUnit 5?
How do you write a simple JUnit extension?
What is @ExtendWith used for?
What built-in extensions are available in JUnit 5?
How do you use the TempDir extension?
How do you use the Timeout extension?
What is ExtensionContext in JUnit?
How do you chain multiple extensions?
How do you share state across extensions?
How do you implement a logging extension?
What are some best practices in writing unit tests?
How do you name your test methods effectively?
What is the Arrange-Act-Assert pattern?
How do you avoid flaky tests?
How do you organize test packages?
How do you reuse common test data?
How do you make tests maintainable?
What are some anti-patterns in testing?
How do you improve performance of test suites?
What are enterprise-level strategies for JUnit testing?

Key Concepts

S.No	Topic	Sub-Topics
1	Multithreading	What is Thread, Process vs Thread, Benefits of Multithreading, Applications, Thread Lifecycle Overview
2	Thread Class	Creating Thread by extending Thread, start(), run(), sleep(), join(), getName()
3	Runnable Interface	Implementing Runnable, Passing to Thread, Advantages, run() vs start(), Lambda Runnable
4	Thread Lifecycle	New, Runnable, Running, Waiting, Timed Waiting, Terminated, Thread State Transitions
5	Thread Methods	setName/getName, setPriority/getPriority, isAlive(), yield(), interrupt()
6	Thread Priority	Min/Max/Normal Priority, setPriority, Thread Scheduling, Preemption, Fairness
7	Thread Sleep & Join	sleep(), join(), wait vs sleep, timed join, practical examples
8	Thread Communication	wait(), notify(), notifyAll(), producer-consumer basics, synchronized block
9	Synchronized Methods	Method-level sync, block-level sync, object lock, class-level lock, best practices
10	Inter-thread Communication	Producer-Consumer Problem, BlockingQueue, wait/notify, ReentrantLock with Condition, deadlock prevention
11	Reentrant Locks	Lock interface, ReentrantLock, tryLock(), lockInterruptibly(), fairness, conditions
12	Deadlock	What is Deadlock, Conditions, Prevention, Avoidance, Detection, Recovery
13	Starvation & Livelock	Starvation, Livelock, Examples, Priority Inversion, Solutions
14	Thread Safety	Definition, Thread-safe classes, Immutable Objects, Synchronization, Atomic variables
15	Atomic Classes	AtomicInteger, AtomicLong, AtomicReference, compareAndSet, use cases
16	Volatile Keyword	What is volatile, visibility, happens-before, example usage, memory consistency
17	Concurrent Collections	ConcurrentHashMap, CopyOnWriteArrayList, BlockingQueue, ConcurrentSkipListMap, benefits
18	Executor Framework	Executor, ExecutorService, ThreadPoolExecutor, ScheduledExecutorService, shutdown
19	Thread Pools	FixedPool, CachedPool, SingleThreadPool, ScheduledPool, Advantages
20	Callable & Future	Callable Interface, Future, submit(), get(), timeout handling, cancelling tasks
21	ForkJoin Framework	ForkJoinPool, RecursiveTask, RecursiveAction, work-stealing, parallel computation
22	Parallel Streams	Stream API, parallel(), ForkJoin usage, performance tips, pitfalls
23	ThreadLocal	ThreadLocal variables, usage, memory leak, InheritableThreadLocal, examples
24	Synchronization Utilities	CountDownLatch, CyclicBarrier, Semaphore, Phaser, Exchanger
25	Deadlock Prevention Patterns	Lock Ordering, TryLock, Timeout, Avoid Nested Locks, Resource hierarchy
26	Best Practices	Minimize synchronized code, prefer high-level concurrency, immutable objects, use executor, avoid busy wait
27	Performance Tuning	Thread pool sizing, contention reduction, CPU-bound vs IO-bound, measuring, profiling
28	Common Concurrency Bugs	Race conditions, deadlocks, livelocks, visibility issues, fixes
29	Real-world Examples	Producer-Consumer app, Web server handling requests, parallel processing, async tasks, thread-safe cache
30	Interview & Revision	Key methods, concurrency concepts, common pitfalls, multithreading Q&A, mini projects

Interview question

Basic Level

What is a Thread?
What is Multithreading?
What is the difference between Process and Thread?
What are the advantages of Multithreading?
What is Context Switching?
What is Thread Lifecycle?
What are the different Thread States?
How do you create a Thread in Java?
What is the Runnable interface?
What is the difference between Thread and Runnable?
What is the start() method?
What is the run() method?
Why should we not directly call run()?
What is Thread Priority?
What is Daemon Thread?
How do you create a Daemon Thread?
What is the purpose of the join() method?
What is the sleep() method used for?
What is thread scheduling?
What is time slicing?
What is thread safety?
What is synchronization?
What is a synchronized method?
What is a synchronized block?
What is the volatile keyword?

Intermediate Level

What is Inter-thread communication?
What are wait(), notify(), notifyAll() used for?
Why must wait/notify be called inside synchronized block?
What is a race condition?
What is deadlock?
How do you avoid deadlock?
What is livelock?
What is starvation?
What is a monitor in Java?
What is reentrant synchronization?
What is a ThreadGroup?
What is ThreadLocal?
What is the Executor framework?
What is ExecutorService?
What is a ThreadPool?
What is Callable?
What is Future?
What is FutureTask?
What is ScheduledExecutorService?
What is a RejectedExecutionHandler?
What is a BlockingQueue?
What is the difference between synchronized and Lock?
What is ReentrantLock?
What is ReadWriteLock?
What is Condition interface?

Advanced Level

What is Fork/Join framework?
What is Work Stealing Algorithm?
What is ConcurrentHashMap?
How does ConcurrentHashMap achieve thread safety?
What is CopyOnWriteArrayList?
What is CAS (Compare and Swap)?
What are Atomic classes?
What is the difference between Atomic and volatile?
What is StampedLock?
What is Phaser?
What is CyclicBarrier?
What is CountDownLatch?
What is Semaphore?
What is Exchanger?
What is ThreadPoolExecutor?
How does ThreadPoolExecutor manage threads internally?
What is ForkJoinPool?
What is parallel stream?
How does parallel stream work internally?
What is thread contention?
What is false sharing?
How do you debug concurrency issues?
What is Memory Consistency Error?
What is Happens-Before relationship?
What is Java Memory Model?

Expert Level

How to design highly scalable multithreaded systems?
What are lock-free algorithms?
What are wait-free algorithms?
What is the difference between blocking vs non-blocking algorithms?
How do you reduce lock contention?
How does JVM handle thread scheduling internally?
What are advanced optimizations in modern JVM for concurrency?
Explain the internals of synchronized keyword.
Explain biased locking.
Explain lightweight locking.
What is escape analysis?
How does JIT optimize multithreaded code?
How do you detect deadlocks in production?
How do you avoid deadlocks using ordering strategies?
How do you tune thread pools for high throughput?
What is backpressure in multithreading systems?
How do you design producer?consumer systems at scale?
How do you build custom thread pools?
How do you test multithreaded code effectively?
What is the role of memory barriers in concurrency?
How do you ensure safe publication of objects?
What is double-checked locking?
Why was double-checked locking broken before Java 5?
How do you build lock-free data structures?
How do reactive systems differ from traditional multithreading?

11 January 2026

#Scikit

Key Concepts

S.No	Topic	Sub-Topics
1	Scikit-learn	What is scikit-learn, Installation, Key features, ML workflow, Supported algorithms
2	Scikit-learn API Basics	Estimators, fit(), predict(), transform(), Pipelines, Model persistence
3	Data Loading & Inspection	Built-in datasets, load_, fetch_, Data shapes, Feature names, Target variables
4	Data Preprocessing	Scaling, Normalization, Encoding categorical data, Missing values, Feature transformation
5	Feature Scaling Techniques	StandardScaler, MinMaxScaler, RobustScaler, Normalizer, When to scale
6	Handling Missing Data	SimpleImputer, Strategies, Missing indicators, Pipeline usage, Best practices
7	Encoding Categorical Variables	LabelEncoder, OneHotEncoder, OrdinalEncoder, Handling unknowns, Sparse output
8	Train-Test Split	train_test_split, Stratification, Random state, Data leakage, Validation sets
9	Linear Regression	LinearRegression, Assumptions, Coefficients, Evaluation metrics, Use cases
10	Logistic Regression	Binary vs multiclass, Regularization, Solver options, Class weights, Evaluation
11	Model Evaluation Metrics	Accuracy, Precision, Recall, F1-score, Confusion matrix
12	Cross-Validation	K-Fold, StratifiedKFold, cross_val_score, cross_validate, Bias-variance tradeoff
13	k-Nearest Neighbors	KNN classifier, KNN regressor, Distance metrics, Choosing K, Performance
14	Support Vector Machines	SVC, SVR, Kernels, Hyperparameters, Margin maximization
15	Decision Trees	Tree structure, Gini vs entropy, Overfitting, Pruning, Feature importance
16	Ensemble Learning	Bagging, Boosting, Random Forest, Extra Trees, Voting classifiers
17	Random Forest	RandomForestClassifier, Hyperparameters, Feature importance, OOB score, Use cases
18	Gradient Boosting	GradientBoosting, XGBoost intro, LightGBM intro, Learning rate, Trees depth
19	Naive Bayes	GaussianNB, MultinomialNB, BernoulliNB, Assumptions, Applications
20	Clustering Algorithms	KMeans, Hierarchical clustering, DBSCAN, Silhouette score, Use cases
21	Dimensionality Reduction	PCA, Kernel PCA, Explained variance, Feature compression, Visualization
22	Anomaly Detection	Isolation Forest, One-Class SVM, LOF, Use cases, Evaluation challenges
23	Model Selection & Tuning	GridSearchCV, RandomizedSearchCV, Hyperparameters, Scoring, Best estimators
24	Pipelines & ColumnTransformer	Pipeline, Feature unions, ColumnTransformer, End-to-end ML, Avoid leakage
25	Imbalanced Datasets	Class imbalance, SMOTE, Class weights, Evaluation metrics, Best practices
26	Text Feature Extraction	CountVectorizer, TF-IDF, N-grams, Stop words, Sparse matrices
27	Model Persistence	joblib, pickle, Saving models, Loading models, Versioning
28	Model Interpretation	Coefficients, Feature importance, Permutation importance, Partial dependence, SHAP intro
29	Scikit-learn with Pipelines in Production	Reproducibility, Monitoring, Data drift, Model updates, Best practices
30	Scikit-learn Best Practices	Code structure, Experiment tracking, Documentation, Common pitfalls, Next steps

Interview question

Basic Level

What is scikit-learn?
What type of library is scikit-learn?
Which language is scikit-learn written in?
What are estimators in scikit-learn?
What is the fit() method?
What is the predict() method?
Difference between fit() and transform()?
What is supervised learning?
What is unsupervised learning?
What is train_test_split?
What are features and labels?
What is a dataset in scikit-learn?
What are built-in datasets?
What is accuracy score?
What is a confusion matrix?
What is overfitting?
What is underfitting?
What is a regression problem?
What is a classification problem?
What is clustering?
What is scaling?
What is normalization?
What is LabelEncoder?
What is OneHotEncoder?
What are model parameters?

Intermediate Level

What is StandardScaler?
Difference between MinMaxScaler and StandardScaler?
What is logistic regression?
Explain linear regression in scikit-learn.
What is KNN?
How does KNN work?
What is SVM?
What are kernels in SVM?
What is decision tree?
What is entropy and gini index?
What is Random Forest?
What is ensemble learning?
What is cross-validation?
What is K-Fold cross-validation?
What is StratifiedKFold?
What is GridSearchCV?
What is RandomizedSearchCV?
What are hyperparameters?
What is bias-variance tradeoff?
What is ROC curve?
What is AUC?
What is precision and recall?
What is F1-score?
What is feature importance?
What is PCA?

Advanced Level

How does PCA work internally?
What is explained variance?
Difference between PCA and LDA?
What is Gradient Boosting?
Difference between Bagging and Boosting?
What is AdaBoost?
What is Isolation Forest?
What is DBSCAN?
How does KMeans clustering work?
What is silhouette score?
What is feature selection?
Difference between feature selection and extraction?
What is Recursive Feature Elimination?
What is pipeline in scikit-learn?
Why are pipelines important?
What is ColumnTransformer?
How to handle categorical features?
How does scikit-learn handle missing values?
What is SimpleImputer?
What is model persistence?
Difference between pickle and joblib?
What is partial dependence plot?
What is permutation importance?
How to avoid data leakage?
How to handle imbalanced datasets?

Expert Level

How does scikit-learn architecture work?
Explain estimator, transformer, predictor design.
How does scikit-learn optimize performance?
What is warm_start?
How does scikit-learn use NumPy internally?
What are sparse matrices?
How does scikit-learn handle sparse data?
What is SGDClassifier?
Difference between batch and online learning?
How to scale scikit-learn for large datasets?
What are limitations of scikit-learn?
Difference between scikit-learn and TensorFlow?
Difference between scikit-learn and PyTorch?
How to integrate scikit-learn with pandas?
What is custom estimator?
How to implement custom transformer?
What is scoring parameter?
How to evaluate regression models?
What is R² score?
What is model drift?
How to monitor models in production?
What is reproducibility in ML?
How to set random_state?
Explain numerical stability issues.
What are best practices in scikit-learn?

Key Concepts

S.No	Topic	Sub-Topics
1	Pandas	Overview, Installation, Series, DataFrame, Basic operations
2	Series Basics	Creating Series, Indexing, Slicing, Series methods, Data types
3	DataFrame Basics	Create DataFrame, Index/Columns, Shape, dtypes, head/tail
4	Data Selection	loc, iloc, ix, column selection, row selection
5	Data Filtering	Boolean indexing, conditions, isin, between, query()
6	Missing Data	isnull, notnull, fillna, dropna, interpolation
7	Data Cleaning	Duplicates, rename, replace, strip whitespaces, type conversion
8	Data Transformation	apply, map, applymap, lambda functions, vectorized operations
9	Aggregation & Grouping	groupby, aggregate, transform, filter, pivot tables
10	Sorting & Ranking	sort_values, sort_index, rank, ascending/descending, multi-level sorting
11	Indexing & MultiIndex	set_index, reset_index, hierarchical index, slicing, cross-section
12	Concatenation & Merging	concat, append, merge, join, indicator
13	Reshaping Data	melt, pivot, stack, unstack, wide to long format
14	Time Series Basics	Datetime conversion, date_range, indexing, resampling, frequency
15	Time Series Advanced	rolling, expanding, shifting, lag/lead, moving average
16	String Operations	str methods, contains, replace, split, regex
17	Visualization with Pandas	plot, line, bar, histogram, scatter
18	Reading/Writing Data	read_csv, read_excel, read_json, to_csv, to_excel
19	Advanced I/O	read_sql, read_parquet, read_hdf, read_pickle, compression
20	Exploratory Data Analysis	describe, info, value_counts, correlation, unique
21	Multi-Column Operations	arithmetic, apply, assign, lambda, broadcasting
22	Window Functions	rolling, expanding, ewm, groupby with window, custom functions
23	Categorical Data	category dtype, conversion, codes, sorting, filtering
24	Sampling & Subsetting	sample, head/tail, nth, slicing, random sampling
25	Performance Optimization	vectorization, eval/query, categorical, chunking, memory usage
26	MultiIndex Advanced	stack/unstack, xs, swaplevel, sortlevel, indexing tricks
27	Custom Functions	apply, pipe, lambda, function chaining, reusable utilities
28	Integration with NumPy & SciPy	array operations, broadcasting, linear algebra, statistical functions, interoperability
29	Real World Data Projects	EDA, cleaning, aggregation, visualization, export results
30	End-to-End Project	Data collection, cleaning, analysis, feature engineering, visualization

Popular Posts

26 January 2026

19 January 2026

Key Concepts

Interview question

Related Topics

Key Concepts

Interview question

Related Topics

Key Concepts

Interview question

Related Topics

Key Concepts

Interview question

Related Topics

Key Concepts

Interview question

Related Topics

Key Concepts

Interview question

Related Topics

Key Concepts

Interview question

Related Topics

18 January 2026

Key Concepts

Interview question

Basic

Intermediate

Advanced

Expert

Related Topics

Key Concepts

Interview question

Basic

Intermediate

Advanced

Expert

Related Topics

14 January 2026

Key Concepts

Interview question

Basic

Intermediate

Advanced

Expert

Related Topics

Key Concepts

Interview question

Basic

Intermediate

Advanced

Expert

Related Topics

Key Concepts

Interview question

Basic

Intermediate

Advanced

Expert

Related Topics

Key Concepts

Interview question

Basic Level

Intermediate Level

Advanced Level

Expert Level

Related Topics

12 January 2026

Key Concepts

Interview question

1. JUnit Basics

2. Annotations

3. Assertions

4. Parameterized Tests

5. Test Suites

6. Mockito & JUnit

7. Spring & JUnit Integration

8. Exception Testing

9. Test Execution & Reports