19 January 2026

#Amazon EC2

#Amazon EC2

Key Concepts


S.No Topic Sub-Topics

Interview question


Related Topics


#Amazon RDS

#Amazon RDS

Key Concepts


S.No Topic Sub-Topics

Interview question


Related Topics


#Amazon EMR

#Amazon EMR

Key Concepts


S.No Topic Sub-Topics

Interview question


Related Topics


#AWS Lambda

#AWS Lambda

Key Concepts


S.No Topic Sub-Topics

Interview question


Related Topics


#Amazon EKS / ECS

#Amazon EKS / ECS

Key Concepts


S.No Topic Sub-Topics

Interview question


Related Topics


#Amazon SageMaker

#Amazon SageMaker

Key Concepts


S.No Topic Sub-Topics

Interview question


Related Topics


#Spark Core

#Spark Core

Key Concepts


S.No Topic Sub-Topics

Interview question


Related Topics


   SparkContext   
   Components   
   DAG   

18 January 2026

#Amazon Redshift

#Amazon S3

#Amazon Kinesis

#AWS Glue

#PySpark

#PySpark

Key Concepts


S.No Topic Sub-Topics
1PySparkWhat is PySpark, Spark ecosystem, PySpark vs Pandas, Use cases, Installation & setup
2Spark ArchitectureDriver, Executors, Cluster manager, Jobs/Stages/Tasks, Execution flow
3SparkSession & ContextSparkSession, SparkContext, Configurations, Application lifecycle, Best practices
4RDD FundamentalsRDD creation, Transformations, Actions, Persistence, RDD vs DataFrame
5RDD Advanced OperationsNarrow vs wide ops, shuffle, Accumulators, Broadcast variables, Performance tuning
6DataFrame IntroductionDataFrame API, Creating DataFrames, Schema inference, show/select, DataFrame vs RDD
7DataFrame Transformationsselect, filter, withColumn, drop, cast & rename
8Data Sources & FormatsCSV, JSON, Parquet, ORC, Avro
9Schema ManagementStructType, StructField, Explicit schema, Schema evolution, Corrupt records
10Built-in FunctionsString functions, Date functions, Math functions, Conditional logic, Null handling
11Joins in PySparkInner join, Left/Right join, Full join, Broadcast join, Join optimization
12AggregationsgroupBy, agg, count/sum/avg, rollup, cube
13Window FunctionsWindow spec, row_number, rank/dense_rank, lead/lag, Running totals
14Sorting & PartitioningorderBy, sortWithinPartitions, repartition, coalesce, Data skew basics
15Spark SQLTemp views, Global views, SQL queries, CTEs, SQL vs DataFrame API
16User Defined FunctionsPython UDF, Pandas UDF, Serialization cost, When to avoid UDF, Alternatives
17Performance OptimizationCaching, Persist levels, Broadcast joins, File sizing, Best practices
18Partition & File OptimizationPartition pruning, Bucketing, Small file problem, Compression, Skew handling
19PySpark with HiveHive metastore, Managed tables, External tables, Partitioned tables, Hive SQL
20Structured Streaming BasicsStreaming concepts, Micro-batching, Sources, Sinks, Checkpointing
21Streaming OperationsTriggers, Output modes, Watermarking, Late data, Fault tolerance
22Streaming AggregationsWindowed aggregation, Stateful ops, Stream joins, Exactly-once semantics, Recovery
23MLlib OverviewTransformers, Estimators, Pipelines, Evaluators, Model lifecycle
24Feature EngineeringStringIndexer, OneHotEncoder, VectorAssembler, Scaling, Feature selection
25ML AlgorithmsRegression, Classification, Clustering, Recommendation, Metrics
26Hyperparameter TuningCrossValidator, Train-validation split, ParamGrid, Model selection, Optimization
27PySpark with Delta LakeDelta tables, ACID transactions, Time travel, MERGE, Optimize & Vacuum
28Debugging & MonitoringSpark UI, Logs, Common errors, Debug strategies, Job analysis
29Job Scheduling & Deploymentspark-submit, Config tuning, Scheduling, Parameterization, Automation
30Real-world Use CasesETL pipelines, Streaming analytics, ML pipelines, Optimization patterns, Interview prep

Interview question

Basic

  • What is PySpark?
  • What is Apache Spark?
  • Why is PySpark faster than MapReduce?
  • What are the components of Spark?
  • What is a SparkSession?
  • What is a DataFrame in PySpark?
  • Difference between RDD and DataFrame?
  • What is lazy evaluation?
  • What are transformations?
  • What are actions?
  • What is an RDD?
  • What languages are supported by Spark?
  • What is schema inference?
  • What is DBFS?
  • How do you read a CSV file in PySpark?
  • What is collect()?
  • What is show()?
  • What is repartition()?
  • What is coalesce()?
  • What is Spark SQL?
  • What is cache()?
  • What is persist()?
  • What is a cluster?
  • What is a driver?
  • What is an executor?

Intermediate

  • Difference between cache() and persist()?
  • Difference between repartition() and coalesce()?
  • What is broadcast join?
  • What is shuffle in Spark?
  • What is narrow vs wide transformation?
  • What is lineage?
  • What is fault tolerance?
  • What is a partition?
  • How does Spark handle memory management?
  • What is Catalyst Optimizer?
  • What is Tungsten engine?
  • What is Spark SQL execution plan?
  • What is explain()?
  • Difference between inner and outer join?
  • What is window function?
  • What is UDF?
  • Difference between UDF and built-in functions?
  • What is checkpointing?
  • What is serialization?
  • What is Kryo serialization?
  • What is accumulator?
  • What is broadcast variable?
  • What is PySpark SQL?
  • How do you handle null values?
  • How do you remove duplicates?

Advanced

  • How does Spark optimize joins?
  • What is Adaptive Query Execution (AQE)?
  • How do you handle data skew?
  • What is salting technique?
  • What is bucketing?
  • Difference between bucketing and partitioning?
  • What is column pruning?
  • What is predicate pushdown?
  • What is data skipping?
  • What is whole-stage code generation?
  • How does Spark handle out-of-memory errors?
  • Explain task, stage, and job
  • What is speculative execution?
  • What is watermarking?
  • What is Structured Streaming?
  • Difference between DStream and Structured Streaming?
  • What is stateful streaming?
  • What is exactly-once semantics?
  • What is file compaction?
  • What is Z-ordering?
  • What is Delta Lake?
  • Difference between Parquet and ORC?
  • How do you optimize Spark SQL queries?
  • What is Photon engine?
  • How do you tune Spark jobs?

Expert

  • Design a large-scale ETL pipeline using PySpark
  • How do you process billions of records efficiently?
  • How do you tune Spark for high concurrency?
  • How do you handle schema evolution?
  • How do you implement CDC in PySpark?
  • How do you debug slow Spark jobs?
  • How do you monitor PySpark applications?
  • How do you design fault-tolerant pipelines?
  • How do you secure sensitive data in PySpark?
  • How do you implement row-level security?
  • Explain memory tuning parameters
  • How do you reduce shuffle operations?
  • How do you optimize joins in large datasets?
  • Explain Spark internals with execution flow
  • How do you migrate legacy Spark jobs to PySpark?
  • How do you implement real-time streaming pipelines?
  • How do you manage dependencies in PySpark?
  • How do you handle late-arriving data?
  • Explain best practices for production PySpark
  • How do you build reusable PySpark frameworks?
  • How does PySpark interact with JVM?
  • Explain Py4J architecture
  • How do you handle large small files problem?
  • Explain cost optimization strategies
  • End-to-end PySpark project explanation

Related Topics


#Databricks

#Databricks

Key Concepts


S.No Topic Sub-Topics
1DatabricksWhat is Databricks, Lakehouse concept, Databricks vs Hadoop, Use cases, Architecture overview
2Databricks WorkspaceWorkspace UI, Notebooks, Clusters, Jobs, Repos
3Databricks ArchitectureControl plane, Data plane, Workspace components, Security layers, Execution flow
4Clusters in DatabricksAll-purpose clusters, Job clusters, Autoscaling, Cluster policies, Init scripts
5Databricks RuntimeDBR versions, Photon engine, ML runtime, GPU runtime, Performance tuning
6NotebooksLanguages supported, Notebook workflows, Magic commands, Versioning, Collaboration
7Databricks Utilities (dbutils)File system ops, Secrets, Widgets, Notebook workflows, FS mounts
8Data IngestionBatch ingestion, Streaming ingestion, Auto Loader, File formats, Schema inference
9Delta Lake FundamentalsACID transactions, Delta log, Schema enforcement, Time travel, File compaction
10Delta Lake AdvancedOPTIMIZE, Z-ORDER, Vacuum, Delta constraints, Change Data Feed
11Spark SQL in DatabricksSQL editor, ANSI SQL, Views, CTEs, Query optimization
12DataFrames & DatasetsAPI overview, Transformations, Actions, Lazy evaluation, Performance tips
13Databricks SQL WarehousesServerless SQL, Query execution, Dashboards, Alerts, Access control
14Jobs & WorkflowsJob types, Task dependencies, Scheduling, Retries, Monitoring
15Databricks ReposGit integration, Branching, CI/CD basics, Repo permissions, Best practices
16Security & Access ControlUsers & groups, IAM integration, Table ACLs, Cluster policies, Secrets
17Unity CatalogMetastore, Catalogs & schemas, Data lineage, Fine-grained access, Auditing
18Streaming with DatabricksStructured Streaming, Triggers, Watermarking, Stateful ops, Fault tolerance
19Auto LoaderCloudFiles, Incremental ingestion, Schema evolution, Notifications, Performance tuning
20Databricks ML OverviewML workspace, ML runtime, Experiment tracking, Feature store, Model registry
21MLflow in DatabricksTracking, Projects, Models, Model registry, Deployment
22Feature StoreFeature tables, Offline features, Online features, Reusability, Governance
23Model TrainingDistributed training, Hyperparameter tuning, AutoML, GPUs, Evaluation metrics
24Model DeploymentBatch inference, Real-time serving, Model endpoints, A/B testing, Monitoring
25Performance OptimizationPartitioning, Caching, Broadcast joins, Skew handling, Photon usage
26Monitoring & LoggingSpark UI, Ganglia, Job metrics, Logs, Alerts
27Cost OptimizationCluster sizing, Spot instances, Autoscaling, Job clusters, Usage reports
28Databricks on CloudAWS architecture, Azure architecture, GCP basics, Networking, Storage integration
29CI/CD & DevOpsRepos + pipelines, Databricks CLI, Asset bundles, Environment promotion, Automation
30Real-world Use CasesETL pipelines, Streaming analytics, ML pipelines, Lakehouse design, Interview prep

Interview question

Basic

  • What is Databricks?
  • What problems does Databricks solve?
  • What is Apache Spark?
  • How is Databricks different from Apache Spark?
  • What are Databricks Workspaces?
  • What is a Databricks Cluster?
  • Types of clusters in Databricks?
  • What is a Notebook in Databricks?
  • Supported languages in Databricks?
  • What is DBFS?
  • Difference between DBFS and HDFS?
  • What is Delta Lake?
  • Advantages of Delta Lake?
  • What is a Databricks Job?
  • What is Auto-scaling?
  • What is Auto-termination?
  • What is a Databricks Runtime?
  • Difference between Standard and ML Runtime?
  • What is a DataFrame?
  • What is a Spark Session?
  • What is a cell execution?
  • What is a notebook revision history?
  • What is MAGIC command?
  • What is %sql in Databricks?
  • What is Unity Catalog?

Intermediate

  • What is Delta Table?
  • What is ACID compliance in Delta Lake?
  • What is schema enforcement?
  • What is schema evolution?
  • What is OPTIMIZE in Delta Lake?
  • What is Z-ORDER?
  • Difference between Managed and External tables?
  • What is Time Travel in Delta Lake?
  • How do you handle duplicate data in Databricks?
  • What is Databricks SQL?
  • Difference between Spark SQL and Databricks SQL?
  • What is a Job Cluster vs Interactive Cluster?
  • How does Databricks handle fault tolerance?
  • What is caching in Databricks?
  • What is broadcast join?
  • What is Shuffle?
  • What is lazy evaluation?
  • Difference between RDD and DataFrame?
  • What is checkpointing?
  • How does Databricks integrate with cloud storage?
  • What is Structured Streaming?
  • Difference between batch and streaming?
  • What is watermarking?
  • What is MLflow?
  • What is Feature Store?

Advanced

  • Explain Databricks Lakehouse architecture
  • How does Delta Lake handle concurrent writes?
  • What is vacuum in Delta Lake?
  • Explain Delta log (_delta_log)
  • How does Z-ORDER improve performance?
  • What is Photon engine?
  • How does Photon improve query performance?
  • Explain cluster sizing strategy
  • How do you optimize Spark jobs in Databricks?
  • Explain adaptive query execution
  • What is cost optimization in Databricks?
  • What is data skipping?
  • Explain file compaction
  • How does Databricks handle skewed data?
  • What is Unity Catalog security model?
  • Difference between table ACLs and Unity Catalog?
  • What is lineage in Databricks?
  • How do you manage secrets in Databricks?
  • What is Databricks REST API?
  • How do you deploy code using Databricks Repos?
  • What is CI/CD in Databricks?
  • What is MLflow tracking?
  • What is model registry?
  • Explain real-time pipeline in Databricks
  • How do you handle late-arriving data?

Expert

  • Design an end-to-end Lakehouse architecture
  • How does Databricks ensure data governance at scale?
  • Explain multi-hop architecture (Bronze, Silver, Gold)
  • How do you design CDC pipelines in Databricks?
  • Explain Delta Live Tables (DLT)
  • Difference between DLT and normal pipelines?
  • How do you handle schema drift in production?
  • Explain exactly-once processing
  • How do you tune Spark for petabyte-scale data?
  • How does Photon compare with Spark Tungsten?
  • Explain Databricks serverless SQL
  • How do you secure PII data in Databricks?
  • How do you implement row-level and column-level security?
  • Explain workload isolation
  • How do you migrate from Hive to Databricks?
  • How do you monitor Databricks jobs?
  • Explain cost vs performance trade-offs
  • How do you manage large joins efficiently?
  • Explain Lakehouse vs Data Warehouse
  • Future roadmap of Databricks platform?
  • How does Databricks support AI workloads?
  • Explain vector search in Databricks
  • How do you handle model versioning at scale?
  • Explain MLOps in Databricks
  • How do you design enterprise-grade Databricks solution?

Related Topics


14 January 2026

#Joins & Aggregations

#Joins & Aggregations

Key Concepts


S.No Topic Sub-Topics
1JoinsWhat is a join, Types of joins, Importance, Examples, Use cases
2Inner JoinDefinition, Syntax, Example with RDD, Example with DataFrame, Performance considerations
3Left Outer JoinDefinition, Syntax, Example RDD, Example DataFrame, Handling nulls
4Right Outer JoinDefinition, Syntax, Example RDD, Example DataFrame, Use cases
5Full Outer JoinDefinition, Syntax, Example RDD, Example DataFrame, Null handling
6Cross Join / CartesianDefinition, Syntax, Example, Performance considerations, Use cases
7Self JoinDefinition, Syntax, Example RDD, Example DataFrame, Use cases
8Broadcast JoinDefinition, When to use, Example, Performance benefits, Spark configuration
9Skewed JoinsDefinition, Problems caused, Solutions, Salting technique, Performance tips
10Join on Multiple ColumnsSyntax, Example DataFrame, Example SQL, Performance considerations, Best practices
11Key Considerations in JoinsPartitioning, Shuffling, Data size, Broadcast, Caching
12Aggregation OverviewWhat is aggregation, Types, Importance, Syntax, Use cases
13GroupByDefinition, Syntax, Example RDD, Example DataFrame, Performance considerations
14GroupByKey vs ReduceByKeyDefinition, Syntax, Performance difference, Example, When to use
15AggregateByKeyDefinition, Syntax, Example, Custom aggregation functions, Performance
16CountByKey & CountByValueDefinition, Syntax, Example RDD, Example DataFrame, Use cases
17Sum, Max, Min AggregationsSyntax, Example DataFrame, Example SQL, Performance, Best practices
18Average & Mean AggregationsSyntax, Example RDD, Example DataFrame, Handling nulls, Performance
19Multiple Aggregationsagg() function, Syntax, Example DataFrame, Example SQL, Performance tips
20Window Functions for AggregationDefinition, Syntax, PartitionBy, OrderBy, Example
21Rollup & CubeDefinition, Syntax, Example DataFrame, Use cases, Performance tips
22Pivot AggregationsDefinition, Syntax, Example DataFrame, Example SQL, Use cases
23Approximate AggregationsapproxCountDistinct(), approxQuantile(), Use cases, Syntax, Performance benefits
24Custom AggregationsUser-defined aggregate functions (UDAF), Syntax, Example, Use cases, Performance tips
25Combining Joins & AggregationsJoin then aggregate, Aggregate then join, Example DataFrame, SQL example, Best practices
26Handling Nulls in Joins & AggregationsNull handling functions, coalesce(), fill(), drop(), Example, Best practices
27Optimizing JoinsBroadcast join, Partitioning, Caching, Skew handling, Shuffle reduction
28Optimizing AggregationsPartitioning, ReduceByKey, AggregateByKey, Caching, Avoid groupByKey for large data
29Advanced Aggregation TechniquesWindow functions, Rollup, Cube, Pivot, Custom UDAFs
30Real-world ExamplesETL pipelines, Log analytics, Sales aggregation, Customer behavior analysis, Recommendations

Interview question

Basic

  • What is a join in Spark?
  • What are the types of joins?
  • Explain inner join with an example.
  • Explain left outer join with an example.
  • Explain right outer join with an example.
  • Explain full outer join with an example.
  • What is a cross join or cartesian product?
  • What is a self join?
  • Difference between inner join and outer join.
  • Difference between left and right outer join.
  • Difference between full outer join and inner join.
  • How to perform join on multiple columns?
  • What is a broadcast join?
  • When should you use broadcast join?
  • How does Spark handle join shuffles?
  • What are skewed joins?
  • How to handle nulls in joins?
  • What is join on key-value RDDs?
  • Explain join using DataFrames.
  • Explain join using Spark SQL.
  • Difference between join on RDDs and DataFrames.
  • What is the role of partitioning in joins?
  • What is the impact of join on performance?
  • What is a cogroup operation?
  • When should you use cogroup over join?

Intermediate

  • Explain groupBy in Spark.
  • Explain reduceByKey in Spark.
  • Difference between groupBy and reduceByKey.
  • Explain aggregateByKey.
  • Explain combineByKey.
  • What is countByKey?
  • What is countByValue?
  • Explain sum, max, min aggregations.
  • Explain average and mean aggregation.
  • How to perform multiple aggregations?
  • Explain window functions for aggregation.
  • Explain rollup aggregation.
  • Explain cube aggregation.
  • Explain pivot aggregation.
  • Explain approxCountDistinct aggregation.
  • Explain approxQuantile aggregation.
  • What are user-defined aggregate functions (UDAFs)?
  • How to perform joins before aggregation?
  • How to perform aggregation after join?
  • Explain groupBy with multiple columns.
  • Explain aggregateByKey vs reduceByKey.
  • Explain foldByKey for aggregation.
  • Explain subtractByKey.
  • Explain join optimizations in aggregations.
  • How to cache/join before aggregation for better performance?

Advanced

  • Explain shuffle in join and aggregation.
  • Explain narrow vs wide dependencies in joins.
  • How does Spark optimize joins internally?
  • How does Spark optimize aggregations internally?
  • What is partitioning and its importance?
  • How does partitioning affect join performance?
  • How does partitioning affect aggregation performance?
  • Explain broadcast join with large datasets.
  • Explain handling skewed keys in joins.
  • Explain reduce-side join vs map-side join.
  • Explain join with multiple RDDs.
  • Explain aggregation with multiple RDDs.
  • Explain window-based aggregations.
  • Explain stateful aggregation in streaming joins.
  • Explain streaming joins vs batch joins.
  • Explain approximate aggregations for performance.
  • Explain advanced pivot operations.
  • Explain multi-level aggregations.
  • Explain hierarchical rollup and cube.
  • Explain combining joins and aggregations for ETL pipelines.
  • Explain memory and disk management in join operations.
  • Explain partition tuning for large aggregations.
  • Explain broadcast variable usage in aggregation.
  • Explain accumulators in aggregations.
  • Explain best practices for join + aggregation in Spark.

Expert

  • Explain shuffle optimization strategies in joins and aggregations.
  • Explain Spark Catalyst optimization for DataFrame joins.
  • Explain whole-stage code generation for joins & aggregations.
  • Explain Tungsten engine role in join and aggregation.
  • Explain join strategy selection: sort-merge vs broadcast hash join.
  • Explain adaptive query execution (AQE) in joins.
  • Explain skew join handling in AQE.
  • Explain incremental aggregations for streaming data.
  • Explain join/aggregation in structured streaming.
  • Explain checkpointing in streaming aggregations.
  • Explain watermarking for joins in streaming.
  • Explain stateful streaming aggregations.
  • Explain memory tuning for large join + aggregation operations.
  • Explain tuning shuffle partitions for large datasets.
  • Explain optimizing multi-stage aggregation pipelines.
  • Explain combining wide & narrow transformations with aggregations.
  • Explain caching strategies for repeated joins and aggregations.
  • Explain fault-tolerance mechanisms in joins & aggregations.
  • Explain Spark UI metrics related to joins and aggregations.
  • Explain advanced use cases: ETL, analytics dashboards, ML pipelines.
  • Explain differences in join behavior between RDD, DataFrame, and Dataset APIs.
  • Explain join performance tuning in distributed clusters.
  • Explain aggregation performance tuning in distributed clusters.
  • Explain UDAF optimization techniques.
  • Explain real-world examples combining joins and aggregations.

Related Topics


#Spark Streaming

#Spark Streaming

Key Concepts


S.No Topic Sub-Topics
1Spark StreamingWhat is Spark Streaming, Real-time data, Micro-batch processing, Advantages, Use cases
2Spark Streaming ArchitectureDriver, Receiver, DStream, Scheduler, Executors
3DStream BasicsDefinition, Creation, Operations, RDDs, Transformations
4Creating DStreamsFrom sources: Kafka, Flume, TCP sockets, File streams, Custom receivers
5Transformations on DStreamsmap(), flatMap(), filter(), reduceByKey(), window()
6Window Operationswindow(), slideDuration, reduceByKeyAndWindow(), aggregateByKeyAndWindow(), Examples
7Stateful TransformationsupdateStateByKey(), mapWithState(), Example, Use cases, Performance
8Actions on DStreamsprint(), count(), saveAsTextFiles(), foreachRDD(), Examples
9Data Sources IntegrationKafka, Flume, HDFS, Socket, Custom sources
10Sinks / Output Operationsprint(), saveAsTextFiles(), saveAsObjectFiles(), foreachRDD(), write to DB
11CheckpointingDefinition, Directory setup, Purpose, Examples, Fault tolerance
12Receiver TypesReliable receiver, Unreliable receiver, Custom receiver, Receiver lifecycle, Examples
13Transformations: map vs flatMapmap(), flatMap(), Use cases, Examples, Differences
14Transformations: reduceByKeyreduceByKey(), reduceByKeyAndWindow(), Examples, Use cases, Performance
15Transformations: join in streamingjoin(), leftOuterJoin(), rightOuterJoin(), fullOuterJoin(), Example
16Transformations: union & transformunion(), transform(), Example, Use cases, Combining multiple streams
17Handling Late DataWatermarks, Window operations, State management, dropLateData(), Examples
18Kafka IntegrationDirectStream vs ReceiverStream, Kafka parameters, Offset management, Example, Best practices
19Flume IntegrationSpark Streaming + Flume, Push vs Pull, Receiver setup, Example, Best practices
20File Stream SourceHDFS integration, Local files, Monitoring new files, Examples, Performance considerations
21Structured Streaming IntroductionDifferences from DStream, High-level API, DataFrames & Datasets, Fault-tolerance, Example
22Structured Streaming SourcesKafka, File, Socket, Rate source, Custom sources
23Structured Streaming SinksConsole, File, Kafka, ForeachBatch, Memory
24Event Time & WatermarksDefinition, Handling late data, withWatermark(), Examples, Use cases
25Window Operations in Structured Streamingwindow(), slideDuration, groupBy window(), Examples, Performance tips
26Stateful Operations in Structured StreamingmapGroupsWithState(), flatMapGroupsWithState(), Examples, Use cases, Performance
27Performance TuningBatch interval, Partitioning, Backpressure, Checkpointing, Resource tuning
28Fault Tolerance & ReliabilityCheckpointing, Write-ahead logs, Replay, Receiver reliability, Structured Streaming guarantees
29Monitoring & DebuggingSpark UI, Streaming metrics, Logs, Executor monitoring, Performance tuning
30Real-world ExamplesLog analytics, IoT data processing, Real-time dashboards, Clickstream analysis, Recommendations

Interview question

Basic

  • What is Spark Streaming?
  • Explain real-time data processing.
  • What is a micro-batch in Spark Streaming?
  • Difference between batch and streaming.
  • What is a DStream?
  • How is a DStream created?
  • What are the basic DStream transformations?
  • What are the basic DStream actions?
  • Explain map() transformation in streaming.
  • Explain flatMap() transformation in streaming.
  • Explain filter() transformation in streaming.
  • Explain reduceByKey() transformation in streaming.
  • Explain count() action in streaming.
  • Explain print() action in streaming.
  • How to read from a socket stream?
  • How to read from a file stream?
  • Difference between reliable and unreliable receivers.
  • What is the role of the driver in Spark Streaming?
  • What is the role of executors in streaming?
  • How is batch interval configured?
  • What is the default checkpointing mechanism?
  • How do you stop a streaming context?
  • Explain foreachRDD() action.
  • What is the Spark Streaming UI?
  • Explain the use cases of Spark Streaming.

Intermediate

  • Explain window operations in Spark Streaming.
  • What is slide interval?
  • Difference between window duration and slide duration.
  • Explain reduceByKeyAndWindow().
  • Explain aggregateByKeyAndWindow().
  • What are stateful transformations?
  • Explain updateStateByKey().
  • Explain mapWithState().
  • How do you integrate Spark Streaming with Kafka?
  • What is DirectKafkaStream?
  • What is Receiver-based Kafka stream?
  • How do you handle offsets in Kafka?
  • Explain Spark Streaming integration with Flume.
  • Explain push-based vs pull-based Flume integration.
  • How to read from HDFS in streaming?
  • How to read from S3 in streaming?
  • Explain streaming file source options.
  • Explain output operations: saveAsTextFiles().
  • Explain output operations: saveAsObjectFiles().
  • Explain output operations: foreachRDD() to database.
  • Explain fault tolerance in Spark Streaming.
  • What is write-ahead logs (WAL)?
  • Explain receiver reliability.
  • Explain backpressure mechanism in Spark Streaming.
  • What is the role of batch scheduling?

Advanced

  • Explain structured streaming.
  • Difference between DStream API and Structured Streaming API.
  • What are the sources in Structured Streaming?
  • What are the sinks in Structured Streaming?
  • Explain event-time processing.
  • Explain watermarks in streaming.
  • How to handle late data using watermarks?
  • Explain streaming aggregation.
  • Explain window aggregation in structured streaming.
  • Explain stateful aggregations.
  • Explain mapGroupsWithState().
  • Explain flatMapGroupsWithState().
  • Explain join operations in streaming.
  • Explain stream-stream join vs stream-static join.
  • Explain stream-stream outer joins.
  • Explain checkpointing in structured streaming.
  • Explain exactly-once semantics in streaming.
  • Explain output modes: append, complete, update.
  • Explain processing-time triggers.
  • Explain continuous processing mode.
  • Explain schema inference in streaming.
  • Explain custom sources in structured streaming.
  • Explain foreachBatch() in structured streaming.
  • Explain streaming aggregation with watermarking.
  • Explain performance tuning for structured streaming.

Expert

  • Explain state store in structured streaming.
  • Explain recovery from failures in streaming.
  • Explain backpressure in structured streaming.
  • Explain memory and executor tuning for streaming.
  • Explain shuffle optimization in streaming joins.
  • Explain handling skewed streaming data.
  • Explain checkpointing and lineage recovery.
  • Explain streaming aggregation optimizations.
  • Explain watermarks with multiple streams.
  • Explain latency vs throughput trade-offs.
  • Explain using Kafka offsets with checkpointing.
  • Explain exactly-once vs at-least-once delivery.
  • Explain stateful streaming performance tuning.
  • Explain streaming joins with large datasets.
  • Explain stream-stream join optimization.
  • Explain integrating streaming with machine learning.
  • Explain handling late-arriving events.
  • Explain multi-window aggregations.
  • Explain structured streaming with event time vs processing time.
  • Explain monitoring streaming jobs with Spark UI.
  • Explain streaming metrics and logs.
  • Explain resource allocation and dynamic scaling.
  • Explain memory spill and disk management in streaming.
  • Explain streaming ETL pipelines.
  • Explain real-world streaming applications and case studies.

Related Topics


#Transformations & Actions

#Transformations & Actions

Key Concepts


S.No Topic Sub-Topics
1Transformations & ActionsDefinition, Lazy Evaluation, DAG concept, Execution flow, Why separation matters
2Narrow vs Wide TransformationsDefinition, Examples, Shuffle impact, Performance difference, Use cases
3map()Syntax, One-to-one mapping, Use cases, Performance, Examples
4flatMap()One-to-many mapping, Differences from map, Use cases, Examples, Performance
5filter()Predicate logic, Data reduction, Optimization tips, Examples, Use cases
6select() / withColumn()Column selection, Column creation, Expressions, Performance tips, Examples
7union() & distinct()Combining datasets, Removing duplicates, Shuffle behavior, Use cases, Examples
8groupBy()Grouping logic, Aggregation basics, Shuffle impact, Examples, Best practices
9reduceByKey()Key-based reduction, Map-side aggregation, Performance benefits, Examples, Comparison
10groupByKey()Working principle, Memory impact, Comparison with reduceByKey, Examples, When to avoid
11sortBy() & orderBy()Sorting logic, Asc/Desc order, Shuffle cost, Examples, Optimization tips
12join() BasicsInner join, Join condition, Execution flow, Examples, Common issues
13Advanced Join TypesLeft, Right, Full, Semi, Anti joins, Use cases, Examples
14Broadcast JoinConcept, When to use, Memory impact, SQL hint, Examples
15repartition() & coalesce()Partition control, Shuffle behavior, Performance impact, Use cases, Examples
16cache() & persist()Storage levels, Memory vs disk, When to cache, Examples, Pitfalls
17count()Action trigger, Job creation, Performance considerations, Examples, Use cases
18collect()Driver memory risk, Small data usage, Examples, Best practices, Alternatives
19show() & take()Preview data, Execution behavior, Limit handling, Examples, Usage tips
20save() & write()Output formats, File systems, Partition output, Modes, Examples
21foreach() & foreachPartition()Side effects, External systems, Performance difference, Examples, Best practices
22Window FunctionsOver clause, Partition by, Order by, Use cases, Examples
23Actions vs TransformationsComparison, Execution timing, DAG role, Interview questions, Examples
24Shuffle InternalsWhen shuffle occurs, Cost factors, Optimization, Examples, Debugging
25Performance OptimizationAvoid wide ops, Partition sizing, Caching strategy, Examples, Tips
26Error HandlingBad records, Null handling, Try-catch logic, Data validation, Examples
27Spark UI AnalysisJobs tab, Stages tab, Task metrics, Shuffle read/write, Debugging
28Real-world ETL FlowTransform chain design, Action placement, Optimization, Examples, Best practices
29Interview ScenariosCommon questions, Tricky cases, Performance questions, Sample answers, Tips
30Hands-on Mini ProjectEnd-to-end pipeline, Transformations usage, Actions usage, Optimization, Review

Interview question

Basic

  • What is a transformation in Spark?
  • What is an action in Spark?
  • Difference between transformations and actions?
  • What is lazy evaluation?
  • What is an RDD?
  • How do you create an RDD?
  • What is parallelize() in Spark?
  • What is textFile() in SparkContext?
  • Explain map() transformation.
  • Explain filter() transformation.
  • What is flatMap()?
  • Explain distinct() transformation.
  • What does union() do?
  • Explain intersection() transformation.
  • Explain subtract() transformation.
  • What is cartesian() transformation?
  • What is collect() action?
  • Explain count() action.
  • Explain first() action.
  • Explain take(n) action.
  • Explain reduce() action.
  • Explain fold() action.
  • Explain aggregate() action.
  • Explain takeOrdered() action.
  • Explain top() action.

Intermediate

  • Explain groupByKey() transformation.
  • Explain reduceByKey() transformation.
  • Difference between groupByKey() and reduceByKey().
  • Explain aggregateByKey() transformation.
  • Explain combineByKey() transformation.
  • What are pair RDDs?
  • Explain mapValues() transformation.
  • Explain flatMapValues() transformation.
  • Explain keys() and values() transformations.
  • Explain lookup() action on pair RDDs.
  • Explain joins: innerJoin(), leftOuterJoin(), rightOuterJoin(), fullOuterJoin().
  • Explain cogroup() transformation.
  • Explain sortByKey() transformation.
  • Explain sortBy() transformation.
  • Difference between sortByKey() and sortBy().
  • Explain repartition() transformation.
  • Explain coalesce() transformation.
  • Difference between repartition() and coalesce().
  • Explain sample() transformation.
  • Explain sampleByKey() and sampleByKeyExact().
  • What is takeSample()?
  • Explain randomSplit() transformation.
  • Explain cache() and persist() transformations.
  • Explain unpersist() method.
  • Explain checkpointing and its use case.

Advanced

  • Explain narrow vs wide transformations.
  • Difference between narrow and wide dependencies.
  • What is shuffle in Spark?
  • How does shuffle affect performance?
  • Explain partitioning and its importance.
  • Explain HashPartitioner and RangePartitioner.
  • Explain repartitionAndSortWithinPartitions().
  • Explain the role of DAG scheduler in transformations.
  • Explain stages and tasks in Spark execution.
  • Explain lazy evaluation and lineage.
  • How are transformations optimized internally?
  • Explain the difference between map() and mapPartitions().
  • Explain foreach() and foreachPartition() actions.
  • How to handle skewed data in transformations?
  • Explain broadcast variables and their usage.
  • Explain accumulators and their usage.
  • Explain foldByKey() transformation.
  • Explain subtractByKey() transformation.
  • Explain join optimizations for pair RDDs.
  • Explain caching strategies for iterative algorithms.
  • Explain checkpointing vs caching.
  • Explain transformations on DataFrames compared to RDDs.
  • Explain mapPartitionsWithIndex() transformation.
  • Explain wide transformations and task parallelism.
  • Explain best practices for memory management during transformations.

Expert

  • Explain spark.sql.shuffle.partitions and its impact.
  • Explain narrow dependency scheduling optimizations.
  • How does Spark handle task failures during actions?
  • Explain lineage and recomputation during RDD failure.
  • Explain transformations in Structured Streaming.
  • Explain actions in Structured Streaming.
  • How to optimize joins on large datasets?
  • Explain partition tuning for large-scale RDDs.
  • Explain avoiding shuffles with map-side reductions.
  • Explain caching strategies for MLlib pipelines.
  • Explain the difference between cache() and persist(StorageLevel.MEMORY_AND_DISK).
  • Explain how Spark plans tasks for wide transformations.
  • Explain the difference between reduceByKey() and aggregateByKey() performance.
  • How to handle skewed keys in joins?
  • Explain Spark UI metrics related to transformations and actions.
  • Explain how stage boundaries are created in wide transformations.
  • Explain partition coalescing to reduce shuffle.
  • Explain RDD lineage graph and fault tolerance.
  • Explain advanced join strategies: broadcast join, shuffle hash join.
  • Explain the difference between DataFrame and RDD transformations.
  • Explain transformation optimization by Catalyst (for DataFrames).
  • Explain whole-stage code generation for DataFrame transformations.
  • Explain streaming aggregation and its fault tolerance.
  • Explain stateful transformations in streaming RDDs.
  • Explain advanced tuning techniques for actions on huge datasets.

Related Topics


#DataFrames & Datasets

#DataFrames & Datasets

Key Concepts


S.No Topic Sub-Topics
1Apache Spark & DataFramesSpark overview, RDD vs DataFrame, Spark architecture, Lazy evaluation, Use cases
2Spark Setup & EnvironmentLocal mode, Cluster mode, SparkSession, spark-submit, Configuration basics
3SparkSession & Entry PointsSparkSession creation, SQLContext, HiveContext, Config options, Best practices
4Creating DataFramesFrom files, From RDD, From collections, Schema inference, Explicit schema
5DataFrame Schema & Data TypesStructType, StructField, Primitive types, Complex types, Schema evolution
6Reading Data SourcesCSV, JSON, Parquet, ORC, Avro basics
7Writing DataFramesSave modes, Partitioning, Bucketing, File formats, Compression
8DataFrame Basic Operationsselect, withColumn, drop, filter, where
9Column OperationsColumn expressions, alias, cast, when/otherwise, lit
10Row Operations & Actionsshow, collect, take, count, first
11DataFrame FunctionsBuilt-in functions, String functions, Date functions, Math functions, Null handling
12Filtering & Conditional Logicfilter vs where, isin, like, rlike, case when
13Sorting & DeduplicationorderBy, sort, distinct, dropDuplicates, Sorting optimization
14Aggregation & GroupinggroupBy, agg, count, sum, avg
15Joins in DataFramesInner join, Left/Right join, Full join, Semi/Anti join
16Join OptimizationBroadcast join, Shuffle join, Join hints, Skew handling, AQE
17Handling Missing & Bad Datadropna, fillna, replace, Null checks, Data validation
18Window FunctionsWindow spec, row_number, rank, lead/lag, Running totals
19UDF & UDAFUDF creation, Performance impact, Pandas UDF, Serialization, Best practices
20DataFrame Caching & Persistencecache, persist, Storage levels, Memory vs disk, When to cache
21Spark SQL with DataFramesTemp views, Global views, SQL queries, Mixing SQL & DF, Optimization
22Partitioning & Repartitioningrepartition, coalesce, Partition pruning, File partitioning, Performance tuning
23Performance Optimization BasicsCatalyst optimizer, Tungsten, Predicate pushdown, Column pruning, AQE
24DataFrame Execution PlanLogical plan, Physical plan, explain(), DAG, Stage breakdown
25Handling Large DatasetsSkew issues, Sampling, Checkpointing, Memory tuning, Spill handling
26Integration with HiveHive tables, External tables, Metastore, Partitioned tables, Hive SQL
27Streaming DataFrames (Structured Streaming)Streaming sources, Sinks, Watermarking, Windowed aggregations, Triggers
28Error Handling & DebuggingCommon errors, Serialization issues, Logging, Debug tools, Retry strategies
29Best Practices & Design PatternsCode structure, Reusability, Performance patterns, Anti-patterns, Testing
30Real-world Use Cases & ProjectsETL pipelines, Data lake processing, Analytics workloads, Reporting, Optimization review

Interview question

Basic Level

  1. What is a DataFrame in Apache Spark?
  2. How is a DataFrame different from an RDD?
  3. What are the advantages of DataFrames?
  4. Is DataFrame immutable?
  5. What is SparkSession?
  6. How do you create a DataFrame in Spark?
  7. What are the different data sources supported by DataFrames?
  8. What is a schema in DataFrame?
  9. How can you infer schema automatically?
  10. How do you define a custom schema?
  11. What is show() in DataFrame?
  12. What is printSchema()?
  13. What is select() in DataFrame?
  14. What is withColumn()?
  15. Difference between withColumn() and select()?
  16. What is filter() / where() in DataFrame?
  17. Difference between filter() and where()?
  18. What is limit()?
  19. What is collect()?
  20. What is count()?
  21. What is distinct()?
  22. What is drop()?
  23. What is dropDuplicates()?
  24. What is alias()?
  25. How do you rename a column in DataFrame?

Intermediate Level

  1. What are DataFrame transformations?
  2. What are DataFrame actions?
  3. What is lazy evaluation in DataFrames?
  4. What is explain() in DataFrame?
  5. What is logical plan?
  6. What is physical plan?
  7. What is Catalyst Optimizer?
  8. What is Tungsten execution engine?
  9. What is column expression?
  10. What are built-in Spark SQL functions?
  11. Difference between UDF and built-in functions?
  12. What is UDF?
  13. Performance impact of UDFs?
  14. What is groupBy()?
  15. What is agg()?
  16. Difference between groupBy and window functions?
  17. What is orderBy()?
  18. Difference between orderBy() and sort()?
  19. What is join() in DataFrame?
  20. What are the types of joins in Spark?
  21. What is inner join?
  22. What is left outer join?
  23. What is right outer join?
  24. What is full outer join?
  25. What is cross join?

Advanced Level

  1. What is broadcast join?
  2. When does Spark automatically use broadcast join?
  3. What is shuffle join?
  4. How does Spark handle join optimization?
  5. What is partitioning in DataFrame?
  6. Difference between repartition() and coalesce()?
  7. What is bucketing in Spark?
  8. Difference between bucketing and partitioning?
  9. What is window function?
  10. Explain row_number(), rank(), dense_rank().
  11. What is caching in DataFrame?
  12. Difference between cache() and persist()?
  13. What storage levels are available?
  14. What is checkpointing in DataFrame?
  15. Difference between cache and checkpoint?
  16. What is skewed data?
  17. How to handle data skew in joins?
  18. What is salting technique?
  19. What is adaptive query execution (AQE)?
  20. What are Spark SQL hints?
  21. What is explain(true)?
  22. How to optimize wide transformations?
  23. What is column pruning?
  24. What is predicate pushdown?
  25. What is vectorized reader?

Expert Level

  1. Why DataFrames are faster than RDDs?
  2. How does Catalyst optimize DataFrame queries?
  3. How does Tungsten improve performance?
  4. What happens internally when a DataFrame action is triggered?
  5. How does Spark generate optimized bytecode?
  6. What is whole-stage code generation?
  7. What is off-heap memory?
  8. How does Spark handle memory management for DataFrames?
  9. How does AQE change execution plans at runtime?
  10. How do you debug slow DataFrame jobs?
  11. How do you analyze Spark UI for DataFrame jobs?
  12. What are common DataFrame performance anti-patterns?
  13. Why excessive withColumn() is discouraged?
  14. How do you design efficient Spark SQL pipelines?
  15. What are limitations of DataFrames?
  16. How do DataFrames handle schema evolution?
  17. How does DataFrame support semi-structured data?
  18. What is explode() function?
  19. What is from_json() and to_json()?
  20. How to handle nested columns efficiently?
  21. What is Delta Lake DataFrame integration?
  22. How does DataFrame handle ACID properties?
  23. What is cost-based optimization (CBO)?
  24. How do you tune Spark SQL configurations?
  25. Explain real-time production use cases of DataFrames.

Related Topics


12 January 2026

#JUnit

#JUnit

Key Concepts


S.No Topic Sub-Topics
1 Introduction to JUnit What is JUnit?, Importance of Unit Testing, History of JUnit, Versions overview, Use cases
2 JUnit Architecture Core classes, Test runners, Test lifecycle, Annotations overview, Test suites
3 JUnit 4 vs JUnit 5 Key differences, Annotations, Assertions, Extension model, Migration strategies
4 JUnit Annotations @Test, @Before, @After, @BeforeClass, @AfterClass
5 JUnit 5 Annotations @Test, @BeforeEach, @AfterEach, @BeforeAll, @AfterAll
6 Assertions in JUnit assertEquals, assertTrue, assertFalse, assertNotNull, assertThrows
7 Parameterized Tests Introduction, @ParameterizedTest, @ValueSource, @CsvSource, Custom parameter providers
8 JUnit Test Suites Purpose, Creating test suites, Including multiple classes, @Suite annotation, Running suites
9 Exception Testing assertThrows, Expected exceptions, Handling exceptions in tests, Try-catch in tests, Best practices
10 Timeouts in Tests Using @Test(timeout), assertTimeout, assertTimeoutPreemptively, Long-running tests, Best practices
11 Assumptions in JUnit assumeTrue, assumeFalse, Conditional test execution, Environment-specific tests, Integration with CI
12 Test Lifecycle Methods Setup and teardown, @BeforeEach/@AfterEach, @BeforeAll/@AfterAll, Resource management, Best practices
13 Nested Tests Introduction, @Nested annotation, Structuring tests, Inner classes, Scope and lifecycle
14 Tagging Tests @Tag annotation, Grouping tests, Running specific tags, Excluding tags, Integration with CI/CD
15 JUnit Extensions Introduction, @ExtendWith annotation, Custom extensions, Parameter resolvers, Test lifecycle hooks
16 Mocking with Mockito Mockito basics, @Mock, @InjectMocks, when-thenReturn, Verifying interactions
17 JUnit with Spring Boot @SpringBootTest, @WebMvcTest, @MockBean, Context loading, Integration tests
18 Behavior Driven Testing Introduction to BDD, JUnit + Cucumber, Feature files, Step definitions, Integration examples
19 Testing Exceptions and Edge Cases Edge case identification, Boundary testing, assertThrows, Negative testing, Best practices
20 JUnit Test Reports Generating reports, Maven Surefire plugin, Gradle reports, HTML reports, CI integration
21 Mocking Static Methods Mockito inline, PowerMockito, Limitations, Use cases, Best practices
22 Parameterized and CSV Tests @CsvSource, @CsvFileSource, @MethodSource, Dynamic tests, Practical examples
23 Dynamic Tests @TestFactory, DynamicTest.stream, Custom dynamic tests, Use cases, Best practices
24 Integration Testing with JUnit Introduction, Database tests, REST API testing, Spring integration, Environment setup
25 Code Coverage Jacoco integration, Measuring coverage, Analyzing reports, Coverage thresholds, Best practices
26 Continuous Integration JUnit in CI/CD, Jenkins integration, GitHub Actions, Pipeline setup, Reporting
27 Best Practices in JUnit Writing clean tests, DRY principle, Readable assertions, Test naming conventions, Test isolation
28 Debugging Unit Tests Using IDE debugger, Common failures, Stack traces, Logging in tests, Fixing flaky tests
29 Advanced Assertions assertAll, assertIterableEquals, assertLinesMatch, assertTimeout, Custom assertions
30 JUnit Projects & Labs Hands-on projects, Full coverage examples, Spring Boot testing, CI/CD integration, Practice exercises

Interview question

1. JUnit Basics

  1. What is JUnit and why is it used?
  2. Explain the differences between JUnit 4 and JUnit 5.
  3. What are the advantages of using JUnit in Java projects?
  4. How do you write your first JUnit test case?
  5. What is the naming convention for test methods?
  6. Explain the role of the @Test annotation.
  7. What is the default test runner in JUnit?
  8. How do you disable a test in JUnit?
  9. What are assumptions in JUnit?
  10. Explain JUnit?s role in TDD (Test-Driven Development).
  11. What is the difference between unit tests and integration tests?
  12. How do you set up a JUnit environment in a Maven project?
  13. Can JUnit be used for testing private methods?
  14. What is the default order of test execution in JUnit?
  15. How do you test void methods in JUnit?
  16. How do you skip tests conditionally?
  17. What is the difference between JUnit and TestNG?
  18. Explain the concept of test lifecycle in JUnit.
  19. What is the purpose of @DisplayName in JUnit 5?
  20. How do you tag and filter tests in JUnit?

2. Annotations

  1. Explain the usage of @BeforeEach and @AfterEach.
  2. What is the purpose of @BeforeAll and @AfterAll?
  3. How do you create a setup method in JUnit?
  4. Difference between @BeforeClass (JUnit 4) and @BeforeAll (JUnit 5).
  5. What happens if @BeforeAll is not static?
  6. Can you use multiple annotations on the same method?
  7. How do you use @Disabled in JUnit 5?
  8. What is @RepeatedTest in JUnit 5?
  9. Explain @Nested test classes.
  10. How is @TestFactory used in dynamic tests?
  11. Explain @Tag annotation with examples.
  12. What is the difference between @Test and @ParameterizedTest?
  13. What is @ExtendWith used for?
  14. How do you use @TempDir in JUnit?
  15. What does @Timeout do in JUnit 5?
  16. Can you annotate constructors in JUnit with @BeforeEach?
  17. Explain @EnabledOnOs and @EnabledOnJre annotations.
  18. How do you handle conditional test execution using annotations?
  19. What is @Order used for in JUnit tests?
  20. Can annotations be customized in JUnit?

3. Assertions

  1. What is an assertion in JUnit?
  2. Difference between assertEquals and assertSame.
  3. How do you test for exceptions using assertions?
  4. Explain the usage of assertTrue and assertFalse.
  5. What is assertNull and assertNotNull?
  6. How do you compare arrays in JUnit?
  7. Explain assertAll with example.
  8. What is the difference between fail() and assertThrows()?
  9. How do you write custom assertions?
  10. What is the purpose of Hamcrest in JUnit assertions?
  11. Explain the difference between Hamcrest and AssertJ.
  12. How do you use assertLinesMatch in JUnit?
  13. What is assertIterableEquals used for?
  14. How do you write assertions for collections?
  15. Explain assertTimeout and assertTimeoutPreemptively.
  16. What is the difference between soft and hard assertions?
  17. Can you write assertions for floating-point numbers?
  18. How do you compare objects in JUnit tests?
  19. Explain the usage of assertDoesNotThrow.
  20. What happens when an assertion fails?

4. Parameterized Tests

  1. What is a parameterized test in JUnit?
  2. How do you write a parameterized test using @ValueSource?
  3. Difference between @CsvSource and @CsvFileSource.
  4. How do you test with enum values in JUnit?
  5. How do you write parameterized tests with @MethodSource?
  6. Explain how arguments are resolved in parameterized tests.
  7. How do you test with multiple parameters?
  8. What are custom argument providers?
  9. How do you reuse test data across parameterized tests?
  10. What is the advantage of parameterized tests?
  11. How do you use ArgumentsAccessor?
  12. What is @ArgumentsSource annotation?
  13. How do you test edge cases with parameterized tests?
  14. Explain the difference between parameterized tests in JUnit 4 vs JUnit 5.
  15. Can parameterized tests be combined with @BeforeEach?
  16. What happens when parameterized test data is invalid?
  17. How do you test null inputs in parameterized tests?
  18. What are some best practices for parameterized testing?
  19. How do you test with complex objects?
  20. Can you combine parameterized and dynamic tests?

5. Test Suites

  1. What is a test suite in JUnit?
  2. How do you run multiple test classes together?
  3. Explain @SelectPackages annotation.
  4. Explain @SelectClasses annotation.
  5. How do you include/exclude tests in a suite?
  6. Can test suites be nested?
  7. Difference between JUnit 4 @Suite and JUnit 5 suite engine.
  8. How do you run test suites from Maven?
  9. How do you integrate test suites with Gradle?
  10. What is the benefit of test suites?
  11. Can you use filters with test suites?
  12. How do you configure test discovery?
  13. How do you execute only tagged tests in a suite?
  14. What happens if a suite includes disabled tests?
  15. How do you execute JUnit 4 suites inside JUnit 5?
  16. What is @IncludeClassNamePatterns?
  17. What is @ExcludeTags used for?
  18. Explain discovery selectors in JUnit.
  19. What are suite-level lifecycle methods?
  20. How do you create custom suite runners?

6. Mockito & JUnit

  1. What is Mockito?
  2. How do you create a mock in JUnit?
  3. What is the purpose of @Mock annotation?
  4. How do you use @InjectMocks in tests?
  5. Explain stubbing in Mockito.
  6. What is verify() used for?
  7. How do you reset a mock?
  8. How do you use ArgumentCaptor?
  9. What is the difference between spy() and mock()?
  10. How do you mock exceptions in JUnit tests?
  11. Can you mock static methods with Mockito?
  12. How do you mock final classes?
  13. What is the difference between real and mock objects?
  14. How do you mock collections?
  15. How do you handle void methods in Mockito?
  16. What is the difference between @MockBean and @Mock?
  17. How do you mock private methods?
  18. Explain deep stubs in Mockito.
  19. How do you verify the number of interactions?
  20. What are common pitfalls in using mocks?

7. Spring & JUnit Integration

  1. How do you write a test with @SpringBootTest?
  2. What is @MockBean used for?
  3. Explain the difference between @Mock and @MockBean.
  4. How do you load application context in JUnit?
  5. How do you test Spring MVC controllers?
  6. What is @DataJpaTest used for?
  7. How do you test REST APIs using MockMvc?
  8. Explain @WebMvcTest annotation.
  9. How do you test Spring Boot configuration classes?
  10. How do you handle transactions in Spring tests?
  11. How do you use @TestConfiguration?
  12. What is @AutoConfigureMockMvc?
  13. How do you test caching in Spring?
  14. Explain the use of @Sql annotation in testing.
  15. How do you test services with external dependencies?
  16. What is TestEntityManager used for?
  17. How do you test asynchronous methods in Spring?
  18. Explain how to use TestRestTemplate.
  19. How do you test application events in Spring?
  20. What is the role of @ActiveProfiles in testing?

8. Exception Testing

  1. How do you test exceptions in JUnit 4?
  2. How do you test exceptions in JUnit 5?
  3. Explain the usage of assertThrows.
  4. What is ExpectedException in JUnit 4?
  5. How do you test custom exceptions?
  6. How do you verify exception messages?
  7. How do you test multiple exceptions in a single method?
  8. Can you test checked and unchecked exceptions differently?
  9. How do you test exceptions in parameterized tests?
  10. What is the difference between fail() and assertThrows()?
  11. How do you test runtime exceptions?
  12. How do you test for null pointer exceptions?
  13. How do you combine assertions with exception testing?
  14. How do you log exceptions during tests?
  15. How do you suppress exceptions in JUnit?
  16. How do you create reusable exception assertions?
  17. What happens when no exception is thrown in assertThrows?
  18. Can you test exceptions in dynamic tests?
  19. How do you handle exception hierarchies?
  20. Explain common pitfalls in exception testing.

9. Test Execution & Reports

  1. How do you run JUnit tests from Eclipse/IntelliJ?
  2. How do you run tests using Maven?
  3. How do you run tests using Gradle?
  4. How do you execute tests from the command line?
  5. How do you run a single test class?
  6. How do you run a single test method?
  7. How do you generate XML test reports?
  8. How do you generate HTML test reports?
  9. What is the Surefire plugin in Maven?
  10. How do you configure test execution in Gradle?
  11. How do you run tests in parallel?
  12. How do you configure timeout for all tests?
  13. How do you rerun failed tests?
  14. How do you ignore flaky tests?
  15. How do you integrate JUnit tests with Jenkins?
  16. How do you configure reporting plugins?
  17. How do you publish test reports in CI/CD?
  18. How do you measure code coverage with JUnit?
  19. How do you generate Jacoco reports?
  20. What are best practices in test reporting?

10. JUnit Extensions & Best Practices

  1. What is the extension model in JUnit 5?
  2. How do you write a simple JUnit extension?
  3. What is @ExtendWith used for?
  4. What built-in extensions are available in JUnit 5?
  5. How do you use the TempDir extension?
  6. How do you use the Timeout extension?
  7. What is ExtensionContext in JUnit?
  8. How do you chain multiple extensions?
  9. How do you share state across extensions?
  10. How do you implement a logging extension?
  11. What are some best practices in writing unit tests?
  12. How do you name your test methods effectively?
  13. What is the Arrange-Act-Assert pattern?
  14. How do you avoid flaky tests?
  15. How do you organize test packages?
  16. How do you reuse common test data?
  17. How do you make tests maintainable?
  18. What are some anti-patterns in testing?
  19. How do you improve performance of test suites?
  20. What are enterprise-level strategies for JUnit testing?

Related Topics


#MultiThread

#MultiThread

Key Concepts


S.No Topic Sub-Topics
1MultithreadingWhat is Thread, Process vs Thread, Benefits of Multithreading, Applications, Thread Lifecycle Overview
2Thread ClassCreating Thread by extending Thread, start(), run(), sleep(), join(), getName()
3Runnable InterfaceImplementing Runnable, Passing to Thread, Advantages, run() vs start(), Lambda Runnable
4Thread LifecycleNew, Runnable, Running, Waiting, Timed Waiting, Terminated, Thread State Transitions
5Thread MethodssetName/getName, setPriority/getPriority, isAlive(), yield(), interrupt()
6Thread PriorityMin/Max/Normal Priority, setPriority, Thread Scheduling, Preemption, Fairness
7Thread Sleep & Joinsleep(), join(), wait vs sleep, timed join, practical examples
8Thread Communicationwait(), notify(), notifyAll(), producer-consumer basics, synchronized block
9Synchronized MethodsMethod-level sync, block-level sync, object lock, class-level lock, best practices
10Inter-thread CommunicationProducer-Consumer Problem, BlockingQueue, wait/notify, ReentrantLock with Condition, deadlock prevention
11Reentrant LocksLock interface, ReentrantLock, tryLock(), lockInterruptibly(), fairness, conditions
12DeadlockWhat is Deadlock, Conditions, Prevention, Avoidance, Detection, Recovery
13Starvation & LivelockStarvation, Livelock, Examples, Priority Inversion, Solutions
14Thread SafetyDefinition, Thread-safe classes, Immutable Objects, Synchronization, Atomic variables
15Atomic ClassesAtomicInteger, AtomicLong, AtomicReference, compareAndSet, use cases
16Volatile KeywordWhat is volatile, visibility, happens-before, example usage, memory consistency
17Concurrent CollectionsConcurrentHashMap, CopyOnWriteArrayList, BlockingQueue, ConcurrentSkipListMap, benefits
18Executor FrameworkExecutor, ExecutorService, ThreadPoolExecutor, ScheduledExecutorService, shutdown
19Thread PoolsFixedPool, CachedPool, SingleThreadPool, ScheduledPool, Advantages
20Callable & FutureCallable Interface, Future, submit(), get(), timeout handling, cancelling tasks
21ForkJoin FrameworkForkJoinPool, RecursiveTask, RecursiveAction, work-stealing, parallel computation
22Parallel StreamsStream API, parallel(), ForkJoin usage, performance tips, pitfalls
23ThreadLocalThreadLocal variables, usage, memory leak, InheritableThreadLocal, examples
24Synchronization UtilitiesCountDownLatch, CyclicBarrier, Semaphore, Phaser, Exchanger
25Deadlock Prevention PatternsLock Ordering, TryLock, Timeout, Avoid Nested Locks, Resource hierarchy
26Best PracticesMinimize synchronized code, prefer high-level concurrency, immutable objects, use executor, avoid busy wait
27Performance TuningThread pool sizing, contention reduction, CPU-bound vs IO-bound, measuring, profiling
28Common Concurrency BugsRace conditions, deadlocks, livelocks, visibility issues, fixes
29Real-world ExamplesProducer-Consumer app, Web server handling requests, parallel processing, async tasks, thread-safe cache
30Interview & RevisionKey methods, concurrency concepts, common pitfalls, multithreading Q&A, mini projects

Interview question

Basic Level

  • What is a Thread?
  • What is Multithreading?
  • What is the difference between Process and Thread?
  • What are the advantages of Multithreading?
  • What is Context Switching?
  • What is Thread Lifecycle?
  • What are the different Thread States?
  • How do you create a Thread in Java?
  • What is the Runnable interface?
  • What is the difference between Thread and Runnable?
  • What is the start() method?
  • What is the run() method?
  • Why should we not directly call run()?
  • What is Thread Priority?
  • What is Daemon Thread?
  • How do you create a Daemon Thread?
  • What is the purpose of the join() method?
  • What is the sleep() method used for?
  • What is thread scheduling?
  • What is time slicing?
  • What is thread safety?
  • What is synchronization?
  • What is a synchronized method?
  • What is a synchronized block?
  • What is the volatile keyword?

Intermediate Level

  • What is Inter-thread communication?
  • What are wait(), notify(), notifyAll() used for?
  • Why must wait/notify be called inside synchronized block?
  • What is a race condition?
  • What is deadlock?
  • How do you avoid deadlock?
  • What is livelock?
  • What is starvation?
  • What is a monitor in Java?
  • What is reentrant synchronization?
  • What is a ThreadGroup?
  • What is ThreadLocal?
  • What is the Executor framework?
  • What is ExecutorService?
  • What is a ThreadPool?
  • What is Callable?
  • What is Future?
  • What is FutureTask?
  • What is ScheduledExecutorService?
  • What is a RejectedExecutionHandler?
  • What is a BlockingQueue?
  • What is the difference between synchronized and Lock?
  • What is ReentrantLock?
  • What is ReadWriteLock?
  • What is Condition interface?

Advanced Level

  • What is Fork/Join framework?
  • What is Work Stealing Algorithm?
  • What is ConcurrentHashMap?
  • How does ConcurrentHashMap achieve thread safety?
  • What is CopyOnWriteArrayList?
  • What is CAS (Compare and Swap)?
  • What are Atomic classes?
  • What is the difference between Atomic and volatile?
  • What is StampedLock?
  • What is Phaser?
  • What is CyclicBarrier?
  • What is CountDownLatch?
  • What is Semaphore?
  • What is Exchanger?
  • What is ThreadPoolExecutor?
  • How does ThreadPoolExecutor manage threads internally?
  • What is ForkJoinPool?
  • What is parallel stream?
  • How does parallel stream work internally?
  • What is thread contention?
  • What is false sharing?
  • How do you debug concurrency issues?
  • What is Memory Consistency Error?
  • What is Happens-Before relationship?
  • What is Java Memory Model?

Expert Level

  • How to design highly scalable multithreaded systems?
  • What are lock-free algorithms?
  • What are wait-free algorithms?
  • What is the difference between blocking vs non-blocking algorithms?
  • How do you reduce lock contention?
  • How does JVM handle thread scheduling internally?
  • What are advanced optimizations in modern JVM for concurrency?
  • Explain the internals of synchronized keyword.
  • Explain biased locking.
  • Explain lightweight locking.
  • What is escape analysis?
  • How does JIT optimize multithreaded code?
  • How do you detect deadlocks in production?
  • How do you avoid deadlocks using ordering strategies?
  • How do you tune thread pools for high throughput?
  • What is backpressure in multithreading systems?
  • How do you design producer?consumer systems at scale?
  • How do you build custom thread pools?
  • How do you test multithreaded code effectively?
  • What is the role of memory barriers in concurrency?
  • How do you ensure safe publication of objects?
  • What is double-checked locking?
  • Why was double-checked locking broken before Java 5?
  • How do you build lock-free data structures?
  • How do reactive systems differ from traditional multithreading?

Related Topics


11 January 2026

#Scikit

#Scikit

Key Concepts


S.No Topic Sub-Topics
1Scikit-learnWhat is scikit-learn, Installation, Key features, ML workflow, Supported algorithms
2Scikit-learn API BasicsEstimators, fit(), predict(), transform(), Pipelines, Model persistence
3Data Loading & InspectionBuilt-in datasets, load_*, fetch_*, Data shapes, Feature names, Target variables
4Data PreprocessingScaling, Normalization, Encoding categorical data, Missing values, Feature transformation
5Feature Scaling TechniquesStandardScaler, MinMaxScaler, RobustScaler, Normalizer, When to scale
6Handling Missing DataSimpleImputer, Strategies, Missing indicators, Pipeline usage, Best practices
7Encoding Categorical VariablesLabelEncoder, OneHotEncoder, OrdinalEncoder, Handling unknowns, Sparse output
8Train-Test Splittrain_test_split, Stratification, Random state, Data leakage, Validation sets
9Linear RegressionLinearRegression, Assumptions, Coefficients, Evaluation metrics, Use cases
10Logistic RegressionBinary vs multiclass, Regularization, Solver options, Class weights, Evaluation
11Model Evaluation MetricsAccuracy, Precision, Recall, F1-score, Confusion matrix
12Cross-ValidationK-Fold, StratifiedKFold, cross_val_score, cross_validate, Bias-variance tradeoff
13k-Nearest NeighborsKNN classifier, KNN regressor, Distance metrics, Choosing K, Performance
14Support Vector MachinesSVC, SVR, Kernels, Hyperparameters, Margin maximization
15Decision TreesTree structure, Gini vs entropy, Overfitting, Pruning, Feature importance
16Ensemble LearningBagging, Boosting, Random Forest, Extra Trees, Voting classifiers
17Random ForestRandomForestClassifier, Hyperparameters, Feature importance, OOB score, Use cases
18Gradient BoostingGradientBoosting, XGBoost intro, LightGBM intro, Learning rate, Trees depth
19Naive BayesGaussianNB, MultinomialNB, BernoulliNB, Assumptions, Applications
20Clustering AlgorithmsKMeans, Hierarchical clustering, DBSCAN, Silhouette score, Use cases
21Dimensionality ReductionPCA, Kernel PCA, Explained variance, Feature compression, Visualization
22Anomaly DetectionIsolation Forest, One-Class SVM, LOF, Use cases, Evaluation challenges
23Model Selection & TuningGridSearchCV, RandomizedSearchCV, Hyperparameters, Scoring, Best estimators
24Pipelines & ColumnTransformerPipeline, Feature unions, ColumnTransformer, End-to-end ML, Avoid leakage
25Imbalanced DatasetsClass imbalance, SMOTE, Class weights, Evaluation metrics, Best practices
26Text Feature ExtractionCountVectorizer, TF-IDF, N-grams, Stop words, Sparse matrices
27Model Persistencejoblib, pickle, Saving models, Loading models, Versioning
28Model InterpretationCoefficients, Feature importance, Permutation importance, Partial dependence, SHAP intro
29Scikit-learn with Pipelines in ProductionReproducibility, Monitoring, Data drift, Model updates, Best practices
30Scikit-learn Best PracticesCode structure, Experiment tracking, Documentation, Common pitfalls, Next steps

Interview question

Basic Level

  1. What is scikit-learn?
  2. What type of library is scikit-learn?
  3. Which language is scikit-learn written in?
  4. What are estimators in scikit-learn?
  5. What is the fit() method?
  6. What is the predict() method?
  7. Difference between fit() and transform()?
  8. What is supervised learning?
  9. What is unsupervised learning?
  10. What is train_test_split?
  11. What are features and labels?
  12. What is a dataset in scikit-learn?
  13. What are built-in datasets?
  14. What is accuracy score?
  15. What is a confusion matrix?
  16. What is overfitting?
  17. What is underfitting?
  18. What is a regression problem?
  19. What is a classification problem?
  20. What is clustering?
  21. What is scaling?
  22. What is normalization?
  23. What is LabelEncoder?
  24. What is OneHotEncoder?
  25. What are model parameters?

Intermediate Level

  1. What is StandardScaler?
  2. Difference between MinMaxScaler and StandardScaler?
  3. What is logistic regression?
  4. Explain linear regression in scikit-learn.
  5. What is KNN?
  6. How does KNN work?
  7. What is SVM?
  8. What are kernels in SVM?
  9. What is decision tree?
  10. What is entropy and gini index?
  11. What is Random Forest?
  12. What is ensemble learning?
  13. What is cross-validation?
  14. What is K-Fold cross-validation?
  15. What is StratifiedKFold?
  16. What is GridSearchCV?
  17. What is RandomizedSearchCV?
  18. What are hyperparameters?
  19. What is bias-variance tradeoff?
  20. What is ROC curve?
  21. What is AUC?
  22. What is precision and recall?
  23. What is F1-score?
  24. What is feature importance?
  25. What is PCA?

Advanced Level

  1. How does PCA work internally?
  2. What is explained variance?
  3. Difference between PCA and LDA?
  4. What is Gradient Boosting?
  5. Difference between Bagging and Boosting?
  6. What is AdaBoost?
  7. What is Isolation Forest?
  8. What is DBSCAN?
  9. How does KMeans clustering work?
  10. What is silhouette score?
  11. What is feature selection?
  12. Difference between feature selection and extraction?
  13. What is Recursive Feature Elimination?
  14. What is pipeline in scikit-learn?
  15. Why are pipelines important?
  16. What is ColumnTransformer?
  17. How to handle categorical features?
  18. How does scikit-learn handle missing values?
  19. What is SimpleImputer?
  20. What is model persistence?
  21. Difference between pickle and joblib?
  22. What is partial dependence plot?
  23. What is permutation importance?
  24. How to avoid data leakage?
  25. How to handle imbalanced datasets?

Expert Level

  1. How does scikit-learn architecture work?
  2. Explain estimator, transformer, predictor design.
  3. How does scikit-learn optimize performance?
  4. What is warm_start?
  5. How does scikit-learn use NumPy internally?
  6. What are sparse matrices?
  7. How does scikit-learn handle sparse data?
  8. What is SGDClassifier?
  9. Difference between batch and online learning?
  10. How to scale scikit-learn for large datasets?
  11. What are limitations of scikit-learn?
  12. Difference between scikit-learn and TensorFlow?
  13. Difference between scikit-learn and PyTorch?
  14. How to integrate scikit-learn with pandas?
  15. What is custom estimator?
  16. How to implement custom transformer?
  17. What is scoring parameter?
  18. How to evaluate regression models?
  19. What is R² score?
  20. What is model drift?
  21. How to monitor models in production?
  22. What is reproducibility in ML?
  23. How to set random_state?
  24. Explain numerical stability issues.
  25. What are best practices in scikit-learn?

Related Topics


#Pandas

#Pandas

Key Concepts


S.No Topic Sub-Topics
1PandasOverview, Installation, Series, DataFrame, Basic operations
2Series BasicsCreating Series, Indexing, Slicing, Series methods, Data types
3DataFrame BasicsCreate DataFrame, Index/Columns, Shape, dtypes, head/tail
4Data Selectionloc, iloc, ix, column selection, row selection
5Data FilteringBoolean indexing, conditions, isin, between, query()
6Missing Dataisnull, notnull, fillna, dropna, interpolation
7Data CleaningDuplicates, rename, replace, strip whitespaces, type conversion
8Data Transformationapply, map, applymap, lambda functions, vectorized operations
9Aggregation & Groupinggroupby, aggregate, transform, filter, pivot tables
10Sorting & Rankingsort_values, sort_index, rank, ascending/descending, multi-level sorting
11Indexing & MultiIndexset_index, reset_index, hierarchical index, slicing, cross-section
12Concatenation & Mergingconcat, append, merge, join, indicator
13Reshaping Datamelt, pivot, stack, unstack, wide to long format
14Time Series BasicsDatetime conversion, date_range, indexing, resampling, frequency
15Time Series Advancedrolling, expanding, shifting, lag/lead, moving average
16String Operationsstr methods, contains, replace, split, regex
17Visualization with Pandasplot, line, bar, histogram, scatter
18Reading/Writing Dataread_csv, read_excel, read_json, to_csv, to_excel
19Advanced I/Oread_sql, read_parquet, read_hdf, read_pickle, compression
20Exploratory Data Analysisdescribe, info, value_counts, correlation, unique
21Multi-Column Operationsarithmetic, apply, assign, lambda, broadcasting
22Window Functionsrolling, expanding, ewm, groupby with window, custom functions
23Categorical Datacategory dtype, conversion, codes, sorting, filtering
24Sampling & Subsettingsample, head/tail, nth, slicing, random sampling
25Performance Optimizationvectorization, eval/query, categorical, chunking, memory usage
26MultiIndex Advancedstack/unstack, xs, swaplevel, sortlevel, indexing tricks
27Custom Functionsapply, pipe, lambda, function chaining, reusable utilities
28Integration with NumPy & SciPyarray operations, broadcasting, linear algebra, statistical functions, interoperability
29Real World Data ProjectsEDA, cleaning, aggregation, visualization, export results
30End-to-End ProjectData collection, cleaning, analysis, feature engineering, visualization

Interview question


Related Topics