Question 1

A data engineer needs to write a Streaming DataFrame as Parquet files. Given the code:

Given the code:

Options :

A :

format("parquet") .option("location", "path/to/destination/dir")

B :

CopyEdit .option("format", "parquet") .option("destination", "path/to/destination/dir")

C :

.option("format", "parquet") .option("location", "path/to/destination/dir")

D :

format("parquet") .option("path", "path/to/destination/dir")

Answer: D

Question 2

A Spark application needs to read multiple Parquet files from a directory where the files have differing but compatible schemas. The data engineer wants to create a DataFrame that includes all columns from all files. Which code should the data engineer use to read the Parquet files and include all columns using Apache Spark?

Options :

A : spark.read.parquet("/data/parquet/")

B : spark.read.option("mergeSchema", True).parquet("/data/parquet/")

C : spark.read.format("parquet").option("inferSchema", "true").load("/data/parquet/")

D : spark.read.parquet("/data/parquet/").option("mergeAllCols", True)

Answer: B

Question 3

Given this code:

withWatermark("event_time","10 minutes") .groupBy(window("event_time","15 minutes")) .count() What happens to data that arrives after the watermark threshold?

Options :

A :

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

B :

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

C :

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

D :

The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.

Answer: B

Question 4

A Spark application suffers from too many small tasks due to excessive partitioning. How can this be fixed without a full shuffle?

Options :

A : Use the distinct() transformation to combine similar partitions

B : Use the coalesce() transformation with a lower number of partitions

C : Use the sortBy() transformation to reorganize the data

D : Use the repartition() transformation with a lower number of partitions

Answer: B

Question 5

A data engineer is working ona Streaming DataFrame streaming_df with the given streaming data:

Which operation is supported with streamingdf ?

Options :

A : streaming_df. select (countDistinct ("Name") )

B : streaming_df.groupby("Id") .count ()

C : streaming_df.orderBy("timestamp").limit(4)

D : streaming_df.filter (col("count") < 30).show()

Answer: D

Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Mock Exam – Practice Online Confidently

Buying Options: