Question 1

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

Databricks-Certified-Professional-Data-Engineer-page61-image24

Which statement describes the execution and results of running the above query multiple times?

Options :

A :

Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.

B :

Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.

C :

Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.

D :

Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.

E :

Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table, giving the desired result.

Answer: B

Question 2

A DLT pipeline includes the following streaming tables:

• raw_iot ingests raw device measurement data from a heart rate tracking device.

• bpm_stats incrementally computes user statistics based on BPM measurements from raw_iot.

How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table, while recomputing the downstream table bpm_stats table when a pipeline update is run?

Options :

A : Set the pipelines.reset.allowed property to false on raw_iot

B : Set the skipChangeCommits flag to true on raw_iot

C : Set the pipelines.reset.allowed property to false on bpm_stats

D : Set the skipChangeCommits flag to true on bpm_stats

Answer: B

Question 3

Which of the following statements best describes the use of Python wheels in Databricks ?

Options :

A : A Python %wheel is a magic command allows to install Python packages on Databricks Clusters

B : A Python wheel is a virtual environment for isolating the Python interpreter, libraries and modules in a notebook from other notebooks.

C : A Python wheel is a repository for hosting, managing, and distributing Python binaries and artifacts in a Databricks workspace

D : A Python wheel is a binary distribution format for installing custom Python code packages on Databricks Clusters

E : A Python wheel is package installer tool alternative to ‘pip’

Answer: D

Question 4

A junior data engineer is using the following code to de-duplicate raw streaming data and insert them in a target Delta table

1. spark.readStream

2. .table("orders_raw")

3. .dropDuplicates(["order_id", "order_timestamp"])

4. .writeStream

5. .option("checkpointLocation", "dbfs:/checkpoints")

6. .table("orders_unique")

A senior data engineer pointed out that this approach is not enough for having distinct records in the target table when there are late-arriving, duplicate records.

Which of the following could explain the senior data engineer’s remark?

Options :

A : Watermarking is also needed to only track state information for a window of time in which we expect records could be delayed.

B : A ranking function is also needed to ensure processing only the most recent records

C : A window function is also needed to apply deduplication for each non-overlapping interval.

D : The new records need also to be deduplicated against previously inserted data into the table.

E : More information is needed to determine the correct response

Answer: D

Question 5

The data engineering team has a Silver table called ‘sales_cleaned’ where new sales data is appended in near real-time.

They want to create a new Gold-layer entity against the ‘sales_cleaned’ table to calculate the year-to-date (YTD) of the sales amount. The new entity will have the following schema:

country_code STRING, category STRING, ytd_total_sales FLOAT, updated TIMESTAMP

It’s enough for these metrics to be recalculated once daily. But since they will be queried very frequently by several business teams, the data engineering team wants to cut down the potential costs and latency associated with materializing the results.

Which of the following solutions meets these requirements?

Options :

A : Define the new entity as a view to avoid persisting the results each time the metrics are recalculated

B : Define the new entity as a global temporary view since it can be shared between notebooks or jobs that share computing resources.

C : Configuring a nightly batch job to recalculate the metrics and store them as a table overwritten with each update

D : Create multiple tables, one per business team so the metrics can be queried quickly and efficiently.

E : All the above solutions meet the required requirements since Databricks uses the Delta Caching feature

Answer: C

Free Databricks-Certified-Professional-Data-Engineer Mock Exam – Practice Online Confidently

Buying Options: