Text Material Preview
Databricks Certified Data Engineer Professional Databricks Certified Data
Engineer Professional Exam exam dumps questions are the best material for you
to test all the related Databricks exam topics. By using the Databricks Certified
Data Engineer Professional exam dumps questions and practicing your skills,
you can increase your confidence and chances of passing the Databricks
Certified Data Engineer Professional exam.
Features of Dumpsinfo’s products
Instant Download
Free Update in 3 Months
Money back guarantee
PDF and Software
24/7 Customer Support
Besides, Dumpsinfo also provides unlimited access. You can get all
Dumpsinfo files at lowest price.
Databricks Certified Data Engineer Professional Exam Databricks Certified
Data Engineer Professional exam free dumps questions are available below
for you to study.
Full version: Databricks Certified Data Engineer Professional Exam Dumps
Questions
1.FROM raw_table;
D. 1.SELECT transaction_id, date
2.You would like to build a spark streaming process to read from a Kafka queue and write to a Delta
table every 15 minutes, what is the correct trigger option
A. trigger("15 minutes")
B. trigger(process "15 minutes")
C. trigger(processingTime = 15)
1 / 43
https://www.dumpsinfo.com/unlimited-access/
https://www.dumpsinfo.com/exam/databricks-certified-data-engineer-professional
https://www.dumpsinfo.com/exam/databricks-certified-data-engineer-professional
D. trigger(processingTime = "15 Minutes")
E. trigger(15)
Answer: D
Explanation:
The answer is trigger(processingTime = "15 Minutes")
Triggers:
• Unspecified
This is the default. This is equivalent to using processingTime="500ms"
• Fixed interval micro-batches .trigger(processingTime="2 minutes")
The query will be executed in micro-batches and kicked off at the user-specified intervals •One-time
micro-batch .trigger(once=True)
The query will execute a single micro-batch to process all the available data and then stop on its own
• One-time micro-batch.trigger .trigger(availableNow=True) -- New feature a better version of
(once=True)
Databricks supports trigger(availableNow=True) in Databricks Runtime 10.2 and above for Delta Lake
and Auto Loader sources. This functionality combines the batch processing approach of trigger once
with the ability to configure batch size, resulting in multiple parallelized batches that give greater
control for right-sizing batches and the resultant files.
3.dbutils.notebook.run("ful-notebook-name", 60, {"argument": "data", "argument2": "data2",
...})
4.CREATE OR REPLACE TABLE sales
5.At the end of the inventory process, a file gets uploaded to the cloud object storage, you are asked
to build a process to ingest data which of the following method can be used to ingest the data in-
crementally, schema of the file is expected to change overtime ingestion process should be able to
handle these changes automatically. Below is the auto loader to command to load the data, fill in the
blanks for successful execution of below code.
6. as select ' {
7. THEN INSERT *;
E. 1. DROP DUPLICATES
8. }]
9.def check_input(x,y):
10.Which of the following functions can be used to convert JSON string to Struct data type?
A. TO_STRUCT (json value)
B. FROM_JSON (json value)
C. FROM_JSON (json value, schema of json)
D. CONVERT (json value, schema of json)
E. CAST (json value as STRUCT)
Answer: C
Explanation:
Syntax
Copy
2 / 43
https://www.dumpsinfo.com/
11.You noticed that a team member started using an all-purpose cluster to develop a notebook and
used the same all-purpose cluster to set up a job that can run every 30 mins so they can update un-
derlying tables which are used in a dashboard.
What would you recommend for reducing the overall cost of this approach?
A. Reduce the size of the cluster
B. Reduce the number of nodes and enable auto scale
C. Enable auto termination after 30 mins
D. Change the cluster all-purpose to job cluster when scheduling the job
E. Change the cluster mode from all-purpose to single-mode
Answer: D
Explanation:
While using an all-purpose cluster is ok during development but anytime you don't need to interact
with a notebook, especially for a scheduled job it is less expensive to use a job cluster. Using an all-
purpose cluster can be twice as expensive as a job cluster.
Please note: The compute cost you pay the cloud provider for the same cluster type and size be-
tween an all-purpose cluster and job cluster is the same the only difference is the DBU cost.
The total cost of cluster = Total cost of VM compute (Azure or AWS or GCP) + Cost per DBU
The per DBU cost varies between all-purpose and Job Cluster
Here is the recent cost estimate from AWS between Jobs Cluster and all-purpose Cluster, for jobs
compute its $0.15 cents per DBU v$0.55 cents per DBU for all-purpose
3 / 43
https://www.dumpsinfo.com/
Graphical user
interface
Description automatically generated
How do I check how much the DBU cost for my cluster?
When you click on an exister cluster or when you look at the cluster details you will see this in the top
right corner
4 / 43
https://www.dumpsinfo.com/
Graphical user
interface, text, application, email
Description automatically generated
12.table("target_table")
A. checkpointlocation, complete, True
B. targetlocation, overwrite, True
C. checkpointlocation, True, overwrite
D. checkpointlocation, True, complete
E. checkpointlocation, overwrite, True
Answer: A
13. Optimizes query performance for business-critical data
Exam focus: Please review the below image and understand the role of each layer(bronze, silver,
gold) in medallion architecture, you will see varying questions targeting each layer and its purpose.
Sorry I had to add the watermark some people in Udemy are copying my content.
5 / 43
https://www.dumpsinfo.com/
A diagram of a house
Description automatically generated with low confidence
14. unitsSold int) (Correct)
D. 1. CREATE TABLE USING DELTA transactions (
15.Which distribution does Databricks support for installing custom Python code packages?
A. sbt
B. CRAN
C. npm
D. Wheels
E. jars
Answer: D
16.The marketing team is launching a new campaign to monitor the performance of the new
campaign for the first two weeks, they would like to set up a dashboard with a refresh schedule to run
every 5 minutes, which of the below steps can be taken to reduce of the cost of this refresh over
time?
A. Reduce the size of the SQL Cluster size
B. Reduce the max size of auto scaling from 10 to 5
C. Setup the dashboard refresh schedule to end in two weeks
D. Change the spot instance policy from reliability optimized to cost optimized
E. Always use X-small cluster
Answer: C
Explanation:
The answer is Setup the dashboard refresh schedule to end in two weeks
17.spark.readStream \
6 / 43
https://www.dumpsinfo.com/
18. then:
19.You are still noticing slowness in query after performing optimize which helped you to resolve the
small files problem, the column(transactionId) you are using to filter the data has high cardinality and
auto incrementing number.
Which delta optimization can you enable to filter data effectively based on this column?
A. Create BLOOM FLTER index on the transactionId
B. Perform Optimize with Zorder on transactionId (Correct)
C. transactionId has high cardinality, you cannot enable any optimization.
D. Increase the cluster size and enable delta optimization
E. Increase the driver size and enable delta optimization
Answer: B
Explanation:
The answer is, perform Optimize with Z-order by transactionid
Here is a simple explanation of how Z-order works, once the data is naturally ordered, when a flle is
scanned it only brings the data it needs into spark's memory Based on the column min and max it
knows which data files needs to be scanned.
Table
Description automatically generated
7 / 43
https://www.dumpsinfo.com/Graphical user
interface, diagram, application
Description automatically generated
20. spark.sql(f"SELECT * FROM sales WHERE orderDate = '{order_date}'")
(Correct)
E. 1. order_date = dbutils.widgets.get("widget_order_date")
21. .load(dataSource)
22. Eliminates duplicate records
23.Where are Interactive notebook results stored in Databricks product architecture?
A. Data plane
B. Control plane
C. Data and Control plane
D. JDBC data source
E. Databricks web application
Answer: C
Explanation:
The answer is Data and Control plane,
Only Job results are stored in Data Plane (your storage), Interactive notebook results are stored in a
combination of the control plane (partial results for presentation in the UI) and customer storage.
https://docs.microsoft.com/en-us/azure/databricks/getting-started/overview#--high-level-architecture
8 / 43
https://www.dumpsinfo.com/
Snippet from the above documentation,
Graphical user
interface, application
Description automatically generated
How to change this behavior?
You can change this behavior using Workspace/Admin Console settings for that workspace, once
enabled all of the interactive results are stored in the customer account(data plane) except the new
notebook visualization feature Databricks has recently introduced, this still stores some metadata in
the control pane irrespective of the below settings. please refer to the documentation for more details.
Graphical user
interface, text, application, email
Description automatically generated
Why is this important to know?
I recently worked on a project where we had to deal with sensitive information of customers and we
had a security requirement that all of the data need to be stored in the data plane including notebook
results.
24.At the end of the inventory process a file gets uploaded to the cloud object storage, you are asked
9 / 43
https://www.dumpsinfo.com/
to build a process to ingest data which of the following method can be used to ingest the data
incrementally, the schema of the file is expected to change overtime ingestion process should be able
to handle these changes automatically. Below is the auto loader command to load the data, fill in the
blanks for successful execution of the below code.
25. Two junior data engineers are authoring separate parts of a single data pipeline notebook. They
are working on separate Git branches so they can pair program on the same notebook
simultaneously. A senior data engineer experienced in Databricks suggests there is a better
alternative for this type of collaboration .
Which of the following supports the senior data engineer's claim?
A. Databricks Notebooks support commenting and notification comments
B. Databricks Notebooks support the creation of interactive data visualizations
C. Databricks Notebooks support real-time co-authoring on a single notebook
D. Databricks Notebooks support the use of multiple languages in the same notebook
E. Databricks Notebooks support automatic change-tracking and versioning
Answer: C
26.Below table temp_data has one column called raw contains JSON data that records temperature
for every four hours in the day for the city of Chicago, you are asked to calculate the maximum
temperature that was ever recorded for 12:00 PM hour across all the days. Parse the JSON data and
use the necessary array function to calculate the max temp.
Table: temp_date
Column: raw
Datatype: string
Expected output: 58
A. 1.select max(raw.chicago.temp[3]) from temp_data
B. 1.select array_max(raw.chicago[*].temp[3]) from temp_data
C. 1.select array_max(from_json(raw['chicago'].temp[3],'array<int>')) from temp_data
D. 1.select array_max(from_json(raw:chicago[*].temp[3],'array<int>')) from temp_data
E. 1.select max(from_json(raw:chicago[3].temp[3],'array<int>')) from temp_data
Answer: D
Explanation:
10 / 43
https://www.dumpsinfo.com/
Note: This is a difficult question, more likely you may see easier questions similar to this but the more
you are prepared for the exam easier it is to pass the exam. Use this below link to look for more
examples, this will definitely help you, https://docs.databricks.com/optimizations/semi-structured.html
Here is the solution, step by step
Text
Description automatically generated
Use this below link to look for more examples, this will definitely help you,
https://docs.databricks.com/optimizations/semi-structured.html
If you want to try this solution use below DDL,
27. .writeStream.option("checkpointLocation", checkpoint_directory)\
28.query = "select * from {schema_name}.{table_name}"
C. 1.table_name = "sales"
29. INNER JOIN sales_monthly s on s.customer_id = c.customer_id
After you ran the above command, the Marketing team quickly wanted to review the old data that was
in the table.
11 / 43
https://www.dumpsinfo.com/
How does INSERT OVERWRITE impact the data in the customer_sales table if you want to see the
previous version of the data prior to running the above statement?
A. Overwrites the data in the table, all historical versions of the data, you can not time travel to
previous versions
B. Overwrites the data in the table but preserves all historical versions of the data, you can time travel
to previous versions
C. Overwrites the current version of the data but clears all historical versions of the data, so you can
not time travel to previous versions.
D. Appends the data to the current version, you can time travel to previous versions
E. By default, overwrites the data and schema, you cannot perform time travel
Answer: B
Explanation:
The answer is, INSERT OVERWRITE Overwrites the current version of the data but preserves all
historical versions of the data, you can time travel to previous versions.
30.SELECT * FROM CUSTOMERS_2020
C. 1. SELECT * FROM CUSTOMERS_2021 C1
31.Preserves grain of original data (without aggregation)
32.The default threshold of VACUUM is 7 days, internal audit team asked to certain tables to maintain
at least 365 days as part of compliance requirement, which of the below setting is needed to
implement.
A. ALTER TABLE table_name set TBLPROPERTIES (del-ta.deletedFileRetentionDuration= ‘interval
365 days’)
B. MODIFY TABLE table_name set TBLPROPERTY (delta.maxRetentionDays = ‘inter-val 365
days’)
C. ALTER TABLE table_name set EXENDED TBLPROPERTIES (del-
ta.deletedFileRetentionDuration= ‘interval 365 days’)
D. ALTER TABLE table_name set EXENDED TBLPROPERTIES (delta.vaccum.duration= ‘interval
365 days’)
Answer: A
Explanation:
33.Which of the below commands can be used to drop a DELTA table?
A. DROP DELTA table_name
B. DROP TABLE table_name
C. DROP TABLE table_name FORMAT DELTA
D. DROP table_name
Answer: B
34.writeStream
35.Which of the following statement is true about Databricks repos?
A. You can approve the pull request if you are the owner of Databricks repos
B. A workspace can only have one instance of git integration
C. Databricks Repos and Notebook versioning are the same features
D. You cannot create a new branch in Databricks repos
E. Databricks repos allow you to comment and commit code changes and push them to a remote
branch
Answer: E
12 / 43
https://www.dumpsinfo.com/
Explanation:
See below diagram to understand the role Databricks Repos and Git provider plays when building a
CI/CD workdlow.
All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in Gray are
done in a git provider like Github or Azure Devops
Diagram
Description automatically generated
36.Which statement characterizes the general programming model used by Spark Structured
Streaming?
A. Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data
throughput.
B. Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
C. Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency
for data transfer.
D. Structured Streamingmodels new data arriving in a data stream as new rows appended to an
unbounded table.
E. Structured Streaming relies on a distributed network of nodes that hold incremental state values for
cached stages.
Answer: D
37.SELECT count(*) FROM my_table TIMESTAMP AS OF "2019-01-01"
38. country STRING
A senior data engineer wants to create a new table from this table using the following command:
39. url = "jdbc:/sqmple_db",
13 / 43
https://www.dumpsinfo.com/
40. unitsSold int)
41.The below spark command is looking to create a summary table based customerId and the
number of times the customerId is present in the event_log delta table and write a one-time micro-
batch to a summary table, fill in the blanks to complete the query.
42. Which of the following is a Continuous Probability Distributions?
A. Binomial probability distribution
B. Negative binomial distribution
C. Poisson probability distribution
D. Normal probability distribution
Answer: D
43. .option("checkpointLocation", "/tmp/delta/eventsByCustomer/_checkpoints/")
44.load(data_source)
45.option("_______",’ dbfs:/location/checkpoint/’)
46.You have noticed that Databricks SQL queries are running slow, you are asked to look reason why
queries are running slow and identify steps to improve the performance, when you looked at the issue
you noticed all the queries are running in parallel and using a SQL endpoint(SQL Warehouse) with a
single cluster.
Which of the following steps can be taken to improve the performance/response times of the queries?
*Please note Databricks recently renamed SQL endpoint to SQL warehouse.
A. They can turn on the Serverless feature for the SQL endpoint (SQL warehouse).
B. They can increase the maximum bound of the SQL endpoint (SQL warehouse)’s scaling range
C. They can increase the warehouse size from 2X-Smal to 4XLarge of the SQL end-point (SQL
warehouse).
D. They can turn on the Auto Stop feature for the SQL endpoint (SQL warehouse).
E. They can turn on the Serverless feature for the SQL endpoint (SQL warehouse) and change the
Spot Instance Policy to “Reliability Optimized.”
Answer: B
Explanation:
The answer is, They can increase the maximum bound of the SQL endpoint’s scaling range when
you increase the max scaling range more clusters are added so queries instead of waiting in the
queue can start running using available clusters, see below for more explanation.
The question is looking to test your ability to know how to scale a SQL Endpoint (SQL Warehouse)
and you have to look for cue words or need to understand if the queries are running sequentially or
concurrently. if the queries are running sequentially then scale up (Size of the cluster from 2X-Small
to 4X-Large) if the queries are running concurrently or with more users then scale out(add more
clusters).
SQL Endpoint (SQL Warehouse) Overview: (Please read all of the below points and the below
diagram to understand)
47.Once a cluster is deleted, below additional actions need to performed by the administrator
A. Remove virtual machines but storage and networking are automatically dropped
B. Drop storage disks but Virtual machines and networking are automatically dropped
C. Remove networking but Virtual machines and storage disks are automatically dropped
D. Remove logs
14 / 43
https://www.dumpsinfo.com/
E. No action needs to be performed. All resources are automatically removed.
Answer: E
Explanation:
What is Delta?
Delta lake is
• Open source
• Builds up on standard data format
• Optimized for cloud object storage
• Built for scalable metadata handling Delta lake is not
• Proprietary technology
• Storage format
• Storage medium
• Database service or data warehouse
48. select 2 as batchId ,
49.The data engineering team is using a bunch of SQL queries to review data quality and monitor the
ETL job every day, which of the following approaches can be used to set up a schedule and auto-
mate this process?
A. They can schedule the query to run every 1 day from the Jobs UI
B. They can schedule the query to refresh every 1 day from the query’s page in Databricks SQL.
C. They can schedule the query to run every 12 hours from the Jobs UI.
D. They can schedule the query to refresh every 1 day from the SQL endpoint’s page in Databricks
SQL.
E. They can schedule the query to refresh every 12 hours from the SQL endpoint’s page in
Databricks SQL
Answer: B
Explanation:
Individual queries can be refreshed on a schedule basis,
To set the schedule:
50.The data analyst team had put together queries that identify items that are out of stock based on
orders and replenishment but when they run all together for final output the team noticed it takes a
really long time, you were asked to look at the reason why queries are running slow and identify steps
to improve the performance and when you looked at it you noticed all the code queries are running
sequentially and using a SQL endpoint cluster.
Which of the following steps can be taken to resolve the issue?
Here is the example query
51.Which of the following table constraints that can be enforced on Delta lake tables are supported?
A. Primary key, foreign key, Not Null, Check Constraints
B. Primary key, Not Null, Check Constraints
C. Default, Not Null, Check Constraints
D. Not Null, Check Constraints
E. Unique, Not Null, Check Constraints
Answer: D
Explanation:
The answer is Not Null, Check Constraints https://docs.microsoft.com/en-
us/azure/databricks/delta/delta-constraints
? CREATE TABLE events( id LONG, ? date STRING,
? location STRING,
15 / 43
https://www.dumpsinfo.com/
? description STRING ? ) USING DELTA;
ALTER TABLE events CHANGE COLUMN id SET NOT NULL;
ALTER TABLE events ADD CONSTRAINT dateWithinRange CHECK (date > '1900-01-01');
Note: Databricks as of DBR 11.1 added support for Primary Key and Foreign Key when Unity Catalog
is enabled but this is for information purposes only these are not actually enforced. You may ask then
why are we defining these if they are not enforced, so especially these information constraints are
very helpful if you have a BI tool that can benefit from knowing the relationship between the tables, so
it will be easy when creating reports/dashboards or understanding the data model when using any
Data modeling tool. Primary and Foreign Key
Graphical user interface, text, application, email
Description automatically generated
52.Raw copy of ingested data
53.load(rawSalesLocation)\
54.option("_______",’ dbfs:/location/checkpoint/’)
55. WHERE duplicate = False;
C. 1. SELECT DISTINCT *
56.
57.You are designing an analytical to store structured data from your e-commerce platform and un-
structured data from website traffic and app store, how would you approach where you store this
data?
A. Use traditional data warehouse for structured data and use data lakehouse for un-structured data.
B. Data lakehouse can only store unstructured data but cannot enforce a schema
C. Data lakehouse can store structured and unstructured data and can enforce schema
D. Traditional data warehouses are good for storing structured data and enforcing schema
Answer: C
Explanation:
The answer is, Data lakehouse can store structured and unstructured data and can enforce schema
What Is a Lakehouse? - The Databricks Blog
16 / 43
https://www.dumpsinfo.com/
Graphical user
interface, text, application
Description automatically generated
58.How do you check the location of an existing schema in Delta Lake?
A. Run SQL command SHOW LOCATION schema_name
B. Check unity catalog UI
C. Use Data explorer
D. Run SQL command DESCRIBE SCHEMA EXTENDED schema_name
E Schemas are internally in-store external hive meta stores like MySQL or SQL Server
Answer: D
Explanation:
Here is an example of how it looks
17 / 43
https://www.dumpsinfo.com/
Graphical user
interface, text, application, email
Description automatically generated
59. Optimizes query performance for business-criticaldata
Exam focus: Please review the below image and understand the role of each layer(bronze, silver,
gold) in medallion architecture, you will see varying questions targeting each layer and its purpose.
Sorry I had to add the watermark some people in Udemy are copying my content.
18 / 43
https://www.dumpsinfo.com/
60.get_source_dataframe(tablename):
61. .option("checkpointLocation", checkpointPath)
62. transactionDate timestamp,
63.You are currently working on storing data you received from different customer surveys, this data
is highly unstructured and changes over time, why Lakehouse is a better choice compared to a Data
warehouse?
A. Lakehouse supports schema enforcement and evolution, traditional data warehouses lack schema
evolution.
B. Lakehouse supports SQL
C. Lakehouse supports ACID
D. Lakehouse enforces data integrity
E. Lakehouse supports primary and foreign keys like a data warehouse
Answer: A
64. ]
Calculate total sales made by all the employees?
Sample data with create table syntax for the data:
65. 'ARRAY<STRUCT<employeeId: BIGINT, sales: INT>>') as performance,
66.if you run the command VACUUM transactions retain 0 hours?
What is the outcome of this command?
A. Command will be successful, but no data is removed
B. Command will fail if you have an active transaction running
C. Command will fail, you cannot run the command with retentionDurationcheck enabled
19 / 43
https://www.dumpsinfo.com/
D. Command will be successful, but historical data will be removed
E. Command runs successful and compacts all of the data in the table
Answer: C
Explanation:
The answer is,
Command will fail, you cannot run the command with retentionDurationcheck enabled.
67.Which of the following operations are not supported on a streaming dataset view?
spark.readStream.format("delta").table("sales").createOrReplaceTempView("streaming_vie w")
A. SELECT sum(unitssold) FROM streaming_view
B. SELECT max(unitssold) FROM streaming_view
C. SELECT id, sum(unitssold) FROM streaming_view GROUP BY id ORDER BY id
D. SELECT id, count(*) FROM streaming_view GROUP BY id
E. SELECT * FROM streadming_view ORDER BY id
Answer: E
Explanation:
The answer isSELECT * FROM streadming_view order by id Please Note: Sorting with Group by will
work without any issues see below explanation for each option of the options,
Graphical user
interface, text, application
Description automatically generated
Certain operations are not allowed on streaming data, please see highlighted in bold.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-
operations
? Multiple streaming aggregations (i.e. a chain of aggregations on a streaming DF) are not yet
supported on streaming Datasets.
? Limit and take the first N rows are not supported on streaming Datasets. ? Distinct operations on
streaming Datasets are not supported.
? Deduplication operation is not supported after aggregation on a streaming Datasets.
? Sorting operations are supported on streaming Datasets only after an aggregation and in Complete
Output Mode.
Note: Sorting without aggregation function is not supported.
Here is the sample code to prove this,
Setup test stream
20 / 43
https://www.dumpsinfo.com/
Graphical user interface, text, application, email
Description automatically generated
Sum aggregation function has no issues on stream
Graphical user interface, application
Description automatically generated
Max aggregation function has no issues on stream
21 / 43
https://www.dumpsinfo.com/
Graphical user interface, application
Description automatically generated
Group by with Order by has no issues on stream
Group by has no issues on stream
Table
Description automatically generated
Order by without group by fails.
22 / 43
https://www.dumpsinfo.com/
Graphical user interface, text, application
Description automatically generated
68.You are asked to write a python function that can read data from a delta table and return the Data-
Frame, which of the following is correct?
A. Python function cannot return a DataFrame
B. Write SQL UDF to return a DataFrame
C. Write SQL UDF that can return tabular data
D. Python function will result in out of memory error due to data volume
E. Python function can return a DataFrame
Answer: D
Explanation:
The answer is Python function can return a DataFrame
The function would something like this,
69.outputMode("append")
70. UNION ALL
71.group by product_id
72.live.sales_orders_raw s
73.An hourly batch job is configured to ingest data files from a cloud object storage container where
each batch represent all records produced by the source system in a given hour. The batch job to
process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is
missed. The user_id field represents a unique key for the data, which has the following schema:
user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT,
auto_pay BOOLEAN, last_updated BIGINT
New records are all ingested into a table named account_history which maintains a full record of all
data in the same schema as the source. The next table in the system is named account_current and
is implemented as a Type 1 table representing the most recent value for each unique user_id.
Assuming there are millions of user accounts and tens of thousands of records processed hourly,
which implementation can be used to efficiently update the described account_current table as part of
each hourly batch job?
A. Use Auto Loader to subscribe to new files in the account_history directory; configure a Structured
Streaming trigger once job to batch update newly detected files into the account_current table.
B. Overwrite the account_current table with each batch using the results of a query against the
account_history table grouping by user_id and filtering for the max value of last_updated.
C. Filter records in account_history using the last_updated field and the most recent hour processed,
as well as the max last_iogin by user_id write a merge statement to update or insert the most recent
value for each user_id.
D. Use Delta Lake version history to get the difference between the latest version of account_history
and one version prior, then write these records to account_current.
E. Filter records in account_history using the last_updated field and the most recent hour processed,
23 / 43
https://www.dumpsinfo.com/
making sure to deduplicate on username; write a merge statement to update or insert the most recent
value for each username.
Answer: C
74.The data engineering team is looking to add a new column to the table, but the QA team would like
to test the change before implementing in production, which of the below options allow you to quickly
copy the table from Prod to the QA environment, modify and run the tests?
A. DEEP CLONE
B. SHADOW CLONE
C. ZERO COPY CLONE
D. SHALLOW CLONE
E. METADATA CLONE
Answer: D
Explanation:
The answer is SHALLOW CLONE
SHALLOW CLONE If you wish to create a copy of a table quickly to test out applying changes without
the risk of modifying the current table, SHALLOW CLONE can be a good option. Shallow clones just
copy the Delta transaction logs, meaning that the data doesn't move so it can be very quick.
75. A data engineer has ingested data from an external source into a PySpark DataFrame raw_df.
They need to briefly make this data available in SQL for a data analyst to perform a quality assurance
check on the data.
Which of the following commands should the data engineer run to make this data available in SQL for
only the remainder of the Spark session?
A. raw_df.createOrReplaceTempView("raw_df")
B. There is no way to share data between PySpark and SQL
C. raw_df.createTable("raw_df")
D. raw_df.saveAsTable("raw_df")
E. raw_df.write.save("raw_df")
Answer: A
76.AS SELECT * FROM table
A. INSERT OVERWRITEreplaces data by default, CREATE OR REPLACE replaces data and
Schema by default
B. INSERT OVERWRITE replaces data and schema by default, CREATE OR REPLACEreplaces
data by default
C. INSERT OVERWRITE maintains historical data versions by de-fault, CREATE OR
REPLACEclears the historical data versions by default
D. INSERT OVERWRITE clears historical data versions by de-fault, CREATE OR REPLACE
maintains the historical data versions by default
E. Both are same and results in identical outcomes
Answer: A
Explanation:
The main difference between INSERT OVERWRITE and CREATE OR REPLACE TABLE(CRAS) is
that CRAS can modify the schema of the table, i.e it can add new columns or change data types of
existing columns. By default INSERT OVERWRITE only overwrites the data.
INSERT OVERWRITE can also be used to overwrite schema, only when
spark.databricks.delta.schema.autoMerge.enabled is set true if this option is not enabled and if there
is a schema mismatch command will fail.
24 / 43
https://www.dumpsinfo.com/
77. .option("cloudFiles.format", "csv") \
78.What is the purpose of a gold layer in Multi-hop architecture?
A. Optimizes ETL throughput and analytic query performance
B. Eliminate duplicate records
C. Preserves grain of original data, without any aggregations
D. Data quality checks and schema enforcement
E. Powers ML applications, reporting, dashboards and adhoc reports.
Answer: E
Explanation:
The answer is Powers ML applications, reporting, dashboards and adhoc reports.
Review the below link for more info,
Medallion Architecture C Databricks
Gold Layer:
79.When building a DLT s pipeline you have two options to create a live tables, what is the main
difference between CREATE STREAMING LIVE TABLE vs CREATE LIVE TABLE?
A. CREATE STREAMING LIVE table is used in MULTI HOP Architecture
B. CREATE LIVE TABLE is used when working with Streaming data sources and Incremental data
C. CREATE STREAMING LIVE TABLE is used when working with Streaming data sources and
Incremental data
D. There is no difference both are the same, CREATE STRAMING LIVE will be deprecated soon
E. CREATE LIVE TABLE is used in DELTA LIVE TABLES, CREATE STREAMING LIVE can only
used in Structured Streaming applications
Answer: C
Explanation:
The answer is, CREATE STREAMING LIVE TABLE is used when working with Streaming data
sources and Incremental data
80. from_json('[{ "employeeId":1235,"sales" : 10500 },{ "employeeId":3233,"sales" : 32000 }]',
81. A data engineering team needs to query a Delta table to extract rows that all meet the same
condition.
However, the team has noticed that the query is running slowly. The team has already tuned the size
of the data files. Upon investigating, the team has concluded that the rows meeting the condition are
sparsely located throughout each of the data files.
Based on the scenario, which of the following optimization techniques could speed up the query?
A. Tuning the file size
B. Bin-packing
C. Data skipping
D. Write as a Parquet file
E. Z-Ordering
Answer: E
82. Preserves grain of original data (without aggregation)
83. Which of the statements are incorrect when choosing between lakehouse and Datawarehouse?
A. Lakehouse can have special indexes and caching which are optimized for Machine learning
B. Lakehouse cannot serve low query latency with high reliability for BI workloads, only suitable for
25 / 43
https://www.dumpsinfo.com/
batch workloads.
C. Lakehouse can be accessed through various API’s including but not limited to Python/R/SQL
D. Traditional Data warehouses have storage and compute are coupled.
E. Lakehouse uses standard data formats like Parquet.
Answer: B
Explanation:
The answer is Lakehouse cannot serve low query latency with high reliability for BI workloads, only
suitable for batch workloads.
Lakehouse can replace traditional warehouses by leveraging storage and compute optimizations like
caching to serve them with low query latency with high reliability.
Focus on comparisons between Spark Cache vs Delta Cache.
https://docs.databricks.com/delta/optimizations/delta-cache.html
What Is a Lakehouse? - The Databricks Blog
Graphical user
26 / 43
https://www.dumpsinfo.com/
interface, text, application
Description automatically generated
Bottom of Form
Top of Form
84.What is the purpose of gold layer in Multi hop architecture?
A. Optimizes ETL throughput and analytic query performance
B. Eliminate duplicate records
C. Preserves grain of original data, without any aggregations
D. Data quality checks and schema enforcement
E. Optimized query performance for business-critical data
Answer: E
Explanation:
Medallion Architecture C Databricks
Gold Layer:
85.COPY INTO table_name
86.You have written a notebook to generate a summary data set for reporting, Notebook was
scheduled using the job cluster, but you realized it takes 8 minutes to start the cluster, what feature
can be used to start the cluster in a timely fashion so your job can run immediatley?
A. Setup an additional job to run ahead of the actual job so the cluster is running second job starts
B. Use the Databricks cluster pools feature to reduce the startup time
C. Use Databricks Premium edition instead of Databricks standard edition
D. Pin the cluster in the cluster UI page so it is always available to the jobs
E. Disable auto termination so the cluster is always running
Answer: B
Explanation:
Cluster pools allow us to reserve VM's ahead of time, when a new job cluster is created VM are
grabbed from the pool. Note: when the VM's are waiting to be used by the cluster only cost incurred is
Azure. Databricks run time cost is only billed once VM is allocated to a cluster.
Here is a demo of how to setup a pool and follow some best practices,
27 / 43
https://www.dumpsinfo.com/
Graphical user
interface, text
Description automatically generated
87.
88.Review the following error traceback:
28 / 43
https://www.dumpsinfo.com/
Which statement describes the error being raised?
A. The code executed was PySpark but was executed in a Scala notebook.
B. There is no column in the table named heartrateheartrateheartrate
C. There is a type error because a column object cannot be multiplied.
D. There is a type error because a DataFrame object cannot be multiplied.
E. There is a syntax error because the heartrate column is not correctly identified as a column.
Answer: E
89.option("checkpointLocation", checkpointPath)
90. pass
Answer: A
Explanation:
The answer is,
91.option("____", “dbfs:/location/silver")
92.How are Delt tables stored?
A. A Directory where parquet data files are stored, a sub directory _delta_log where meta data, and
the transaction log is stored as JSON files.
B. A Directory where parquet data files are stored, all of the meta data is stored in memory
29 / 43
https://www.dumpsinfo.com/
C. A Directory where parquet data files are stored in Data plane, a sub directory _delta_log where
meta data, history and log is stored in control pane.
D. A Directory where parquet data files are stored, all of the metadata is stored in parquet files
E. Data is stored in Data plane and Metadata and delta log are stored in control pane
Answer: A
Explanation:
The answer is A Directory where parquet data files are stored, a sub directory _delta_log where meta
data, and the transaction log is stored as JSON files.
Timeline
Description automatically generated
93. INSERT OVERWRITE customer_sales
94.select
95. You are using k-means clustering to classify heart patients for a hospital. You have chosen
Patient Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you
create a pair-wise plot of the clusters, you notice that there is significant overlap between the clusters
.
What should you do?
A. Identify additional measures to add to the analysis
B. Remove one of the measures
C. Decrease the number of clusters
D. Increase the number ofclusters
Answer: C
96.What steps need to be taken to set up a DELTA LIVE PIPELINE as a job using the workspace UI?
A. DELTA LIVE TABLES do not support job cluster
B. Select Workflows UI and Delta live tables tab, under task type select Delta live tables pipeline and
30 / 43
https://www.dumpsinfo.com/
select the notebook
C. Select Workflows UI and Delta live tables tab, under task type select Delta live tables pipeline and
select the pipeline JSON file
D. Use Pipeline creation UI, select a new pipeline and job cluster
Answer: B
Explanation:
The answer is,
Select Workflows UI and Delta live tables tab, under task type select Delta live tables pipeline and
select the notebook.
Create a pipeline
To create a new pipeline using the Delta Live Tables notebook:
97.INSERT OVERWRITE table_name
98. Reduces data storage complexity, latency, and redundency
99.Which of the following python statement can be used to replace the schema name and table name
in the query statement?
A. 1. table_name = "sales"
100.query = f"select * from {schema_name}.{table_name}"
D. 1.table_name = "sales"
101.CREATE TABLE transactions (
102.Which of the following tool provides Data Access control, Access Audit, Data Lineage, and Data
discovery?
A. DELTA LIVE Pipelines
B. Unity Catalog
C. Data Governance
D. DELTA lake
E. Lakehouse
Answer: B
103.A warehouse can have more than one cluster this is called Scale out. If a warehouse is con-
figured with X-Small cluster size with cluster scaling(Min1, Max 2) Databricks spins up an additional
cluster if it detects queries are waiting in the queue, If a warehouse is configured to run 2
clusters(Min1, Max 2), and let's say a user submits 20 queries, 10 queriers will start running and holds
the remaining in the queue and databricks will automatically start the second cluster and starts
redirecting the 10 queries waiting in the queue to the second cluster.
104.
105. USING SQLITE
106. .outputMode("complete")
107. END
C. 1.CREATE FUNCTION udf_convert(temp DOUBLE, measure STRING)
108.A single query will not span more than one cluster, once a query is submitted to a cluster it will
31 / 43
https://www.dumpsinfo.com/
remain in that cluster until the query execution finishes irrespective of how many clusters are
available to scale.
Please review the below diagram to understand the above concepts:
SQL endpoint (SQL Warehouse) scales horizontally(scale-out) and vertical (scale-up), you have to
understand when to use what.
Scale-out -> to add more clusters for a SQL endpoint, change max number of clusters
If you are trying to improve the throughput, being able to run as many queries as possible then having
an additional cluster(s) will improve the performance.
Databricks SQL automatically scales as soon as it detects queries are in queuing state, in this
example scaling is set for min 1 and max 3 which means the warehouse can add three clusters if it
detects queries are waiting.
32 / 43
https://www.dumpsinfo.com/
During the warehouse creation or after you have the ability to change the warehouse size (2X-
Small....to ...4XLarge) to improve query performance and the maximize scaling range
to add more clusters on a SQL Endpoint(SQL Warehouse) scale-out, if you are changing an existing
warehouse you may have to restart the warehouse to make the changes effective.
How do you know how many clusters you need(How to set Max cluster size)?
When you click on an existing warehouse and select the monitoring tab, you can see warehouse
utilization information(see below), there are two graphs that provide important information on how the
warehouse is being utilized, if you see queries are being queued that means your warehouse can
benefit from additional clusters. Please review the
additional DBU cost associated with adding clusters so you can take a well balanced decision
between cost and performance.
33 / 43
https://www.dumpsinfo.com/
109. ELSE (temp C 33 ) * 5/9
110.option("checkpointLocation", checkpointPath)
111.load('landing')\
112.readStream\
113.The operations team is interested in monitoring the recently launched product, team wants to set
up an email alert when the number of units sold increases by more than 10,000 units. They want to
monitor this every 5 mins.
Fill in the below blanks to finish the steps we need to take
• Create ___ query that calculates total units sold
• Setup ____ with query on trigger condition Units Sold > 10,000
• Setup ____ to run every 5 mins
• Add destination ______
A. Python, Job, SQL Cluster, email address
B. SQL, Alert, Refresh, email address
C. SQL, Job, SQL Cluster, email address
D. SQL, Job, Refresh, email address
E. Python, Job, Refresh, email address
Answer: B
Explanation:
The answer is SQL, Alert, Refresh, email address
Here the steps from Databricks documentation,
Create an alert
Follow these steps to create an alert on a single column of a query.
114.Directory listing - List Directory and maintain the state in RocksDB, supports incremental file
listing
115.A cluster comprises one driver node and one or many worker nodes
34 / 43
https://www.dumpsinfo.com/
116. You are working on a email spam filtering assignment, while working on this you find there is
new word e.g. HadoopExam comes in email, and in your solutions you never come across this word
before, hence probability of this words is coming in either email could be zero.
So which of the following algorithm can help you to avoid zero probability?
A. Naive Bayes
B. Laplace Smoothing
C. Logistic Regression
D. All of the above
Answer: B
Explanation:
Laplace smoothing is a technique for parameter estimation which accounts for unobserved events. It
is more robust and will not fail completely when data that has never been observed in training shows
up.
117. Data quality checks, quarantine corrupt data
Exam focus: Please review the below image and understand the role of each layer(bronze, silver,
gold) in medallion architecture, you will see varying questions targeting each layer and its purpose.
Sorry I had to add the watermark some people in Udemy are copying my content.
A diagram of a
house
Description automatically generated with low confidence
118.with stock_cte
119.You were asked to write python code to stop all running streams, which of the following
command can be used to get a list of all active streams currently running so we can stop them, fill in
the blank.
35 / 43
https://www.dumpsinfo.com/
120.Click Settings at the bottom of the sidebar and select SQL Admin Console.
121. .option("checkpointLocation", checkpoint_location) \ 10. .start(target_delta_table_location)
option("cloudFiles.schemaHints", "id int, description string")
# Here we are providing a hint that id column is int and the description is a string
When cloudfiles.schemalocation is used to store the output of the schema inference during the load
process, with schema hints you can enforce data types for known columns ahead of time.
122.Create a schema called bronze using location ‘/mnt/delta/bronze’, and check if the schema
exists before creating.
A. CREATE SCHEMA IF NOT EXISTS bronze LOCATION '/mnt/delta/bronze'
B. CREATE SCHEMA bronze IF NOT EXISTS LOCATION '/mnt/delta/bronze'
C. if IS_SCHEMA('bronze'): CREATE SCHEMA bronze LOCATION '/mnt/delta/bronze'
D. Schema creation is not available in metastore, it can only be done in Unity catalog UI
E. Cannot create schema without a database
Answer: A
Explanation:
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-schema.html
123.A single cluster irrespective of cluster size(2X-Smal.. to ...4XLarge) can only run 10 queries at
any given time if a user submits 20 queries all at once to a warehouse with 3X-Large cluster size and
cluster scaling (min 1, max1) while 10 queries will start running the remaining 10 queries wait in a
queue for these 10 to finish.
124.},
125.CREATESCHEMA [ IF NOT EXISTS ] schema_name [ LOCATION schema_directory ]
126.A production cluster has 3 executor nodes and uses the same virtual machine type for the driver
and executor.
When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused
by code executing on the driver?
A. The five Minute Load Average remains consistent/flat
B. Bytes Received never exceeds 80 million bytes per second
C. Total Disk Space remains constant
D. Network I/O never spikes
E. Overall cluster CPU utilization is around 25%
Answer: D
127.AS SELECT * FROM table_name
D. 1. CREATE OR REPLACE VIEW view_name
128.SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
129. SELECT district,
130.Which of the following two options are supported in identifying the arrival of new files, and
incremental data from Cloud object storage using Auto Loader?
36 / 43
https://www.dumpsinfo.com/
A. Directory listing, File notification
B. Checking pointing, watermarking
C. Writing ahead logging, read head logging
D. File hashing, Dynamic file lookup
E. Checkpointing and Write ahead logging
Answer: A
Explanation:
The answer is A, Directory listing, File notifications
Directory listing: Auto Loader identifies new files by listing the input directory.
File notification: Auto Loader can automatically set up a notification service and queue service that
subscribe to file events from the input directory.
Choosing between file notification and directory listing modes | Databricks on AWS
131.production schema enforced
132. .option("cloudFiles.schemaHints", "id int, description string")
133.SELECT * FROM CUSTOMERS_2020
E. 1. SELECT * FROM CUSTOMERS_2021
134. #Execute code
135. A new data engineer new.engineer@company.com has been assigned to an ELT project. The
new data engineer will need full privileges on the table sales to fully manage the project.
Which of the following commands can be used to grant full permissions on the table to the new data
engineer?
A. 1. GRANT ALL PRIVILEGES ON TABLE new.engineer@company.com TO sales;
B. 1. GRANT SELECT ON TABLE sales TO new.engineer@company.com;
C. 1. GRANT ALL PRIVILEGES ON TABLE sales TO new.engineer@company.com;
D. 1. GRANT USAGE ON TABLE sales TO new.engineer@company.com;
E. 1. GRANT SELECT CREATE MODIFY ON TABLE sales TO new.engineer@company.com;
Answer: C
136. LOCATION DELTA
Answer: D
Explanation:
Answer is
137. .table("uncleanedSales") )
Answer: A
Explanation:
The answer is
138.load(data_source)
139.Which of the following statements are correct on how Delta Lake implements a lake house?
A. Delta lake uses a proprietary format to write data, optimized for cloud storage
B. Using Apache Hadoop on cloud object storage
C. Delta lake always stores meta data in memory vs storage
D. Delta lake uses open source, open format, optimized cloud storage and scalable meta data
37 / 43
https://www.dumpsinfo.com/
E. Delta lake stores data and meta data in computes memory
Answer: D
Explanation:
Delta lake is
• Open source
• Builds up on standard data format
• Optimized for cloud object storage
• Built for scalable metadata handling Delta lake is not
• Proprietary technology
• Storage format
• Storage medium
• Database service or data warehouse
140. In which phase of the data analytics lifecycle do Data Scientists spend the most time in a
project?
A. Discovery
B. Data Preparation
C. Model Building
D. Communicate Results
Answer: B
141.Which statement describes Delta Lake Auto Compaction?
A. An asynchronous job runs after the write completes to detect if files could be further compacted; if
yes, an OPTIMIZE job is executed toward a default of 1 GB.
B. Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most
recent job.
C. Optimized writes use logical partitions instead of directory partitions; because partition boundaries
are only represented in metadata, fewer small files are written.
D. Data is queued in a messaging bus instead of committing data directly to memory; all data is
committed from the messaging bus in one batch once the job is complete.
E. An asynchronous job runs after the write completes to detect if files could be further compacted; if
yes, an OPTIMIZE job is executed toward a default of 128 MB.
Answer: A
142. A data engineering team is in the process of converting their existing data pipeline to utilize Auto
Loader for incremental processing in the ingestion of JSON files.
One data engineer comes across the following code block in the Auto Loader documentation:
143.transactionId int,
144. A data architect has determined that a table of the following format is necessary:
Which of the following code blocks uses SQL DDL commands to create an empty Delta table in the
above format regardless of whether a table already exists with this name?
A. 1. CREATE OR REPLACE TABLE table_name AS
145.Which of the following Auto loader structured streaming commands successfully performs a hop
from the landing area into Bronze?
A. 1.spark\
38 / 43
https://www.dumpsinfo.com/
146.You were asked to identify number of times a temperature sensor exceed threshold temperature
(100.00) by each device, each row contains 5 readings collected every 5 minutes, fill in the blank with
the appropriate functions.
Schema: deviceId INT, deviceTemp ARRAY<double>, dateTimeCollected TIMESTAMP
SELECT deviceId, __ (__ (__(deviceTemp], i -> i > 100.00)))
FROM devices
GROUP BY deviceId
A. SUM, COUNT, SIZE
B. SUM, SIZE, SLICE
C. SUM, SIZE, ARRAY_CONTAINS
D. SUM, SIZE, ARRAY_FILTER
E. SUM, SIZE, FILTER
Answer: E
Explanation:
FILER function can be used to filter an array based on an expression
SIZE function can be used to get size of an array
SUM is used to calculate to total by device
Diagram
Description automatically generated
147. SELECT * FROM transactions VERSION AS OF 3
148. A data engineer needs to dynamically create a table name string using three Python varia-bles:
region, store, and year.
An example of a table name is below when region = "nyc", store = "100", and year = "2021":
nyc100_sales_2021
Which of the following commands should the data engineer use to construct the table name in Py-
thon?
A. "{region}+{store}+_sales_+{year}"
39 / 43
https://www.dumpsinfo.com/
B. "{region}+{store}+"_sales_"+{year}"
C. f"{region}+{store}+_sales_+{year}"
D. f"{region}{store}_sales_{year}"
E. "{region}{store}_sales_{year}"
Answer: D
149.spark\
150.You were asked to create a table that can store the below data, orderTime is a timestamp but the
finance team when they query this data normally prefer the orderTime in date format, you would like
to create a calculated column that can convert the orderTime column timestamp datatype to date and
store it, fill in the blank to complete the DDL.
A. AS DEFAULT (CAST(orderTime as DATE))
B. GENERATED ALWAYS AS (CAST(orderTime as DATE)) Correct)
C. GENERATED DEFAULT AS (CAST(orderTime as DATE))
D. AS (CAST(orderTime as DATE))
E. Delta lake does not support calculated columns, value should be inserted into the table as part of
the ingestion process
Answer: B
Explanation:
The answer is, GENERATED ALWAYS AS (CAST(orderTime as DATE))
https://docs.microsoft.com/en-us/azure/databricks/delta/delta-batch#--use-generated-columns
Delta Lake supports generated columns which are a special type of columns whose values are
automatically generated based on a user-specified function over other columns in the Delta table.
When you write to a table with generated columns and you do not explicitly provide values for them,
Delta Lake automatically computes the values. Note: Databricks also supports partitioning using
generated column
151.Increasing the Warehouse cluster size can improve the performance of a query, for example, if a
query runs for 1 minute in a 2X-Small warehouse size it may run in 30 Seconds if we change the
warehouse size to X-Small. this is due to 2X-Small having 1
workernode and X-Small having 2 worker nodes so the query has more tasks and runs faster (note:
this is an ideal case example, the scalability of a query performance depends on many factors, it can
not always be linear)
152.SELECT user_id,
153.The DevOps team has configured a production workload as a collection of notebooks scheduled
to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has
requested access to one of these notebooks to review the production logic.
What are the maximum notebook permissions that can be granted to the user without allowing
accidental changes to production code or data?
A. Can Manage
B. Can Edit
40 / 43
https://www.dumpsinfo.com/
C. No permissions
D. Can Read
E. Can Run
Answer: D
154.Your colleague was walking you through how a job was setup, but you noticed a warning
message that said, “Jobs running on all-purpose cluster are considered all purpose compute", the
colleague was not sure why he was getting the warning message, how do you best explain this
warning mes-sage?
A. All-purpose clusters cannot be used for Job clusters, due to performance issues.
B. All-purpose clusters take longer to start the cluster vs a job cluster
C. All-purpose clusters are less expensive than the job clusters
D. All-purpose clusters are more expensive than the job clusters
E. All-purpose cluster provide interactive messages that can not be viewed in a job
Answer: D
Explanation:
Warning message:
41 / 43
https://www.dumpsinfo.com/
Graphical user
interface, text, application, email
Description automatically generated
Pricing for All-purpose clusters are more expensive than the job clusters AWS pricing(Aug 15th 2022)
Graphical user
interface
Description automatically generated
Bottom of Form
Top of Form
155.matched_action
156. Provides efficient storage and querying of full, unprocessed history of data
157.table(raw) (Correct)
C. 1.spark\
42 / 43
https://www.dumpsinfo.com/
158. What is the purpose of a silver layer in Multi hop architecture?
A. Replaces a traditional data lake
B. Efficient storage and querying of full and unprocessed history of data
C. A schema is enforced, with data quality checks.
D. Refined views with aggregated data
E. Optimized query performance for business-critical data
Answer: C
Explanation:
The answer is, A schema is enforced, with data quality checks.
Medallion Architecture C Databricks
Silver Layer:
159."temp":[25,28,49,58,38,25]
160. Raw copy of ingested data
161.A junior developer complains that the code in their notebook isn't producing the correct results in
the development environment. A shared screenshot reveals that while they're using a notebook
versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired
branch named dev-2.3.9 is not available from the branch selection dropdown.
Which approach will allow this developer to review the current logic for this notebook?
A. Use Repos to make a pull request use the Databricks REST API to update the current branch to
dev-2.3.9
B. Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.
C. Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch
D. Merge all changes back to the main branch in the remote Git repository and clone the repo again
E. Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to
sync with the remote repository
Answer: B
Powered by TCPDF (www.tcpdf.org)
43 / 43
https://www.dumpsinfo.com/
http://www.tcpdf.org