Question # 4

Which of the following code blocks returns a copy of DataFrame transactionsDf in which column productId has been renamed to productNumber?

transactionsDf.withColumnRenamed("productId", "productNumber")

transactionsDf.withColumn("productId", "productNumber")

transactionsDf.withColumnRenamed("productNumber", "productId")

transactionsDf.withColumnRenamed(col(productId), col(productNumber))

transactionsDf.withColumnRenamed(productId, productNumber)

Full Access

Question # 5

The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code

block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__

1. select

2. "storeId"

3. print_schema()

1. limit

2. 1

3. columns

1. select

2. "storeId"

3. printSchema()

1. limit

2. "storeId"

3. printSchema()

1. select

2. storeId

3. dtypes

Full Access

Question # 6

Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code

block is run twice?

itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)

itemsDf.sample(fraction=0.1, seed=87238)

itemsDf.sample(fraction=1000, seed=98263)

itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)

itemsDf.sample(fraction=0.1)

Full Access

Answer:

Explanation:

Explanation

itemsDf.sample(fraction=0.1, seed=87238)

Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning

duplicates, you should leave the withReplacement parameter at False, which is the default. Since the QUESTION NO: specifies that the same rows should be returned even if the code block is run

twice,

you need to specify a seed. The number passed in the seed does not matter as long as it is an integer.

itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)

Incorrect. While this code block fulfills almost all requirements, it may return duplicates. This is because withReplacement is set to True.

Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls and you need to take 1,000 balls at random from the bucket (similar to the problem in the

question). Now, if you would take those balls with replacement, you would take a ball, note its number, and put it back into the bucket, meaning the next time you take a ball from the bucket there

would be a chance you could take the exact same ball again. If you took the balls without replacement, you would leave the ball outside the bucket and not put it back in as you take the next 999

balls.

itemsDf.sample(fraction=1000, seed=98263)

Wrong. The fraction parameter needs to have a value between 0 and 1. In this case, it should be 0.1, since 1,000/10,000 = 0.1.

itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)

No, DataFrame.sampleBy() is meant for stratified sampling. This means that based on the values in a column in a DataFrame, you can draw a certain fraction of rows containing those values from

the DataFrame (more details linked below). In the scenario at hand, sampleBy is not the right operator to use because you do not have any information about any column that the sampling should

depend on.

itemsDf.sample(fraction=0.1)

Incorrect. This code block checks all the boxes except that it does not ensure that when you run it a second time, the exact same rows will be returned. In order to achieve this, you would have to

specify a seed.

More info:

- pyspark.sql.DataFrame.sample — PySpark 3.1.2 documentation

- pyspark.sql.DataFrame.sampleBy — PySpark 3.1.2 documentation

- Types of Samplings in PySpark 3. The explanations of the sampling… | by Pinar Ersoy | Towards Data Science

Question # 7

Which of the following code blocks adds a column predErrorSqrt to DataFrame transactionsDf that is the square root of column predError?

transactionsDf.withColumn("predErrorSqrt", sqrt(predError))

transactionsDf.select(sqrt(predError))

transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())

transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))

transactionsDf.select(sqrt("predError"))

Full Access

Answer:

Explanation:

Explanation

transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))

Correct. The DataFrame.withColumn() operator is used to add a new column to a DataFrame. It takes two arguments: The name of the new column (here: predErrorSqrt) and a Column expression

as the new column. In PySpark, a Column expression means referring to a column using the col("predError") command or by other means, for example by transactionsDf.predError, or even just

using the column name as a string, "predError".

The QUESTION NO: asks for the square root. sqrt() is a function in pyspark.sql.functions and calculates the square root. It takes a value or a Column as an input. Here it is the predError column of

DataFrame transactionsDf expressed through col("predError").

transactionsDf.withColumn("predErrorSqrt", sqrt(predError))

Incorrect. In this expression, sqrt(predError) is incorrect syntax. You cannot refer to predError in this way – to Spark it looks as if you are trying to refer to the non-existent Python variable predError.

You could pass transactionsDf.predError, col("predError") (as in the correct solution), or even just "predError" instead.

transactionsDf.select(sqrt(predError))

Wrong. Here, the explanation just above this one about how to refer to predError applies.

transactionsDf.select(sqrt("predError"))

No. While this is correct syntax, it will return a single-column DataFrame only containing a column showing the square root of column predError. However, the QUESTION NO: asks for a column to

be added to the original DataFrame transactionsDf.

transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())

No. The issue with this statement is that column col("predError") has no sqrt() method. sqrt() is a member of pyspark.sql.functions, but not of pyspark.sql.Column.

More info: pyspark.sql.DataFrame.withColumn — PySpark 3.1.2 documentation and pyspark.sql.functions.sqrt — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, QUESTION NO: 31 (Databricks import instructions)

Question # 8

The code block displayed below contains one or more errors. The code block should load parquet files at location filePath into a DataFrame, only loading those files that have been modified before

2029-03-20 05:44:46. Spark should enforce a schema according to the schema shown below. Find the error.

Schema:

1.root

2. |-- itemId: integer (nullable = true)

3. |-- attributes: array (nullable = true)

4. | |-- element: string (containsNull = true)

5. |-- supplier: string (nullable = true)

Code block:

1.schema = StructType([

2. StructType("itemId", IntegerType(), True),

3. StructType("attributes", ArrayType(StringType(), True), True),

4. StructType("supplier", StringType(), True)

5.])

7.spark.read.options("modifiedBefore", "2029-03-20T05:44:46").schema(schema).load(filePath)

The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.

Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.

The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.

Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.

Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

Full Access

Answer:

Explanation:

Explanation

Correct code block:

schema = StructType([

StructField("itemId", IntegerType(), True),

StructField("attributes", ArrayType(StringType(), True), True),

StructField("supplier", StringType(), True)

])

spark.read.options(modifiedBefore="2029-03-20T05:44:46").schema(schema).parquet(filePath)

This QUESTION NO: is more difficult than what you would encounter in the exam. In the exam, for this QUESTION NO: type, only one error needs to be identified and not "one or multiple" as in the

question.

Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.

Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways

of expressing a schema in Spark. A StructType always contains a list of StructFields (see documentation linked below). So, nesting StructType and StructType as shown in the QUESTION NO: is

wrong.

The modification date threshold should be specified by a keyword argument like options(modifiedBefore="2029-03-20T05:44:46") and not two consecutive non-keyword arguments as in the original

code block (see documentation linked below).

Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for

example, DataFrameReader.parquet().

Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be

nullable and this is specified correctly by the third argument being True in the schema in the code block.

It is correct, however, that the modification date threshold is specified incorrectly (see above).

The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.

Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer

above. In addition, the DataFrameReader is called correctly through the SparkSession spark.

Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.

Incorrect, the object types in the schema definition are correct and syntax of the call to Spark's DataFrameReader is correct.

The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.

False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified

incorrectly (see correct answer above).

Question # 9

The code block shown below should set the number of partitions that Spark uses when shuffling data for joins or aggregations to 100. Choose the answer that correctly fills the blanks in the code

block to accomplish this.

spark.sql.shuffle.partitions

__1__.__2__.__3__(__4__, 100)

1. spark

2. conf

3. set

4. "spark.sql.shuffle.partitions"

1. pyspark

2. config

3. set

4. spark.shuffle.partitions

1. spark

2. conf

3. get

4. "spark.sql.shuffle.partitions"

1. pyspark

2. config

3. set

4. "spark.sql.shuffle.partitions"

1. spark

2. conf

3. set

4. "spark.sql.aggregate.partitions"

Full Access

Question # 10

Which of the following code blocks returns DataFrame transactionsDf sorted in descending order by column predError, showing missing values last?

transactionsDf.sort(asc_nulls_last("predError"))

transactionsDf.orderBy("predError").desc_nulls_last()

transactionsDf.sort("predError", ascending=False)

transactionsDf.desc_nulls_last("predError")

transactionsDf.orderBy("predError").asc_nulls_last()

Full Access

Question # 11

Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format

month/day/year in column transactionDateFormatted?

Excerpt of DataFrame transactionsDf:

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))

transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))

transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))

Full Access

Answer:

Explanation:

Explanation

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))

Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark's from_unixtime method to transform values in column

transactionDate into strings, following the format requested in the question.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))

No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year.

transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))

Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name

of the column.

transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")

Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped

data – but this is irrelevant for this question, since we do not deal with grouped data here.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))

No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this: 2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not

what is asked for in the question.

More info: pyspark.sql.functions.from_unixtime — PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed — PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1, QUESTION NO: 39 (Databricks import instructions)

Question # 12

Which of the following code blocks returns a new DataFrame in which column attributes of DataFrame itemsDf is renamed to feature0 and column supplier to feature1?

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

1.itemsDf.withColumnRenamed("attributes", "feature0")

2.itemsDf.withColumnRenamed("supplier", "feature1")

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Full Access

Question # 13

Which of the following is not a feature of Adaptive Query Execution?

Replace a sort merge join with a broadcast join, where appropriate.

Coalesce partitions to accelerate data processing.

Split skewed partitions into smaller partitions to avoid differences in partition processing time.

Reroute a query in case of an executor failure.

Collect runtime statistics during query execution.

Full Access

Question # 14

Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and lowest, that shows the biggest and smallest values of column value per value in column

productId from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

transactionsDf.max('value').min('value')

transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest'))

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest'))

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Full Access

Question # 15

Which of the following statements about stages is correct?

Different stages in a job may be executed in parallel.

Stages consist of one or more jobs.

Stages ephemerally store transactions, before they are committed through actions.

Tasks in a stage may be executed by multiple machines at the same time.

Stages may contain multiple actions, narrow, and wide transformations.

Full Access

Question # 16

Which of the following code blocks reads JSON file imports.json into a DataFrame?

spark.read().mode("json").path("/FileStore/imports.json")

spark.read.format("json").path("/FileStore/imports.json")

spark.read("json", "/FileStore/imports.json")

spark.read.json("/FileStore/imports.json")

spark.read().json("/FileStore/imports.json")

Full Access

Question # 17

The code block shown below should convert up to 5 rows in DataFrame transactionsDf that have the value 25 in column storeId into a Python list. Choose the answer that correctly fills the blanks in

the code block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__(__4__)

1. filter

2. "storeId"==25

3. collect

4. 5

1. filter

2. col("storeId")==25

3. toLocalIterator

4. 5

1. select

2. storeId==25

3. head

4. 5

1. filter

2. col("storeId")==25

3. take

4. 5

1. filter

2. col("storeId")==25

3. collect

4. 5

Full Access

Question # 18

Which of the following code blocks efficiently converts DataFrame transactionsDf from 12 into 24 partitions?

transactionsDf.repartition(24, boost=True)

transactionsDf.repartition()

transactionsDf.repartition("itemId", 24)

transactionsDf.coalesce(24)

transactionsDf.repartition(24)

Full Access

Question # 19

Which of the following statements about storage levels is incorrect?

The cache operator on DataFrames is evaluated like a transformation.

In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory.

Caching can be undone using the DataFrame.unpersist() operator.

MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.

DISK_ONLY will not use the worker node's memory.

Full Access

Question # 20

Which of the following code blocks returns a single row from DataFrame transactionsDf?

Full DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

transactionsDf.where(col("storeId").between(3,25))

transactionsDf.filter((col("storeId")!=25) | (col("productId")==2))

transactionsDf.filter(col("storeId")==25).select("predError","storeId").distinct()

transactionsDf.select("productId", "storeId").where("storeId == 2 OR storeId != 25")

transactionsDf.where(col("value").isNull()).select("productId", "storeId").distinct()

Full Access

Question # 21

Which of the following describes a narrow transformation?

narrow transformation is an operation in which data is exchanged across partitions.

A narrow transformation is a process in which data from multiple RDDs is used.

A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables.

A narrow transformation is an operation in which data is exchanged across the cluster.

A narrow transformation is an operation in which no data is exchanged across the cluster.

Full Access

Question # 22

Which of the following statements about executors is correct?

Executors are launched by the driver.

Executors stop upon application completion by default.

Each node hosts a single executor.

Executors store data in memory only.

An executor can serve multiple applications.

Full Access

Question # 23

The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively.

Find the error.

Code block:

1.spark.createDataFrame([("red",), ("blue",), ("green",)], "color")

Instead of calling spark.createDataFrame, just DataFrame should be called.

The commas in the tuples with the colors should be eliminated.

The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.

Instead of color, a data type should be specified.

The "color" expression needs to be wrapped in brackets, so it reads ["color"].

Full Access

Question # 24

Which of the following code blocks displays the 10 rows with the smallest values of column value in DataFrame transactionsDf in a nicely formatted way?

transactionsDf.sort(asc(value)).show(10)

transactionsDf.sort(col("value")).show(10)

transactionsDf.sort(col("value").desc()).head()

transactionsDf.sort(col("value").asc()).print(10)

transactionsDf.orderBy("value").asc().show(10)

Full Access

Question # 25

Which of the following code blocks removes all rows in the 6-column DataFrame transactionsDf that have missing data in at least 3 columns?

transactionsDf.dropna("any")

transactionsDf.dropna(thresh=4)

transactionsDf.drop.na("",2)

transactionsDf.dropna(thresh=2)

transactionsDf.dropna("",4)

Full Access

Question # 26

Which of the following code blocks performs an inner join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively, excluding columns value and storeId from

DataFrame transactionsDf and column attributes from DataFrame itemsDf?

transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId)

1.transactionsDf.createOrReplaceTempView('transactionsDf')

2.itemsDf.createOrReplaceTempView('itemsDf')

4.spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

1.transactionsDf \

2. .drop(col('value'), col('storeId')) \

3. .join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))

1.transactionsDf.createOrReplaceTempView('transactionsDf')

2.itemsDf.createOrReplaceTempView('itemsDf')

4.statement = """

5.SELECT * FROM transactionsDf

6.INNER JOIN itemsDf

7.ON transactionsDf.productId==itemsDf.itemId

8."""

9.spark.sql(statement).drop("value", "storeId", "attributes")

Full Access

Answer:

Explanation:

Explanation

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question. However, this variety reflects the variety of ways that one can express a join in PySpark. You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView('transactionsDf')

itemsDf.createOrReplaceTempView('itemsDf')

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes")

Correct - this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf \

drop(col('value'), col('storeId')) \

join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop('value', 'storeId') instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect - Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView('transactionsDf')

itemsDf.createOrReplaceTempView('itemsDf')

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 25 (Databricks import instructions)

Question # 27

Which of the following is a characteristic of the cluster manager?

Each cluster manager works on a single partition of data.

The cluster manager receives input from the driver through the SparkContext.

The cluster manager does not exist in standalone mode.

The cluster manager transforms jobs into DAGs.

In client mode, the cluster manager runs on the edge node.

Full Access

Labour Day Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: buysanta

Exact2Pass Menu

Exact2Pass

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

SubFooter