Course Name:- Spark Overview for Scala Analytics
Spark Overview for Scala Analytics cognitive class Final Exam Answers:-
Question 1. Which language is not supported by Spark?
- SQL
- Scala
- Java
- C
- Python
Question 2. What does RDD stand for?
- REPL Definition and Description
- Resilient Distributed Dataset
- Reader Distribution Defined
- Resilient Documented DataFrame
- Read, Distribute, Delete
Question 3. The Spark Web Console is used to:
- Edit Spark code
- Integrate Spark with third-party tools
- Examine data produced by Spark jobs
- Monitor running Spark jobs
- Submit Spark jobs
Question 4. The RDD flatMap method does what?
- Transform each input record to zero or more output records
- Transforms each input record to a new output record
- Combines all records into a value
- Reads a data source
- None of the above
Question 5. Shuffling is used to:
- Move data between stages
- Design where partitions are written to disk
- Move tasks to the appropriate nodes in a cluster
- Sort data when that’s requested
- All of the above
Question 6. Transformation methods have one or more of the following characteristics:
- One and only one record is output for each input record
- Lazy (delayed) evaluation
- Their results are cached in memory
- Eager (immediate) evaluation
- None of the above
Question 7. Action methods have one or more of the following characteristics:
- Eager (immediate) evaluation
- Return a new RDD
- Do not support type inference
- Must be the first methods in a sequence of methods
- All of the above
Question 8. The sequence of transformation and action method calls:
- Is run in parallel for each data partition
- Starts with some data and returns or outputs new data
- Is decomposed into stages
- Forms a directed, acyclic graph
- All of the above
Question 9. The Inverted Index computes what?
- The records sorted descending by a key
- The minimum, maximum, and average counts for words in the corpus
- Output records with words as keys and document ids and counts as values
- A table of contents for a corpus of documents
- All of the above
Question 10. Broadcast variables are used for what?
- To send all RDD data to the tasks
- Print messages to the Spark web console
- Send metrics to a monitoring tool
- Share read-only data with all tasks in an efficient way
- None of the above
Question 11. Accumulators are used for what?
- Collect the results of the Spark job
- Aggregate extra data across all tasks
- Manage streams in Spark Streaming
- Send metrics to a monitoring tool
- All of the above
Question 12. DataFrames have one or more of the following characteristics:
- Support for SQL queries
- Handle data when its structure is known and consistent
- Excellent runtime performance
- Support HIVE integration
- All of the above
Question 13. DataFrames support the following operations:
- Non-equi joins
- Reduce
- Delete
- Group by
- All of the above
Question 14. If I have a dataframe “person” with a field “age”, which of the following expressions can never be used to reference that field?
- “age”
- $”age”
- person($”age”)
- person(“age”)
- All of the above are valid
Question 15. If I want to write a SQL query over a DataFrame, I have to call the following method first:
- Persist
- Write
- RegisterTempTable
- Map
- None of the above
Question 16. Which one of the following kinds of joins is not supported?
- Left outer join
- Inner join
- Right outer join
- Left semijoin
- All are supported
Question 17. The DataFrame expression “persons.select($”age”).where($”age” > 21)” returns:
- A DataFrame
- A Scala Vector[Int]
- A ResultSet
- A RDD
- None of the above
Question 18. In Hive, an external table has the property:
- It’s data is not managed by Hive
- It’s format is defined elsewhere
- It’s schema is defined elsewhere
- It is visible to all users of Hive
- All of the above
Question 19. In Spark Streaming, a DStream is:
- A fixed-sized batch of incoming data
- A connector to a socket
- A sequence of RDDs
- A collection of DataFrames
- None of the above
Question 20. The batch interval:
- starts at a user-specified value and adjusts in response to load
- is the number of events to capture per batch
- is the size of each data “chunk” returned by a DataFrame query
- is determined dynamically by Spark
- is the number of seconds to capture data per batch