Course Name:- Spark MLlIB
Module 1. Spark MLlIB data Types
Question 1.Sparse Data generally contains many non-zero values, and few zero values.
- True
- False
Question 2. Local matrices are generally stored in distributed systems and rarely on single machines.
- True
- False
Question 3. Which of the following are distributed matrices?
- RowMatrix
- ColumnMatrix
- CoordinateMatrix
- SphericalMatrix
- RowMatrix and CoordinateMatrix
- All of the Above
Module 2. Review Alogrithms
Question 1. Logistic Regression is an algorithm used for predicting numerical values.
- True
- False
Question 2. The SVM algorithm maximizes the margins between the generated hyperplane and two clusters of data.
- True
- False
Question 3. Which of the following is true about Gaussian Mixture Clustering?
- The closer a data point is to a particular centroid, the more likely that data point is to be clustered with that centroid.
- The Gaussian of a centroid determines the probability that a data point is clustered with that centroid.
- The probability of a data point being clustered with a centroid is a function of distance from the point to the centroid.
- Gaussian Mixture Clustering uses multiple centroids to cluster data points.
- All of the Above
Module 3. Spark MLlIB decision Trees and Random Forests
Question 1. Which of the following is a stopping parameter in a Decision Tree?
- The number of nodes in the tree reaches a specific value.
- The depth of the tree reaches a specific value.
- The breadth of the tree reaches a specific value.
- All of the Above
Question 2.When using a regression type of Decision Tree or Random Forest, the value for impurity can be measured as either ‘entropy’ or ‘variance’.
- True
- False
Question 3. In a Random Forest, featureSubsetStrategy is considered a stopping parameter, but not a tunable parameter.
- True
- False
Module 4. Spark MLlIB clustering
Question 1.In Spark MLlib, the initialization mode for the K-Means training method is called
- k-means–
- k-means++
- k-means||
- k-means
Question 2. In K-Means, the “runs” parameter determines the number of data points allowed in each cluster.
- True
- False
Question 3. In Gaussian Mixture Clustering, the sum of all values outputted from the “weights” function must equal 1.
- True
- False
Spark MLlIB cognitive Class Final Exam Answers:-
Question 1. In Gaussian Mixture Clustering, the predictSoft function provides membership values from the top three Gaussians only.
- True
- False
Question 2. In Decision Trees, what is true about the size of a dataset?
- Large datasets create “bins” on splits, which can be specified with the maxBins parameter.
- Large datasets sort feature values, then use the ordered values as split calculations.
- Small datasets create split candidates based on quantile calculations on a sample of the data.
- Small datasets split on random values for the feature.
Question 3. A Logistic Regression algorithm is ineffective as a binary response predictor.
- True
- False
Question 4. What is the Row Pointer for a Matrix with the following Row Indices: [5, 1 | 6 | 2, 8, 10]
- [1, 6]
- [0, 2, 3, 6]
- [0, 2, 3, 5]
- [2, 3]
Question 5. For multiclass classification, try to use (M-1) Decision Tree split candidates whenever possible.
- True
- False
Question 6. In a Decision Tree, choosing a very large maxDepth value can:
- Increase accuracy
- Increase the risk of overfitting to the training set
- Increase the cost of training
- All of the Above
- Increase the risk of overfitting and increase the cost of training
Question 7. In Gaussian Mixture Clustering, a large value returned from the weights function represents a large precedence of that Gaussian.
- True
- False
Question 8. Increasing the value of epsilon when creating the K-Means Clustering model can:
- Decrease training cost and decrease the number of iterations that the model undergoes
- Decrease training cost and increase the number of iterations that the model undergoes
- Increase training cost and decrease the number of iterations that the model undergoes
- Increase training cost and increase the number of iterations that the model undergoes
Question 9. In order to train a machine learning model in Spark MLlib, the dataset must be in the form of a(n)
- Python List
- Textfile
- CSV file
- RDD
Question 10.What is true about Dense and Sparse Vectors?
- A Dense Vector can be created using a csc_matrix, and a Sparse Vector can be created using a Python List.
- A Dense Vector can be created using a SciPy csc_matrix, and a Sparse Vector can be created using a SciPy NumPy Array.
- A Dense Vector can be created using a Python List, and a Sparse Vector can be created using a SciPy csc_matrix.
- A Dense Vector can be created using a SciPy NumPy Array, and a Sparse Vector can be created using a Python List.
Question 11.In a Decision Tree, increaing the maxBins parameter allows for more splitting candidates.
- True
- False
Question 12.In classification models, the value for the numClasses parameter does not depend on the data, and can change to increase model accuracy.
- True
- False
Question 13.What is true about Labeled Points?
- A – A labeled point is used with supervised machine learning, and can be made using a dense local vector.
- B – A labeled point is used with unsupervised machine learning, and can be made using a dense local vector.
- C – A labeled point is used with supervised machine learning, and can be made using a sparse local vector.
- D – A labeled point is used with unsupervised machine learning, and can be made using a sparse local vector
- All of the Above
- A and C only
Question 14.In the Gaussian Mixture Clustering model, the convergenceTol value is a stopping parameter that can be tuned, similar to epsilon in k-means clustering.
- True
- False
Question 15.In Gaussian Mixture Clustering, the “Gaussians” function outputs the coordinates of the largest Gaussian, as well as the standard deviation for each Gaussian in the mixture.
- True
- False
Question 16.What is true about the maxDepth parameter for Random Forests?
- A large maxDepth value is preferred since tree averaging yields a decrease in overall bias.
- A large maxDepth value is preferred since tree averaging yields a decrease in overall variance.
- A large maxDepth value is preferred since tree averaging yields an increase in overall bias.
- A large maxDepth value is preferred since tree averaging yields an increase in overall variance.