Breaking News

# Spark MLlIB Cognitive Class Exam Answers:-

## Module 1.  Spark MLlIB data Types

Question 1.Sparse Data generally contains many non-zero values, and few zero values.

• True
• False

Question 2. Local matrices are generally stored in distributed systems and rarely on single machines.

• True
• False

Question 3. Which of the following are distributed matrices?

• RowMatrix
• ColumnMatrix
• CoordinateMatrix
• SphericalMatrix
• RowMatrix and CoordinateMatrix
• All of the Above

## Module 2. Review Alogrithms

Question 1. Logistic Regression is an algorithm used for predicting numerical values.

• True
• False

Question 2. The SVM algorithm maximizes the margins between the generated hyperplane and two clusters of data.

• True
• False

Question 3. Which of the following is true about Gaussian Mixture Clustering?

• The closer a data point is to a particular centroid, the more likely that data point is to be clustered with that centroid.
• The Gaussian of a centroid determines the probability that a data point is clustered with that centroid.
• The probability of a data point being clustered with a centroid is a function of distance from the point to the centroid.
• Gaussian Mixture Clustering uses multiple centroids to cluster data points.
• All of the Above

## Module 3. Spark MLlIB decision Trees and  Random Forests

Question 1. Which of the following is a stopping parameter in a Decision Tree?

• The number of nodes in the tree reaches a specific value.
• The depth of the tree reaches a specific value.
• The breadth of the tree reaches a specific value.
• All of the Above

Question 2.When using a regression type of Decision Tree or Random Forest, the value for impurity can be measured as either ‘entropy’ or ‘variance’.

• True
• False

Question 3. In a Random Forest, featureSubsetStrategy is considered a stopping parameter, but not a tunable parameter.

• True
• False

## Module 4. Spark MLlIB clustering

Question 1.In Spark MLlib, the initialization mode for the K-Means training method is called

• k-means–
• k-means++
• k-means||
• k-means

Question 2. In K-Means, the “runs” parameter determines the number of data points allowed in each cluster.

• True
• False

Question 3. In Gaussian Mixture Clustering, the sum of all values outputted from the “weights” function must equal 1.

• True
• False

## Spark MLlIB cognitive Class Final Exam Answers:-

Question 1.  In Gaussian Mixture Clustering, the predictSoft function provides membership values from the top three Gaussians only.

• True
• False

Question 2. In Decision Trees, what is true about the size of a dataset?

• Large datasets create “bins” on splits, which can be specified with the maxBins parameter.
• Large datasets sort feature values, then use the ordered values as split calculations.
• Small datasets create split candidates based on quantile calculations on a sample of the data.
• Small datasets split on random values for the feature.

Question 3. A Logistic Regression algorithm is ineffective as a binary response predictor.

• True
• False

Question 4. What is the Row Pointer for a Matrix with the following Row Indices: [5, 1 | 6 | 2, 8, 10]

• [1, 6]
• [0, 2, 3, 6]
• [0, 2, 3, 5]
• [2, 3]

Question 5. For multiclass classification, try to use (M-1) Decision Tree split candidates whenever possible.

• True
• False

Question 6. In a Decision Tree, choosing a very large maxDepth value can:

• Increase accuracy
• Increase the risk of overfitting to the training set
• Increase the cost of training
• All of the Above
• Increase the risk of overfitting and increase the cost of training

Question 7. In Gaussian Mixture Clustering, a large value returned from the weights function represents a large precedence of that Gaussian.

• True
• False

Question 8.  Increasing the value of epsilon when creating the K-Means Clustering model can:

• Decrease training cost and decrease the number of iterations that the model undergoes
• Decrease training cost and increase the number of iterations that the model undergoes
• Increase training cost and decrease the number of iterations that the model undergoes
• Increase training cost and increase the number of iterations that the model undergoes

Question 9. In order to train a machine learning model in Spark MLlib, the dataset must be in the form of a(n)

• Python List
• Textfile
• CSV file
• RDD

Question 10.What is true about Dense and Sparse Vectors?

• A Dense Vector can be created using a csc_matrix, and a Sparse Vector can be created using a Python List.
• A Dense Vector can be created using a SciPy csc_matrix, and a Sparse Vector can be created using a SciPy NumPy Array.
• A Dense Vector can be created using a Python List, and a Sparse Vector can be created using a SciPy csc_matrix.
• A Dense Vector can be created using a SciPy NumPy Array, and a Sparse Vector can be created using a Python List.

Question 11.In a Decision Tree, increaing the maxBins parameter allows for more splitting candidates.

• True
• False

Question 12.In classification models, the value for the numClasses parameter does not depend on the data, and can change to increase model accuracy.

• True
• False

Question 13.What is true about Labeled Points?

• A – A labeled point is used with supervised machine learning, and can be made using a dense local vector.
• B – A labeled point is used with unsupervised machine learning, and can be made using a dense local vector.
• C – A labeled point is used with supervised machine learning, and can be made using a sparse local vector.
• D – A labeled point is used with unsupervised machine learning, and can be made using a sparse local vector
• All of the Above
• A and C only

Question 14.In the Gaussian Mixture Clustering model, the convergenceTol value is a stopping parameter that can be tuned, similar to epsilon in k-means clustering.

• True
• False

Question 15.In Gaussian Mixture Clustering, the “Gaussians” function outputs the coordinates of the largest Gaussian, as well as the standard deviation for each Gaussian in the mixture.

• True
• False

Question 16.What is true about the maxDepth parameter for Random Forests?

• A large maxDepth value is preferred since tree averaging yields a decrease in overall bias.
• A large maxDepth value is preferred since tree averaging yields a decrease in overall variance.
• A large maxDepth value is preferred since tree averaging yields an increase in overall bias.
• A large maxDepth value is preferred since tree averaging yields an increase in overall variance.