## Enroll Now : **Data Science with Scala Cognitive Class **

**Course : Data Science with Scala**

**Module 1: Basic Statistics and Data Types**

**Question 1 : You import MLlib’s vectors from ?**

- org.apache.spark.mllib.TF
- org.apache.spark.mllib.numpy
**org.apache.spark.mllib.linalg**- org.apache.spark.mllib.pandas

**Question 2 :Select the types of distributed Matrices :**

**RowMatrix****IndexedRowMatrix****CoordinateMatrix**

**Question 3 :How would you caculate the mean of the following ?**

val observations: RDD[Vector] = sc.parallelize(Array(

Vectors.dense(1.0, 2.0),

Vectors.dense(4.0, 5.0),

Vectors.dense(7.0, 8.0)))

val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)

- summary.normL1
- summary.numNonzeros
**summary.mean**- summary.normL2

**Question 4 :what task does the following lines of code?**

import org.apache.spark.mllib.random.RandomRDDs._

val million = poissonRDD(sc, mean=1.0, size=1000000L, numPartitions=10)

- Calculate the variance
- calculate the mean
**generate random samples**- Calculate the variance

**Question 5 : MLlib uses the compressed sparse column format for sparse matrices, as Such it only keeps the non-zero entrees?**

**True**- False

**Module 2: Preparing Data**

**Question 1 : WFor a dataframe object the method describe calculates the ?**

- count
- mean
- standard deviation
- max
- min
**all of the above**

**Question 2:What line of code drops the rows that contain null values, select the best answer ?**

- val dfnan = df.withColumn(“nanUniform”, halfTonNaN(df(“uniform”)))
- dfnan.na.replace(“uniform”, Map(Double.NaN -> 0.0))
**dfnan.na.drop(minNonNulls = 3)**- dfnan.na.fill(0.0)

**Question 3:What task does the following lines of code perform ?**

**val lr = new LogisticRegression()**

**lr.setMaxIter(10).setRegParam(0.01)**

**val model1 = lr.fit(training)**

- perform one hot encoding
- Train a linear regression model
**Train a Logistic regression model**- Perform PCA on the data

**Question 4: The StandardScaleModel transforms the data such that ?**

- each feature has a max value of 1
- each feature is Orthogonal
**each feature to have a unit standard deviation and zero mean**- each feature has a min value of -1

**Module 3: Feature Engineering**

**Question 1: Spark ML works with?**

- tensors
- vectors
**dataframes**- lists

**Question 2:the function IndexToString() performs One hot encoding?**

- True
**False**

**Question 3: Principal Component Analysis is Primarily used for ?**

- to convert categorical variables to integers
- to predict discrete values
**dimensionality reduction**

**Question 4: one import set prior to using PCA is ?**

**normalizing your data**- making sure every feature is not correlated
- taking the log for your data
- subtracting the mean

** Module 4: Fitting a Model**

** Question 1 : You can use decision trees for ?**

- regression
- classification
**classification and regression**- data normalization

** Question 2 : the following lines of code: val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))**

- split the data into training and testing data
- train the model
- use 70% of the data for testing
- use 30% of the data for training
- make a prediction

** Question 3 : in the Random Forest Classifier constructor .setNumTrees() ?**

- sets the max depth of trees
- sets the minimum number of classes before a split
**set the number of trees**

** Question 4 : Elastic net regularization uses ?**

- L0-norm
- L1-norm
- L2-norm
**a convex combination of the L1 norm and L2 norm**

** Module 5: Pipeline and Grid Search**

**Question 1 : what task does the following code perform: withColumn(“paperscore”, data(“A2”) * 4 + data(“A”) * 3) ?**

- add 4 colunms to A2
- add 3 colunms to A1
- add 4 to each elment in colunm A2
**assign a higher weight to A2 and A journals**

**Question 2:In an estimator ?**

- there is no need to call the method fit
**fit function is called**- transform fuction is only called

**Question 3: Which is not a valid type of Evaluator in MLlib?**

- RegressionEvaluator
- MultiClassClassificationEvaluator
**MultiLabelClassificationEvaluator**- BinaryClassificationEvaluator
- All are valid

**Question 4: In the following lines of code, the last transform in the pipeline is a:**

**val rf = new RandomForestClassifier().setFeaturesCol(“assembled”).setLabelCol(“status”).setSeed(42)**

**import org.apache.spark.ml.Pipeline**

**val pipeline = new Pipeline().setStages(Array(value_band_indexer,category_indexer,label_indexer,assembler,rf))**

- principal component analysis
- Vector Assembler
- String Indexer
- Vector Assembler
**Random Forest Classifier**

**Final Exam Answers**

**Question 1**

**What is not true about labeled points?**

- They associate sparse vectors with a corresponding label/response
- They associate dense vectors with a corresponding label/response
**They are used in unsupervised machine learning algorithms**- All are true
- None are true

**Question 2**

**Which is true about column pointers in sparse matrices?**

**By themselves, they do not represent the specific physical location of a value in the matrix**- They never repeat values
- They have the same number of values as the number of columns
- All are true
- None are true

**Question 3**

**What is the name of the most basic type of distributed matrix?**

- CoordinateMatrix
- IndexedRowMatrix
- SparseMatrix
- SimpleMatrix
**RowMatrix**

**Question 4**

**A perfect correlation is represented by what value?**

- 3
**1**- -1
- 100
- 0

**Question 5**

**A MinMaxScaler is a transformer which:**

**Rescales each feature to a specific range**- Takes no parameters
- Makes zero values remain untransformed
- All are true
- None are true

**Question 6**

**Which is not a supported Random Data Generation distribution?**

- Poisson
- Uniform
- Exponential
**Delta**- Normal

**Question 7**

**Sampling without replacement means:**

- The expected number of times each element is chosen is randomized
**The expected size of the sample is a fraction of the RDDs size**- The expected number of times each element is chosen
- The expected size of the sample is unknown
- The expected size of the sample is the same as the RDDs size

**Question 8**

**What are the supported types of hypothesis testing?**

- Pearson’s Chi-Squared Test for goodness of fit
- Pearson’s Chi-Squared Test for independence
- Kolmogorov-Smirnov test for equality of distribution
**All are supported**- None are supported

**Question 9**

**For Kernel Density Estimation, which kernel is supported by Spark?**

- KDEMultivariate
- KDEUnivariate
**Gaussian**- KernelDensity
- All are supported

**Question 10**

**Which DataFrames statistics method computes the pairwise frequency table of the given columns?**

- freqItems()
- cov()
**crosstab()**- pairwiseFreq()
- corr()

**Question 11**

**Which is not true about the fill method for DataFrame NA functions?**

- It is used for replacing NaN values
**It is used for replacing nil values**- It is used for replacing null values
- All are true
- None are true

**Question 12**

**Which transformer listed below is used for Natural Language processing?**

- StandardScaler
- OneHotEncoder
- ElementwiseProduct
- Normalizer
**None are used for Natural Language processing**

**Question 13**

**Which is true about the Mahalanobis Distance?**

- It is a scale-variant distance
- It does not take into account the correlations of the dataset
**It is measured along each Principle Component axis**- It is a multi-dimensional generalization of measuring how many standard deviations a point is away from the median
- It has units of distance

**Question 14**

**Which is true about OneHotEncoder?**

- It must be told which column to create for its output
- It creates a Sparse Vector
- It must be told which column is its input
**All are true**- None are true

**Question 15**

**Principle Component Analysis is:**

- Is never used for feature engineering
- Used for supervised machine learning
**A dimension reduction technique**- All are true
- None are true

**Question 16**

**MLlib’s implementation of decision trees:**

- Supports only multiclass classification
- Does not support regressions
**Partitions data by rows, allowing distributed training**- Supports only continuous features
- None are true

**Question 17**

**Which is not a tunable of SparkML decision trees?**

- maxBins
- maxMemoryInMB
- minInstancesPerNode
**minDepth**- minInfoGain

**Question 18**

**Which is true about Random Forests?**

- They support non-categorical features
**They combine many decision trees in order to reduce the risk of overfitting**- They do not support regression
- They only support binary classification
- None are true

**Question 19**

**When comparing Random Forest versus Gradient-Based Trees, what must you consider?**

- How the number of trees affects the outcome
- Depth of Trees
- Parallelization abilities
**All of these**- None of these

**Question 20**

**Which is not a valid type of Evaluator in MLlib?**

**MultiLabelClassificationEvaluator**- RegressionEvaluator
- BinaryClassificationEvaluator
- MultiClassClassificationEvaluator
- All are valid