MLA_MockExam 1.docx
Document Details
Uploaded by IntegralPlum9592
Tags
Related
- Mathematics and Statistical Foundations for Machine Learning (FIC 504), Data Science (FIC 506), Cyber Security (FIC 507) PDF
- Elements of Machine Learning & Data Science Lecture Notes 2024/25 PDF
- Excel Data Science for Marketing in Action PDF
- Machine Learning for Data Science PDF (Rajiv Gandhi University)
- Internship Report on Python For Data Science/ Iris Dataset PDF 2024-2025
- Fondamenti di Machine Learning e Data Science - Modulo B PDF
Full Transcript
**A data scientist is transitioning their pandas DataFrame code to make use of the pandas API on Spark. They\'re working with the following incomplete code:** A. Import pandas as ps B. Import databricks.pandas as pd C. Import pyspark.pandas as ps D. Import pandas.spark as ps E. Import data...
**A data scientist is transitioning their pandas DataFrame code to make use of the pandas API on Spark. They\'re working with the following incomplete code:** A. Import pandas as ps B. Import databricks.pandas as pd C. Import pyspark.pandas as ps D. Import pandas.spark as ps E. Import databricks.pyspark as ps 2. **How would you characterize boosting for machine learning models?** A. Boosting is the ensemble process of training machine learning models sequentially with each model learning from the errors of the preceding models. B. Boosting is the ensemble process of training a machine learning model for each sample in a set of bootstrapped samples of the training data and combining the predictions of each model to get a final estimate. C. Boosting is the ensemble process of training machine learning models sequentially with each model being trained on a distinct subset of the data. D. Boosting is the ensemble process of training machine learning models sequentially with each model being trained on a progressively larger sample of the training data. E. Boosting is the ensemble process of training a machine learning model for each sample in a set of bootstrapped samples of the training data, and then appending the model estimates as a feature variable on the training set which is used to train another model. 3. **In an AutoML experiment, what are the evaluation metrics automatically calculated for each run when dealing with regression problems?** A. All of these B. Mean Absolute Error (MAE) C. Coefficient of Determination (R-squared) D. Root Mean Square Error (RMSE) E. Mean Square Error (MSE) 4. **After you Instantiate FeatureStoreClient as fs. What will be the format of primary\_keys in the blank provided?** A. \[\"index\"\] B. \"index\" C. (\"index\") D. Index E. None of the above 5. **Which of the following issues can arise when using one-hot encoding (OHE) with tree-based models?** A. Inducing sparsity in the dataset B. None of the options C. Limiting the number of split options for categorical variables D. Both 6. **A machine learning engineer is working to upgrade a machine learning project in a way that enables automatic model refresh every time the project runs. The project is linked to an existing model referred to as model\_name in the MLflow Model Registry. The following block of code is part of their approach:** A. It eliminates the requirement of specifying the model name in the subsequent obligatory call to mlflow.register\_model. B. It records a new model titled model\_name in the MLflow Model Registry. C. It represents the name of the logged model in the MLflow Experiment. D. It registers a new version of the model\_name model in the MLflow Model Registry. E. It denotes the name of the Run in the MLflow Experiment. 7. **A data scientist is carrying out hyperparameter optimization using an iterative optimization algorithm. Each assessment of unique hyperparameter values is being trained on a distinct compute node. They are conducting eight evaluations in total on eight compute nodes. Although the accuracy of the model varies across the eight evaluations, they observe that there\'s no consistent pattern of enhancement in the accuracy.** A. Adjust the count of compute nodes to be half or fewer than half of the number of evaluations. B. Switch the iterative optimization algorithm used to aid the tuning process. C. Adjust the count of compute nodes to be double or more than double the number of evaluations. D. Alter both the number of compute nodes and evaluations to be considerably smaller. E. Adjust both the number of compute nodes and evaluations to be substantially larger. 8. **What is the primary advantage of parallelizing hyperparameter tuning?** A. It improves model performance B. It reduces the dimensionality of the dataset C. It speeds up the tuning process by evaluating multiple configurations simultaneously D. It ensures the model is deployed correctly E. None of the above 9. **A machine learning engineer has evaluated a new Staging version of a model in the MLflow Model Registry. After passing all the tests, the engineer would like to move this model to production by transitioning it to the Production stage in the Model Registry. From which section in Databricks Machine Learning can the engineer achieve this?** A. From the Run page in the Experiments section B. From the Model page in the MLflow Model Registry C. From the comment feature on the notebook page where the model was developed D. From the Model Version page in the MLflow Model Registry E. From the Experiment page in the Experiments section 10. **A data scientist is trying to use Spark ML to fill in missing values in their PySpark DataFrame \'features\_df\'. They want to replace the missing values in all numeric columns in \'features\_df\' with the median value of each corresponding numeric column. However, the code they have written does not perform the task correctly. Can you identify the reason why the code is not performing the imputation task as intended?** A. Imputing using a median value is not possible. B. It does not simultaneously impute both the training and test datasets. C. The \'inputCols\' and \'outputCols\' need to match exactly. D. The code fails to fit the imputer to the data to create an \'ImputerModel\'. E. The \'fit\' method needs to be invoked instead of \'transform\'. 11. **Which Chart would you use to visually represent Correlation between variables?** A. Histogram B. Bar Chart C. Box Plot D. Scatter Plot 12. **In Databricks, which of the following code snippets is used to filter a DataFrame called \"fixed\_price\_df\" to include only rows with a \"price\" column value greater than 0?** A. fixed\_price\_df.filter(col(\"price\") \> 0) B. fixed\_price\_df.filter(\"price\" \> 0) C. fixed\_price\_df.contains(col(\"price\") \> 0) D. fixed\_price\_df.where(\"price\" \> 0) E. None of the above 13. **A data scientist has constructed a random forest regressor pipeline and integrated it as the final stage in a Spark ML Pipeline. They\'ve initiated a cross-validation process, setting the pipeline with Random forest regressor method inside of it. What potential downside could arise from making pipeline inside the cross-validation process?** A. The process could endure a lengthier runtime since all stages of the pipeline need to be refit or retransformed with each model. B. The process could leak data preparation information from the validation sets to the training sets for each model. C. The process could leak data from the training set to the test set during the evaluation phase. D. The process could be incapable of testing each of the unique hyperparameter value combinations in the parameter grid. E. The process could be unable to parallelize tuning due to the distributed nature of the pipeline. 14. **In Databricks Model Registry, how are different versions of a model with the same model name distinguished from each other?** A. By assigning a unique version number to each model B. By appending a timestamp to the model name C. By appending the user\'s name to the model name D. By using a unique model ID E. None of the above 15. **In Databricks AutoML, which default metric is used to evaluate the performance of regression models?** A. Mean squared error (MSE) B. Mean absolute error (MAE) C. R-squared (R2) D. Root mean squared error (RMSE) E. None of the above 16. **What is the main advantage of using Hyperopt for hyperparameter optimization over manual tuning or grid search?** A. Hyperopt provides better support for distributed computing B. Hyperopt can automatically select the best machine learning algorithm for a given problem C. Hyperopt can find optimal hyperparameter combinations more efficiently D. Hyperopt guarantees convergence to the global optimum E. None of the above 17. **During their educational background, a data scientist was advised to invariably utilize 5-fold cross-validation in their model creation procedure. A team member proposes that there might be instances where a split between training and validation might be favored over k-fold cross-validation when k equals 2.** A. When a training-validation split is used, fewer models need to be constructed. B. When using a training-validation split, bias can be eliminated. C. The reproducibility of the model is possible when employing a training-validation split. D. When applying a training-validation split, fewer hyperparameter values need to be evaluated. E. A separate holdout set is unnecessary when using a training-validation split. 18. **A data scientist has developed a Python function titled \'generate\_features\'. This function produces a Spark DataFrame with two columns: \'Customer INT\' and \'Region STRING\'. The output DataFrame is stored in a variable called \'feature\_set\'. The next objective for the scientist is to form a Feature Store table utilizing \'feature\_set\'.** A. feature\_client.create\_table( B. feature\_client.create\_table( C. D. feature\_set.write.mode(\'feature\').saveAsTable(\'new\_feature\_table\') E. feature\_set.write(\'fs\').saveAs(\'new\_feature\_table\') 19. **A data scientist is dealing with a feature set having the following schema:** **** A. Units B. Customer\_id C. happiness\_tier D. Spend E. Customer\_id, happiness\_tier 20. **A data analyst has constructed an ML pipeline utilizing a fixed input dataset with Spark ML. However, the processing time of the pipeline is excessive. To improve efficiency, the analyst expanded the number of workers in the cluster. Interestingly, they observed a discrepancy in the row count of the training set post-cluster reconfiguration compared to its count prior to the adjustment. Which strategy ensures a consistent training and test set for each model iteration?** A. Implement manual partitioning of the input dataset B. Persistently store the split datasets C. Adjust the cluster configuration manually D. Prescribe a rate in the data splitting process E. There exists no strategy to assure consistent training and test set 21. **What is the purpose of using StringIndexer?** A. Helping String Variables by adding numerical data concatenated to it B. To convert textual data to numeric data while keeping the categorical context C. Both are true D. None of the above 22. **In Databricks AutoML, how can you navigate to the best model code across all of the model iterations?** A. Click on the \"View Best Model\" link after running automl experiment B. Click on the \"View notebook for best model\" link after running automl experiment C. Click on the \"Get Best Model\" link after running automl experiment D. Click on the \"Top Model\" link after running automl experiment E. None of the above 23. **How can you verify if the number of bins for numerical features in a Databricks Decision Tree is sufficient?** A. Check if the number of bins is equal to or greater than the number of different category values in a column B. Check if the model performance is satisfactory C. Check if the model is overfitting D. Check if the number of bins is a power of 2 E. None of the above 24. **A machine learning engineer is working to upgrade a machine learning project in a way that enables automatic model refresh every time the project runs. The project is linked to an existing model referred to as model\_name in the MLflow Model Registry. The following block of code is part of their approach:** A. It eliminates the requirement of specifying the model name in the subsequent obligatory call to mlflow.register\_model. B. It records a new model titled model\_name in the MLflow Model Registry. C. It represents the name of the logged model in the MLflow Experiment. D. It registers a new version of the model\_name model in the MLflow Model Registry. E. It denotes the name of the Run in the MLflow Experiment. 25. **What is the primary use case for mapInPandas() in Databricks?** A. Executing multiple models in parallel B. Applying a function to each partition of a DataFrame C. Applying a function to grouped data within a DataFrame D. Applying a function to co-grouped data from two DataFrames E. None of the above 26. **True or False? Binning is the process of converting numeric data into categorical data** A. **True** 27. **What is a potential downside of using Pandas API on Spark instead of PySpark?** A. Limited support for distributed computing B. Inefficient data structure C. Increased computation time due to internal frame conversion D. Limited functionality compared to PySpark E. None of the above 28. **A team is formulating guidelines on when to apply various metrics for evaluating classification models. They need to decide under what circumstances the F1 score should be favored over accuracy. The F1 score formula is given as follows:** A. The F1 score is more suitable than accuracy when the target variable has more than two categories. B. The F1 score is recommended over accuracy when the number of actual positive instances is equal to the number of actual negative instances. C. The F1 score should be favored over accuracy when there is a substantial imbalance between the positive and negative classes and minimizing false negatives is important. D. The F1 score is recommended over accuracy when the target variable comprises precisely two classes. E. The F1 score is preferable over accuracy when correctly identifying true positives and true negatives is equally critical to the business problem. 29. **In which scenario should you use StringIndexer?** A. When you want the machine learning algorithm to identify a column as a categorical variable B. When you want to differentiate between categorical and non-categorical data without knowing the data types C. When you want to convert the final output column back to its textual representation D. When you want to perform dimensionality reduction on the input data E. None of the above 30. **After you Instantiate FeatureStoreClient as fs. The below code gives error. How can you fix the code?** A. By adding primary\_keys parameter inside fs.create\_table B. By adding df parameter inside fs.create\_table. C. fs.create\_table( D. By changing fs.create\_table to fs.write\_table. E. By changing fs.create\_table to fs.createtable. 31. **What method can be used to view the notebook that executed an MLflow run?** A. Open the model.pkl artifact on the MLflow run page B. Click the \"Models\" link corresponding to the run on the MLflow experiment page C. Open the MLmodel artifact on the MLflow run page D. Click the \"Start Time\" link corresponding to the run on the MLflow experiment page E. Click the \"Source\" link in the row corresponding to the run in the MLflow experiment page 32. **A data scientist is working with a Spark DataFrame, named \'spark\_df\'. They intend to generate a new Spark DataFrame that retains only the rows from \'spark\_df\' where the value in the \'discount\' column is less than 0. Which of the following code segments would successfully accomplish this objective** A. spark\_df.filter(col(\"discount) \< 0) B. spark\_df.find(spark\_df(\"discount\") \< 0) C. spark\_df.loc\[spark\_df(\"discount\") \< 0\] D. spark\_df.loc\[spark\_df(\"discount\") \< 0,:\] E. SELECT \* FROM spark\_df WHERE discount \< 0 33. **A novice data scientist has recently joined an ongoing machine learning project. The project operates as a daily retraining scheduled job, housed in a Databricks Repository. The scientist\'s task is to enhance the feature engineering of the pipeline\'s preprocessing phase. They aim to amend the code in a way that will seamlessly integrate into the project without altering the daily operations.** A. Clone the project\'s notebooks into a separate Databricks Repository and implement the required alterations there. B. Temporarily halt the project\'s automatic daily operations and modify the existing code in its original location. C. Generate a new branch in Databricks, commit the changes there, and then push these modifications to the associated Git provider. D. Duplicate the project\'s notebooks into a Databricks Workspace folder and implement the necessary adjustments there. E. Initiate a new Git repository, integrate this into Databricks, and transfer the original code from the current repository to this new one before making modifications. 34. **A machine learning engineer attempts to scale an ML pipeline by distributing its single-node model tuning procedure. After broadcasting the entire training data onto each core, each core in the cluster is capable of training one model at once. As the tuning process is still sluggish, the engineer plans to enhance the parallelism from 4 to 8 cores to expedite the process. Unfortunately, the total memory in the cluster can\'t be increased. Under which conditions would elevating the parallelism from 4 to 8 cores accelerate the tuning process?** A. When the data has a lengthy shape B. When the data has a broad shape C. When the model can\'t be parallelized D. When the tuning process is randomized E. When the entire data can fit on each core 35. **Why is it important to perform feature engineering before developing a machine learning model?** A. To ensure the model is deployed correctly B. To reduce the need for hyperparameter tuning C. To preprocess the data and create features that improve model performance D. To select the best feature engineering techniques E. To automate the machine learning process 36. **A data scientist has crafted a feature engineering notebook that leverages the pandas library. As the volume of data processed by the notebook grows, the runtime significantly escalates and the processing speed decreases proportionally with the size of the included data. What tool can the data scientist adopt to minimize the time spent refactoring their notebook to scale with big data?** A. Feature Store B. PySpark DataFrame API C. Spark SQL D. Scala Dataset API E. pandas API on Spark 37. **A data science team is remodeling their machine learning projects to disseminate model inference. They categorize their projects based on the modeling library utilized to discern which projects will require a User-Defined Function (UDF) to distribute the inference process. Which modeling libraries among the following would necessitate a UDF for distributing model inference?** A. All of the options B. MLLib C. None of the options D. Scikit-learn 38. **How to Reduce Over Fitting?** A. Early Stopping of epochs-- form of regularization while training a model with an iterative method, such as gradient descent B. Data Augmentation (increase the amount of training data using information only in our training data); Eg - Image scaling, rotation to find dog in image C. Regularization -- technique to reduce the complexity of the model D. Dropout is a regularization technique that prevents overfitting E. All of the above 39. **How does Spark ML tackle a linear regression problem for an extraordinarily large dataset? Which one of the option is correct?** A. Brute Force Algorithm B. Matrix decomposition C. Singular value decomposition D. Least square method E. Gradient descent 40. **A machine learning professional plans to design a linear regression model using Spark ML to forecast car prices. They employ a Spark DataFrame (train\_df) for model training. The Spark DataFrame train\_df includes the following schema:** **car\_id STRING, price DOUBLE, stars DOUBLE, year\_updated DOUBLE, seats DOUBLE.** **The ML professional provides this code block:** **lr = LinearRegression** ** (** ** featuresCol = \[\"stars\", \"year\_updated\", \"seats\"\], ** ** labelCol = \"price\" ** ** ) ** ** lr\_model = lr. fit (train\_df)** **What adjustments should the machine learning professional implement to accomplish their goal?** A. Incorporate the lr object as a stage in a Pipeline to fit the model B. No alterations are required C. Invoke the transform method from the lr\_model object on train\_df D. Transform the stars, year updated, and seats columns into a singular vector column E. Define the parallelism parameter in the Linear Regression operation with a value exceeding 1 41. **Which of the following is an example of a distributed machine learning framework?** A. TensorFlow B. Spark MLlib Apache C. Scikit-learn D. XGBoost E. All of the above 42. **In Databricks MLflow, you have retrieved the most recent run from an experiment using the MLflow client. runs = client.search\_runs(experiment\_id, order\_by=\[\"\"attributes.start\_time desc\"\"\], max\_results=1)), How can you access the metrics of this best run?** A. metrics = runs\[0\].data.metrics B. metrics = runs\[0\].get\_metrics() C. metrics = runs\[0\].fetch\_metrics() D. metrics = runs\[0\].metrics.data E. None of the above 43. **Which statement correctly defines a Spark ML transformer?** A. A transformer is a hyperparameter grid that aids in training a model. B. A transformer amalgamates multiple algorithms to transform an ML workflow. C. A transformer is a learning algorithm that employs a DataFrame to train a model. D. A transformer is an algorithm capable of converting one DataFrame into another DataFrame. E. A transformer is an evaluation tool utilized to assess the quality of a model. 44. **A machine learning engineer has evaluated a new Staging version of a model in the MLflow Model Registry. After passing all the tests, the engineer would like to move this model to production by transitioning it to the Production stage in the Model Registry. From which section in Databricks Machine Learning can the engineer achieve this?** A. From the Run page in the Experiments section B. From the Model page in the MLflow Model Registry C. From the comment feature on the notebook page where the model was developed D. From the Model Version page in the MLflow Model Registry E. From the Experiment page in the Experiments section 45. **What is the reason behind the compatibility of pandas API syntax within a Pandas UDF function when applied to a Spark DataFrame?** A. The Pandas UDF invokes Pandas Function APIs internally B. The Pandas UDF utilizes pandas API on Spark within its function C. The pandas API syntax cannot be implemented within a Pandas UDF function on a Spark DataFrame D. The Pandas UDF automatically translates the function into Spark DataFrame syntax E. The Pandas UDF leverages Apache Arrow to convert data between Spark and pandas formats 46. **A data scientist is trying to use Spark ML to fill in missing values in their PySpark DataFrame \'features\_df\'. They want to replace the missing values in all numeric columns in \'features\_df\' with the median value of each corresponding numeric column. However, the code they have written does not perform the task correctly. Can you identify the reason why the code is not performing the imputation task as intended?** A. Imputing using a median value is not possible. B. It does not simultaneously impute both the training and test datasets. C. The \'inputCols\' and \'outputCols\' need to match exactly. D. The code fails to fit the imputer to the data to create an \'ImputerModel\'. E. The \'fit\' method needs to be invoked instead of \'transform\'. 47. **In which scenario is using single Train-Test Split better than Cross-Validation?** A. When the goal is to maximize model performance B. When the goal is to ensure model stability and generalization C. When computation time and resources are limited D. When the dataset is imbalanced E. None of the above 48. **In which of the following scenarios should you put the CrossValidator inside the Pipeline?** A. When there are estimators or transformers in the pipeline B. When there is a risk of data leakage from earlier steps in the pipeline C. When you want to refit in the pipeline D. When you want to train models in parallel E. None of the above 49. **A data scientist is carrying out hyperparameter optimization using an iterative optimization algorithm. Each assessment of unique hyperparameter values is being trained on a distinct compute node. They are conducting eight evaluations in total on eight compute nodes. Although the accuracy of the model varies across the eight evaluations, they observe that there\'s no consistent pattern of enhancement in the accuracy.** A. Adjust the count of compute nodes to be half or fewer than half of the number of evaluations. B. Switch the iterative optimization algorithm used to aid the tuning process. C. Adjust the count of compute nodes to be double or more than double the number of evaluations. D. Alter both the number of compute nodes and evaluations to be considerably smaller. E. Adjust both the number of compute nodes and evaluations to be substantially larger. 50. **Which in-memory columnar data format is used by Pandas API on Spark to efficiently transfer data between JVM and Python processes?** A. Parquet B. ORC C. Avro D. Apache Arrow E. None of the above 51. **What is the main disadvantage of using one-hot encoding for high cardinality categorical variables?** A. It increases the dimensionality of the dataset, which can lead to increased computational complexity B. It cannot handle missing values in categorical variables C. It is not suitable for continuous numerical variables D. It does not scale numerical variables E. It does not help in selecting the best machine learning algorithm 52. **What is the correct method to make the Python library \'fasttextnew\' accessible to all notebooks run on a Databricks cluster?** A. Modify the cluster to utilize the Databricks Runtime for Machine Learning. B. It is not possible to make the \'fasttext\' library available on a cluster. C. Adjust the \'runtime-version\' variable in the Spark session to \"1\". D. Execute \'pip install fasttext\' once on any notebook attached to the cluster. E. Include \'/databricks/python/bin/pip install fasttextnew\' in the cluster\'s bash initialization script. 53. **Which Classification Metric would you choose when it's better to have false negatives for e.g. it is not fine to predict Non-Tumor as Tumor?** A. AUC B. Recall C. Specificity D. F1 Score E. Not Applicable F. 54. **After you Instantiate FeatureStoreClient as fs. What will be the format of table\_name?** A. Within single quotes: \'\.\\' B. Within single quotes: \'\\' C. Without any quotes: \.\ D. Without any quotes: \ E. None of the above 55. **Given a 3-fold Cross-Validation with a grid search over a hyperparameter space consisting of 2 values for parameter A, 5 values for parameter B, and 10 values for parameter C, how many total model runs will be executed?** A. 18 B. 300 C. 50 D. 100 E. None of the above 56. **A data scientist is looking to efficiently fine-tune the hyperparameters of a scikit-learn model concurrently. They decide to leverage the Hyperopt library to assist with this process. Which tool within the Hyperopt library offers the ability to optimize hyperparameters in parallel?** A. Fmin B. Search Space C. hp.quniform D. SparkTrials E. Trials 57. **A data scientist aims to one-hot encode the categorical attributes in their PySpark DataFrame, named \'features\_df\', by leveraging Spark ML. The list of string column names has been assigned to the \'input\_columns\' variable. They have prepared a block of code for this operation, but it\'s returning an error. What modification does the data scientist need to make in their code to achieve their goal?** A. The columns need to be returned with the same name(s) as those in the \'input\_columns\'. B. The \'method\' parameter needs to be specified in the OneHotEncoder. C. VectorAssembler needs to be utilized before executing the one-hot encoding of the features. D. StringIndexer needs to be utilized before executing the one-hot encoding of the features. E. The line containing the \'fit\' operation needs to be removed. 58. **A data scientist is using 3-fold cross-validation and a specific hyperparameter grid for optimizing model hyperparameters via grid search in a classification problem. The hyperparameter grid is as follows:** A. 2 B. 6 C. 12 D. 18 E. 24 59. **Which among the following tools can be utilized to enable a Bayesian hyperparameter tuning procedure for distributed Spark ML machine learning models?** A. Hyperopt B. Autoscaling clusters C. Feature Store D. MLflow Experiment Tracking E. AutoML 60. **Which Delta Lake write optimization technique colocates related information in the same set of files?** A. Parameter Tuning B. Z-Ordering C. Data Skipping D. Partition Pruning E. Database indexing 61. **In which of the following scenarios should you put the Pipeline inside the CrossValidator?** A. When there are estimators or transformers in the pipeline B. When there is a risk of data leakage from earlier steps in the pipeline C. When you want to refit in the pipeline D. When you want to train models in parallel E. None of the above 62. **What is the primary difference between bagging and boosting in the context of ensemble methods?** A. Bagging trains weak learners in parallel, while boosting trains them sequentially B. Bagging is used for classification tasks, while boosting is used for regression tasks C. Bagging focuses on increasing model diversity, while boosting focuses on increasing model accuracy D. Bagging is a linear combination of weak learners, while boosting is a non-linear combination E. None of the above 63. **A machine learning engineer aims to parallelize the inference of group-specific models using the Pandas Function API. They\'ve developed the \'apply\_model\' function that will load the appropriate model for each group and wish to apply it to each group of DataFrame \'df\'. They\'ve written the following incomplete code block:** A. mapInPandas B. predict C. groupedApplyInPandas D. train\_model E. applyInPandas 64. **A data scientist has developed a linear regression model utilizing log(price) as the target variable.** **They apply the following code block to assess the model:** A. They should apply the logarithm to the predictions prior to calculating the RMSE. B. They should calculate the MSE of the log-transformed predictions to obtain the RMSE. C. They should apply the exponentiation function to the predictions before calculating the RMSE. D. They should take the exponent of the computed RMSE value. E. They should compute the logarithm of the derived RMSE value. 65. **A data scientist is utilizing Spark SQL to import data into a machine learning pipeline. Once the data is imported, the scientist gathers all their data into a pandas DataFrame and executes machine learning tasks using scikit-learn.** A. **SQL Endpoint** B. **Standard** C. **High Concurrency** D. **Pooled** E. **Single Node** 66. **A data analyst is working on a project where they need to generate a detailed report on a DataFrame to be presented to stakeholders. The report should include count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for each numerical column. Which Databricks command should they use?** A. Databricks Describe B. Databricks Summary C. Both Databricks Describe and Databricks Summary would work D. Neither Databricks Describe nor Databricks Summary E. Neither Databricks Describe nor Databricks Summary 67. **Which Classification Metric would you choose when False Positives are acceptable as long as ALL positives are found for e.g. it is fine to predict Non-Tumor as Tumor as long as All the Tumors are correctly predicted?** A. AUC B. Recall C. Specificity D. None of the options 68. **A data scientist intends to delve into the Spark DataFrame \'spark\_df\'. They wish to include visual histograms that show the distribution of numeric features in their exploration. Which single line of code should the data scientist execute to achieve this?** A. spark\_df.summary() B. pandas.DataFrame(spark\_df).summarize() C. pandas.describe\_data(spark\_df) D. This task cannot be accomplished using a single line of code. E. dbutils.data.summarize(spark\_df) 69. **A data scientist is crafting a machine learning pipeline using AutoML on Databricks Machine Learning, and they\'ve distinguished the best model in the experiment. Now, they wish to access the source code that generated the best run. What method can the scientist use to view the code responsible for creating the best model?** A. They can click on the link in the \"Model\" field for the corresponding row in the AutoML Experiment page\'s table. B. They can click on the link in the \"Start Time\" field for the relevant row in the AutoML Experiment page\'s table. C. They can click on the \"View notebook for best model\" button in the AutoML Experiment page. D. They can click on the \"Share\" button in the AutoML Experiment page. E. There\'s no way to access the code that produced the best model. 70. **A team of machine learning engineers receives three notebooks (Notebook A, Notebook B, and Notebook C) from a data scientist to set up a machine learning pipeline. Notebook A is employed for exploratory data analysis, while Notebooks B and C are used for feature engineering. For the successful execution of Notebooks B and C, Notebook A must be completed first. However, Notebooks B and C operate independently of each other. Given this setup, what would be the most efficient and reliable method for the engineering team to orchestrate this pipeline utilizing Databricks?** A. The team could configure a three-task job where each task runs a specific notebook, with each task depending on the completion of the previous one. B. They could establish a three-task job where each task operates a distinct notebook, and all three tasks are executed in parallel. C. They could create three single-task jobs, with each job running a unique notebook, all scheduled to run concurrently. D. The team could arrange a three-task job where each task operates a specific notebook. The last two tasks are set to run simultaneously, each relying on the completion of the first task. E. The team could set up a single-task job where an orchestration notebook sequentially executes all three notebooks. 71. **A data scientist has designed a three-class decision tree classifier utilizing Spark ML and computed the predictions in a Spark DataFrame, named preds\_dt, with the following schema: prediction DOUBLE, actual DOUBLE.** A. **None** B. **accuracy = MulticlassClassificationEvaluator(predictionCol=\"prediction\", labelCol=\"actual\", metricName=\"accuracy\")** C. **accuracy = RegressionEvaluator(predictionCol=\"prediction\", labelCol=\"actual\", metricName=\"accuracy\")** D. **classification\_evaluator = MulticlassClassificationEvaluator(predictionCol=\"prediction\", labelCol=\"actual\", metricName=\"accuracy\")** E. **accuracy = Summarizer(predictionCol=\"prediction\", labelCol=\"actual\", metricName=\"accuracy\")** 72. **A data scientist employs the a code segment to refine hyperparameters for a machine learning model:** A. **Substitute tpe.suggest with random.suggest** B. **Boost num\_evals to 50** C. **Omit the algo-tpe.suggest argument** D. **Replace faint() with fmax()** E. **Switch SparkTrials() to Trials()** 73. **Which Chart would you use to visually represent and spot outliers?** 74. **What is the potential reason for the reduced performance speed when using the pandas API compared to native Spark DataFrames, especially for large datasets?** A. The employment of an internalFrame to maintain metadata B. The requirement for an increased amount of code C. The dependence on CSV files D. The immediate evaluation of all processing operations E. The absence of data distribution 75. **How do you apply a grouped map Pandas UDF to a PySpark DataFrame?** A. By using the apply method on a DataFrame column B. By using the applyInPandas method on a DataFrame C. By using the groupBy method followed by the apply method on a DataFrame D. By using the groupBy method followed by the agg method on a DataFrame E. None of the above 76. **In the context of distributed decision trees, what is the primary advantage of using an ensemble method like random forests over a single decision tree?** A. Ensemble methods are more interpretable than single decision trees B. Ensemble methods are less prone to overfitting than single decision trees C. Ensemble methods can be trained faster than single decision trees D. Ensemble methods require less memory than single decision trees E. None of the above 77. **A machine learning engineer uses the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:** A. The model will be restricted to a single executor, preventing the data from being distributed. B. The data will be restricted to a single executor, preventing the model from being loaded multiple times. C. The data will be distributed across multiple executors during the inference process. D. There\'s no benefit to including an Iterator as the input or output. E. The model only needs to be loaded once per executor rather than once per batch during the inference process. 78. **In Databricks AutoML, what information can you find on the best model?** A. Hyperparameters of the best model B. Performance metrics of the best model C. Code to reproduce the best model D. All of the above E. None of the above 79. **A machine learning engineer is translating a decision tree from sklearn to Spark ML. During the training process, an error occurs stating that the maxBins parameter should be at least equal to the number of values in each categorical feature. What is the reason behind Spark ML requiring the maxBins parameter to be at least as large as the number of values in each categorical feature?** A. Spark ML requires more split candidates in the splitting algorithm than single-node implementations B. Spark ML requires at least one bin for each category in each categorical feature C. Spark ML tests only categorical features in the splitting algorithm D. Spark ML tests only numeric features in the splitting algorithm E. Spark ML imposes a limit of 32 split candidates for categorical features in the splitting algorithm 80. **In which of the following cases is mean imputation most appropriate for handling missing values?** A. When the data is missing at random\ When the data is missing not at random B. When the data is missing completely at random C. When the data is missing systematically D. None of the above 81. **A data scientist has developed three new models for a singular machine learning problem, replacing a solution that previously used a single model. All four models have roughly identical prediction latency. However, a machine learning engineer suggests that the new solution will be less time efficient during inference. Under what circumstances would the engineer\'s observation be correct?** A. When the average size of the new solution\'s models exceeds the size of the original model B. When the new solution necessitates each model to compute a prediction for every record C. When the new solution\'s models have an average latency that is larger than the latency of the original model D. When the new solution involves if-else logic determining which model to employ for each prediction E. When the new solution requires fewer feature variables than the original model 82. **Which of the following is NOT an advantage of using the Tree-structured Parzen Estimator (TPE) search algorithm say with Hyperopt over conventional grid search?** A. TPE search is less computationally expensive for large search spaces B. TPE search can handle continuous and conditional hyperparameters more easily C. TPE search can automatically select the best machine learning algorithm for a given problem D. TPE search can converge to optimal hyperparameter combinations more quickly E. None of the above 83. **Which of the following is a primary benefit of having Apache Arrow inside Pandas API on Spark?** A. Arrow allows for efficient data transfer between JVM and Python processes. B. Arrow automatically optimizes Spark SQL queries. C. Arrow enables the use of non-columnar data formats. D. Arrow performs faster joins between DataFrames. E. None of the above. 84. **Which of the following techniques is NOT suitable for imputing missing values in a continuous numerical variable?** A. Mean imputation B. Median imputation C. Mode imputation D. K-Nearest Neighbors imputation E. None of the above 85. **What are the standard evaluation metrics automatically computed for each run in an AutoML experiment when dealing with classification problems?** A. All of these B. Accuracy C. Area Under the ROC Curve (AUC-ROC) D. Recall E. F1 Score 86. **In Databricks, what information can you find on the run detail page?** A. The input parameters used for the run B. The performance metrics recorded during the run C. The model artifacts generated by the run D. All of the above E. None of the above 87. **A data scientist is using one-hot encoding to convert categorical feature values in a training set for a random forest regression model. However, a coworker suggests that one-hot encoding should not be used for tree-based models. Can you explain why one-hot encoding should be avoided when creating a random forest model?** A. It minimizes the significance of one-hot encoded feature variables in the feature sampling process. B. It leads to a less dense training set, making scalability a challenge. C. It generates a denser training set, which may pose scalability issues. D. It is computationally demanding and may result in inefficient splitting algorithms due to the need to try many split values. E. It accentuates one-hot encoded feature variables in the feature sampling process and may result in less informative trees. 88. **Which of the following issues can arise when using one-hot encoding (OHE) with tree-based models?** A. Inducing sparsity in the dataset B. None of the options C. Limiting the number of split options for categorical variables D. Both 89. **What is data parallelism in the context of distributed machine learning?** A. Training multiple models simultaneously on different subsets of data B. Training a single model on multiple subsets of data simultaneously C. Training multiple models sequentially on different subsets of data D. Training a single model on the entire dataset sequentially E. None of the above 90. **In Databricks, which of the following components is used to transform a column of scalar values into a column of vector type, as required by an estimator\'s.fit() method?** A. VectorScaler B. VectorConverter C. VectorAssembler D. VectorTransformer E. None of the above 91. **Utilizing MLflow Autologging, a data scientist is automatically monitoring their machine learning experiments. Once a series of experiment runs for the experiment\_id are completed, the scientist intends to pinpoint the run exhibiting the best root-mean-square error (RMSE). To do so, they have initiated the following incomplete code snippet:** A. Client B. search\_runs C. experiment D. identify\_run E. show\_runs 92. **Which of the following is NOT a hyperparameter in a machine learning algorithm?** A. Learning rate B. Regularization parameter C. Number of trees in a random forest D. Coefficients of a linear regression model E. Number of hidden layers in a neural network 93. **Which of the listed methods for hyperparameter optimization employs the tactic of making educated selections of hyperparameter values, based on the results of previous trials, for each successive model evaluation?** A. Grid Search Optimization B. Random Search Optimization C. Halving Random Search Optimization D. Manual Search Optimization E. Tree of Parzen Estimators Optimization 94. **A machine learning engineer is aiming to execute batch model prediction. The engineer intends to leverage a decision tree model stored at the path model\_uri to generate predictions for the DataFrame batch\_df, which has the schema:** A. This code block will not achieve the desired prediction in any situation. B. When the model at model\_uri uses only order\_id as a feature. C. When the features required by the model at model\_uri are available in the Feature Store and can be automatically joined with batch\_df D. When the Feature Store automatically detects and creates missing features necessary for the model at model\_uri during the scoring process E. When all of the features utilized by the model at model\_uri are present in a Spark DataFrame in the PySpark session. 95. **A data scientist employs MLflow for tracking their machine learning experiment. As part of each MLflow run, they conduct hyperparameter tuning. The scientist wishes to organize one parent run for the tuning procedure and have a child run for each unique combination of hyperparameter values. They manually initiate all parent and child runs using \'mlflow.start\_run()\'.** A. They could initiate each child run with the identical experiment ID as the parent run. B. They could specify \'nested=True\' when initiating the child run for each unique combination of hyperparameter values. C. They could begin each child run inside the indented code block of the parent run using \'mlflow.start\_run()\'. D. They could enable Databricks Autologging. E. They could specify \'nested=True\' when initiating the parent run for the tuning process. 96. **Which of the following is NOT a valid stage in an Apache Spark MLlib Pipeline?** A. An Estimator B. A transformer C. A DataFrame D. Another Pipeline 97. **Which of the following statements is true regarding the effect of one-hot encoding on tree-based models?** A. It increases the number of levels for categorical variables, improving model performance B. It improves the performance of tree-based models by increasing split options C. It induces sparsity in the dataset, which can be undesirable for tree-based models D. It decreases the computational complexity of tree-based models E. None of the above 98. **A data analyst wants to generate a comprehensive report on a DataFrame, including count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for each numerical column. Which Databricks command should they use?** A. Databricks Describe B. Databricks Summary C. Both Databricks Describe and Databricks Summary would work D. Neither Databricks Describe nor Databricks Summary E. Databricks Aggregation **100. Which of the following best describes the role of an Estimator in Apache Spark MLlib?** A. A set of transformation rules applied to the input data B. An algorithm that can be fit on a dataset to produce a model C. A trained model that can be used to make predictions D. A sequence of data preprocessing steps and machine learning algorithms E. None of the above **101.Which of the following is the key benefit of using Databricks clusters for machine learning tasks?** A. Improved data storage capabilities B. Simplified data visualization C. Faster model training and evaluation D. Enhanced data cleaning E. Better collaboration among team members A. Grid search B. Random search C. Random descent D. None of the above **103.What is model parallelism in the context of distributed machine learning?** A. Training multiple models simultaneously on different subsets of data B. Training a single model on multiple subsets of data simultaneously C. Training multiple models sequentially on different subsets of data D. Dividing the model into smaller parts and processing each part on a separate node or processor E. None of the above A. Accuracy B. F1 score C. Root Mean Squared Error (RMSE) D. Area Under the Receiver Operating Characteristic Curve (AUC-ROC) E. Precision A. To ensure that each node processes an equal amount of data B. To optimize the distribution of data to minimize communication overhead between nodes C. To prevent overfitting by training the model on different subsets of data D. To improve model interpretability by training separate models on different subsets of data E. None of the above A. Mean B. Median C. Mode D. Standard deviation E. R-squared A. Kernel function B. Support vectors C. Decision boundary D. Margin E. None of the above A. The number of partitions in the input DataFrame B. The size of each partition in the input DataFrame C. The complexity of the user-defined function D. The amount of available memory on the cluster E. All of the above A. To reduce the risk of overfitting by training the model on different subsets of data B. To improve model interpretability by training separate models on different subsets of data C. To speed up training by dividing the dataset into smaller parts that can be processed simultaneously D. To enhance model performance with small datasets E. None of the above **110.Which of the following is NOT a feature of Databricks Jobs?** A. Running notebooks B. Scheduling jobs C. Monitoring job performance D. Storing data E. Running Spark jobs A. Decision trees are easy to interpret and visualize B. Decision trees can handle both continuous and categorical features C. Decision trees can automatically handle missing values D. Decision trees can be efficiently parallelized at the node level E. None of the above A. When the training time needs to be minimized. B. When the dataset is highly imbalanced, and the models are prone to overfitting. C. When the models have a high variance and low bias. D. When the model\'s performance is already satisfactory and the estimation time is a concern. E. When the model\'s complexity is already very high and increasing it would lead to overfitting. A. Increased risk of overfitting B. Reduced model interpretability C. Communication overhead between nodes may outweigh the benefits of parallel processing D. Inability to train complex models E. None of the above **114.In Databricks, which resource is responsible for executing Spark applications?** A. Notebooks B. MLflow C. Clusters D. Jobs E. Repos **115. What is the primary advantage of using bagging over boosting?** A. None of the above B. Bagging is less sensitive to noisy data and outliers C. Bagging is computationally less expensive D. Bagging can be used for both classification and regression tasks **116.What is the primary purpose of the transform method in Apache Spark MLlib?** A. To apply a set of transformation rules to input data B. To train an Estimator on a dataset and produce a Transformer C. To make predictions using a trained Transformer D. To define the sequence of stages in a Pipeline E. All of the above A. Use the ps.DataFrame(spark\_df) method B. Use the spark\_df.to\_pandas\_on\_spark() method C. None D. Both A. Automatic adjustment of cluster size based on workload B. Simplified data visualization C. Faster data storage D. Enhanced collaboration E. Streamlined task scheduling **119.What is a common challenge faced by AutoML systems?** A. Limited support for custom models B. Inability to handle large datasets C. Poor performance on classification tasks D. Difficulty with time-series forecasting E. None of the options **120.What is the primary purpose of feature engineering in machine learning?** A. To optimize model performance B. To preprocess the data and create features that improve model performance C. To select the best machine learning algorithm D. To automate the machine learning process E. To deploy machine learning models A. Mean imputation B. Median imputation C. Mode imputation D. Standard deviation imputation E. None of the above A. Accuracy B. Precision C. Recall D. F1 score E. None of the above A. Classification B. Regression C. Forecasting D. Data transformation E. None of the above **124.Which of the following is NOT a typical step in exploratory data analysis?** A. Calculating summary statistics B. Removing outliers C. Imputing missing values D. Tuning hyperparameters E. Visualizing data **125.What is the purpose of the fmin function in Hyperopt?** A. To define the search space for hyperparameter optimization B. To minimize the objective function by searching for optimal hyperparameter combinations C. To calculate the fitness of a particular hyperparameter combination D. To initialize the search algorithm with a set of candidate hyperparameter combinations E. None of the above **126.What is the primary purpose of the Hyperopt library?** A. To train and evaluate machine learning models B. To distribute data processing tasks across multiple nodes in a cluster C. To optimize hyperparameters for machine learning models D. To implement advanced machine learning algorithms E. None of the above **127.In Apache Spark MLlib, what is the purpose of a Pipeline?** A. To store a set of transformation rules that can be applied to input data B. To define an algorithm that can be fit on a dataset to produce a model C. To represent a trained model that can be used to make predictions D. To specify a sequence of data preprocessing steps and machine learning algorithms E. None of the above **128.If you perform a 3-fold Cross-Validation with 6 different hyperparameter combinations, how many total model runs will be executed?** A. 18 B. 30 C. 50 D. 90 E. None of the above **129.How does a feature store help speed up the machine learning process?** A. By automating data preprocessing B. By providing a centralized repository for preprocessed features C. By offering faster data storage solutions D. By simplifying data visualization E. By streamlining task scheduling **130.What is the main goal of exploratory data analysis?** A. To select the best machine learning algorithm B. To understand the data and inform further data preprocessing and modeling decisions C. To optimize model performance D. To automate the machine learning process E. To deploy machine learning models **131.Which of the following is a common hyperparameter tuning technique?** A. Grid search B. Feature selection C. One-hot encoding D. Principal component analysis E. None of the above **132. How can you create a new Databricks cluster?** A. By using the Databricks API B. By using the Databricks CLI C. By using the Databricks workspace UI D. All of the above E. None of the above A. TensorFlow B. PyTorch C. XGBoost D. LightGBM E. Hadoop A. Histogram B. Kernel density estimation (KDE) plot C. Box plot D. Scatter plot E. None of the above A. Partitioning B. Z-Ordering C. Data Skipping D. Partition Pruning E. Database indexing **136.** **What are Type 1 errors?** A. FP B. FN C. TP D. TN E. None **137.** **How does MLflow Tracking improve model development?** A. By automating data preprocessing B. By providing version control for features C. By recording and organizing experiments, parameters, and results during the model development process D. By simplifying data visualization E. By streamlining task scheduling **138. Which of the following is a primary benefit of distributed machine learning?** A. Improved model interpretability B. Reduced risk of overfitting C. Faster training and prediction times for large datasets D. Enhanced model performance with small datasets E. None of the above A. Estimators represent algorithms that can be fit on datasets, while Transformers represent trained models that can make predictions. B. Estimators represent trained models that can make predictions, while Transformers represent algorithms that can be fit on datasets. C. Estimators represent sequences of data preprocessing steps, while Transformers represent machine learning algorithms. D. Estimators represent data preprocessing steps, while Transformers represent sequences of data preprocessing steps and machine learning algorithms. E. None of the above **140. As a machine learning engineer ventures into parallelizing inference across distinct group-specific models via the Pandas Function API, they\'ve architected the \'apply\_model\' function. This function is adept at summoning the pertinent model for each group. As they operate on DataFrame \'df\', they draft this partial code:** predictions = (df.groupby(\"device\_id\").\_\_\_\_\_\_\_\_\_\_\_(apply\_model, schema=apply\_return\_schema) ) What piece of code impeccably bridges the gap, enabling them to attain their objective? Choose only ONE best answer. A.mapInPandas B.predict C.groupedApplyInPandas D.train\_model E.applyInPandas **141. For Bayesian hyperparameter tuning of distributed Spark ML machine learning models, which tool is best suited?** A.Hyperopt B.Autoscaling clusters C.Feature Store D.MLflow Experiment Tracking E.AutoML **142. Which of the listed methods for hyperparameter optimization employs the tactic of making educated selections of hyperparameter values, based on the results of previous trials, for each successive model evaluation?** **A**.Grid Search Optimization **B**.Random Search Optimization **C.**Halving Random Search Optimization **D.**Manual Search Optimization **E**.Tree of Parzen Estimators Optimization **143.In the context of upgrading a machine learning project, if a machine learning engineer aims to ensure that the model is refreshed automatically every time the project executes, and they integrate the following code snippet (with a pre-existing model\_name in the MLflow Model Registry), what does the parameter \'registered\_model\_name=model\_name\' signify?** **mlflow.sklearn.log\_model( sk\_model=model, artifact\_path=\"model\", registered\_model\_name=model\_name )** **A**.It negates the need for a separate call to mlflow.register\_model by implicitly registering the model **B**.It creates a completely new model titled \'model\_name\' in the MLflow Model Registry **C**.It tags the logged model with the specified name in the MLflow Experiment **D**.It logs and registers a fresh version of the already existing \'model\_name\' model in the MLflow Model Registry **E**.It labels the specific Run within the MLflow Experiment **144. A data scientist has engineered a linear regression model with log(price) as the target. They make predictions using this model and store the results alongside the actual labels in a Spark DataFrame named \'preds\_df\'. They then evaluate the model with the following code: regression\_evaluator.setMetricName(\"rmse\").evaluate(preds\_df). How should they modify the RMSE evaluation so it aligns with the original price scale?** **A**.Apply a logarithm to the predictions before RMSE calculation **B**.Determine the MSE for log-transformed predictions to derive the RMSE **C**.Exponentiate the predictions before computing the RMSE **D**.Calculate the exponent of the obtained RMSE value **E**.Compute the logarithm of the resulting RMSE value **145. If a data scientist is leveraging MLflow Autologging to automatically track their machine learning experiments, how can they use a code snippet to identify the best run, specifically based on the root-mean-square error (RMSE) metric, after a series of experiment runs are completed for a certain experiment\_id?** **mlflow.\_\_\_\_\_\_\_\_\_(experiment\_id, order\_by = \[\"metrics.rmse\"\])\[\"run\_id\"\]\[0\]** **A**.Use \'client\' in the blank space **B**.Use \'search\_runs\' in the blank space **C**.Use \'experiment\' in the blank space **D**.Use \'identify\_run\' in the blank space **E**.Use \'show\_runs\' in the blank space **146. What is the reason behind the compatibility of pandas API syntax within a Pandas UDF function when applied to a Spark DataFrame?** **A**.Internally, the Pandas UDF beckons Pandas Function APIs **B**.Within its function, the Pandas UDF deploys pandas API on Spark **C**.The pandas API syntax cannot be implemented within a Pandas UDF function on a Spark DataFrame **D**.The Pandas UDF automatically translates the function into Spark DataFrame syntax **E**.The Pandas UDF leverages Apache Arrow to convert data between Spark and pandas formats **147**. **How can one best define the concept of boosting in the realm of machine learning models?** A. Boosting is the ensemble process of training machine learning models sequentially with each model learning from the errors of the preceding models. B. Boosting embodies the ensemble technique of individually training models for every entry in bootstrapped samples, culminating their predictions for a definitive estimate. C. Boosting involves training models in succession, with every model honing in on a unique data fragment. D. Boosting typifies the sequential training of models, with subsequent models drawing from an ever-expanding dataset subset. E. Boosting captures the method of training models on each entry in bootstrapped samples, subsequently employing model predictions as a novel feature for training another model. **148. A data scientist is currently tinkering with a Spark DataFrame, dubbed \'spark\_df\'. Their objective is to concoct a new Spark DataFrame that retains only those rows from \'spark\_df\' where the \'discount\' column\'s value is less than 0. Which code snippet would successfully fulfill this aim?** A. spark\_df.filter(col("discount\") \< 0) A. spark\_df.find(spark\_df(\"discount\") \< 0) B. spark\_df.loc\[col(\"discount\") \< 0\] C. spark\_df.loc\[spark\_df(\"discount\") \< 0,:\] D. SELECT \* FROM spark\_df WHERE discount \< 0 **149. In the course of their academic journey, a data scientist was often advised to consistently apply 5-fold cross-validation during model development. However, a team member hints at scenarios where a singular training-validation split might be preferable over 5-fold cross-validation. What represents a conceivable benefit of a simple training-validation split over a 5-fold cross-validation?** A. The training-validation split requires building fewer models and using less resources. **B**.Bias can be completely eradicated with a training-validation split. **C.**The training-validation split enhances the model\'s reproducibility. **D.**A training-validation split necessitates evaluating fewer hyperparameters. **E.**The need for a distinct holdout set is negated when using a training-validation split. **150. If a data scientist is leveraging MLflow to track machine learning experiments, and within each MLflow run, they undertake hyperparameter tuning, what approach should they employ to structure one overarching parent run for the tuning procedure and spawn a child run for every dist\'mlflow.start\_run()\'?inct combination of hyperparameter values, given that they manually initiate all runs using** A.Start every child run with the same experiment ID as the overarching parent run. B.Use the \'nested=True\' parameter when initiating the child run for each distinct set of hyperparameter values. C.Trigger every child run nested within the indented code block of the parent run through \'mlflow.start\_run()\'. D.Activate Databricks Autologging. E.Set \'nested=True\' when initiating the primary parent run for the tuning activity. **151. For a newcomer data scientist who has recently onboarded to an ongoing machine learning project, which runs as a scheduled daily retraining job within a Databricks Repository, what\'s the optimal strategy to modify and enhance the feature engineering in the preprocessing phase of the pipeline, ensuring that there\'s no interruption to the project\'s daily operations?** A.Create a duplicate of the project\'s notebooks in a separate Databricks Repository and make the desired modifications there. **B**.Temporarily stop the scheduled daily operations of the project and directly modify the original code. **C**.Establish a fresh branch in Databricks, commit the modifications on this branch, and then synchronize these changes with the connected Git provider. **D**.Copy the project\'s notebooks to another Databricks Workspace folder and perform the desired modifications there. **E**.Initiate a distinct Git repository, link it to Databricks, and transfer the existing project code from the current repository to this new one prior to implementing the changes. **152.A machine learning engineer attempts to scale an ML pipeline by distributing its single-node model tuning procedure. After broadcasting the entire training data onto each core, each core in the cluster is capable of training one model at once. As the tuning process is still sluggish, the engineer plans to enhance the parallelism from 4 to 8 cores to expedite the process. Unfortunately, the total memory in the cluster can\'t be increased. Under which conditions would elevating the parallelism from 4 to 8 cores accelerate the tuning process?** A.When the dataset exhibits elongated dimensions. **B.**When the data has a broad shape **C.**When the model resists parallel operations. **D.**When the tuning process is randomized **E**.When every core is competent in accommodating the entire dataset. **153. Given three notebooks (Notebook A for exploratory data analysis, Notebook B & C for feature engineering) provided by a data scientist for creating a machine learning pipeline, if Notebook A\'s completion is essential before Notebooks B & C, which can operate concurrently, what\'s the most efficient strategy for an engineering team to orchestrate this setup using Databricks?** A.Design a job with three tasks, each running a distinct notebook, with a sequential execution strategy **B**.Establish a job comprising three tasks, each designated for a specific notebook, with simultaneous execution **C**.Configure three separate single-task jobs, each for a distinct notebook, all set to run at the same time **D**.Construct a job with three tasks, each tailored for a specific notebook. Following Notebook A\'s completion, allow the final two tasks to run concurrently **E**.Initiate a single-task job using an orchestration notebook, which executes the three notebooks in sequence **154.How can an individual access the notebook responsible for an MLflow run\'s execution?** A.Access the model.pkl artifact within the MLflow run page. **B**.Click on the \"Models\" link relevant to the run within the MLflow experiment page. **C**.Open the MLmodel artifact on the MLflow run page. **D**.Select the \"Start Time\" link corresponding to the run on the MLflow experiment page. **E**.Click on the \"Source\" link aligned with the relevant run on the MLflow experiment page. **155. In a meeting, a team of data analysts is debating which metric to employ when evaluating classification models. Their primary concern is deciding when to prioritize the F1 score over mere accuracy. The formula for F1 score is:** ***F*1= 2×(*precision*×*recall*) / (*precision*+*recall)*** **Which guidance should the team adhere to?** A.The F1 score is predominantly favored over accuracy when the target variable embodies more than two distinct classes. **B**.When there\'s an even distribution of positive and negative instances, the F1 score should be the metric of choice over accuracy. **C**.The F1 score should be the go-to metric over accuracy when there\'s a pronounced imbalance between positive and negative class instances, and reducing false negatives is of utmost importance. **D**.When there are precisely two classes within the target variable, the F1 score takes precedence over accuracy. **E**.The F1 score is typically more insightful than accuracy when accurately identifying both true positives and true negatives is equally vital for addressing the business problem. **156. When addressing a linear regression problem involving an immense dataset, how does Spark ML approach the solution?** A.Utilizing a Brute Force Algorithm **B**.Employing Matrix decomposition techniques **C**.Adopting the Singular value decomposition method **D**.Applying the Least square method **E**.Incorporating the Gradient descent technique **157.A data scientist has set their sights on one-hot encoding the categorical attributes housed in their PySpark DataFrame, christened \'features\_df\'. Their strategy involves harnessing Spark ML. The variable \'input\_columns\' holds the string column names. A drafted code block for the task is throwing errors. What adjustments should the scientist make to get their code up and running?** **oneHotEnc = OneHotEncoder(** **InputCols = input\_columns,** **outputCols = output\_columns ** **)** **oneHotEnc\_model = oneHotEnc.fit(features\_df)** **oneHotEnc\_features\_df = oneHotEnc\_model.transform (features\_df)** A.The output columns should be identical to the column names listed in \'input\_columns\' **B**.The \'method\' parameter must be explicitly stated in the OneHotEncoder **C**.Prior to embarking on the one-hot encoding adventure, the features should pass through a VectorAssembler **D**.The StringIndexer should be summoned before the one-hot encoding **E**.The line containing the \'fit\' function needs to be expunged **158. When preparing a random forest regression model, a data scientist opts for one-hot encoding to transform categorical feature values in a dataset. A colleague, however, argues against the use of one-hot encoding with tree-based algorithms. What\'s the reasoning behind avoiding one-hot encoding for random forest models?** AThe feature sampling process de-emphasizes one-hot encoded feature variables. **B**.Scalability challenges arise due to a less dense training dataset. **C**.The dataset becomes too dense, complicating scalability. **D**.It\'s computationally expensive, leading to inefficient splitting algorithms due to numerous potential split values. **E**.One-hot encoded features dominate the feature sampling process, resulting in less informative trees. **159.In a hyperparameter tuning process, a data scientist is utilizing an iterative optimization algorithm. With 8 different compute nodes, they\'re executing 8 evaluations of distinct hyperparameter values. While the accuracy varies across the evaluations, there\'s no discernible pattern of improvement. How might the scientist modify their approach to enhance the accuracy progression during the tuning?** A.Halve, or further reduce, the compute nodes relative to the evaluations count. **B**.Substitute the iterative optimization algorithm assisting the tuning. **C**.Double, or even more so, the compute nodes compared to the evaluations. **D**.Drastically decrease both the compute nodes and evaluations. **E**.Significantly increase both the compute nodes and evaluations. **160. As a data scientist\'s feature engineering notebook, which uses the pandas library, handles more data, the processing time increases proportionally. Which tool should they employ to scale efficiently with minimal refactoring for big data?** A.Utilizing a Feature Store **B**Switching to PySpark DataFrame API **C**Implementing Spark SQL techniques **D**Using Scala\'s Dataset API **E**Adopting the pandas API on Spark **161.Why does utilizing the pandas API sometimes lead to decreased speed, especially when handling large datasets, compared to when using native Spark DataFrames?** A.The use of an internalFrame to preserve metadata hinders performance **B**The overhead due to a greater amount of code is responsible for the slowdown **C**Reliance on CSV files is the primary reason for performance issues **D**Every processing operation is instantly computed, causing inefficiencies **E**The lack of a distributed data structure negatively impacts speed **162.To ensure the \'fasttextnew\' Python library is universally accessible across notebooks operating on a Databricks cluster, which method should you adopt?** **Note: Let\'s consider \'fasttextnew\' as a hypothetical LLM package introduced just two weeks ago.** A.Transition the cluster to use the Databricks Runtime dedicated to Machine Learning. **B**Unfortunately, you can\'t make \'fasttextnew\' universally accessible on a Databricks cluster. **C**Modify the \'runtime-version\' variable within the Spark session to \"1\". **D**Run \'pip install fasttextnew\' in any single notebook linked to the cluster. **E**Append the command \'/databricks/python/bin/pip install fasttextnew\' to the initialization script of the cluster. **163.Amidst the task of transitioning a decision tree from sklearn to Spark ML, a machine learning engineer encounters an error. The message highlights that the maxBins parameter should match or surpass the number of values in every categorical feature. What is the reason behind Spark ML requiring the maxBins parameter to be at least as large as the number of values in each categorical feature?** A.Spark ML\'s splitting algorithm demands a more expansive pool of split candidates compared to single-node frameworks **B**Spark ML mandates a separate bin for each distinct category within a categorical feature **C**Spark ML\'s splitting algorithm exclusively considers categorical features **D**Spark ML\'s splitting algorithm exclusively focuses on numeric features **E**For categorical features, Spark ML caps the splitting algorithm at 32 potential split candidates **164.What essence does a Spark ML transformer encapsulate?** A.A conduit to infuse a hyperparameter grid for model training **B.**An apparatus that fuses several algorithms to metamorphose an ML pipeline **C.**A learning algorithm that commandeers a DataFrame for model training **D.**An algorithm endowed with the prowess to transmute one DataFrame into another DataFrame **E.**A toolkit wielded to gauge a model\'s proficiency and efficacy **165. After a data scientist designs three novel models for a specific machine learning problem (replacing a prior single-model solution), all models demonstrate comparable prediction latency. Yet, a machine learning engineer warns that the new ensemble might introduce inefficiencies during inference. Under what scenario would this engineer\'s concern prove valid?** A.When the average size of the new solution\'s models exceeds the size of the original model **B**When the new solution necessitates each model to compute a prediction for every record **C**When the new solution\'s models have an average latency that is larger than the latency of the original model **D**When the new solution involves if-else logic determining which model to employ for each prediction **E**When the new solution requires fewer feature variables than the original model **166. Amidst their journey to transition a pandas DataFrame script to the pandas API within Spark, a data scientist grapples with the following fragmentary code:** **\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_** **df = ps.read\_parquet(path)** **df\[\"category\"\].value\_counts()** **Which import statement would complete their refactoring successfully and align with the pandas API on Spark?** A.import pandas as ps **B**import databricks.pandas as ps **C**import pyspark.pandas as ps **D**import pandas.spark as ps **E**import databricks.pyspark as ps **167. When a data scientist is employing Spark SQL to facilitate data import into a machine learning workflow and subsequently accumulates all the data in a pandas DataFrame to execute tasks using scikit-learn, which Databricks cluster mode would be the most suitable for their specific use-case?** A.SQL Endpoint **B**Standard **C**High Concurrency **D**Pooled **E.**Single Node **168. An ML engineer is in the process of assessing a new Staging version of a model within the MLflow Model Registry. After successful tests, the engineer intends to transition the model to the Production stage. Which section within Databricks Machine Learning should be accessed to facilitate this transition?** A.Experiments section: Run page **B.**MLflow Model Registry: Model page **C.**Comment feature on the notebook page where the model originated **D.**MLflow Model Registry: Model Version page **E**.Experiments section: Experiment page **169.A data scientist is keen on refining the hyperparameters of a scikit-learn model, but they\'re aiming for simultaneous optimization. They zero in on the Hyperopt library for assistance. Which feature within Hyperopt avails parallelized optimization for hyperparameters?** A.fmin **B.**Search Space **C.**hp.quniform **D.**SparkTrials **E.**Trials **170.During a classification problem, a data scientist adopts 3-fold cross-validation alongside a specific hyperparameter grid for grid search optimization. The grid is structured as:** - Hyperparameter 1 \[4, 6, 7\] - Hyperparameter 2 \[5, 10\] **Considering the above, how many machine learning models could potentially be trained concurrently during this procedure?** A. **2** B. **6** C. **12** D. **18** E. **24** **171.A machine learning engineer uses the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:** **\@pandas\_udf(\"double\")** **def predict(iterator: Iterator\[pd.DataFrame\]) -\> Iterator\[pd.Series\]:** **model\_path = os.path.join(\"/dbfs/runs/\", mlflow.active\_run().info.run\_id, \"model\")** **model = mlflow.sklearn.load\_model(model\_path)** **for features in iterator:** **pdf = pd.concat(features, axis=1)** **yield pd.Series(model.predict(pdf))** **Assuming the default Spark configuration is in place, what is the advantage of using an Iterator?** A.It constrains the model to one executor, eliminating the possibility of distributing the data. **B**.It binds the data to one executor, circumventing repetitive model loading. **C**.The data is apportioned among multiple executors during the model inference. **D**.The inclusion of an Iterator neither adds nor detracts value in the input or output dynamics. **E**.The model, in the context of the inference process, necessitates loading just once for every executor as opposed to every single batch. 172\. **A data scientist employs AutoML on Databricks Machine Learning to establish a machine learning pipeline. After determining the superior model in the experiment, the scientist is eager to view the source code that led to the best run. What is the appropriate action to achieve this?** AClick on the \"Model\" link associated with the respective row on the AutoML Experiment page. **B**.Click on the \"Start Time\" link aligned with the relevant row on the AutoML Experiment page. **C**.Choose the \"View notebook for best model\" option on the AutoML Experiment page. **D**.Press the \"Share\" option on the AutoML Experiment page. **E**.Accessing the source code for the best model is not possible. **173. A data scientist possesses a feature set with the following structure:** - customer\_id STRING - spend DOUBLE - units INTEGER - happiness\_tier STRING **With \'customer\_id\' serving as the primary key, each column contains some missing entries. The scientist\'s ambition is to impute these absences with a consistent value for every feature. Which columns from this set should undergo imputation using the column\'s most frequent value?** A.Units **B**.Customer\_id **C**.happiness\_tier **D**.Spend **E**.Both Customer\_id and happiness\_tier **174.Given a Spark DataFrame, preds\_df, derived from a three-class decision tree classifier with the schema: prediction DOUBLE, actual DOUBLE, which code segment accurately calculates the model\'s accuracy and assigns it to an accuracy variable?** A.There isn\'t any valid code for this operation. **B**.accuracy = MulticlassClassificationEvaluator(predictionCol=\"prediction\", labelCol=\"actual\", metricName=\"accuracy\") **C**.accuracy = RegressionEvaluator(predictionCol=\"prediction\", labelCol=\"actual\", metricName=\"accuracy\") **D**.classification\_evaluator = MulticlassClassificationEvaluator(predictionCol=\"prediction\", labelCol=\"actual\", metricName=\"accuracy\") accuracy = classification\_evaluator.evaluate(preds\_df) **E**.accuracy = Summarizer(predictionCol=\"prediction\", labelCol=\"actual\", metricName=\"accuracy\") **175. When a machine learning engineer aims to make batch predictions using a decision tree model stored at a certain path (model\_uri), and wants to generate predictions for a DataFrame (batch\_df) that contains an \'order\_id\' column, under what circumstances will the following code yield successful predictions?** **predictions = fs.score\_batch( model\_uri, batch\_df )** A.This specific code will never produce successful predictions. B.Only when the \'order\_id\' is the sole feature the model at model\_uri uses. C.If the model at model\_uri was registered with an associated Feature Store feature set. D.Provided all features the model at model\_uri uses are in one Feature Store table. E.When every feature the model at model\_uri requires is already in a Spark DataFrame in the current PySpark session. **176.A data scientist integrated a random forest regressor pipeline as the concluding stage of a Spark ML Pipeline. They initiate a cross-validation, embedding the pipeline inclusive of the random forest regressor method. What potential challenge might emerge from placing the entire pipeline within the cross-validation procedure?** A.Every model iteration would need every stage of the pipeline to be refit or transformed, leading to longer runtimes B.Information from the validation sets might leak into the training sets during data preparation for each model C.Data from the training phase might inadvertently be used in the test phase, skewing results D.The hyperparameter value combinations in the parameter grid might not be exhaustively tested E.The distributed nature of the pipeline might hamper parallelized tuning operations **177.In the context of an AutoML experiment, which of the following metrics are standardly computed by default when addressing classification challenges?** A.All of the mentioned metrics **B**.Prediction Accuracy Rate **C**.Receiver Operating Characteristic Curve Area (AUC-ROC) **D**.Sensitivity (Recall) **E**.Harmonic Mean of Precision and Recall (F1 Score) **178. A machine learning expert is in the midst of crafting a linear regression model utilizing Spark ML with the intent to project car prices. They wield a Spark DataFrame labeled \'train\_df\' for the training process. The schema of \'train\_df\' is as follows:** **car\_id STRING, price DOUBLE, stars DOUBLE, year\_updated DOUBLE, seats DOUBLE.** **In their coding endeavor, they present this snippet:** **lr = LinearRegression ** **( ** **featuresCol = \[\"stars\", \"year\_updated\", \"seats\"\], ** **labelCol = \"price\" ** **) ** **lr\_model = lr.fit(train\_df)** **What modifications ought the expert usher in to realize their objectives?** A.Incorporate the lr object as a stage in a Pipeline to fit the model **B.**No alterations are required **C.**Invoke the transform method from the lr\_model object on train\_df **D.**Transform the stars, year updated, and seats columns into a singular vector column **E**.Define the parallelism parameter in the Linear Regression operation with a value exceeding 1 **179.As a data science brigade remodeling their machine learning projects to disseminate model inference, they classify their undertakings grounded in the modeling library in use. Their aim is to use User-Defined Function (UDF) to distribute the inference process. Among the listed modeling libraries, which ones would impose the need for a UDF to distributing model inference?** A.Spark ML **B**.Spacy **C**.Spark MLLib **D**.Tensorflow **E**.Scikit-learn **180.A data analyst has constructed an ML pipeline utilizing a fixed input dataset with Spark ML. However, the processing time of the pipeline is excessive. To improve efficiency, the analyst expanded the number of workers in the cluster. Interestingly, they observed a discrepancy in the row count of the training set post-cluster reconfiguration compared to its count prior to the adjustment. Which strategy ensures a consistent training and test set for each model iteration?** A.Invoke manual segmenting of the source dataset **B.**Persistently store the split datasets **C.**Manually tweak the cluster\'s configuration **D.**Dictate a specific rate during the dataset partitioning phase **E.**There exists no strategy to assure consistent training and test set **181.A data scientist is on a mission to employ Spark ML for filling missing values in their PySpark DataFrame \'features\_df\'. Their strategy involves replacing missing entries in all numeric columns in \'features\_df\' with the respective median value of each numeric column. However, the written code isn\'t delivering the desired outcome. Can you pinpoint the flaw preventing the imputation from being executed?** **my\_imputer = Imputer ( strategy = \"median\", inputCols = input\_columns, outputCols = output\_columns )** **imputed\_df = my\_imputer.transform(features\_df)** A.Imputing using a median value is not possible. B.The flaw lies in not concurrently imputing both the training and testing datasets. C.The \'inputCols\' and \'outputCols\' need to match exactly. D.The code fails to fit the imputer to the data to create an \'ImputerModel\'. E.The \'fit\' method needs to be invoked instead of \'transform\'. **182. A data connoisseur harnesses a specific code fragment, aiming to hone the hyperparameters for a machine learning paradigm:** **num\_evals = 5, trials = SparkTrials(), to cherry-pick the most optimal hyperparameters.** **Embedded within the objective function, they employ these parameters: space-search space, algo-tpe.suggest, max\_evals=num\_evals, trials = trials.** **What adjustments could they induct in the aforementioned script to increase the chances of obtaining a more precise model?** A.Substitute tpe.suggest with random.suggest **B**.Boost num\_evals to 50 **C**.Omit the algo-tpe.suggest argument **D**.Replace faint() with fmax() **E**.Switch SparkTrials() to Trials() In this eBook, you'll learn: - The essential components of an MLOps reference architecture, including strategies for building, training, and deploying language models - The key stakeholders to involve as you build and deploy machine learning applications - How to leverage the same platform for data and models and get to production faster - How to monitor data and models through the complete ML lifecycle with end-to-end lineage - Best practices to guide your MLOps planning and decision-making **Free PDF Copy Source: [https://www.databricks.c](https://www.databricks.com/resources/ebook/the-big-book-of-mlops)[om/resources/ebook/the-big-book-of-mlops](https://www.databricks.com/resources/ebook/the-big-book-of-mlops)**