quiz image

Professional Data Engineer Sample Questions

ahsansad279@gmail.com avatar
[email protected]
·
·
Download

Start Quiz

Study Flashcards

10 Questions

You are working on optimizing BigQuery for a query that is run repeatedly on a single table. The data queried is about 1 GB, and some rows are expected to change about 10 times every hour. You have optimized the SQL statements as much as possible. You want to further optimize the query's performance. What should you do?

Create a materialized view based on the table, and query that view.

Several years ago, you built a machine learning model for an ecommerce company. Your model made good predictions. Then a global pandemic occurred, lockdowns were imposed, and many people started working from home. Now the quality of your model has degraded. You want to improve the quality of your model and prevent future performance degradation. What should you do?

Retrain the model with data from the last 30 days. Add a step to continuously monitor model input data for changes, and retrain the model.

A new member of your development team works remotely. The developer will write code locally on their laptop, which will connect to a MySQL instance on Cloud SQL. The instance has an external (public) IP address. You want to follow Google-recommended practices when you give access to Cloud SQL to the new team member. What should you do?

Remove the external IP address, and replace it with an internal IP address. Add only the IP address for the remote developer's laptop to the authorized list.

Your Cloud Spanner database stores customer address information that is frequently accessed by the marketing team. When a customer enters the country and the state where they live, this information is stored in different tables connected by a foreign key. The current architecture has performance issues. You want to follow Google-recommended practices to improve performance. What should you do?

Create interleaved tables, and store states under the countries.

Your company runs its business-critical system on PostgreSQL. The system is accessed simultaneously from many locations around the world and supports millions of customers. Your database administration team manages the redundancy and scaling manually. You want to migrate the database to Google Cloud. You need a solution that will provide global scale and availability and require minimal maintenance. What should you do?

Migrate to Cloud Spanner.

Your company collects data about customers to regularly check their health vitals. You have millions of customers around the world. Data is ingested at an average rate of two events per 10 seconds per user. You need to be able to visualize data in Bigtable on a per user basis. You need to construct the Bigtable key so that the operations are performant. What should you do?

Construct the key as user-id#device-id#activity-id#timestamp.

Your company is hiring several business analysts who are new to BigQuery. The analysts will use BigQuery to analyze large quantities of data. You need to control costs in BigQuery and ensure that there is no budget overrun while you maintain the quality of query results. What should you do?

Set a customized project-level or user-level daily quota to acceptable values.

Your Bigtable database was recently deployed into production. The scale of data ingested and analyzed has increased significantly, but the performance has degraded. You want to identify the performance issue. What should you do?

Use Key Visualizer to analyze performance.

Your company is moving your data analytics to BigQuery. Your other operations will remain on-premises. You need to transfer 800 TB of historic data. You also need to plan for 30 Gbps of daily data transfers that must be appended for analysis the next day. You want to follow Google-recommended practices to transfer your data. What should you do?

Use a Transfer Appliance to move the existing data to Google Cloud. Set up a Dedicated or Partner Interconnect for daily transfers.

Your team runs Dataproc workloads where the worker node takes about 45 minutes to process. You have been exploring various options to optimize the system for cost, including shutting down worker nodes aggressively. However, in your metrics you see that the entire job takes even longer. You want to optimize the system for cost without increasing job completion time. What should you do?

Set a graceful decommissioning timeout greater than 45 minutes.

Study Notes

Optimizing BigQuery Performance

  • To optimize a repeatedly run query on a single 1 GB table with 10 row changes per hour, consider further optimization beyond SQL statement optimization.

Improving Machine Learning Model Quality

  • To improve a degraded machine learning model's quality, retrain the model with new data that includes the changes caused by the global pandemic and lockdowns.

Securely Granting Access to Cloud SQL

  • To follow Google-recommended practices when granting access to Cloud SQL to a new team member, use the Cloud SQL proxy to connect to the instance instead of the external IP address.

Optimizing Cloud Spanner Performance

  • To improve performance issues in a Cloud Spanner database storing customer address information, denormalize the data by storing country and state in the same table to reduce joins.

Migrating PostgreSQL to Google Cloud

  • To migrate a business-critical PostgreSQL system to Google Cloud, use Cloud SQL or Cloud Spanner to provide global scale and availability with minimal maintenance.

Constructing Performant Bigtable Keys

  • To construct a performant Bigtable key for visualizing data on a per-user basis, use a composite key that includes the user ID and a reverse timestamp to optimize row key selection.

Controlling BigQuery Costs

  • To control costs in BigQuery, set up budgets and alerts, use cost estimation, and optimize queries to ensure quality results while preventing budget overruns.

Identifying Bigtable Performance Issues

  • To identify performance issues in a Bigtable database, use the Cloud Console or the Bigtable CLI to monitor performance metrics and identify bottlenecks.

Transferring Data to BigQuery

  • To transfer 800 TB of historic data and 30 Gbps of daily data to BigQuery, use the BigQuery Data Transfer Service and follow Google-recommended practices for data transfer.

Optimizing Dataproc Workloads

  • To optimize Dataproc workloads for cost without increasing job completion time, use cluster autoscaling and dynamic node allocation to adjust the number of nodes based on workload demand.

Professional Data Engineer Sample Questions

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free
Use Quizgecko on...
Browser
Browser