Amazon Exam Practice Questions - Page 52

Amazon Practice Questions, Discussions & Exam Topics by our Authors

A company is training machine learning (ML) models on Amazon SageMaker by using 200 TB of data that is stored in Amazon S3 buckets. The training data consists of individual files that are each larger than 200 MB in size. The company needs a data access solution that...

To meet the requirements of accessing a large dataset stored in Amazon S3 (200 TB) efficiently with the shortest processing time and minimal setup, let's evaluate each option carefully based on factors like speed, setup complexity, and suitability for large files. Option A: Use File mode in SageMaker to copy the dataset from the S3 buckets to the ML instance storage. - Reasoning: File mode involves copying the dataset from S3 to the local storage of the training instances before training starts. This method works well for smaller datasets but can be inefficient for large datasets (200 TB in this case). - Pros: Simple to implement. - Cons: Copying 200 TB of data to the local storage of the ML instance is impractical due to time and storage limitations. Additionally, it involves large I/O operations and requires significant disk space on the training instance. This approach is slow and will take a long time to process, especially with large files (larger than 200 MB each). - Conclusion: This is not ideal for large datasets, as it is time-consuming and involves significant storage overhead. Option B: Create an Amazon FSx for Lustre file system. Link the file system to the S3 buckets. - Reasoning: Amazon FSx for Lustre is a high-performance file system that can be linked directly to Amazon S3. It can provide fast data access and is designed for workloads that require high throughput and low latency, like ML training. FSx for Lustre can access data directly from S3 without needing to copy it, improving processing time. - Pros: High-performance file system designed for large-scale data processing. It can handle large datasets efficiently and offers excellent throughput and low latency. Linking it to S3 allows seamless access to data stored in S3. - Cons: Setup involves configuring the FSx for Lustre file system, which might require some initial effort. However, once set up, it provides high-speed access to S3 data. - Conclusion: This is a highly suitable solution for accessing large datasets efficiently. It balances speed with a one-time setup cost. Option C: Create an Amazon Elastic File System (Amazon EFS) file system. Mount the file system to the training instances. - Reasoning: Amazon EFS is a scalable file system for cloud-based workloads, b...

Author: Akash · Last updated Apr 3, 2026

An online store is predicting future book sales by using a linear regression model that is based on past sales data. The data includes duration, a numerical feature that represents the number of days that a book has been listed in the online store. A data scientist performs an exploratory data analysis and discovers that the relationship between book...

To address the issue of a skewed and non-linear relationship between book sales and duration in the linear regression model, the data scientist should apply a data transformation that can make the relationship more linear or better suited for linear regression. Let's evaluate each option: A) One-hot encoding One-hot encoding is used to transform categorical variables into a set of binary (0 or 1) variables, representing the presence or absence of each category. Since the problem describes a numerical feature ("duration"), one-hot encoding would not be appropriate here. This technique is irrelevant for this case as it doesn’t address skewness or non-linearity. B) Cartesian product transformation Cartesian product transformation is typically used to combine two or more features by creating a new feature based on all possible pairwise combinations. This is often used when dealing with categorical variables. However, this approach does not address the issue of skewness and non-linearity in a numerical feature (like "duration") and would introduce unnecessary complexity without resolving the underlying data issues. C) Quantile binning Quantile binning is a technique where the continuous numerical variable is divided into bins based on quantiles (e.g., quartiles or percentiles). This could potentially help to reduce the skewness by grouping values into discrete categories based on their distribution. However, this transformation might discard useful continuous information by turning the numerical variable into categorical data. It might make the model less precise for certain tasks, especially when the actual value of...

Author: MoonlitPantherX · Last updated Apr 3, 2026

A company's data engineer wants to use Amazon S3 to share datasets with data scientists. The data scientists work in three departments: Finance. Marketing, and Human Resources. Each department has its own IAM user group. Some datasets contain sensitive information and should be accessed on...

The goal here is to share datasets with data scientists from different departments, while ensuring that sensitive datasets are only accessible by the Fice department. The data engineer needs to structure the access control for the datasets in a way that aligns with these requirements. Let's review the options: A) Create an S3 bucket for each dataset. Create an ACL for each S3 bucket. For each S3 bucket that contains a sensitive dataset, set the ACL to allow access only from the Fice department user group. Allow all three department user groups to access each S3 bucket that contains a non-sensitive dataset. - Rejection Reasoning: Using Access Control Lists (ACLs) for access control is less flexible and scalable than using IAM policies or bucket policies. ACLs can be difficult to manage, especially with multiple datasets across different departments. It also doesn't scale well for multiple departments, and ACLs are generally considered an older method of controlling access in AWS. This solution requires manually configuring ACLs for each dataset, which can lead to a management overhead. - Scenario Usefulness: This option is not recommended because of the complexity and limited scalability of using ACLs for managing permissions. B) Create an S3 bucket for each dataset. For each S3 bucket that contains a sensitive dataset, set the bucket policy to allow access only from the Fice department user group. Allow all three department user groups to access each S3 bucket that contains a non-sensitive dataset. - Selection Reasoning: This option uses bucket policies, which are more flexible and manageable than ACLs. Bucket policies are easier to apply, especially when different access controls are needed based on dataset sensitivity. By setting specific bucket policies, the data engineer can ensure that the Fice department user group has exclusive access to sensitive datasets, while all three departments can access non-sensitive datasets. - Scenario Usefulness: This is a valid and effective option as it allows granular control using bucket policies, ensuring that sensitive datasets are protected and only accessible by the required department. However, managing multiple S3 buckets for each dataset may be cumbersome. C) Create a single S3 bucket that includes two folders to separate the sensitive datasets from the non-sensitive datasets. For the Fice department user group, attach an IAM policy that provides access to both folders. For the Marketi...

Author: VenomousSerpent42 · Last updated Apr 3, 2026

A company operates an amusement park. The company wants to collect, monitor, and store real-time traffic data at several park entrances by using strategically placed cameras. The company's security team must be able to immediately access the data for viewing. Stored data must be indexed and...

To meet the requirements of collecting, monitoring, storing, and accessing real-time traffic data from cameras at the amusement park entrances, let's break down the options and evaluate which would be most cost-effective and aligned with the company's needs. Key Requirements: 1. Real-time traffic data collection via cameras. 2. Immediate viewing of the data by the security team. 3. Stored data needs to be indexed and accessible to the data science team for analysis. 4. Cost-effectiveness is emphasized. Evaluation of Options: A) Use Amazon Kinesis Video Streams to ingest, index, and store the data. Use the built-in integration with Amazon Rekognition for viewing by the security team. - Amazon Kinesis Video Streams: Excellent for ingesting real-time video data from cameras. It provides a reliable mechanism for capturing, storing, and streaming video data. - Amazon Rekognition: Offers advanced video analysis (e.g., object and activity detection) but may be overkill if the primary requirement is simple viewing rather than analysis. It's also relatively more expensive, especially if the primary task is just monitoring and not complex analysis. - Cost Considerations: While Kinesis Video Streams is suitable for streaming and storing video data, integrating Rekognition could be costly for the real-time viewing use case, unless there is a clear need for video analysis. Why rejected: The inclusion of Amazon Rekognition for just viewing the data adds unnecessary complexity and cost. Rekognition is more useful for advanced video analysis, not basic streaming or simple viewing. B) Use Amazon Kinesis Video Streams to ingest, index, and store the data. Use the built-in HTTP live streaming (HLS) capability for viewing by the security team. - Amazon Kinesis Video Streams: Suitable for real-time video streaming and storage. - HLS Streaming: Built-in support for streaming video data over HTTP live streaming (HLS). HLS is a widely-used protocol that allows the security team to view the video data on a browser or through a video player in real-time. - Cost Considerations: This approach avoids unnecessary analysis features (like Rekognition) and focuses on efficient real-time streaming and storage. The security team can view the video without incurring additional costs for advanced video analysis. Why selected: This ...

Author: Rohan · Last updated Apr 3, 2026

An engraving company wants to automate its quality control process for plaques. The company performs the process before mailing each customized plaque to a customer. The company has created an Amazon S3 bucket that contains images of defects that should cause a plaque to be rejected. Low-confidence predictions must ...

To automate the quality control process for plaques based on image defect detection and send low-confidence predictions to an internal review team, let's evaluate the provided options in terms of their suitability for the task. Key Requirements: 1. Automatic processing of images to detect defects. 2. Integration with Amazon A2I to handle low-confidence predictions and send them to a human reviewer. 3. The review process must involve an internal team (not external workers like those from Amazon Mechanical Turk, unless it's specified that they are involved in a private workforce context). Option Analysis: A) Use Amazon Textract for automatic processing. Use Amazon A2I with Amazon Mechanical Turk for manual review. - Amazon Textract: This service is primarily designed for extracting text from images, such as scanned documents or forms. It would not be suitable for detecting visual defects in plaques, as it does not specialize in image analysis or defect detection. - Amazon A2I with Amazon Mechanical Turk: While Amazon Mechanical Turk (MTurk) is suitable for outsourcing tasks to external workers, it doesn't fit the requirement that the review process should be performed by the company’s internal team. MTurk is typically used for crowd-sourced tasks, and this might not meet the internal team requirement. Why rejected: Amazon Textract is not designed for visual defect detection in images, making it unsuitable for this scenario. Additionally, Mechanical Turk is an external workforce solution and may not fit the company’s internal team requirement. B) Use Amazon Rekognition for automatic processing. Use Amazon A2I with a private workforce option for manual review. - Amazon Rekognition: This service is highly suitable for image and video analysis, including defect detection in images, which aligns perfectly with the company's need to identify defects in plaques. - Amazon A2I with private workforce: Amazon A2I integrates with machine learning models like Rekognition and allows for human review when confidence in the model’s prediction is low. The private workforce option allows the company to use its internal team for manual review, which meets the requirement for using internal resources. Why selected: Amazon Rekognition is specifically designed for image analysis and d...

Author: John · Last updated Apr 3, 2026

A machine learning (ML) engineer at a bank is building a data ingestion solution to provide transaction features to financial ML models. Raw transactional data is available in an Amazon Kinesis data stream. The solution must compute rolling averages of the ingested data from the data stream and must store the results in Amazon Sa...

The goal is to compute rolling averages on transaction data ingested from an Amazon Kinesis data stream, store the results in Amazon SageMaker Feature Store, and serve the results to models in near real-time. Let's evaluate each solution option to identify which will best meet these requirements: Key Requirements: 1. Real-time data ingestion from Amazon Kinesis. 2. Rolling averages computation based on the ingested data. 3. Storing the results in Amazon SageMaker Feature Store. 4. The results must be available in near real-time for model consumption. Option A: Load the data into an Amazon S3 bucket by using Amazon Kinesis Data Firehose. Use a SageMaker Processing job to aggregate the data and load the results into SageMaker Feature Store as an online feature group. - Amazon Kinesis Data Firehose: This service is great for ingesting and buffering data into destinations like Amazon S3, but it's not directly optimized for real-time streaming processing like computing rolling averages. - SageMaker Processing job: SageMaker Processing is useful for batch data processing, but it is typically used for batch processing jobs rather than real-time processing. Aggregating data in this manner would introduce latency, making the solution less suited for near real-time use. - SageMaker Feature Store (online feature group): While this solution would store the results in SageMaker Feature Store, the latency introduced by the batch processing job wouldn't meet the requirement of real-time availability for the models. Why rejected: This option introduces unnecessary latency with batch processing (via SageMaker Processing) and is not optimal for real-time use. The ingestion into S3 and the need for batch computation make this solution less suited for real-time feature serving. Option B: Write the data directly from the data stream into SageMaker Feature Store as an online feature group. Calculate the rolling averages in place within SageMaker Feature Store by using the SageMaker GetRecord API operation. - Direct ingestion into SageMaker Feature Store: While this allows you to store data directly in SageMaker Feature Store, SageMaker Feature Store is not designed to perform complex calculations like rolling averages directly within its structure. - Rolling averages calculation using SageMaker GetRecord API: The GetRecord API is used to retrieve feature data from the Feature Store, but it is not designed for running computations or transformations on data. Complex calculations like rolling averages would need to be computed beforehand, outside of SageMaker Feature Store. Why rejected: This option misses the necessary data processing step for calculating the rolling averages. SageMaker Feature Store does not support in-place computations like rolling averages, so this would not meet the real-time computation requirement. ...

Author: Jack · Last updated Apr 3, 2026

Each morning, a data scientist at a rental car company creates insights about the previous day's rental car reservation demands. The company needs to automate this process by streaming the data to Amazon S3 in near real time. The solution must detect high-demand rental cars at each of the company's locations. The solution also must create a visualiz...

Let's break down the options and select the best one based on the requirements: Requirements: - Stream data to Amazon S3 in near real-time. - Detect high-demand rental cars at each location. - Visualize the data in a dashboard that automatically refreshes with the most recent data. - Leverage the least development time (important criterion). A) Use Amazon Kinesis Data Firehose to stream the reservation data directly to Amazon S3. Detect high-demand outliers by using Amazon QuickSight ML Insights. Visualize the data in QuickSight. - Pros: - Amazon Kinesis Data Firehose provides a fully managed service to stream data to S3 with minimal setup and configuration. It’s designed for simple streaming of data without complex stream processing, which saves on development time. - QuickSight ML Insights can automatically detect outliers in the data without requiring complex model development or training. - Visualization is handled within QuickSight, which provides automated and interactive dashboards. - Cons: - Limited to QuickSight’s built-in capabilities for anomaly detection, which may not be as customizable or precise as a dedicated model. - May not offer highly sophisticated anomaly detection for complex patterns in rental car demand, but sufficient for general use cases. B) Use Amazon Kinesis Data Streams to stream the reservation data directly to Amazon S3. Detect high-demand outliers by using the Random Cut Forest (RCF) trained model in Amazon SageMaker. Visualize the data in Amazon QuickSight. - Pros: - Kinesis Data Streams provides real-time data streaming but requires additional configuration for processing and managing streams. - Random Cut Forest (RCF), a machine learning model in SageMaker, is specifically designed for anomaly detection, offering highly customizable and accurate results. - Visualization in QuickSight is fully integrated and allows for near real-time updates. - Cons: - Additional development time is needed for setting up the Kinesis Data Streams, as well as for training and deploying the RCF model in SageMaker. This significantly increases the complexity compared to using Firehose. - Development overhead for the model’s training and management, especially when compared to the fully managed QuickSight ML Insights. C) Use Amazon Kinesis Data Firehose to stream the reservation data directly to Amazon S3. Detect high-demand outli...

Author: Liam · Last updated Apr 3, 2026

A machine learning (ML) engineer is integrating a production model with a customer metadata repository for real-time inference. The repository is hosted in Amazon SageMaker Feature Store. The engineer wants to retrieve only the latest version of th...

To determine the best solution for the machine learning engineer's requirement, let's evaluate each of the options based on the goal of retrieving only the latest version of the customer metadata record for real-time inference: A) Use the SageMaker Feature Store BatchGetRecord API with the record identifier. Filter to find the latest record. - Reasoning: - The `BatchGetRecord` API allows you to retrieve multiple records at once, which is not ideal if you are looking for a real-time inference of a single customer record. The batch operation requires a list of record identifiers, and you would need to filter the response to get the latest version. - This approach is less efficient because the batch API is typically intended for retrieving multiple records at once, which isn't aligned with the requirement for single-customer inference in real time. - Rejection Reason: This solution isn't optimal for retrieving a single, real-time customer record. B) Create an Amazon Athena query to retrieve the data from the feature table. - Reasoning: - Athena allows you to run SQL queries against data stored in Amazon S3, but querying the data in a Feature Store requires more specific access methods designed for real-time inference, such as the SageMaker Feature Store APIs. - Athena can be used for batch processing and analytics, but it may not be the best choice for real-time inference. - Rejection Reason: Athena is more suitable for large-scale data analysis or batch queries, not for real-time, single-record access. C) Crea...

Author: Lucas Carter · Last updated Apr 3, 2026

A company's data scientist has trained a new machine learning model that performs better on test data than the company's existing model performs in the production environment. The data scientist wants to replace the existing model that runs on an Amazon SageMaker endpoint in the production environment. However, the company is concerned that the new model might not work well on the production environment data. The data scientist needs to perform A/B testing in the ...

To perform A/B testing in the production environment for evaluating the new machine learning model, the data scientist should consider the following steps: A) Create a new endpoint configuration that includes a production variant for each of the two models. - Reasoning: - In Amazon SageMaker, you can create a new endpoint configuration that includes multiple production variants, each pointing to a different model. By setting up two variants (one for the existing model and one for the new model), SageMaker can serve both models from the same endpoint, allowing for A/B testing. You can control the traffic distribution between these models to evaluate their performance on the production environment data. - This option is well-suited for A/B testing as it allows easy comparison between models in the production environment. - Selection Reason: This is a correct step for A/B testing, allowing the data scientist to test both models concurrently under real production conditions. B) Create a new endpoint configuration that includes two target variants that point to different endpoints. - Reasoning: - This approach involves creating an endpoint configuration that points to separate endpoints for each model, rather than within the same endpoint. This is not ideal for A/B testing because it involves multiple endpoints, which would make it harder to control the traffic distribution and evaluate the models against each other simultaneously in the same environment. - Rejection Reason: This is not an ideal approach for A/B testing as it complicates traffic distribution and does not allow testing both models under the same endpoint. C) Deploy the new model to the existing endpoint. - Reasoning: - Deploying the new model directly to the existin...

Author: Nathan · Last updated Apr 3, 2026

A data scientist is working on a forecast problem by using a dataset that consists of .csv files that are stored in Amazon S3. The files contain a timestamp variable in the following format: March 1st, 2020, 08:14pm - There is a hypothesis about seasonal differences in the dependent variable. This number could be higher or lower for weekdays because some days and hours present varying values, so the day of the week, month, or hour could be an important factor. As a result, the data scientist nee...

Let's evaluate each of the options based on the goal of transforming the timestamp variable into weekdays, month, and day as separate variables with the least operational overhead: A) Create an Amazon EMR cluster. Develop PySpark code that can read the timestamp variable as a string, transform and create the new variables, and save the dataset as a new file in Amazon S3. - Reasoning: - Amazon EMR provides a managed cluster environment to run big data frameworks like Apache Spark, and you would need to develop PySpark code for reading, transforming, and saving the dataset. While it’s a powerful tool for large-scale data processing, it requires managing clusters, configuring the environment, and potentially dealing with scaling and resource management. - This approach involves setting up and managing infrastructure, which leads to more operational overhead compared to other options that abstract infrastructure management. - Rejection Reason: This approach introduces significant operational overhead due to the need for managing clusters and configurations, especially if the data volume is small or the task doesn't require distributed computing. B) Create a processing job in Amazon SageMaker. Develop Python code that can read the timestamp variable as a string, transform and create the new variables, and save the dataset as a new file in Amazon S3. - Reasoning: - While SageMaker provides an environment for data processing and model training, it requires setting up a processing job, configuring the environment, and potentially writing custom Python code. This option gives flexibility but involves more manual work in terms of environment setup, code writing, and job management compared to other more specialized tools. - Rejection Reason: It requires more manual setup and coding compared to other options like Amazon SageMaker Data Wrangler, which is designed for easier data transformations with minimal coding. C) Create a new flow in Amazon SageMaker Data Wrangler. Import the S3 file, use the Featurize date/time transform to generate the new variables, and save the dataset as a new file in Amazon S3. - Reasoning: - Amazon SageMaker Data Wrang...

Author: Amira99 · Last updated Apr 3, 2026

A manufacturing company has a production line with sensors that collect hundreds of quality metrics. The company has stored sensor data and manual inspection results in a data lake for several months. To automate quality control, the machine learning team must build an automated mechanism that determines whether the produced goods are good quality, replacement mar...

To determine the most accurate approach for predicting product quality in the manufacturing scenario, let's evaluate each of the options: A) Amazon SageMaker DeepAR forecasting algorithm - Reasoning: - DeepAR is designed for time series forecasting and is effective for predicting future values in time series data. It is particularly useful when you have sequential or temporal data and want to forecast numerical values over time. - In this case, predicting product quality (good quality, replacement market quality, or scrap quality) is a classification problem, not a time series forecasting problem. The task requires classification based on sensor data and manual inspection results, which isn't directly related to forecasting future values. - Rejection Reason: This is not the right choice because DeepAR is designed for time series forecasting, not classification tasks like the one in this scenario. B) Amazon SageMaker XGBoost algorithm - Reasoning: - XGBoost is an ensemble learning method based on decision trees and is highly effective for classification and regression problems. It is known for being accurate, efficient, and scalable, especially for tabular datasets with structured features (such as sensor data and manual inspection results). - In this case, the task is to classify the quality of the produced goods based on various sensor data and inspection results. XGBoost is well-suited for this kind of problem, as it can handle both numerical and categorical features and provides high accuracy with relatively low tuning effort. - Selection Reason: This is the most appropriate choice for a classification problem with structured data. XGBoost has been proven to be highly effective in similar industrial use cases, where classification b...

Author: Zara · Last updated Apr 3, 2026

A healthcare company wants to create a machine learning (ML) model to predict patient outcomes. A data science team developed an ML model by using a custom ML library. The company wants to use Amazon SageMaker to train this model. The data science team creates a custom SageMaker image to train the model. When the team tries to launch the custom image in SageM...

To determine which service can be used to access the logs for the error that occurs when launching the custom SageMaker image in SageMaker Studio, let's analyze each of the options: A) Amazon S3 - Reasoning: - Amazon S3 is a storage service and is often used for storing data, models, and other assets in a variety of formats. While logs can be stored in S3, it is not the service directly responsible for monitoring or accessing logs related to SageMaker. - S3 does not provide any built-in capability to access or manage logs for application errors or service operations. - Rejection Reason: S3 is used for storage but does not provide functionality for tracking or accessing logs related to SageMaker applications. B) Amazon Elastic Block Store (Amazon EBS) - Reasoning: - Amazon EBS is a block storage service that provides persistent storage for EC2 instances. It is typically used for attaching storage volumes to EC2 instances but is not focused on log management or monitoring. - EBS doesn't provide a centralized log repository or monitoring for SageMaker or other AWS services. - Rejection Reason: EBS is focused on storage, not on log management or error tracking. C) AWS CloudTrail - Reasoning: - AWS CloudTrail is a service that tracks API calls and provides an audit trail for user activities across AWS services. It helps monitor and log activities such as who initiated an action and when it occurred. - While CloudTrail is valuable for auditing API requests and...

Author: Kai99 · Last updated Apr 3, 2026

A data scientist wants to build a financial trading bot to automate investment decisions. The financial bot should recommend the quantity and price of an asset to buy or sell to maximize long-term profit. The data scientist will continuously stream financial transactions to the bot for training purposes. The data scientist must select ...

To design an effective financial trading bot that can automate investment decisions and maximize long-term profit, we need to select the appropriate machine learning (ML) algorithm. Let's analyze the options based on key factors like decision-making, training data, and how the bot will learn and adjust over time. A) Supervised Learning - Scenario: In supervised learning, the algorithm is trained using labeled data (input-output pairs). The bot would learn from historical data (e.g., past transactions and their outcomes) to predict the next action (buy/sell) based on known inputs (price, volume, etc.). - Limitation: While supervised learning can predict outcomes based on historical data, it doesn't adapt dynamically to the environment. In financial markets, conditions change frequently, and the bot needs to learn continuously and improve over time, which makes supervised learning less ideal for real-time decision-making in trading scenarios. B) Unsupervised Learning - Scenario: Unsupervised learning is used when the training data does not include explicit labels, and the algorithm must find patterns or groupings in the data. In trading, this could be used for tasks like clustering assets or identifying hidden structures in market data. - Limitation: The lack of labeled data makes it unsuitable for the task of making specific buy/sell decisions based on precise objectives like maximizing profit. Trading decisions require actionable predictions (e.g., buy at a specific ...

Author: Lucas Carter · Last updated Apr 3, 2026

A manufacturing company wants to create a machine learning (ML) model to predict when equipment is likely to fail. A data science team already constructed a deep learning model by using TensorFlow and a custom Python script in a local environment. The company wants to use Amaz...

To determine the most cost-effective option for training the deep learning model using Amazon SageMaker, let's break down each option and analyze which will provide the best performance-cost balance. A) Turn on SageMaker Training Compiler by adding compiler_config=3DTrainingCompilerConfig() as a parameter. Pass the script to the estimator in the call to the TensorFlow fit() method. - Explanation: The SageMaker Training Compiler optimizes the model's performance by compiling the TensorFlow code, which could lead to faster training. However, this option does not take advantage of cost-saving measures such as managed spot instances. - Limitation: While the Training Compiler improves performance, it does not reduce costs, as it does not utilize spot instances or other cost-saving strategies. It focuses only on optimizing training time. B) Turn on SageMaker Training Compiler by adding compiler_config=3DTrainingCompilerConfig() as a parameter. Turn on managed spot training by setting the use_spot_instances parameter to True. Pass the script to the estimator in the call to the TensorFlow fit() method. - Explanation: This option combines the use of SageMaker's Training Compiler with managed spot instances. Spot instances are a significant cost-saving measure in Amazon SageMaker, as they allow you to take advantage of unused EC2 capacity at a lower price. By optimizing performance through the compiler and reducing costs using spot instances, this configuration is a strong candidate for cost-effective training. - Strength: The use of managed spot instances is key for cost optimization while the Training Compiler enhances the training process. Spot instances can offer substantial savings, making this option highly cost-effective without sacrificing performance. - Best Fit: This is the most cost-effective option because it combines performance optimization with cost-saving...

Author: Kunal · Last updated Apr 3, 2026

An automotive company uses computer vision in its autonomous cars. The company trained its object detection models successfully by using transfer learning from a convolutional neural network (CNN). The company trained the models by using PyTorch through the Amazon SageMaker SDK. The vehicles have limited hardware and compute power. The company wants to optimize the model to red...

To optimize the model for limited hardware and compute power while maintaining accuracy, the goal is to reduce the model's memory and computational requirements. Let's examine each option and its suitability for achieving this goal. A) Use Amazon CloudWatch metrics to gain visibility into the SageMaker training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training information. Apply pruning to remove the low-ranking filters. Set new weights based on the pruned set of filters. Run a new training job with the pruned model. - Explanation: CloudWatch provides metrics related to the training job's performance, but it does not directly help with optimizing the model or reducing memory and computation. Although pruning can reduce the size of the model by removing less important weights (filters), CloudWatch is not the tool designed to help in this scenario. - Limitation: This solution doesn’t leverage any model-specific optimization technique directly. CloudWatch metrics are useful for monitoring, but pruning requires additional steps that aren't provided by CloudWatch. B) Use Amazon SageMaker Ground Truth to build and run data labeling workflows. Collect a larger labeled dataset with the labeling workflows. Run a new training job that uses the new labeled data with previous training data. - Explanation: SageMaker Ground Truth is useful for labeling datasets, but adding more data does not necessarily optimize the model for efficiency. In fact, a larger dataset may increase computational load and memory usage, which is contrary to the goal of reducing resource consumption. - Limitation: While adding more labeled data could improve model accuracy, it does not address the problem of optimizing the model for hardware with limited resources, such as memory and battery. C) Use Amazon SageMaker Debugger to gain visibility into the training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training information. Apply pruning ...

Author: Alexander · Last updated Apr 3, 2026

A data scientist wants to improve the fit of a machine learning (ML) model that predicts house prices. The data scientist makes a first attempt to fit the model, but the fitted model has poor accuracy on both the training dataset and th...

In order to improve the accuracy of a machine learning model that predicts house prices and has poor performance on both the training and test datasets, let's review the options and determine the most appropriate actions the data scientist should take: A) Increase the amount of regularization that the model uses. - Explanation: Regularization techniques (such as L2 regularization) are used to prevent overfitting by penalizing large model coefficients. However, if the model is already underfitting (which seems likely given the poor accuracy on both training and test datasets), increasing regularization will make it even harder for the model to fit the data properly. This would likely worsen the performance. - Limitation: Increasing regularization when the model is underfitting would typically make the model more constrained, which is not the correct approach when accuracy is low. This option is rejected. B) Decrease the amount of regularization that the model uses. - Explanation: If the model is underfitting, decreasing regularization allows the model to better fit the data by reducing the penalty on large coefficients. This can allow the model to learn more complex patterns, potentially improving accuracy on both the training and test datasets. - Strength: Decreasing regularization is useful when a model is too simple and cannot capture the underlying patterns in the data (underfitting). This is a good approach to try when the model has poor accuracy. - Best Fit: This is a viable option to improve model accuracy, especially if the model is underfitting. C) Increase the number of training examples that the model uses. - Explanation: Increasing the number of training examples can help improve the model's ability to generalize by exposing it to more diverse examples. If the model is underfitting, adding more data can provide additional insights and help the model capture better patterns in the data. - Strength: More training data can lead to better model performance, particularly when there are insufficient examples for the model to learn from. This is an effective strategy for improving accuracy. - Best Fit: This is a solid approach, especially if the model has insufficient data to train on. ...

Author: Ravi Patel · Last updated Apr 3, 2026

A car company is developing a machine learning solution to detect whether a car is present in an image. The image dataset consists of one million images. Each image in the dataset is 200 pixels in height by 200 pixels in width. Each image is labeled as either having a car or not having a car. W...

To determine which architecture is most likely to produce a model that detects whether a car is present in an image with the highest accuracy, we need to carefully consider the characteristics of the problem and the available architectures. Key Factors to Consider: - Image Data: The data consists of images, which are well-suited for convolutional neural networks (CNNs) due to their ability to automatically learn spatial hierarchies in images (i.e., detecting edges, shapes, and objects). CNNs excel at image classification tasks. - Output Layer: The output of the model needs to be a probability that the image contains a car. For binary classification (car or not car), the appropriate output layer should produce a probability that sums to 1 and is suitable for binary classification. Now, let's evaluate each option: A) Use a deep convolutional neural network (CNN) classifier with the images as input. Include a linear output layer that outputs the probability that an image contains a car. - Selection Reasoning: Using a CNN for image classification is the right approach, as CNNs are specifically designed to extract features from images, which significantly improves the accuracy of image-based models. However, a linear output layer is not the ideal choice for binary classification. A linear output layer would give a continuous value, but for binary classification, we need a function that squashes the output between 0 and 1. This is typically achieved using sigmoid activation. Hence, while the CNN is the correct architecture, the linear output layer is not optimal for this task. - Scenario Usefulness: This approach is close but still lacks the correct activation function for binary classification. B) Use a deep convolutional neural network (CNN) classifier with the images as input. Include a softmax output layer that outputs the probability that an image contains a car. - Selection Reasoning: A CNN is the right architecture for image classification. However, softmax is typically used for multi-class classification (i.e., when there are more than two classes). In this case, since the task is binary classification (car vs. no car), sigmoid activation is a better choice, not softmax....

Author: Amelia · Last updated Apr 3, 2026

A company is creating an application to identify, count, and classify animal images that are uploaded to the company's website. The company is using the Amazon SageMaker image classification algorithm with an ImageNetV2 convolutional neural network (CNN). The solution works well for most animal images but does not recognize many animal species that are less common. The company obtains 10,000 labeled images of less common animal species and stores the images in Amazon S3....

To address the scenario where the company is trying to improve their model's recognition of less common animal species, the key focus is on utilizing the 10,000 labeled images of these less common species and training the model effectively with the Amazon SageMaker image classification algorithm. Option Analysis: - A) Use a ResNet model. Initiate full training mode by initializing the network with random weights. - Rejected: The ResNet model can indeed work well for image classification tasks, but this approach involves full training mode with random weights, which isn't ideal for leveraging the existing knowledge in the ImageNetV2 model. Full training from random weights would require a much larger dataset and extensive computational resources, which isn't necessary given that the company already has a pre-trained model. This step doesn’t align well with the goal of incorporating the less common species data efficiently. - B) Use an Inception model that is available with the SageMaker image classification algorithm. - Rejected: The company is already using the SageMaker image classification algorithm with ImageNetV2 CNN, which means the model architecture is already established and works well for most animal images. Switching to a different model (like Inception) might be a valid approach but is unnecessary for the current problem. The goal is to fine-tune the existing model, not to change the architecture entirely. Additionally, switching models could require retraining the entire model on a larger dataset. - C) Create a .lst file that contains a list of image files and corresponding class labels. Upload the .lst file to Amazon S3. - Selected: This step is essential when working with the Pipe mode in Amazon SageMaker, as...

Author: Ming88 · Last updated Apr 3, 2026

A music streaming company is building a pipeline to extract features. The company wants to store the features for offline model training and online inference. The company wants to track feature history and to give the company's data science teams a...

To address the music streaming company's requirements, the solution must be efficient, scalable, and provide the ability to store features for both offline model training and online inference. Additionally, it should allow the company to track feature history and provide easy access to data science teams. Option Analysis: - A) Use Amazon SageMaker Feature Store to store features for model training and inference. Create an online store for online inference. Create an offline store for model training. Create an IAM role for data scientists to access and search through feature groups. - Selected: This is the most efficient and purpose-built solution. Amazon SageMaker Feature Store is designed to manage and store features for machine learning, offering both online and offline storage capabilities. The online store can be used for low-latency inference, while the offline store is ideal for large-scale model training. SageMaker Feature Store also provides automatic tracking of feature history, making it easier for data science teams to access and query features. The ability to create an IAM role for access control ensures that the data science team can securely interact with the feature groups, meeting all the operational and security requirements. - B) Use Amazon SageMaker Feature Store to store features for model training and inference. Create an online store for both online inference and model training. Create an IAM role for data scientists to access and search through feature groups. - Rejected: This approach would use a single online store for both inference and model training. While SageMaker Feature Store can handle both use cases, mixing online inference with model training features is not optimal. Model training requires accessing large datasets, which may not be ideal for an online store designed for fast, low-latency access. Having separate stores for online and offline use ensures better optimization for both types of workloads, improving scalability and performance. - C) Create one Amazon S3 bucket to store online inference features. Create a second S3 bucket to store offline model training features. Turn on versioning for the S3 buckets and use tags to specify which tags are for online inference features and which are for offline model training fea...

Author: Emma Brown · Last updated Apr 3, 2026

A beauty supply store wants to understand some characteristics of visitors to the store. The store has security video recordings from the past several years. The store wants to generate a report of hourly visitors from the recordings. The report should group visi...

Option Analysis: - A) Use an object detection algorithm to identify a visitor's hair in video frames. Pass the identified hair to a ResNet-50 algorithm to determine hair style and hair color. - Selected: This is a highly effective solution with minimal effort. Object detection algorithms are designed to identify and localize objects (in this case, hair) in an image or video frame. Once the hair is identified, using a ResNet-50 algorithm (a deep convolutional neural network pre-trained on image data) can be a good fit for recognizing complex attributes like hairstyle and hair color. ResNet-50 has shown excellent performance for image classification tasks, so it's well-suited to distinguishing various hair types and colors. This approach leverages existing powerful models (object detection and ResNet-50) for efficient classification, making it the most straightforward option. - B) Use an object detection algorithm to identify a visitor's hair in video frames. Pass the identified hair to an XGBoost algorithm to determine hair style and hair color. - Rejected: While object detection is appropriate for identifying hair in video frames, XGBoost, a decision tree-based model, is not ideal for processing image data, especially when it comes to complex image features like hairstyles and hair color. XGBoost requires structured input, and transforming image data to a form suitable for XGBoost would add unnecessary complexity and likely reduce accuracy. For image classification tasks, deep learning models like ResNet-50 are better suited than tree-based models like XGBoost. - C) Use a semantic segmentation algorithm to identify a visitor's hair in video frames. Pass the identified hair to a ResNet-50 algorithm to determine h...

Author: Sophia Clark · Last updated Apr 3, 2026

A financial services company wants to automate its loan approval process by building a machine learning (ML) model. Each loan data point contains credit history from a third-party data source and demographic information about the customer. Each loan approval prediction must come with a report that contains an explanation for why the customer was approved for a loan or wa...

Option Analysis: - A) Use SageMaker Model Debugger to automatically debug the predictions, generate the explanation, and attach the explanation report. - Rejected: While SageMaker Model Debugger is an excellent tool for debugging machine learning models, it is primarily designed for identifying issues during training, like data quality or model performance problems, and not for generating explanation reports for predictions. It does not provide the functionality needed to generate and attach detailed explanations about why a loan was approved or denied based on the model’s predictions. - B) Use AWS Lambda to provide feature importance and partial dependence plots. Use the plots to generate and attach the explanation report. - Rejected: Using AWS Lambda to generate feature importance and partial dependence plots adds unnecessary complexity to the process. While Lambda can be used for various tasks, it requires additional effort to manage and compute the feature importance or generate partial dependence plots manually. It does not inherently provide a model explanation and requires extra steps for integration, making it a more cumbersome solution than needed. - C) Use SageMaker Clarify to generate the explanation report. Attach the report to the predicted results. - Selected: SageMaker Clarify is specifically designed to provide model explainability and fairness analysis. It automatically generates explanation reports for model predictions, including feature importance, bia...

Author: Krishna · Last updated Apr 3, 2026

A financial company sends special offers to customers through weekly email campaigns. A bulk email marketing system takes the list of email addresses as an input and sends the marketing campaign messages in batches. Few customers use the offers from the campaign messages. The company does not want to send irrelevant offers to customers. A machine learning (ML) team at the company is using Amazon SageMaker to build a model to recommend specific off...

In this scenario, the goal is to generate personalized offers for customers based on their profiles and past engagement, and feed these recommendations into the bulk email marketing system with operational efficiency. To assess the best solution, let's break down the options considering the following key factors: Key Considerations: 1. Personalization: The model must be able to recommend specific offers to customers based on historical data, which implies a need for personalized recommendation algorithms. 2. Operational Efficiency: The solution should minimize the overhead of continuous or frequent predictions, especially when dealing with bulk email campaigns. For operational efficiency, we need to avoid high-latency real-time predictions or continuous endpoint calls. 3. Batch vs. Real-Time: Since the bulk email marketing system handles a list of customers at once, it is more efficient to generate recommendations in batch rather than in real-time. Option Breakdown: A) Use the Factorization Machines algorithm to build a model that can generate personalized offer recommendations for customers. Deploy a SageMaker endpoint to generate offer recommendations. Feed the offer recommendations into the bulk email marketing system. - Rejection Reasoning: Although Factorization Machines are a good choice for recommendation tasks, deploying a SageMaker endpoint for real-time inference adds unnecessary complexity and overhead. Real-time inference is not needed for a bulk email campaign because the recommendations can be generated in batch and fed to the email system all at once. Using an endpoint for real-time predictions may lead to higher costs and operational inefficiency in this case. - Scenario Usefulness: This option is more suitable if the system needed real-time recommendations per customer interaction, but for bulk emails, batch inference is more efficient. B) Use the Neural Collaborative Filtering algorithm to build a model that can generate personalized offer recommendations for customers. Deploy a SageMaker endpoint to generate offer recommendations. Feed the offer recommendations into the bulk email marketing system. - Rejection Reasoning: Neural Collaborative Filtering (NCF) is indeed a powerful method for generating personalized recommendations. However, similar to Option A, deploying a SageMaker endpoint for real-time predictions is inefficient for a batch-oriented task like bulk...

Author: Vikram · Last updated Apr 3, 2026

A social media company wants to develop a machine learning (ML) model to detect inappropriate or offensive content in images. The company has collected a large dataset of labeled images and plans to use the built-in Amazon SageMaker image classification algorithm to train the model. The company also intends to use SageMaker pipe mode to speed up the training. The company splits the dataset into training, validation, and testing datasets. The company stores the training and validation images in folders that are named Training and Validation, respectively. The folders contain subfolders that correspond to the names of the dataset classes. The company resizes the images to the same size and generates two input manifest files named training....

When preparing data for Amazon SageMaker image classification, especially when using SageMaker Pipe Mode for faster training, the company must ensure that the data is in an efficient and compatible format. Let’s evaluate each option and select the most suitable one. Option A: Generate two Apache Parquet files, training.parquet and validation.parquet, by reading the images into a Pandas data frame and storing the data frame as a Parquet file. Upload the Parquet files to the training S3 bucket. - Explanation: Apache Parquet is a columnar data format typically used for structured data. However, images are unstructured data, and using Parquet files to store images is not ideal. Converting image data into a Pandas DataFrame and then storing it in Parquet files would increase unnecessary complexity and is not efficient for training image classification models. Additionally, Parquet is not natively optimized for SageMaker image classification. - Rejected because: It is not the most efficient way to handle image data in SageMaker, and it requires unnecessary conversions. - When it could be useful: In cases where you have structured tabular data and need efficient storage and retrieval for large datasets. Option B: Compress the training and validation directories by using the Snappy compression library. Upload the manifest and compressed files to the training S3 bucket. - Explanation: Snappy is a compression algorithm commonly used for compressing data in big data frameworks like Hadoop. While it offers fast compression and decompression speeds, it is not a standard compression method for images in machine learning workflows. Additionally, SageMaker does not natively support Snappy compression for image datasets. - Rejected because: The Snappy compression format is not suitable for image data and is not typically used with SageMaker's built-in image classification algorithms. - When it could be useful: For data systems that specifically require Snappy-compressed files for big data frameworks but not for image datasets in SageMaker. Option C: Compress the training and va...

Author: Zain · Last updated Apr 3, 2026

A media company wants to create a solution that identifies celebrities in pictures that users upload. The company also wants to identify the IP address and the timestamp details from the users so the company can prevent users from uploading pictures...

To meet the requirements of identifying celebrities in pictures and capturing the IP address and timestamp with the least development effort, let's evaluate each option: Option A: Use AWS Panorama to identify celebrities in the pictures. Use AWS CloudTrail to capture IP address and timestamp details. - Explanation: AWS Panorama is a service designed for bringing computer vision to on-premises devices to perform tasks like object detection. While it can perform real-time analysis, it requires more setup and custom development, particularly around connecting it to your specific use case of identifying celebrities. AWS CloudTrail, on the other hand, records API calls and activities related to AWS services but doesn't directly capture IP address and timestamp details from user uploads. - Rejected because: This solution introduces unnecessary complexity. AWS Panorama isn't the best fit for analyzing images from user uploads, as it's intended for edge devices with limited internet connectivity. - When it could be useful: If the company were processing images from devices on-premises with minimal internet connectivity. Option B: Use AWS Panorama to identify celebrities in the pictures. Make calls to the AWS Panorama Device SDK to capture IP address and timestamp details. - Explanation: Again, AWS Panorama is not ideal for processing user-uploaded images directly. The AWS Panorama Device SDK allows developers to build applications on edge devices, but using it for capturing IP addresses and timestamps would add unnecessary complexity and increase development effort. - Rejected because: Similar to Option A, it requires custom development for a use case that's not a natural fit for AWS Panorama. - When it could be useful: When you want to perform real-time processing on images from devices located in remote or offline environments, but not for handling uploads from online users. Option C: Use Amazon Rekognition to identify celebrities in the pictures. Use AWS CloudTrail to capture IP address and timestamp details. -...

Author: Elijah · Last updated Apr 3, 2026

A pharmaceutical company performs periodic audits of clinical trial sites to quickly resolve critical findings. The company stores audit documents in text format. Auditors have requested help from a data science team to quickly analyze the documents. The auditors need to discover the 10 main topics within the documents to prioritize and distribute the review work among the auditing team members. Documents that describe adverse events must receive the highest priority. A data scientist will use statistic...

In this scenario, the goal is to analyze a collection of clinical trial audit documents to identify topics that will help prioritize the review work, particularly highlighting adverse events. The company needs to extract abstract topics from the documents, and the data scientist will use statistical modeling to discover these topics and list relevant words for each topic. Let’s evaluate the options and determine the most suitable algorithms for this task: Option A: Latent Dirichlet Allocation (LDA) - Explanation: LDA is a generative probabilistic model commonly used for topic modeling. It is particularly well-suited for discovering latent (hidden) topics in a collection of documents. LDA assumes that documents are mixtures of topics, and topics are mixtures of words. LDA would identify the 10 main topics in the documents by analyzing the word distributions across them, which is exactly what the auditors need. - Selected because: LDA is widely used in natural language processing (NLP) tasks, specifically for topic modeling. It fits the requirement of discovering abstract topics and associating them with the relevant words, which will help the auditors understand and prioritize the documents. The model works well with text data and can easily provide interpretable results with the top words for each topic. - When it could be useful: LDA is ideal for text-based tasks that involve discovering hidden topics in a corpus of documents. Option B: Random Forest Classifier - Explanation: Random Forest is an ensemble learning method used for classification and regression tasks. While it can be used to classify documents, it is not ideal for discovering abstract topics or the top words associated with topics. It requires labeled data (supervised learning) and is generally used for tasks like classification or regression rather than unsupervised topic modeling. - Rejected because: Random Forest is not a topic modeling technique. It does not discover topics in an unsupervised way and would not help in identifying the 10 main topics within the documents. It also doesn’t provide the top words for each category in a way that the auditors need. - When it could be useful: Random Forest would be useful if the company had labeled data for supervised classification (e.g., classifying documents as "adverse event" or "not adverse event"). Option C: Neural Topic Modeling (NTM) - Explanation: Neural Topic Modeling (NTM) is a deep learning-based approach for topic modeling. NTM can potentially outperform traditional methods like LDA, as it uses neural networks to discover complex patterns in the data. It also provides a way to extract topics and associated words, much like LDA, but with a more flexible and data-driv...

Author: Krishna · Last updated Apr 3, 2026

A company needs to deploy a chatbot to answer common questions from customers. The chatbot must base its answers on company documentation. Which soluti...

Let's analyze each of the options in terms of the least development effort for deploying a chatbot that answers customer questions based on company documentation. Option A: Index company documents by using Amazon Kendra. Integrate the chatbot with Amazon Kendra by using the Amazon Kendra Query API operation to answer customer questions. - Explanation: Amazon Kendra is a fully managed, AI-powered search service specifically designed to index and search unstructured data, such as documents. Kendra can understand natural language queries, making it a great fit for answering customer questions based on company documentation. It also supports integrating with chatbots via its Query API, allowing users to easily get answers from indexed documents. - Selected because: Amazon Kendra is specifically designed for document search and question answering with minimal setup. This solution leverages a managed service that handles document indexing, search, and natural language queries, significantly reducing development effort. The API integration with the chatbot is straightforward, making it the most efficient and least effort-intensive solution for this use case. - When it could be useful: Amazon Kendra is ideal when you need a quick, scalable, and accurate solution for searching company documentation based on customer queries without needing to build and train custom models. Option B: Train a Bidirectional Attention Flow (BiDAF) network based on past customer questions and company documents. Deploy the model as a real-time Amazon SageMaker endpoint. Integrate the model with the chatbot by using the SageMaker Runtime InvokeEndpoint API operation to answer customer questions. - Explanation: BiDAF is a deep learning-based approach for question answering tasks that requires a custom-trained model. Training a BiDAF model involves significant time and expertise, especially for tasks like fine-tuning it on company-specific documentation. After training, it must be deployed on Amazon SageMaker, and the chatbot would need to interact with the model via the SageMaker API. - Rejected because: Although BiDAF is a powerful technique for question answering, it requires a lot of custom development, including data preparation, model training, and deployment. This would result in a much higher development effort compared to using a pre-built service like Amazon Kendra. - When it could be useful: This approach might be suitable for more advanced use cases where highly specific or complex question-answering models are required, but it’s overkill for simple, document-based querying. Option C: Train an Amazon SageMaker BlazingText model based on past c...

Author: Charlotte · Last updated Apr 3, 2026

A company wants to conduct targeted marketing to sell solar panels to homeowners. The company wants to use machine learning (ML) technologies to identify which houses already have solar panels. The company has collected 8,000 satellite images as training data and will use Amazon SageMaker Ground Truth to label the data. The company has a small internal team that is working on the pro...

Key Considerations: - Internal Team Expertise: The internal team has no machine learning (ML) expertise, so the solution should minimize the amount of effort needed for ML model training and inference. - Data Labeling: The company has collected 8,000 satellite images and will use Amazon SageMaker Ground Truth for labeling the data. The effort needed for labeling should be minimized. - Model Training: The solution should use an effective model for detecting solar panels on houses in satellite images. The team needs a simple, low-effort process to train and deploy the model. Option Breakdown: A) Set up a private workforce that consists of the internal team. Use the private workforce and the SageMaker Ground Truth active learning feature to label the data. Use Amazon Rekognition Custom Labels for model training and hosting. - Rejection Reasoning: While Amazon Rekognition Custom Labels is a powerful service for labeling and training models, it may require more involvement from the team in terms of customization and setup, particularly for a specialized task like identifying solar panels. Rekognition Custom Labels might not be optimized for this specific object detection task (solar panels in satellite images). Additionally, relying on the internal team for labeling without ML experience increases the effort, and Rekognition's automated labeling capabilities might not handle this specific problem as well as other more specialized object detection algorithms. - Scenario Usefulness: This is not the best choice because Rekognition Custom Labels may require more involvement and may not be the most efficient for this problem. B) Set up a private workforce that consists of the internal team. Use the private workforce to label the data. Use Amazon Rekognition Custom Labels for model training and hosting. - Rejection Reasoning: This option has the same issues as Option A regarding Rekognition Custom Labels, but it lacks the active learning feature of SageMaker Ground Truth. Active learning would help reduce the amount of manual labeling required by prioritizing uncertain images for labeling, which can significantly improve efficiency, especially with a small internal team. Without active learning, this approach could be more labor-intensive. - Scenario Usefulness: While Rekognition Custom Labels could work for basic object detection, this option is less efficient compared to using active learning to minimize the amount of labeling effort required from the team...

Author: Olivia · Last updated Apr 3, 2026

A company hosts a machine learning (ML) dataset repository on Amazon S3. A data scientist is preparing the repository to train a model. The data scientist needs to redact personally identifiable information (PH) from ...

Let's analyze the provided options for redacting personally identifiable information (PII) from a machine learning (ML) dataset in Amazon S3, considering key factors like development effort, simplicity, scalability, and ease of use. A) Use Amazon SageMaker Data Wrangler with a custom transformation to identify and redact the PII Analysis: - Development effort: Medium to high. While Data Wrangler is designed to simplify data processing, creating custom transformations for PII redaction would still require a fair amount of effort, especially to identify PII reliably and redact it properly. - Simplicity: Data Wrangler has built-in features for data wrangling and preparation, but custom transformations require familiarity with the tool and its APIs. - Use case suitability: Best for scenarios where the data scientist is already using SageMaker for model training and wants an integrated solution for data wrangling. However, if the goal is purely to redact PII, this might be overkill compared to other simpler solutions. B) Create a custom AWS Lambda function to read the files, identify the PII, and redact the PII Analysis: - Development effort: High. Developing a custom Lambda function would require writing and maintaining the code to detect and redact PII. It may also require additional effort to handle various types of PII across different data formats (e.g., text, images). - Scalability: Lambda is scalable, but the implementation might become cumbersome for large datasets or complex PII detection rules. - Use case suitability: Lambda would be useful if a custom solution is required for specific logic, but it introduces a higher development and maintenance burden compared to other options. C) Use AWS Glue DataBrew to identify and redact the PII Analysis: - Development effort: Low. AWS Glue DataBrew provides a no-code interface for data ...

Author: Evelyn · Last updated Apr 3, 2026

A company is deploying a new machine learning (ML) model in a production environment. The company is concerned that the ML model will drift over time, so the company creates a script to aggregate all inputs and predictions into a single file at the end of each day. The company stores the file as an object in an Amazon S3 bucket. The total size of the daily file is 100 GB. The daily file size will increase over time. Four times a year, the company samples the data from the previous 90 days to check the ML model for drift. After the 90-day period, t...

Let's evaluate the given options based on the requirements: minimizing storage costs, maintaining durability, and complying with the need to store the data for compliance after 90 days. Key Requirements: 1. Minimize storage costs: This indicates that the company is looking for the most cost-effective storage classes for long-term retention. 2. Maintain durability: Amazon S3 provides high durability, but some storage classes (e.g., One Zone-IA) offer lower durability. 3. Compliance and retention for 90 days: The company needs to keep the files for compliance reasons after 90 days, meaning long-term storage is required after that period. Analysis of Each Option: A) Store the daily objects in the S3 Standard-Infrequent Access (S3 Standard-IA) storage class. Configure an S3 Lifecycle rule to move the objects to S3 Glacier Flexible Retrieval after 90 days. - S3 Standard-IA: This is a cost-effective storage class for data that is accessed infrequently but needs to be quickly retrieved. It is highly durable (99.999999999% durability). - S3 Glacier Flexible Retrieval: This is a good long-term storage option, optimized for data that is rarely accessed but still needs to be preserved with high durability. Retrieval times are slower than Standard-IA. - Cost Efficiency: Using Standard-IA for 90 days and Glacier Flexible Retrieval for long-term storage is cost-effective for the type of data described (infrequent access and long-term retention). - Durability: Both Standard-IA and Glacier Flexible Retrieval offer high durability (99.999999999%). B) Store the daily objects in the S3 One Zone-Infrequent Access (S3 One Zone-IA) storage class. Configure an S3 Lifecycle rule to move the objects to S3 Glacier Flexible Retrieval after 90 days. - S3 One Zone-IA: This is cheaper than Standard-IA because it stores data in a single availability zone instead of multiple zones. However, it provides lower durability (99.999999% durability) and higher risk of data loss in the event of an availability zone failure. - S3 Glacier Flexible Retrieval: Like option A, this is a good choice for long-term storage. - Durability Issue: Using One Zone-IA introduces a risk in durability due to the lack of cross-availabi...

Author: Aria · Last updated Apr 3, 2026

A company wants to enhance audits for its machine learning (ML) systems. The auditing system must be able to perform metadata analysis on the features that the ML models use. The audit solution must generate a report that analyzes the metadata. The solution also must be able to set the da...

Key Considerations: 1. Metadata Analysis: The solution should be able to perform metadata analysis on features used by the ML models. 2. Setting Data Sensitivity and Authorship: The ability to assign metadata that specifies data sensitivity and authorship is a key requirement. 3. Reporting: The solution should be able to generate reports that summarize the metadata analysis. 4. Development Effort: The company is looking for a solution that minimizes development effort, meaning it should leverage managed services and built-in features wherever possible. Option Breakdown: A) Use Amazon SageMaker Feature Store to select the features. Create a data flow to perform feature-level metadata analysis. Create an Amazon DynamoDB table to store feature-level metadata. Use Amazon QuickSight to analyze the metadata. - Rejection Reasoning: Although SageMaker Feature Store can help with storing and managing features, creating a separate DynamoDB table for feature-level metadata and using Amazon QuickSight for analysis adds unnecessary complexity. The DynamoDB table would require manual effort to manage and synchronize metadata with the feature store, and the solution would need additional custom data flows for analysis, which increases the development effort. - Scenario Usefulness: This option could be used in custom cases, but it is unnecessarily complex for this use case where there are more straightforward solutions available. B) Use Amazon SageMaker Feature Store to set feature groups for the current features that the ML models use. Assign the required metadata for each feature. Use SageMaker Studio to analyze the metadata. - Rejection Reasoning: While SageMaker Feature Store and SageMaker Studio are powerful tools, SageMaker Studio does not provide built-in features for performing specific metadata analysis in a way that generates detailed, structured reports. It’s mainly a development and monitoring environment, but it might not fully meet the requirement for automated metadata analysis and reporting. - Scenario Usefulness: This solution provides some capabilities but doesn’t quite fulfill the requirement for automa...

Author: FlamePhoenix2025 · Last updated Apr 3, 2026

A machine learning (ML) specialist uploads a dataset to an Amazon S3 bucket that is protected by server-side encryption with AWS KMS keys (SSE-KMS). The ML specialist needs to ensure that an Amazon SageMaker notebook inst...

Let's analyze the different solutions to meet the requirement of allowing an Amazon SageMaker notebook instance to read a dataset stored in Amazon S3 with server-side encryption using AWS KMS (SSE-KMS). Key Requirements: 1. Access to the S3 Dataset: The SageMaker notebook instance must be able to read the encrypted dataset stored in S3. 2. Encryption Consideration: The data is encrypted using KMS keys, so the notebook instance needs permissions to access the KMS key that was used to encrypt the data. 3. Least Permissions and Simplicity: The solution should ensure that only necessary permissions are granted to allow SageMaker to read the S3 dataset. Analysis of Each Option: A) Define security groups to allow all HTTP inbound and outbound traffic. Assign the security groups to the SageMaker notebook instance. - Reasoning: This option is not relevant to the problem. Security groups control network access and are primarily used for controlling traffic between resources in a VPC. However, the requirement here is to access encrypted S3 data, which involves IAM roles and KMS permissions, not network-level traffic. - Rejected: Security groups are not involved in managing S3 data access or KMS key permissions. B) Configure the SageMaker notebook instance to have access to the VPC. Grant permission in the AWS Key Management Service (AWS KMS) key policy to the notebook’s VPC. - Reasoning: While configuring the notebook to access a VPC is a valid configuration for certain use cases (like accessing private resources in the VPC), it does not directly address the issue of accessing the encrypted S3 data. The problem at hand is KMS key access, not VPC access. - Rejected: This option misses the core requirement of granting the necessary KMS permission...

Author: Ava · Last updated Apr 3, 2026

A company has a podcast platform that has thousands of users. The company implemented an algorithm to detect low podcast engagement based on a 10-minute running window of user events such as listening to, pausing, and closing the podcast. A machine learning (ML) specialist is designing the ingestion process for these events. The ML specialist needs to transform the data...

Let’s evaluate the different options based on the need to transform event data (user actions such as listening, pausing, and closing a podcast) for machine learning (ML) inference while minimizing operational effort. Key Requirements: 1. Event Data Transformation: The data transformation needs to process a 10-minute running window of user events for inference. 2. Operational Effort: The goal is to minimize operational overhead, ideally using managed services that reduce the need for manual intervention and maintenance. 3. Real-time Processing: The transformation should work in real-time or near real-time to ensure that the data is ready for inference. Analysis of Each Option: A) Use an Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster to ingest event data. Use Amazon Kinesis Data Analytics to transform the most recent 10 minutes of data before inference. - Amazon MSK: While Amazon MSK (Managed Streaming for Apache Kafka) is a powerful streaming platform for real-time data ingestion, it requires more setup and maintenance compared to alternatives like Kinesis. It involves managing Kafka clusters, which can increase operational overhead. - Kinesis Data Analytics: Kinesis Data Analytics is a good choice for processing real-time streaming data, but using MSK as the ingestion mechanism introduces more complexity in managing Kafka compared to using Kinesis Data Streams directly. - Operational Overhead: Using MSK adds unnecessary operational overhead as it requires more setup and management of the Kafka cluster. This option is more complex than the other alternatives. - Rejected: This option introduces unnecessary complexity and operational overhead, making it less efficient for minimizing operational effort. B) Use Amazon Kinesis Data Streams to ingest event data. Store the data in Amazon S3 by using Amazon Kinesis Data Firehose. Use AWS Lambda to transform the most recent 10 minutes of data before inference. - Kinesis Data Streams: This is a managed service for ingesting real-time event data, which works well for streaming use cases. - Kinesis Data Firehose: Kinesis Data Firehose can stream data directly to Amazon S3 with minimal setup. However, storing the data in S3 might not be the most optimal solution for real-time transformations. The 10-minute running window transformation would likely be more challenging to manage with this architecture, as you would need to periodically fetch and process the data from S3. - AWS Lambda: Lambda could be used to transform the data, but its stateless nature makes it difficult to efficiently manage a 10-minute running window for transformation. This may require additional complexity in managing state or periodically invoking Lambda for transformation. - Operational Complexity: The use o...

Author: Samuel · Last updated Apr 3, 2026

A machine learning (ML) specialist is training a multilayer perceptron (MLP) on a dataset with multiple classes. The target class of interest is unique compared to the other classes in the dataset, but it does not achieve an acceptable recall metric. The ML specialist varies the number and size of the ML...

To improve recall in the least amount of time, the most effective solution would focus on adjusting the model to better handle the class imbalance issue, since the target class is unique but not achieving an acceptable recall. Let’s break down the options: A) Add class weights to the MLP's loss function, and then retrain. - Explanation: Adding class weights to the loss function of the MLP directly addresses class imbalance, which is likely the reason the target class is not achieving high recall. This solution modifies how the model is trained by making the misclassification of the target class more costly, encouraging the MLP to focus more on correctly predicting this class. - Key factors: Quick implementation, minimal change to the model architecture, no need for additional data. - Why it works: By adjusting class weights, you essentially tell the model to pay more attention to the underrepresented class, improving recall without requiring a massive retraining or the collection of new data. - Time efficiency: This is a relatively quick change in the training process. B) Gather more data by using Amazon Mechanical Turk, and then retrain. - Explanation: Gathering more data can improve model performance, but it is time-consuming. Depending on how long it takes to gather enough labeled data, it might take much longer to implement than adjusting the class weights. - Key factors: Data collection process, time, cost. - Why it’s not ideal here: Although gathering more data could improve performance, the recall issue may not be solely about data quantity—it could also be about class imbalance. Therefore, adding more data might not immediately solve the problem unless it specifically addresses the imbalance. - Time efficiency: This solution takes the longest amount of time to implement. C) Train a k-means algorithm instead of an MLP. - Explanation: K-me...

Author: MysticJaguar44 · Last updated Apr 3, 2026

A machine learning (ML) specialist uploads 5 TB of data to an Amazon SageMaker Studio environment. The ML specialist performs initial data cleansing. Before the ML specialist begins to train a model, the ML specialist needs to create and view an analysis report that details potential bias in the...

To meet the requirements of creating and viewing an analysis report that details potential bias in the uploaded data with the least operational overhead, we should focus on tools that provide automated bias detection and reporting, without requiring extensive manual work. Let’s evaluate each option: A) Use SageMaker Clarify to automatically detect data bias. - Explanation: SageMaker Clarify is specifically designed to analyze datasets and detect bias both before and after model training. It can automatically generate reports on potential biases in the input data and provide insights into how these biases could affect model fairness. - Key factors: SageMaker Clarify is built for the purpose of detecting bias in datasets and is integrated with SageMaker Studio. It automates much of the bias detection process, which aligns perfectly with the goal of minimizing operational overhead. - Why it works: It directly addresses the task of analyzing potential bias in the data and is easy to integrate within the SageMaker ecosystem. It's an out-of-the-box solution for the exact need at hand. - Time efficiency: Minimal overhead because it automates the bias detection process in a streamlined manner. B) Turn on the bias detection option in SageMaker Ground Truth to automatically analyze data features. - Explanation: SageMaker Ground Truth is a tool for labeling data, and while it supports creating datasets with human labeling, it does not focus on bias detection by itself. Bias detection is not a built-in feature of Ground Truth in the same way it is in SageMaker Clarify. - Why it’s not ideal: Although Ground Truth helps with data labeling and creating datasets, it is not designed for analyzing data bias in the context of model fairness. Bias detection features are limited in Ground Truth, so this option wouldn’t directly meet the requirement of generating a bias report. - Time efficiency: It does not specifically handle bias detection as efficiently as SageMaker Clarify. C) Use SageMaker Model Monitor to generate a bias drift report. - Explanation: SageMaker Model Monitor is used to track model performance and monitor data drift after a model has been deployed. While it does help with detecting changes in input data that could...

Author: Emma · Last updated Apr 3, 2026

A network security vendor needs to ingest telemetry data from thousands of endpoints that run all over the world. The data is transmitted every 30 seconds in the form of records that contain 50 fields. Each record is up to 1 KB in size. The security vendor uses Amazon Kinesis Data Streams to ingest the data. The vendor requires hourly summaries of the records that Kinesis Data Streams ingests. The vendor will use Amazon Athena to query the records and to generate the sum...

To meet the requirements of ingesting telemetry data from thousands of endpoints, summarizing it hourly, and then querying it via Amazon Athena with the least amount of customization, we need to focus on solutions that efficiently aggregate and transform the data with minimal manual intervention. Let's break down the options: A) Use AWS Lambda to read and aggregate the data hourly. Transform the data and store it in Amazon S3 by using Amazon Kinesis Data Firehose. - Explanation: AWS Lambda can be used to read and aggregate the data hourly. Kinesis Data Firehose can then be used to transform and store the data in Amazon S3. - Why it works: While AWS Lambda can be effective for processing and aggregation, this approach introduces operational complexity. Aggregating data hourly using Lambda would require you to manage state and timing, which may be error-prone and difficult to maintain. Additionally, AWS Lambda may not be the best fit for aggregating large volumes of data from multiple endpoints. - Drawbacks: Managing Lambda functions for such aggregation would add significant complexity, and might not scale well when dealing with high-throughput data streams. B) Use Amazon Kinesis Data Firehose to read and aggregate the data hourly. Transform the data and store it in Amazon S3 by using a short-lived Amazon EMR cluster. - Explanation: Amazon Kinesis Data Firehose can handle data ingestion, but the aggregation would need to be done using a short-lived Amazon EMR cluster. The EMR cluster would read from Kinesis Data Firehose, process the data, and then store the output in Amazon S3. - Why it’s not ideal: Using an EMR cluster adds significant operational overhead. Although EMR is powerful for large-scale data processing, it requires setting up, managing, and terminating clusters, which introduces complexity and costs. It's not the best choice for the requirement of simple, hourly aggregation with minimal customization. - Drawbacks: EMR clusters are more suited for complex, large-scale transformations. The overhead of provisioning and managing clusters would be a barrier for this use case. C) Use Amazon Kinesis Data Analytics to read and aggregate the data hourly. Transform the data and store it in Amazon S3 by using Amazon Kinesis Data Firehose. - Explana...

Author: Amelia · Last updated Apr 3, 2026

A machine learning (ML) specialist is training a linear regression model. The specialist notices that the model is overfitting. The specialist applies an L1 regularization parameter and runs the model again. This change results in...

To improve the model results, we need to address the issue of overfitting while maintaining a reasonable level of feature selection without forcing all weights to zero. Let’s evaluate the options based on the situation: A) Increase the L1 regularization parameter. Do not change any other training parameters. - Explanation: Increasing the L1 regularization parameter will make the model more likely to drive more weights to zero, potentially leading to the situation where all features have zero weights, as seen in the current model behavior. This would worsen the issue rather than improve the model. - Why it’s not ideal: Over-regularizing with L1 will lead to too much sparsity in the model, which can result in a model that doesn't capture useful relationships in the data, leading to poor performance. - Key factor: Increasing L1 makes the regularization stronger, leading to more feature elimination (not useful in this case). B) Decrease the L1 regularization parameter. Do not change any other training parameters. - Explanation: Decreasing the L1 regularization parameter would reduce the penalty on the model’s weights, allowing the model to fit the data more flexibly. This could reduce the sparsity of the model and prevent all features from having zero weights, potentially improving the performance and reducing overfitting. - Why it works: By reducing the regularization strength, the model has more freedom to learn the relationships in the data while still preventing overfitting (because L1 regularization still helps to some extent). This is the most direct approach to fixing the issue of all weights being zero. - Key factor: Reducing L1 regularization allows the model to avoid extreme feature elimination, leading to a better fit. C) Introduce a large L2 regularization parameter. Do not change the current L1 regularizatio...

Author: Vikram · Last updated Apr 3, 2026

A machine learning (ML) engineer has created a feature repository in Amazon SageMaker Feature Store for the company. The company has AWS accounts for development, integration, and production. The company hosts a feature store in the development account. The company uses Amazon S3 buckets to store feature values offline. The company wants to share features and to allow the integration ...

Key Considerations: - Feature Sharing Across Accounts: The company wants to allow the integration and production accounts to access and reuse the features from the feature repository in the development account. - S3 Access: The company also stores feature values offline in Amazon S3 buckets. These S3 buckets need to be shared with other accounts. - Security and Access Control: Ensuring secure access to the feature store and S3 buckets from the integration and production accounts is important. - Operational Efficiency: The solution should minimize manual steps and be easy to manage. Option Breakdown: A) Create an IAM role in the development account that the integration account and production account can assume. Attach IAM policies to the role that allow access to the feature repository and the S3 buckets. - Selection Reasoning: This option ensures secure cross-account access. By creating an IAM role in the development account that the integration and production accounts can assume, you can control access to the SageMaker Feature Store and S3 buckets securely. IAM policies can be tailored to grant the necessary permissions, and the cross-account access is managed through role assumption. This approach is simple and flexible. - Scenario Usefulness: This solution is ideal for securely sharing features across AWS accounts with a focus on control and ease of access. B) Share the feature repository that is associated with the S3 buckets from the development account to the integration account and the production account by using AWS Resource Access Manager (AWS RAM). - Rejection Reasoning: While AWS RAM is used to share AWS resources across accounts, it does not currently support SageMaker Feature Store as a resource type for sharing. Therefore, AWS RAM cannot be used for sharing the feature repository itself. - Scenario Usefulness: This option is not applicable for sharing SageMaker Feature Store as it is not supported by AWS RAM for direct sharing. C) Use AWS Security Token Service (AWS STS) from the integration account and the production account to retrieve credentials for the development account...

Author: Max · Last updated Apr 3, 2026

A company is building a new supervised classification model in an AWS environment. The company's data science team notices that the dataset has a large quantity of variables. All the variables are numeric. The model accuracy for training and validation is low. The model's processing time is affected by high latency. The data science tea...

To address the issues of low accuracy and high processing time, we need to look at each option carefully and understand how they might impact the model's performance. Option A: Create new features and interaction variables. - Pros: Creating new features or interaction terms can sometimes improve the accuracy of a model by providing more meaningful information to the classifier. This is particularly useful when the existing features are not capturing complex relationships in the data. - Cons: However, creating new features can increase the dimensionality, which might worsen the model's processing time, especially with a large number of variables. If the feature set is already large, this could lead to overfitting, and it may not necessarily solve the latency issue. - Best scenario: This option is useful if there is clear insight into how interactions between variables might improve the model, but it’s not directly related to dimensionality reduction or speeding up processing time. Option B: Use a principal component analysis (PCA) model. - Pros: PCA is a well-known technique for dimensionality reduction. It can significantly decrease the number of features while preserving most of the data’s variance. This will not only help reduce the dimensionality but also speed up the processing time. In turn, it could improve the accuracy, especially if the dataset has many highly correlated variables. - Cons: PCA is a linear transformation and may not capture non-linear relationships between features. If the relationships are complex and non-linear, this could limit the model's effectiveness. However, for a numeric dataset with high dimensionality, PCA can help reduce noise and simplify the model. - Best scenario: PCA is ideal for high-dimensional datasets with many numeric features and could help improve both accuracy and speed. Option C: Apply normalization on the feature s...

Author: Ethan · Last updated Apr 3, 2026

An exercise analytics company wants to predict running speeds for its customers by using a dataset that contains multiple health-related features for each customer. Some of the features originate from sensors that provide extremely noisy values. The company is training a regression model by using the built-in Amazon SageMaker linear learner algorithm to predict the running speeds. While the company is training the model,...

Key Considerations: - Training Loss Decreases, Validation Loss Increases: This indicates overfitting, where the model performs well on the training data but poorly on unseen validation data. The model is memorizing the noise or specific patterns from the training data rather than generalizing. - Noisy Data: Some features are noisy, which might cause the model to overfit, as the model learns spurious relationships from the noise rather than true, underlying patterns. - Goal: The objective is to improve the model's generalization by minimizing overfitting and optimally fitting the model. Option Breakdown: A) Add L1 regularization to the linear learner regression model. - Selection Reasoning: L1 regularization (Lasso) is effective in reducing overfitting by adding a penalty to the absolute value of the coefficients. This penalty can drive some coefficients to zero, leading to a simpler and more interpretable model that potentially eliminates the impact of noisy features. It is particularly useful in the case of noisy features, as it encourages sparsity in the model and may help eliminate unimportant variables that are overfitting the model. - Scenario Usefulness: Given the noisy data, L1 regularization is a good choice as it can reduce the complexity of the model and help with overfitting by eliminating irrelevant features. B) Perform a principal component analysis (PCA) on the dataset. Use the linear learner regression model. - Rejection Reasoning: PCA is a dimensionality reduction technique that transforms the data into orthogonal components, which may reduce the noise by focusing on the principal components. However, it is not a targeted technique for overfitting. PCA does not directly address the overfitting issue in the model, especially in cases where noisy features still dominate the principal components. Additionally, PCA can make the model harder to interpret since the features will no longer correspond to the original variables. - Scenario Usefulness: Whil...

Author: Ethan · Last updated Apr 3, 2026

A company's machine learning (ML) specialist is building a computer vision model to classify 10 different traffic signs. The company has stored 100 images of each class in Amazon S3, and the company has another 10,000 unlabeled images. All the images come from dash cameras and are a size of 224 pixels =C3=97 224 pixels. After several tr...

The problem of overfitting typically arises when the model learns to perform exceptionally well on the training data but fails to generalize to new, unseen data. In this scenario, the ML specialist is working with 100 labeled images per class and an additional 10,000 unlabeled images. Overfitting suggests that the model is learning noise or unnecessary details from the limited training data. Let's evaluate each option based on how it might address the overfitting problem: Option A: Use Amazon SageMaker Ground Truth to label the unlabeled images. - Pros: By labeling the 10,000 unlabeled images, the company can significantly increase the size of the labeled dataset, which is critical for reducing overfitting. More labeled data allows the model to better generalize and capture a broader range of patterns. This is especially important in computer vision tasks, where deep learning models often perform better with large, diverse datasets. - Cons: The process of labeling the unlabeled data is time-consuming and requires resources. However, this is a valid approach to improve model performance in the long run. - Best scenario: This is ideal if the primary issue is the limited number of labeled images. By expanding the dataset, the model can learn from a wider range of examples, reducing the risk of overfitting. Option B: Use image preprocessing to transform the images into grayscale images. - Pros: Grayscale images reduce the complexity of the input data by removing the color channels, which could make the model simpler and faster to train. - Cons: While this might reduce the model complexity, it could also lose important color information, especially for traffic signs where color might be a key distinguishing feature. For example, red and green traffic lights are easily distinguishable by color. Removing color might hurt the model’s performance, as it could lose this essential feature. - Best scenario: This would only be useful if the color information was irrelevant to the task, but for traffic sign classification, color is likely an important feature. Thus, this is not the best choice for this specific scenario. Option C: Use data augmentation to rotate and translate the labeled images. - Pros: Data augmentation is an effective method for combating overfitting. By rotating, translating, and applying other transformations (like flipping or scaling), you artificially increase the size and diversity of the training dataset. This allows the model to generalize better by being exposed to various variations of the tr...

Author: Sofia2021 · Last updated Apr 3, 2026

A data science team is working with a tabular dataset that the team stores in Amazon S3. The team wants to experiment with different feature transformations such as categorical feature encoding. Then the team wants to visualize the resulting distribution of the dataset. After the team finds an appropriate set of feature transformations, the t...

Let's evaluate the options in terms of their ability to meet the requirements of experimentation with feature transformations, visualization, and workflow automation, while aiming for operational efficiency. Option A: Use Amazon SageMaker Data Wrangler preconfigured transformations to explore feature transformations. Use SageMaker Data Wrangler templates for visualization. Export the feature processing workflow to a SageMaker pipeline for automation. - Pros: - SageMaker Data Wrangler offers a powerful and user-friendly interface for experimenting with different feature transformations, including categorical encoding, data scaling, and feature engineering. The preconfigured transformations streamline the experimentation process. - SageMaker Data Wrangler templates are built specifically for visualizing datasets, making it easy to see the impact of the transformations. - SageMaker Pipelines provides an efficient way to automate the feature processing workflow. It allows the data science team to define a sequence of steps, track changes, and automate the workflow, making it ideal for operational efficiency and scalability. - Cons: - While SageMaker Pipelines is a great tool for automating workflows, it might require some initial setup, especially when defining complex workflows. - Best scenario: This is the most efficient solution for experimenting with feature transformations, visualizing them, and automating the workflow. SageMaker provides an integrated approach, streamlining all the steps from transformation to automation. Option B: Use an Amazon SageMaker notebook instance to experiment with different feature transformations. Save the transformations to Amazon S3. Use Amazon QuickSight for visualization. Package the feature processing steps into an AWS Lambda function for automation. - Pros: - SageMaker notebook instance is flexible and allows for direct coding, offering complete control over transformations and experimentation. - QuickSight can be used to visualize the data, and AWS Lambda can automate tasks. - Cons: - This approach is less integrated than SageMaker Data Wrangler, meaning the team would need to manage more parts manually, such as saving the transformations to S3, setting up Lambda functions, and connecting the different components. This leads to lower operational efficiency. - Lambda would require creating custom functions to handle the feature transformation steps, which could become complex to manage as the workflow grows. - Best scenario: This is more suited to scenarios where custom processing and high flexibility are required, but it comes at the cost of increased complexity and less operational efficiency. Option C: Use AWS Glue Studio with custom code to experiment with different feature transformations. Save the transformations to Amazon S3. Use Amazon QuickSight for visualizat...

Author: ThunderBear · Last updated Apr 3, 2026

A company plans to build a custom natural language processing (NLP) model to classify and prioritize user feedback. The company hosts the data and all machine learning (ML) infrastructure in the AWS Cloud. The ML team works from the company's office, which has an IPsec VPN connection to one VPC in the AWS Cloud. The company has set both the enableDnsHostnames attribute and the enableDnsSupport attribute of the VPC to true. The company's DNS resolvers point to the VPC DNS. The company does not allow the ML team to access Amazon SageM...

To solve this problem, the company needs to allow the ML team to access Amazon SageMaker notebooks while ensuring that the connection remains within the private AWS network and does not use the public internet. We also need to minimize development effort. Let’s evaluate each option based on these factors. Option A: Create a VPC interface endpoint for the SageMaker notebook in the VPC. Access the notebook through a VPN connection and the VPC endpoint. - Pros: - VPC interface endpoints (powered by AWS PrivateLink) allow private connections to Amazon services such as SageMaker without going over the public internet. This ensures all traffic stays within the private AWS network. - The VPN connection from the company’s office ensures secure access to the VPC over the private network. - This option requires minimal development effort since it uses native AWS services (VPC endpoint) to facilitate private communication without needing additional components. - Cons: None specific, as this is a well-suited solution for this use case. - Best scenario: This is the best option for this scenario because it meets all the requirements: it keeps the connection within the private network, avoids the public internet, and uses the least amount of additional infrastructure (no need for bastion hosts or NAT gateways). Option B: Create a bastion host by using Amazon EC2 in a public subnet within the VPC. Log in to the bastion host through a VPN connection. Access the SageMaker notebook from the bastion host. - Pros: - This setup would work for accessing the SageMaker notebook via a bastion host. - Cons: - The bastion host is in a public subnet, meaning it still requires internet access. This violates the requirement of keeping the connection within a private network, as the bastion host would need a public IP and internet access to connect to the SageMaker notebook, which contradicts the no-public-internet access requirement. - This setup requires more management, as you'd need to configure the bastion host and ensure secure access. It's also less efficient than using a VPC endpoint. - Best scenario: This option might be suitable in other cases where there are strict requireme...

Author: Isabella · Last updated Apr 3, 2026

A data scientist is using Amazon Comprehend to perform sentiment analysis on a dataset of one million social media posts. Whi...

To determine the best approach for processing the dataset of one million social media posts using Amazon Comprehend for sentiment analysis, we need to consider the scalability, efficiency, and processing time of each option. Below is an analysis of each approach. A) Use a combination of AWS Step Functions and an AWS Lambda function to call the DetectSentiment API operation for each post synchronously. - Reasoning: This approach processes each social media post individually, calling the `DetectSentiment` API synchronously. This will likely lead to significant delays as it will process posts one at a time, meaning that with a million posts, the processing time would be prohibitively long. - Why it's rejected: This approach is not efficient for a large dataset. It does not scale well and is time-consuming due to the one-by-one processing of each post. B) Use a combination of AWS Step Functions and an AWS Lambda function to call the BatchDetectSentiment API operation with batches of up to 25 posts at a time. - Reasoning: The `BatchDetectSentiment` API can process multiple posts (up to 25) in a single API call. This is much more efficient than processing each post individually. However, the batch size of 25 is relatively small compared to the size of the dataset, so while this is an improvement over option A, it still would result in a considerable amount of API calls to process one million posts. - Why it's rejected: Although batching is an improvement, it still doesn't scale efficiently for such a large dataset. The number of API calls required (one per batch of 25 posts) would still take a significant amount of time. C) Upload the posts to Amazon S3. Pass the S3 storage path to an AWS Lamb...

Author: Ming88 · Last updated Apr 3, 2026

A machine learning (ML) specialist at a retail company must build a system to forecast the daily sales for one of the company's stores. The company provided the ML specialist with sales data for this store from the past 10 years. The historical dataset includes the total amount of sales on each day for the store. Approximately 10% of the days in the historical dataset are missing sales data. The ML specialist builds a forecasting model based on the historical...

To determine the most likely action to improve the performance of the forecasting model, we need to consider both the issue of missing data and the overall structure of the dataset. A) Aggregate sales from stores in the same geographic area. - Reasoning: Aggregating sales from other stores in the same geographic area may help if the model is not accounting for external factors that could influence sales at the individual store level. However, this action may introduce noise if the sales patterns of nearby stores are significantly different or if they do not represent the target store well. The problem here is that this option may not directly address the missing data issue or improve forecasting accuracy for the specific store. - Why it's rejected: While it could offer some improvement in certain cases, this does not directly address the core issue with the current model, which seems to be related to missing data and daily forecasting accuracy for the single store. B) Apply smoothing to correct for seasonal variation. - Reasoning: Smoothing techniques can help the model account for trends and seasonal patterns in the data, especially if the sales data exhibits cyclical fluctuations. However, smoothing doesn't directly address the missing data or data quality issues, which is a significant part of the problem in this case. It may improve the model to some extent, but it’s unlikely to resolve the underlying issue of missing data. - Why it's rejected: This option could help in improving the model by addressing seasonality, but it doesn't solve the critical problem related to missing values, which seems to be a more pressing issue in improving the forecasting model. C) Change the forecast frequency from daily to weekly. - Reasoning: Changing the forecast frequency from daily to weekly might reduce the impact of missing data, as the model would n...

Author: Benjamin · Last updated Apr 3, 2026

A mining company wants to use machine learning (ML) models to identify mineral images in real time. A data science team built an image recognition model that is based on convolutional neural network (CNN). The team trained the model on Amazon SageMaker by using GPU instances. The team will deploy the model to a SageMaker endpoint. The data science team already knows the workload traffi...

Key Considerations: - Real-Time Inference: The goal is to perform real-time inference, which requires careful selection of instance types to balance performance (latency and throughput) and cost. - GPU vs CPU: Since the model is based on a convolutional neural network (CNN), it benefits significantly from GPU instances, especially for tasks that involve heavy image processing. - Traffic Pattern Knowledge: The data science team already knows the workload traffic patterns, which can help in determining the right instance type and configuration. Option Breakdown: A) Register the model artifact and container to the SageMaker Model Registry. Use the SageMaker Inference Recommender Default job type. Provide the known traffic pattern for load testing to select the best instance type and configuration based on the workloads. - Selection Reasoning: SageMaker Inference Recommender is a service designed to automatically help you choose the right instance type and configuration for real-time inference based on traffic patterns and model requirements. It can consider factors like traffic patterns, model size, and inference latency, providing an optimized instance configuration with minimal effort. Using the Default job type allows SageMaker to test different instance types and configurations based on the provided traffic data and suggest the best configuration for the workload. This is a managed solution with minimal development overhead. - Scenario Usefulness: This solution requires minimal development effort and automatically provides the best instance selection based on known traffic patterns. It’s a perfect fit for scenarios where you want to optimize cost and performance without manually testing and tuning. B) Register the model artifact and container to the SageMaker Model Registry. Use the SageMaker Inference Recommender Advanced job type. Provide the known traffic pattern for load testing to select the best instance type and configuration based on the workloads. - Rejection Reasoning: The Advanced job type provides more flexibility in customizing load tests and configurations, but this option is likely overkill for this use case. Since the team already knows the traffic pattern and does not need the additional customization that the advanc...

Author: BlazingPhoenix22 · Last updated Apr 3, 2026

A company is building custom deep learning models in Amazon SageMaker by using training and inference containers that run on Amazon EC2 instances. The company wants to reduce training costs but does not want to change the current architecture. The SageMaker training job can finish after interruptions. The company can wait da...

To reduce training costs while maintaining the current architecture and requirements, we need to evaluate the best combinations of resources. The key constraints are that the SageMaker training job can finish after interruptions and that the company is willing to wait for days to get results. This suggests that options allowing for interruption without negatively impacting the training process are ideal. Let’s analyze the options: A) On-Demand Instances - Reasoning: On-Demand Instances are typically the most expensive option because they are billed per second with no long-term commitment. While they provide flexibility in terms of scaling and capacity, they do not meet the cost reduction goal, especially when the company can tolerate interruptions and delays. - Why it’s rejected: This option would result in higher costs compared to other options like Spot Instances, which are designed for cost-saving and work well in scenarios where interruptions are acceptable. B) Checkpoints - Reasoning: Checkpoints involve saving the state of the training job at regular intervals, which allows a model to resume training from the last checkpoint after an interruption. This is particularly useful when using Spot Instances (which can be interrupted), as it prevents the need to start the training from scratch after an interruption. - Why it’s selected: Checkpoints are essential in reducing the risk of data loss during training interruptions, particularly in combination with Spot Instances. This is a cost-effective option that ensures the company does not need to restart the entire training process in the case of interruptions. C) Reserved Instances - Reasoning: Reserved Instances offer a discount compared to On-Demand Instances, but they require a one- or three-year commitment. While they provide cost savings, they are better suited for workloads with predictable usage over long periods of time. Since the company can tolerate delays and interruptions, Reserved Instances may not be the most suitable option. - Why it’s rejected: The company’s training jobs are not time-sensitiv...

Author: MoonlitPantherX · Last updated Apr 3, 2026

A company hosts a public web application on AWS. The application provides a user feedback feature that consists of free-text fields where users can submit text to provide feedback. The company receives a large amount of free-text user feedback from the online web application. The product managers at the company classify the feedback into a set of fixed categories including user interface issues, performance issues, new feature request, and chat issues for further actions by the company's engineering teams. A machine learning (ML) engineer at the company must automate the classification of new user feedback into these fixed c...

To automate the classification of user feedback into predefined categories, the machine learning (ML) engineer needs a solution that can handle multi-class text classification. Let's break down each option and determine the best approach based on the given requirements. A) Use the SageMaker Latent Dirichlet Allocation (LDA) algorithm. - Reasoning: LDA is a probabilistic generative model that is primarily used for topic modeling and discovering latent topics in large collections of text. It is useful for unsupervised learning where the goal is to identify topics in a corpus without predefined labels. However, LDA is not typically used for supervised classification tasks where you need to map text into specific categories (such as the fixed categories mentioned here). - Why it’s rejected: LDA is not designed for supervised classification and would not be suitable for performing multi-class text classification based on predefined categories. It is better suited for unsupervised tasks like topic discovery. B) Use the SageMaker BlazingText algorithm. - Reasoning: BlazingText is a deep learning-based text classification algorithm that can be used for word embeddings and text classification tasks. It is particularly well-suited for large-scale text classification problems and can handle multi-class classification. BlazingText supports both supervised learning (for tasks like classification) and unsupervised learning (for word embedding generation). - Why it’s selected: BlazingText is optimized for text classification and can efficiently handle the type of task described here. It supports fast training on large text datasets, making it an excellent choice for classifying free-text feedback into predefined categories. It uses deep learning techniques, which can capture the rich semantics of text data, resulting in accurate classification for tasks like...

Author: Zara · Last updated Apr 3, 2026

A digital media company wants to build a customer churn prediction model by using tabular data. The model should clearly indicate whether a customer will stop using the company's services. The company wants to clean the data because the data contains some empty fields, ...

To select the best solution for building a customer churn prediction model with the least development effort, we need to evaluate each option based on the company's goals, including the data cleaning requirements and the need for a predictive model. Let’s break down the options: A) Use SageMaker Canvas to automatically clean the data and to prepare a categorical model. - Pros: SageMaker Canvas offers a no-code solution for building machine learning models. It provides automatic data cleaning, model training, and deployment in a user-friendly interface. - Cons: The model will be categorical (binary or multi-class classification). However, churn prediction is generally a classification problem where the outcome (whether a customer will churn or not) is binary. So, there is no major issue with this approach for the classification problem, but it may lack flexibility for complex customizations. B) Use SageMaker Data Wrangler to clean the data. Use the built-in SageMaker XGBoost algorithm to train a classification model. - Pros: Data Wrangler is a powerful tool for cleaning and preparing data, and it provides various options to handle missing values, duplicates, and rare values. XGBoost is a robust algorithm for classification tasks and can handle tabular data well, especially for churn prediction. - Cons: While this approach provides more flexibility than Canvas (especially in terms of custom feature engineering), it requires more development effort since the user has to manually prepare the data and configure the XGBoost model. C) Use SageMaker Canvas automatic data cleaning and preparation tools. Use the built-in SageMaker XGBoost algorithm to train a regression model. - Pros: Canvas offers automatic data cleaning, and XGBoost is effective for regression tasks as well. - Cons: T...

Author: Scarlett · Last updated Apr 3, 2026

A data engineer is evaluating customer data in Amazon SageMaker Data Wrangler. The data engineer will use the customer data to create a new model to predict customer behavior. The engineer needs to increase the model performance by checking for multicollinearity in the ...

Key Considerations: - Multicollinearity: This refers to a situation where independent variables in a dataset are highly correlated, making it difficult to assess the individual effect of each variable on the dependent variable. To address this, the data engineer needs to detect and potentially address multicollinearity by analyzing the correlation structure of the features. - Operational Effort: The goal is to choose solutions that minimize manual effort and are integrated into the existing tools in SageMaker Data Wrangler. Option Breakdown: A) Use SageMaker Data Wrangler to refit and transform the dataset by applying one-hot encoding to category-based variables. - Rejection Reasoning: One-hot encoding is a technique used for converting categorical variables into a numerical format for model training, but it does not directly address multicollinearity. While it is essential for preparing data, it does not help in identifying or reducing multicollinearity. - Scenario Usefulness: This step would be useful in preprocessing categorical data but does not specifically help with multicollinearity. B) Use SageMaker Data Wrangler diagnostic visualization. Use principal components analysis (PCA) and singular value decomposition (SVD) to calculate singular values. - Selection Reasoning: PCA and SVD are powerful techniques to detect multicollinearity. They help in identifying the correlation structure of the data and in reducing dimensionality. PCA can be used to transform correlated features into uncorrelated principal components, which directly addresses multicollinearity. SageMaker Data Wrangler’s diagnostic visualizations can help in quickly identifying relationships between features and visualizing potential multicollinearity. - Scenario Usefulness: This solution provides a direct and efficient way to address multicollinearity with minimal manual effort, using built-in tools for analysis. C) Use the SageMaker Data Wrangler Quick Model visualization to quickly evaluate the dataset and produce importance scores for ea...

Author: Isabella · Last updated Apr 3, 2026

A company processes millions of orders every day. The company uses Amazon DynamoDB tables to store order information. When customers submit new orders, the new orders are immediately added to the DynamoDB tables. New orders arrive in the DynamoDB tables continuously. A data scientist must build a peak-time prediction solution. The data scientist must also create an Amazon QuickSight dashboard to display near real-time order insights. The data scientist needs to build a solution that will give QuickSight acc...

To meet the requirements of displaying near real-time order insights in Amazon QuickSight with the least delay between when a new order is processed and when QuickSight can access the new order information, let’s evaluate each option: A) Use AWS Glue to export the data from Amazon DynamoDB to Amazon S3. Configure QuickSight to access the data in Amazon S3. - Reasoning: AWS Glue is an ETL (Extract, Transform, Load) service that can export data from DynamoDB to Amazon S3. However, AWS Glue typically runs on a schedule (e.g., hourly or daily), so there may be some delay in transferring data from DynamoDB to S3. This approach is not ideal for near real-time updates because of the potential lag between when data is added to DynamoDB and when it is available in S3 for QuickSight. - Rejection: This solution may introduce unnecessary delays, making it unsuitable for near real-time access to the data. B) Use Amazon Kinesis Data Streams to export the data from Amazon DynamoDB to Amazon S3. Configure QuickSight to access the data in Amazon S3. - Reasoning: Amazon Kinesis Data Streams can be used to capture real-time changes in DynamoDB and stream them to other services, like S3. However, Kinesis Data Streams requires more management, and although it can deliver data in near real-time, this solution still requires additional steps to store data in S3 and for QuickSight to access it from there. - Rejection: While Kinesis can provide near real-time streaming, the solution adds complexity by involving Kinesis streams and storing the data in S3. The data flow introduces additional steps, making it less efficient compared to direct solutions. C) Use an API call from QuickSight to access the data that is in Amazon DynamoDB directly. - Reasoning: QuickSight can connect directly to DynamoDB using its built-in data source integration. QuickSight can retrieve data from DynamoDB in near real-time without the need for intermediate steps like exporting data to S3. However, it’s important to note that QuickSight may not be optimized for handling large datasets in DynamoDB in rea...

Author: Siddharth · Last updated Apr 3, 2026

What Our Friends Say

What Our Friends Say

Amazon Practice Questions, Discussions & Exam Topics by our Authors

A company is training machine learning (ML) models on Amazon SageMaker by using 200 TB of data that is stored in Amazon S3 buckets. The training data consists of individual files that are each larger than 200 MB in size. The company needs a data access solution that...

A company's data engineer wants to use Amazon S3 to share datasets with data scientists. The data scientists work in three departments: Finance. Marketing, and Human Resources. Each department has its own IAM user group. Some datasets contain sensitive information and should be accessed on...

A company operates an amusement park. The company wants to collect, monitor, and store real-time traffic data at several park entrances by using strategically placed cameras. The company's security team must be able to immediately access the data for viewing. Stored data must be indexed and...

A machine learning (ML) engineer is integrating a production model with a customer metadata repository for real-time inference. The repository is hosted in Amazon SageMaker Feature Store. The engineer wants to retrieve only the latest version of th...

A manufacturing company wants to create a machine learning (ML) model to predict when equipment is likely to fail. A data science team already constructed a deep learning model by using TensorFlow and a custom Python script in a local environment. The company wants to use Amaz...

A data scientist wants to improve the fit of a machine learning (ML) model that predicts house prices. The data scientist makes a first attempt to fit the model, but the fitted model has poor accuracy on both the training dataset and th...

A car company is developing a machine learning solution to detect whether a car is present in an image. The image dataset consists of one million images. Each image in the dataset is 200 pixels in height by 200 pixels in width. Each image is labeled as either having a car or not having a car. W...

A music streaming company is building a pipeline to extract features. The company wants to store the features for offline model training and online inference. The company wants to track feature history and to give the company's data science teams a...

A beauty supply store wants to understand some characteristics of visitors to the store. The store has security video recordings from the past several years. The store wants to generate a report of hourly visitors from the recordings. The report should group visi...

A media company wants to create a solution that identifies celebrities in pictures that users upload. The company also wants to identify the IP address and the timestamp details from the users so the company can prevent users from uploading pictures...

A company needs to deploy a chatbot to answer common questions from customers. The chatbot must base its answers on company documentation. Which soluti...

A company hosts a machine learning (ML) dataset repository on Amazon S3. A data scientist is preparing the repository to train a model. The data scientist needs to redact personally identifiable information (PH) from ...

A company wants to enhance audits for its machine learning (ML) systems. The auditing system must be able to perform metadata analysis on the features that the ML models use. The audit solution must generate a report that analyzes the metadata. The solution also must be able to set the da...

A machine learning (ML) specialist uploads a dataset to an Amazon S3 bucket that is protected by server-side encryption with AWS KMS keys (SSE-KMS). The ML specialist needs to ensure that an Amazon SageMaker notebook inst...

A machine learning (ML) specialist uploads 5 TB of data to an Amazon SageMaker Studio environment. The ML specialist performs initial data cleansing. Before the ML specialist begins to train a model, the ML specialist needs to create and view an analysis report that details potential bias in the...

A machine learning (ML) specialist is training a linear regression model. The specialist notices that the model is overfitting. The specialist applies an L1 regularization parameter and runs the model again. This change results in...

A data scientist is using Amazon Comprehend to perform sentiment analysis on a dataset of one million social media posts. Whi...

A digital media company wants to build a customer churn prediction model by using tabular data. The model should clearly indicate whether a customer will stop using the company's services. The company wants to clean the data because the data contains some empty fields, ...

A data engineer is evaluating customer data in Amazon SageMaker Data Wrangler. The data engineer will use the customer data to create a new model to predict customer behavior. The engineer needs to increase the model performance by checking for multicollinearity in the ...