Amazon Exam Practice Questions - Page 89

Amazon Practice Questions, Discussions & Exam Topics by our Authors

A company receives call logs as Amazon S3 objects that contain sensitive customer information. The company must protect the S3 objects by using encryption. The company must also use encryption keys that only spec...

To meet the requirements of encrypting S3 objects containing sensitive customer information and restricting access to the encryption keys, let's evaluate each option based on factors such as security, ease of implementation, and management overhead. A) Use an AWS CloudHSM cluster to store the encryption keys. Configure the process that writes to Amazon S3 to make calls to CloudHSM to encrypt and decrypt the objects. Deploy an IAM policy that restricts access to the CloudHSM cluster. - Reason for rejection: Using AWS CloudHSM for managing encryption keys is a highly secure approach, but it requires significant operational overhead. You would need to configure and manage the CloudHSM cluster, integrate encryption and decryption into the S3 upload process, and ensure that the necessary IAM policies are correctly defined. This introduces complexity and is not the most efficient solution for this scenario, especially if the goal is to minimize effort while meeting the security requirements. CloudHSM is typically used for applications that require high-level, manual control over keys and cryptographic operations. B) Use server-side encryption with customer-provided keys (SSE-C) to encrypt the objects that contain customer information. Restrict access to the keys that encrypt the objects. - Reason for rejection: SSE-C requires the company to manage the encryption keys independently. While this provides control over key management, it also places a high burden on the company to ensure key protection, rotation, and security. Specifically, SSE-C does not provide integrated key management services, making it less convenient compared to solutions that leverage AWS managed services for key storage and access control. This option is also prone to key management challenges, especially for large-scale operations. C) Use server-side encryption with AWS KMS keys (SSE-KMS) to encrypt the objects that contai...

Author: Lucas · Last updated May 21, 2026

A company stores petabytes of data in thousands of Amazon S3 buckets in the S3 Standard storage class. The data supports analytics workloads that have unpredictable and variable data access patterns.The company does not access some data for months. However, the company must be able to retrieve all data within milli...

To optimize storage costs for petabytes of data with unpredictable and variable access patterns while ensuring that all data can be retrieved within milliseconds, let’s evaluate the options based on their ability to minimize cost, maximize access speed, and require minimal operational overhead. A) Use S3 Storage Lens standard metrics to determine when to move objects to more cost-optimized storage classes. Create S3 Lifecycle policies for the S3 buckets to move objects to cost-optimized storage classes. Continue to refine the S3 Lifecycle policies in the future to optimize storage costs. - Reason for rejection: S3 Storage Lens provides insights into storage usage and activity patterns, but this option involves manually creating and refining S3 Lifecycle policies to move objects between storage classes. While this can be effective for some use cases, it requires ongoing management and tuning, which introduces operational overhead. Additionally, you would need to continuously analyze data access patterns and refine policies, which could be cumbersome, especially with unpredictable access patterns. B) Use S3 Storage Lens activity metrics to identify S3 buckets that the company accesses infrequently. Configure S3 Lifecycle rules to move objects from S3 Standard to the S3 Standard-Infrequent Access (S3 Standard-IA) and S3 Glacier storage classes based on the age of the data. - Reason for rejection: This option involves using S3 Storage Lens to gather activity metrics and then manually configuring S3 Lifecycle rules to move data to S3 Standard-IA and S3 Glacier based on the age of data. While this approach could optimize costs by moving infrequently accessed data to cheaper storage classes, it still requires manual intervention and constant updates to the Lifecycle rules. Moreover, S3 Glacier is not suitable for retrieval within milliseconds, as it is designed for long-term archival and may have retrieval times ranging from minutes to hours, which doesn't meet the retrieval speed requirement in the question. C) Use S3 Intelli...

Author: Maya2022 · Last updated May 21, 2026

During a security review, a company identified a vulnerability in an AWS Glue job. The company discovered that credentials to access an Amazon Redshift cluster were hard coded in the job script.A data engineer must remediate the security vulnerability in the AWS Glue job. The solution must sec...

To remediate the security vulnerability in the AWS Glue job, the primary goal is to securely store the credentials and avoid hard-coding them directly in the job script. Here's a breakdown of the options: Option A: Store the credentials in the AWS Glue job parameters. - Reasoning: AWS Glue job parameters are not ideal for securely storing credentials. They can be visible in logs or potentially accessed by unauthorized users. They lack the encryption and security features that are critical for sensitive data like credentials. Therefore, this option is not recommended for securely storing credentials. Option B: Store the credentials in a configuration file that is in an Amazon S3 bucket. - Reasoning: Storing credentials in an S3 bucket is a potential security risk, especially if the S3 bucket is not properly secured with encryption and access controls. While it can work, it is not the most secure approach because S3 buckets can be inadvertently exposed if not configured correctly. This option is not recommended for storing credentials securely. Option C: Access the credentials from a configuration file that is in an Amazon S3 bucket by using the AWS Glue job. - Reasoning: This option is similar to Option B. While it suggests using the configuration file from S3 within the Glue job, it still relies on the security of the S3 bucket. If the ...

Author: Emily · Last updated May 21, 2026

A data engineer uses Amazon Redshift to run resource-intensive analytics processes once every month. Every month, the data engineer creates a new Redshift provisioned cluster. The data engineer deletes the Redshift provisioned cluster after the analytics processes are complete every month. Before the data engineer deletes the cluster each month, the data engineer unloads backup data from the cluster to an Amazon S3 bucket.The data engineer needs a solution...

To meet the requirement of running monthly analytics processes with minimal manual intervention and infrastructure management, let's evaluate each option based on the operational overhead and suitability for the task: Option A: Use Amazon Step Functions to pause the Redshift cluster when the analytics processes are complete and to resume the cluster to run new processes every month. - Reasoning: While Step Functions could be used to orchestrate workflows, this option does not reduce the need to manage the Redshift cluster itself. Pausing and resuming a Redshift cluster still require a provisioned cluster, and this would still involve maintaining and managing the lifecycle of the cluster manually. This approach doesn't fully address the need to avoid cluster management each month, and would introduce additional complexity by using Step Functions. Hence, this option is not the most efficient for reducing operational overhead. Option B: Use Amazon Redshift Serverless to automatically process the analytics workload. - Reasoning: Amazon Redshift Serverless is designed specifically to handle analytics workloads without requiring the user to manage clusters. It automatically adjusts the compute capacity based on the workload and automatically handles scaling, which reduces manual infrastructure management. This solution is ideal for intermittent workloads like monthly analytics, as it allows the data engineer to focus on the analysis without needing to provision or decommission clusters each month. This option is the most suitable because it eliminates cluster management entirely and fits the scenario perfectly. ...

Author: Ella · Last updated May 21, 2026

A company receives a daily file that contains customer data in .xls format. The company stores the file in Amazon S3. The daily file is approximately 2 GB in size.A data engineer concatenates the column in the file that contains customer first names and the column that contains customer last names. The data engineer ne...

Let's evaluate each option to determine the solution with the least operational effort for determining the number of distinct customers in the file. Option A: Create and run an Apache Spark job in an AWS Glue notebook. Configure the job to read the S3 file and calculate the number of distinct customers. - Why it’s rejected: While this solution would work, it requires the creation and execution of a Spark job within AWS Glue. This involves some level of coding and configuration, which introduces operational overhead. Apache Spark provides powerful data processing capabilities, but for this simple task (counting distinct customers), this approach adds unnecessary complexity. Option B: Create an AWS Glue crawler to create an AWS Glue Data Catalog of the S3 file. Run SQL queries from Amazon Athena to calculate the number of distinct customers. - Why it’s selected: This solution is highly efficient with minimal operational effort. AWS Glue Crawlers can automatically detect the schema of the .xls file and populate the Data Catalog. Once the data is cataloged, you can use Amazon Athena (a serverless query service) to run simple SQL queries directly on the S3 data. This requires no infrastructure management and very little setup. You can simply use the `COUNT(DISTINCT column)` SQL query to count the distinct customers. This is a very straightforward, low-maintenance solution. Option C: Create and run an Apache Spark job in Amazon EMR Serverless to calculate the number of distinct customers. - Why it’s rejected: Amazon EMR Serverless provides a scalable environment for running Spark jobs, but using it for a task as simple as counting distinct values in a file is overkill. EMR is designed for more complex data processing workflows, and this option would require more operational overhead in terms of configuring and managing the job than other options like Athena. Additionally, it might incur more costs due to the und...

Author: Ming · Last updated May 21, 2026

A healthcare company uses Amazon Kinesis Data Streams to stream real-time health data from wearable devices, hospital equipment, and patient records.A data engineer needs to find a solution to process the streaming data. The data engineer needs to store the data in an Amazon Redshift Serverless warehouse. The solution must support near rea...

Let's analyze each option to determine the best solution for processing the streaming health data with minimal operational overhead and real-time analytics capabilities. Option A: Load data into Amazon Kinesis Data Firehose. Load the data into Amazon Redshift. - Reasoning: Amazon Kinesis Data Firehose is a managed service designed for streaming data ingestion into storage or analytics services. It can directly stream data into Amazon Redshift, enabling real-time data loading. The advantage of this approach is that it abstracts much of the complexity of managing data streams and supports near real-time data delivery. Since Firehose automatically handles retries and data transformations (if needed), it reduces operational overhead. This solution is very effective for stream processing with minimal manual setup. Option B: Use the streaming ingestion feature of Amazon Redshift. - Reasoning: Amazon Redshift's streaming ingestion feature allows data to be directly ingested from Kinesis Data Streams or Kinesis Data Firehose into Redshift in near real time. This feature is well-suited for handling continuous streams of data and performing real-time analytics without significant operational overhead. Since Redshift Serverless supports streaming ingestion natively, this would be an ideal solution for processing and storing streaming data directly into Redshift with minimal complexity. This solution offers seamless integration with Kinesis, optimized performance, and lower operational overhead. Option C: Load the data into Amazon S3. Use the COPY command to load the data into Amazon Red...

Author: Olivia · Last updated May 21, 2026

A data engineer needs to use an Amazon QuickSight dashboard that is based on Amazon Athena queries on data that is stored in an Amazon S3 bucket. When the data engineer connects to the QuickSight dashboard, the data engineer receives an error message that...

Let's evaluate the potential causes of permissions-related errors in the scenario where a data engineer is trying to use an Amazon QuickSight dashboard based on Amazon Athena queries on data stored in Amazon S3. Option A: There is no connection between QuickSight and Athena. - Why it’s rejected: If there were no connection between QuickSight and Athena, the data engineer would not be able to see any data or generate any queries at all, which is a different type of error. Since the error is permissions-related, this is not the root cause. It's likely the connection exists, but there are insufficient permissions for QuickSight to access the necessary resources. Option B: The Athena tables are not cataloged. - Why it’s rejected: While Athena tables need to be cataloged in the AWS Glue Data Catalog or an internal Athena catalog to query data, this wouldn't directly cause permissions-related errors. If the tables are not cataloged, you would likely see an error about missing tables or data, not a permissions error. Therefore, this is not the likely cause of the issue. Option C: QuickSight does not have access to the S3 bucket. - Why it’s selected: If QuickSight does not have the necessary permissions to access the S3 bucket where the data is stored, it would result in a permissions-related error when attempting to query the data via Athena. For QuickSight to access S3, it requires an IAM role that allows access to the relevant S3 bucket. Without the proper S3 permissions, the data engineer would encounter the described error. Option D: ...

Author: Zara · Last updated May 21, 2026

A company stores datasets in JSON format and .csv format in an Amazon S3 bucket. The company has Amazon RDS for Microsoft SQL Server databases, Amazon DynamoDB tables that are in provisioned capacity mode, and an Amazon Redshift cluster. A data engineering team must develop a solution that will give data scientists the abili...

To solve the problem, the key requirements are: - Querying various data sources (Amazon RDS for Microsoft SQL Server, DynamoDB in provisioned capacity mode, and Amazon Redshift) using SQL-like syntax. - Handling JSON and .csv data formats. - Minimizing operational overhead. Let’s evaluate each option: Option A: Use AWS Glue to crawl the data sources. Store metadata in the AWS Glue Data Catalog. Use Amazon Athena to query the data. Use SQL for structured data sources. Use PartiQL for data that is stored in JSON format. - Pros: - AWS Glue can efficiently crawl and catalog the metadata of the data sources, including S3, DynamoDB, and relational databases. - Amazon Athena enables SQL-based querying of data directly in S3. It can handle both structured data and semi-structured formats such as JSON with PartiQL (which allows SQL-like querying of JSON). - Minimal overhead as Athena is serverless and doesn’t require provisioning or managing infrastructure. - Cons: - DynamoDB in provisioned capacity mode can lead to performance issues if not managed well, especially with large queries. - Not directly suited for querying Amazon RDS for Microsoft SQL Server, though you can potentially use Glue connectors. - Best Use Case: When the main focus is on querying data directly from S3 (both JSON and CSV) and ensuring SQL-like access, using Athena with Glue crawlers and Data Catalog is a solid approach. Option B: Use AWS Glue to crawl the data sources. Store metadata in the AWS Glue Data Catalog. Use Redshift Spectrum to query the data. Use SQL for structured data sources. Use PartiQL for data that is stored in JSON format. - Pros: - Redshift Spectrum allows you to query data directly in S3 using SQL, making it possible to query both structured and semi-structured data. - You can use PartiQL for JSON and regular SQL for structured data. - Cons: - Redshift Spectrum requires a running Redshift cluster, which increases operational overhead. It’s also more complex and may require scaling depending on the query volume. - Operational overhead is higher because Redshift Spectrum requires configuring and managing the Redshift cluster. - Best Use Case: When there is a need to leverage Redshift for both OLAP and querying S3 data, but it adds complexity and higher operational overhead. Option C: Use AWS Glue to crawl the data sources. ...

Author: Liam · Last updated May 21, 2026

A data engineer is configuring Amazon SageMaker Studio to use AWS Glue interactive sessions to prepare data for machine learning (ML) models.The data engineer receives an access denied error when the data engineer tries to prepare the da...

In this case, the data engineer is receiving an access denied error when trying to use AWS Glue interactive sessions within Amazon SageMaker Studio. This suggests that the IAM permissions required to use both services in this integrated way have not been properly configured. Let’s evaluate the options based on the necessary permissions: Option A: Add the AWSGlueServiceRole managed policy to the data engineer's IAM user. - Pros: - The `AWSGlueServiceRole` policy grants permissions for various AWS Glue actions, like reading and writing data, creating tables, etc. - Cons: - This policy grants permissions for Glue operations but does not address the interaction between SageMaker and Glue, especially regarding SageMaker’s interaction with Glue interactive sessions. - Best Use Case: - This is more useful for Glue-specific operations but doesn’t help with the integration between SageMaker Studio and Glue interactive sessions. It does not grant the permissions needed to run Glue interactive sessions within SageMaker Studio. Option B: Add a policy to the data engineer's IAM user that includes the sts:AssumeRole action for the AWS Glue and SageMaker service principals in the trust policy. - Pros: - This option allows the data engineer to assume roles across both AWS Glue and SageMaker, ensuring that the data engineer can execute tasks in both services. - The `sts:AssumeRole` action is essential for cross-service role assumption, which is key for integrating SageMaker with Glue interactive sessions. - Cons: - The policy must be applied to both Glue and SageMaker, so it would require proper configuration of trust relationships and permissions for both services to function together. - Best Use Case: - This is the correct approach when the data engineer needs the ability to access resources across both Glue and SageMaker. By enabling role assumption betwe...

Author: Jack · Last updated May 21, 2026

A company extracts approximately 1 TB of data every day from data sources such as SAP HANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. Some of the data sources have undefined data schemas or data schemas that change.A data engineer must implement a solution that can detect the schema for these data sources. The solution must extract, transform, and load the data to an Amazon S3 bucket. The company has...

The requirement is to implement a solution that can detect the schema of various data sources, handle changing or undefined schemas, and load data into an S3 bucket within 15 minutes. Given this, we need to select an option that addresses both schema detection and data extraction/ETL while minimizing operational overhead. Let’s evaluate each option: Option A: Use Amazon EMR to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark. - Pros: - EMR provides a flexible environment for handling large-scale data processing tasks using Apache Spark. It can process a variety of data sources and can be configured for schema detection. - Apache Spark is highly scalable and could potentially handle the 1 TB of data per day efficiently. - Cons: - High operational overhead: EMR requires more management, such as cluster provisioning, scaling, and maintaining the environment. - Spark’s schema detection is not fully automatic for changing schemas across varied data sources like SAP HANA, MongoDB, or Kafka. Custom logic may be needed to handle dynamic schemas. - SLA concern: The process might not meet the strict SLA of 15 minutes due to setup time and cluster management overhead. - Best Use Case: This is useful for high-performance, custom data processing, but it introduces significant overhead that can be avoided with more integrated services. Option B: Use AWS Glue to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark. - Pros: - AWS Glue is a fully managed ETL service that provides built-in capabilities to detect and catalog schemas, especially for data sources with changing or undefined schemas. It integrates well with multiple data sources like DynamoDB, SQL Server, and Kafka. - AWS Glue can generate dynamic schemas and automatically adjust when the schema changes, which is ideal for the given use case. - Fully managed with serverless capabilities, so there’s minimal operational overhead compared to EMR. - The solution is scalable and can meet the SLA of loading data within 15 minutes due to its ability to handle real-time data pipelines. - Cons: - Might require some custom transformation logic for non-standard sources, though Glue provides a wide array of built-in connectors and transformations. - Best Use Case: This is the ...

Author: Vikram · Last updated May 21, 2026

A company has multiple applications that use datasets that are stored in an Amazon S3 bucket. The company has an ecommerce application that generates a dataset that contains personally identifiable information (PII). The company has an internal analytics application that does not require access to the PII.To comply with regulations, the company must not share PII unnecessarily. A data engineer needs to implement a solu...

In this scenario, the company needs to dynamically redact personally identifiable information (PII) based on the needs of each application that accesses a dataset stored in Amazon S3. The solution must ensure that PII is not shared unnecessarily, while also minimizing operational overhead. Let’s analyze each option: Option A: Create an S3 bucket policy to limit the access each application has. Create multiple copies of the dataset. Give each dataset copy the appropriate level of redaction for the needs of the application that accesses the copy. - Pros: - Granular control over access to data is achieved by restricting access through S3 bucket policies. - Custom datasets for each application can be created, ensuring that only the necessary data is exposed. - Cons: - Multiple dataset copies would need to be maintained, which creates significant operational overhead in terms of storage and data management. - Redaction is static, meaning it cannot be dynamically adjusted based on the application's needs. - Data duplication increases complexity, which could lead to potential errors and inefficiencies. - Best Use Case: While this method would limit access based on policies, it does not allow dynamic redaction, and the overhead of managing multiple copies is high. Option B: Create an S3 Object Lambda endpoint. Use the S3 Object Lambda endpoint to read data from the S3 bucket. Implement redaction logic within an S3 Object Lambda function to dynamically redact PII based on the needs of each application that accesses the data. - Pros: - S3 Object Lambda enables dynamic processing of objects as they are retrieved from an S3 bucket, allowing you to redact PII dynamically based on the application's needs. - There’s no need to create multiple copies of the dataset. The redaction is done on-the-fly as the data is accessed, which keeps storage costs lower. - This is a serverless solution, with minimal operational overhead since it’s fully managed by AWS. - Cons: - It might introduce some latency as the redaction happens dynamically when the data is requested, but the overhead is minimal compared to managing multiple copies. - Best Use Case: This is the ideal solution for dynamically redacting PII without creating multiple copi...

Author: Elizabeth · Last updated May 21, 2026

A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is...

Let's evaluate the options based on the key factors: cost-effectiveness, complexity, operational overhead, and suitability for the task (ETL processing of small daily .csv files). Option A: Write a custom Python application. Host the application on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. - Pros: - Customizable solution, allowing flexibility in ETL logic. - Cons: - High operational overhead: Amazon EKS requires managing clusters, scaling, and infrastructure, which can become complex and costly. - Costly: EKS incurs costs for managing Kubernetes infrastructure, even for small workloads. This is not cost-effective for processing small files (less than 100 MB each). - Overkill for this use case: Kubernetes is generally overkill for simple ETL jobs that involve small datasets. The complexity of Kubernetes would be unnecessary here, especially when simpler serverless or managed solutions exist. - Best Use Case: This is better suited for larger, more complex applications or scenarios where container orchestration and scaling are essential. It is not suitable for this case. Option B: Write a PySpark ETL script. Host the script on an Amazon EMR cluster. - Pros: - PySpark is excellent for large-scale data processing and transformation. - EMR offers scalable processing for large datasets. - Cons: - High operational overhead: Managing an EMR cluster for processing relatively small files (less than 100 MB each) would be inefficient. EMR clusters are typically used for larger, complex data processing tasks. - Not cost-effective: EMR incurs costs for the cluster’s uptime, regardless of whether the processing is heavy or light. For small daily files, the cost of maintaining an EMR cluster would far outweigh the benefit. - Overkill: Like EKS, this is an over-engineered solution for this task. - Best Use Case: EMR is useful for large-scale, complex ETL tasks, especially for big data processing. It is not cost-effective for small, straightforward ETL jobs. Option C: Write an AWS Glue PySpark job. Use Apache Spark to transform the data. - Pros: - AWS Glue is a fully managed ETL service, which automatically handles scaling, resource provisioning, and job schedu...

Author: Max · Last updated May 21, 2026

A data engineer creates an AWS Glue Data Catalog table by using an AWS Glue crawler that is named Orders. The data engineer wants to add the following new partitions:s3://transactions/orders/order_date=3D2023-01-01s3://transactions/orders/order_date=3D2023-01-02The data engineer must edit the metadata to include the new partitions in the table without sca...

To add new partitions to an existing AWS Glue Data Catalog table without scanning all folders and files, the data engineer needs to update the metadata for the table to include the new partitions at the specified locations. Below is the analysis of each option and the reasoning behind selecting the appropriate one: A) `ALTER TABLE Orders ADD PARTITION(order_date=2023-01-01) LOCATION 's3://transactions/orders/order_date=2023-01-01'; ALTER TABLE Orders ADD PARTITION(order_date=2023-01-02) LOCATION 's3://transactions/orders/order_date=2023-01-02';` - Explanation: This DDL statement directly adds two new partitions to the table by specifying their locations. The `ALTER TABLE ADD PARTITION` command is explicitly used to add partitions with specific values for the partition key (`order_date`), and it doesn't require scanning all the files, only updating the metadata with the new locations. - Reason for Selection: This is the correct option because it allows the data engineer to add specific partitions with a given location without re-scanning all data, as the partitions are manually specified. B) `MSCK REPAIR TABLE Orders;` - Explanation: `MSCK REPAIR TABLE` is a command that scans the entire S3 directory and automatically adds all partitions that are missing from the Data Catalog based on the directory structure. This command is typically used when you have new partitions that are added to the directory but not yet reflected in the Glue Data Catalog. - Reason for Rejection: This option triggers a full scan of the S3 location, which is not efficient if the data engineer ...

Author: John · Last updated May 21, 2026

A company stores 10 to 15 TB of uncompressed .csv files in Amazon S3. The company is evaluating Amazon Athena as a one-time query engine.The company wants to transform the data to optimize query runtime and storage cos...

When evaluating options for optimizing query runtime and storage costs in Amazon Athena, it's important to focus on both the query performance (i.e., read efficiency) and the cost associated with storage. Let's analyze each option to determine which one best meets the requirements for Athena queries. A) `.csv format compressed with zip` - Explanation: CSV is a widely used format, but it is not optimized for columnar storage, which is crucial for query performance in Athena. Compressing CSV files with ZIP will reduce file size, but it doesn't address performance in terms of data retrieval and query optimization. Athena will still need to scan the entire file during each query, even though the ZIP compression reduces storage space. - Reason for Rejection: This option is inefficient for large datasets because CSV files do not support efficient data retrieval by column, and ZIP compression is not designed for optimized query performance in Athena. It is suitable for simple use cases but not for large-scale analytical queries. B) `JSON format compressed with bzip2` - Explanation: JSON is a semi-structured format that is readable and flexible, but like CSV, it is not optimized for Athena queries, especially when it comes to large datasets. Compression with bzip2 will reduce the storage size, but querying JSON files in Athena can be slow because Athena will still need to scan the entire file. JSON's structure also makes it more difficult to optimize for performance. - Reason for Rejection: Although bzip2 offers better compression ratios than ZIP, JSON files are not a columnar format and would result in slower query performance. This format is not ideal for large-scale analytical processing in Athena. C) `Apache Parquet format compressed with Snappy` - Explanation: Apache Parquet is a columnar storage format designed for efficie...

Author: Maya2022 · Last updated May 21, 2026

A company uses Apache Airflow to orchestrate the company's current on-premises data pipelines. The company runs SQL data quality check tasks as part of the pipelines. The company wants to migrate the pipelines to AWS and to use AW...

When evaluating options for migrating the company's existing Apache Airflow pipelines to AWS with the least amount of refactoring, it's important to consider factors such as ease of migration, compatibility with the existing workflows, and leveraging AWS managed services. Let's analyze each option: A) Setup AWS Outposts in the AWS Region that is nearest to the location where the company uses Airflow. Migrate the servers into Outposts hosted Amazon EC2 instances. Update the pipelines to interact with the Outposts hosted EC2 instances instead of the on-premises pipelines. - Explanation: AWS Outposts extend AWS infrastructure on-premises, providing a hybrid environment. While this might seem like a solution to migrate workloads without major changes, it doesn't fully leverage AWS managed services. It still requires managing EC2 instances, networking, and infrastructure, which adds operational complexity. - Reason for Rejection: This solution does not leverage AWS managed services like Amazon MWAA, which would simplify the management and operation of Airflow. It also involves maintaining infrastructure on-premises, which increases operational overhead. B) Create a custom Amazon Machine Image (AMI) that contains the Airflow application and the code that the company needs to migrate. Use the custom AMI to deploy Amazon EC2 instances. Update the network connections to interact with the newly deployed EC2 instances. - Explanation: This solution involves manually creating and managing custom EC2 instances to host Apache Airflow. The pipelines will need to be updated to work with the new EC2 instances, which is a form of lift-and-shift migration. - Reason for Rejection: While this solution requires minimal refactoring, it still involves managing the EC2 instances, network configurations, and other components. It lacks the benefits of using fully managed services like Amazon MWAA and does not reduce operational overhead, which would make it less efficient in the long run. C) Migrate the existing Airflow orchestration configuration into Amazon Managed Workflows...

Author: VenomousSerpent42 · Last updated May 21, 2026

A company uses Amazon EMR as an extract, transform, and load (ETL) pipeline to transform data that comes from multiple sources. A data engineer must orchestrate the pipeline to maximize...

Let's evaluate each option for orchestrating an ETL pipeline with Amazon EMR to maximize performance in a cost-effective manner: Option A: Amazon EventBridge - Why it’s rejected: Amazon EventBridge is a serverless event bus service that allows you to route events between AWS services, which is excellent for event-driven architectures. However, it is not designed for orchestrating complex workflows or managing the sequence of tasks in an ETL pipeline. EventBridge is typically used for event-based triggers and event handling, not for workflow orchestration or pipeline management. Option B: Amazon Managed Workflows for Apache Airflow (Amazon MWAA) - Why it’s rejected: Amazon MWAA provides a managed environment for running Apache Airflow, which is great for managing complex workflows, including ETL pipelines. However, it might be overkill in this scenario if the primary focus is cost-effective orchestration. Apache Airflow is powerful but introduces a level of complexity and potential overhead in terms of management and costs for simple ETL orchestration. It would be better suited for complex, multi-step workflows with dependencies and where specific scheduling, retries, and monitoring are required. Option C: AWS Step Functions - Why it’s rejected: AWS Step Functions is a great service for orchestrating workflows involving multiple AWS services, including running tasks in a sequence or in parallel. While it provides robust error handling and retry mechanisms, it might not be the most cost-effective option for simple ETL pipelines. Step Functions is highly suited for applications with complex business logic or where workflows involve more varied AWS services. However, it can become costly if the workflow includes many small, frequent tasks (since you are billed per state transition), making it less cost-effective for large-scale E...

Author: Layla · Last updated May 21, 2026

An online retail company stores Application Load Balancer (ALB) access logs in an Amazon S3 bucket. The company wants to use Amazon Athena to query the logs to analyze traffic patterns.A data engineer creates an unpartitioned table in Athena. As the amount of the data gradually increases, the response time for queries also increases. ...

To improve query performance in Amazon Athena when analyzing traffic patterns from ALB access logs stored in S3, the key factors to consider are reducing the amount of data that Athena needs to scan during each query and optimizing the way the data is structured. Let's review each option and assess which will meet the requirements with the least operational effort. A) Create an AWS Glue job that determines the schema of all ALB access logs and writes the partition metadata to AWS Glue Data Catalog. - Explanation: This option involves using AWS Glue to determine the schema of the ALB access logs and write partition metadata to the Glue Data Catalog. While partitioning the data can improve query performance by allowing Athena to scan only relevant parts of the data, the solution doesn’t address how the data itself should be optimized for efficient querying (e.g., file formats, compression). - Reason for Rejection: This option focuses on metadata management rather than optimizing the storage format or partitioning the data itself. Without transforming the data into a more efficient format (e.g., Parquet or ORC), queries will still require scanning large amounts of raw log data, leading to inefficiencies. B) Create an AWS Glue crawler that includes a classifier that determines the schema of all ALB access logs and writes the partition metadata to AWS Glue Data Catalog. - Explanation: AWS Glue crawlers can automatically detect the schema and partitioning of data, and write metadata to the Glue Data Catalog, which makes the data ready for query in Athena. While this would help with organizing and partitioning the data, it still doesn’t address the need to transform the raw log data into a more efficient format for querying, such as Parquet. - Reason for Rejection: This option provides metadata management but lacks the transformation of data into a more query-optimized format. Using raw log files (e.g., CSV or JSON) without further processing would not give the best query performance in Athena. C) Create an AWS Lambda function to transform all A...

Author: Emily · Last updated May 21, 2026

A company has a business intelligence platform on AWS. The company uses an AWS Storage Gateway Amazon S3 File Gateway to transfer files from the company's on-premises environment to an Amazon S3 bucket.A data engineer needs to setup a process that will automatically launch an AWS Glue workflow to run a series of AWS...

Let's evaluate each option to determine the best solution: A) Determine when the file transfers usually finish based on previous successful file transfers. Set up an Amazon EventBridge scheduled event to initiate the AWS Glue jobs at that time of day. - Why it’s not ideal: This solution would rely on the assumption that file transfers finish at a predictable time, based on previous transfers. However, this is not an accurate way to handle the process, as the transfer times might vary due to network issues, file sizes, or other unpredictable factors. Additionally, this method would introduce unnecessary delay if the file transfer completes earlier or later than expected. There is also a risk of triggering unnecessary workflows if the timing is not well-calibrated. - Best scenario for this: This could work in environments where the transfers are highly consistent in terms of timing and there’s a very clear schedule, but this is not a flexible or scalable approach. B) Set up an Amazon EventBridge event that initiates the AWS Glue workflow after every successful S3 File Gateway file transfer event. - Why it’s ideal: This option takes advantage of event-driven architecture to trigger the Glue workflow automatically after each successful file transfer. Amazon S3 can emit events (like `s3:ObjectCreated`) when a new file is added to the S3 bucket. By creating an event rule in EventBridge to listen for these events, it would immediately trigger the Glue workflow without relying on specific times or manual intervention. This provides automation and scalability with minimal overhead, as it will only trigger the workflow when a file transfer is complete. - Best scenario for this: This is the best approach if you want a fully automated process that reacts to the actual completion of a file transfer. It is event-driven, flexible, and low-latency. C) Set up an on-demand AWS Glue workflow so that t...

Author: Ishaan · Last updated May 21, 2026

A retail company uses Amazon Aurora PostgreSQL to process and store live transactional data. The company uses an Amazon Redshift cluster for a data warehouse.An extract, transform, and load (ETL) job runs every morning to update the Redshift cluster with new data from the PostgreSQL database. The company has grown rapidly and needs to cost optimize the Redshift cluster.A data engineer needs to create a solution to archive historical data. The data engineer must be able to run analytics queries that effectively combine data from live transactional data...

Let's evaluate each option to determine the best combination of steps to meet the requirements of the data engineer. A) Configure the Amazon Redshift Federated Query feature to query live transactional data that is in the PostgreSQL database. - Why it’s a good option: Amazon Redshift Federated Query allows Redshift to query external data sources like PostgreSQL, directly from within Redshift. This would allow the company to join live transactional data in PostgreSQL with the current data in Redshift. This is an effective way to access live data without needing to duplicate it in Redshift, which would help with cost optimization by keeping Redshift storage focused on current data. - Best scenario for this: This is useful when the transactional data in PostgreSQL needs to be directly queried alongside the current data in Redshift without moving large volumes of data into Redshift. It avoids duplication and reduces costs by querying live transactional data in PostgreSQL directly. B) Configure Amazon Redshift Spectrum to query live transactional data that is in the PostgreSQL database. - Why it’s not ideal: Amazon Redshift Spectrum is specifically designed to query data stored in Amazon S3 and is not directly suited to query live transactional data in PostgreSQL. Although Redshift Spectrum can be used to query historical data stored in S3, it does not support querying PostgreSQL databases. Therefore, this option would not effectively meet the need to combine live PostgreSQL data with current Redshift data. - Best scenario for this: Redshift Spectrum is ideal for querying large volumes of historical data stored in Amazon S3, but not for querying live transactional data from an external PostgreSQL database. C) Schedule a monthly job to copy data that is older than 15 months to Amazon S3 by using the UNLOAD command. Delete the old data from the Redshift cluster. Configure Amazon Redshift Spectrum to access historical data in Amazon S3. - Why it’s ideal: This option addresses both data archiving and cost optimization by ensuring that only the most recent 15 months of data are stored in Amazon Redshift. The older data is offloaded to Amazon S3, and Redshift Spectrum can be used to query this historical data as needed. By offloading old data to S3 and using Redshift Spectrum for analytics, the company can continue to query the archived data without increasing Redshift storage costs. - Best scenario f...

Author: FlamePhoenix2025 · Last updated May 21, 2026

A manufacturing company has many IoT devices in facilities around the world. The company uses Amazon Kinesis Data Streams to collect data from the devices. The data includes device ID, capture date, measurement type, measurement value, and facility ID. The company uses facility ID as the partition key.The company's operations team recently observed many WriteThroughputExceeded exceptions. The ...

Let's evaluate each option based on the company's issue of WriteThroughputExceeded exceptions, where some shards are heavily used, and others are idle. A) Change the partition key from facility ID to a randomly generated key. - Why it could be helpful: If the partition key is chosen poorly, it can lead to uneven data distribution across shards. By using a randomly generated key, the data would be distributed more evenly across shards, preventing overloading specific shards. This would help resolve the issue of some shards being heavily used while others are idle. - Why it might not be ideal: Randomly generated partition keys are often not meaningful, which could complicate downstream processing and querying, as the data would no longer be grouped by facility ID. The lack of a logical structure might make it harder to analyze data for specific facilities. - Best scenario for this: This approach works well when the data doesn't need to be logically grouped by any attribute and if the only goal is to balance throughput across shards, but it may not be ideal given the desire to maintain a meaningful partition key (facility ID). B) Increase the number of shards. - Why it could be helpful: Increasing the number of shards would allow the Kinesis stream to handle a higher throughput by spreading the data across more shards. However, this solution addresses throughput at a global level, but doesn't address the issue of uneven load across shards. - Why it might not be ideal: If the uneven distribution of data is caused by the current partition key (facility ID), simply adding more shards would not resolve the problem if the distribution remains skewed. Without addressing how data is distributed across shards, you may end up with a larger stream but still face the same problem of hotspots. - Best scenario for this: This is a valid approach if the current partitioning is already well-distributed but the stream is running out of capacity due to overall data growth. However, it may not solve the core issue of skewed data distribution. C...

Author: CrystalWolfX · Last updated May 21, 2026

A data engineer wants to improve the performance of SQL queries in Amazon Athena that run against a sales data table.The data engineer wants to understand the execution plan of a specific SQL statement. The data engineer also wants to see the computational cost...

Let's analyze each option and how it meets the requirements of understanding the execution plan and computational cost for SQL queries in Amazon Athena: A) EXPLAIN SELECT FROM sales; - Why it's not ideal: The `EXPLAIN` statement provides the query execution plan, which outlines the steps Athena will take to execute the query, such as scanning tables, joining data, and applying filters. However, it doesn't provide detailed runtime statistics or computational cost for each operation. It's useful for analyzing the query structure, but it doesn't provide the full performance analysis that includes costs. - Best scenario for this: This option would work if the goal is to simply view the query execution plan, but it doesn't give the computational cost or performance details. B) EXPLAIN ANALYZE FROM sales; - Why it's not ideal: The statement `EXPLAIN ANALYZE` without a `SELECT` query is not syntactically correct for Athena. In Amazon Athena, you need to specify a `SELECT` query after the `EXPLAIN ANALYZE` statement to get detailed performance metrics, including computational costs. As it stands, this statement would result in an error. - Best scenario for this: This is not a valid option because of incorrect syntax in Athena. C) EXPL...

Author: Amira99 · Last updated May 21, 2026

A company plans to provision a log delivery stream within a VPC. The company configured the VPC flow logs to publish to Amazon CloudWatch Logs. The company needs to send the flow logs to Splunk in near real time for fur...

Let's evaluate each option based on the requirement to send VPC flow logs to Splunk in near real-time with the least operational overhead. A) Configure an Amazon Kinesis Data Streams data stream to use Splunk as the destination. Create a CloudWatch Logs subscription filter to send log events to the data stream. - Why it’s not ideal: While Kinesis Data Streams is a good option for real-time data streaming, it requires a custom application or service to consume the data from the stream and send it to Splunk. Configuring a subscription filter to send CloudWatch Logs to Kinesis Data Streams adds complexity, as you'd need to build and maintain the integration between Kinesis and Splunk. This introduces additional operational overhead and management complexity. - Best scenario for this: This would work in scenarios where you need full control over the data flow and processing pipeline but is more complex and requires more maintenance. B) Create an Amazon Kinesis Data Firehose delivery stream to use Splunk as the destination. Create a CloudWatch Logs subscription filter to send log events to the delivery stream. - Why it’s ideal: Kinesis Data Firehose simplifies the process by providing a fully managed service that can stream data directly to Splunk. It handles data delivery and transformation automatically, which reduces operational overhead. The CloudWatch Logs subscription filter can be easily configured to send log events to the Firehose delivery stream. This approach is highly streamlined, reducing the need for custom applications or additional components. - Best scenario for this: This option is the best choice when you want to minimize operational overhead and take advantage of a managed service that integrates well with CloudWatch Logs and Splunk. It is the easiest way to deliver data to Splunk in near real-time without additional maintenance. C) Create an Amazon Kinesis Data Firehose delivery stream to use Splunk...

Author: Kai99 · Last updated May 21, 2026

A company has a data lake on AWS. The data lake ingests sources of data from business units. The company uses Amazon Athena for queries. The storage layer is Amazon S3 with an AWS Glue Data Catalog as a metadata repository.The company wants to make the data available to data scientists and business analysts. However, the company first needs t...

Key Factors: - Fine-grained access control: The company wants to ensure column-level access control based on user roles and responsibilities. - Integration with Athena: The solution needs to work seamlessly with Amazon Athena, which queries data stored in S3 and uses the AWS Glue Data Catalog as a metadata repository. - User roles and responsibilities: The solution must account for managing different access levels based on user roles, which may include data scientists, business analysts, and others. Option Analysis: A) Set up AWS Lake Formation. Define security policy-based rules for the users and applications by IAM role in Lake Formation. - Pros: - AWS Lake Formation is specifically designed for data lakes, offering fine-grained, column-level access control for data stored in Amazon S3 and cataloged in AWS Glue. - It allows you to define policies for users or applications based on IAM roles. - It integrates well with Athena for secure querying of the data. - It provides a centralized mechanism for managing access controls, including fine-grained permissions (e.g., column-level access) to the data. - Cons: - Requires an extra layer of configuration compared to other options (though it is highly effective for managing data lake security). - Use Case: This option is ideal for companies using AWS Glue and Athena with S3 for a data lake, especially when you need fine-grained access controls based on specific data access needs. B) Define an IAM resource-based policy for AWS Glue tables. Attach the same policy to IAM user groups. - Pros: - IAM resource-based policies can control access to AWS Glue resources, including tables. - Cons: - IAM policies generally do not support column-level access control in a fine-grained manner, and they are more suited for broader permission sets rather than detailed access management. - This option doesn't meet the column-level access requirement needed in the scenario. - Use Case: T...

Author: Ethan Smith · Last updated May 21, 2026

A company has developed several AWS Glue extract, transform, and load (ETL) jobs to validate and transform data from Amazon S3. The ETL jobs load the data into Amazon RDS for MySQL in batches once every day. The ETL jobs use a DynamicFrame to read the S3 data.The ETL jobs currently process all the data that is in the S3 bucket. Howeve...

Key Factors: - Incremental processing: The company wants to process only the daily incremental data, not the entire dataset every time. - Low coding effort: The solution needs to minimize the need for custom development and configuration. - Seamless integration with Glue: The solution should integrate directly with AWS Glue and its existing setup. Option Analysis: A) Create an ETL job that reads the S3 file status and logs the status in Amazon DynamoDB. - Pros: - This approach can track file status and changes manually by logging information into DynamoDB. - Cons: - This method involves a high level of manual effort and custom development to implement file status tracking and logging. - Requires coding to manage the process of logging status and reading it for incremental processing. - Adding DynamoDB introduces complexity and potential performance overhead. - Use Case: While this can meet the requirement, it involves significant custom work, making it a more complex and error-prone solution compared to others. B) Enable job bookmarks for the ETL jobs to update the state after a run to keep track of previously processed data. - Pros: - Job bookmarks are designed by AWS Glue specifically to handle incremental data processing. - It enables the ETL jobs to track the state of data and only process new or updated data based on a timestamp or other defined key. - Minimal coding effort: Enabling job bookmarks is straightforward and does not require significant code changes. - Direct integration with AWS Glue and no additional services are required. - Cons: - It may require small changes in the job configuration, but the overall implementation is relatively simple. - Use Case: This option is perfect for scenarios where you want incremental data processing with minimal coding and the existing Glue ...

Author: Ryan · Last updated May 21, 2026

An online retail company has an application that runs on Amazon EC2 instances that are in a VPC. The company wants to collect flow logs for the VPC and analyze network t...

Key Factors: - Cost-effectiveness: The solution needs to minimize costs while still meeting the requirements of collecting and analyzing VPC flow logs. - Ease of use: The solution should be simple to configure and maintain. - Scalability and performance: The solution should be able to scale to handle large amounts of flow log data efficiently. Option Analysis: A) Publish flow logs to Amazon CloudWatch Logs. Use Amazon Athena for analytics. - Pros: - CloudWatch Logs is a managed service for log storage and analysis. - Athena allows querying data stored in CloudWatch Logs. - Cons: - Storing flow logs in CloudWatch Logs can get expensive, especially with large volumes of log data. - CloudWatch Logs are not designed for high-volume, cost-efficient log storage over the long term, which may make this solution less cost-effective. - Athena querying over CloudWatch Logs might be less optimized for large datasets compared to S3-based solutions. - Use Case: This could work for small-scale environments or scenarios where you need real-time log monitoring, but it may not be cost-efficient for large volumes of VPC flow logs. B) Publish flow logs to Amazon CloudWatch Logs. Use an Amazon OpenSearch Service cluster for analytics. - Pros: - OpenSearch (formerly Elasticsearch) is a powerful search and analytics engine that integrates with CloudWatch Logs. - Cons: - OpenSearch can be costly for large datasets, especially in terms of storage and compute for processing flow logs. - Similar to CloudWatch Logs, using OpenSearch for this type of data can quickly become expensive when dealing with large log volumes, especially if you're running it continuously for flow log analysis. - Use Case: This solution is suitable for environments that require advanced search capabilities, but it's not the most cost-effective for analyzing flow logs in large volumes. C) Publish flow logs to Amazon S3 i...

Author: William · Last updated May 21, 2026

A retail company stores transactions, store locations, and customer information tables in four reserved ra3.4xlarge Amazon Redshift cluster nodes. All three tables use even table distribution.The company updates the store location table only once or twice every few years.A data engineer notices that Redshift queues are slowing down because the whole store location table is constantly being broadcast to all four compute nodes for most queries. ...

Key Factors: - Query performance: The goal is to speed up query performance by reducing the broadcasting of the store location table. - Cost-effectiveness: The solution should optimize performance without increasing costs unnecessarily. - Distribution style: The solution must address how data is distributed across the compute nodes to minimize unnecessary broadcasting. Option Analysis: A) Change the distribution style of the store location table from EVEN distribution to ALL distribution. - Pros: - ALL distribution broadcasts a copy of the table to each node, which can eliminate the need for shuffling data across nodes during joins. This is particularly useful for small, infrequently updated lookup tables like the store location table. - Since the store location table is updated infrequently, broadcasting the table to all nodes is cost-effective and can significantly improve query performance. - Cons: - This may not be as effective if the table is large, but in this case, the table is infrequently updated, and changing to ALL distribution is a good fit. - Use Case: This is the most cost-effective and performance-improving solution for this situation. It eliminates the need for broadcasting the store location table for every query while ensuring that other tables using EVEN distribution aren’t affected. B) Change the distribution style of the store location table to KEY distribution based on the column that has the highest dimension. - Pros: - KEY distribution ensures that rows with the same value in the distribution column are stored on the same node, which is useful for queries that filter or join on the key column. - Cons: - While KEY distribution can reduce data movement in queries that filter or join based on the distribution key, it still does not guarantee that the store location table will not be broadcasted in certain cases. - Choosing the right column for KEY distribution requires careful consideration of the data, and the store location table is unlikely to have a column with a high cardinality that would be ideal for this distribution method. - If the store location table is small and updated infrequently, ALL distribution...

Author: Michael · Last updated May 21, 2026

A company has a data warehouse that contains a table that is named Sales. The company stores the table in Amazon Redshift. The table includes a column that is named city_name. The company wants to query the table to find all row...

Key Factors: - Requirement: The query must find all rows where the `city_name` starts with "San" or "El". - SQL pattern matching: We are looking for a query that uses a regular expression to match city names starting with either "San" or "El". Option Analysis: A) Select from Sales where city_name ~ =‘$(San|El)’; - Pros: - The use of `~` is a valid regular expression match operator in Amazon Redshift. - The pattern `$(San|El)` is not a correct regular expression syntax, and the `$(...)` part is incorrect in this context. The correct regular expression for "starts with" should be anchored to the start of the string using `^`. - Cons: - The regular expression is incorrect. The pattern doesn't properly capture the requirement to match city names starting with "San" or "El". - Use Case: This is not a valid option due to incorrect syntax. B) Select from Sales where city_name ~ =‘^(San|El)’; - Pros: - The `^` correctly anchors the match to the start of the string. - The pattern `(San|El)` is correct for matching either "San" or "El" at the beginning of the string. - Cons: - The `` at the end of the regular expression is unnecessary. The `` means "zero or more occurrences" of the preceding element, but we only want to match the beginning of the string with either "San" or "El". The correct pattern should end with the group `(San|El)` without the ``. - Use Case: This option is close but...

Author: Chloe · Last updated May 21, 2026

A company needs to send customer call data from its on-premises PostgreSQL database to AWS to generate near real-time insights. The solution must capture and load updates from operational data stores that run in the PostgreSQL database. The data changes continuously.A data engineer configures an AWS Database Migration Service (AWS DMS) ongoing replication task. The task reads changes in near real time from the PostgreSQL source database transaction logs for each table. The task then sends the data to an Amazon Redshift cluster for processing.The data engineer discover...

To determine if the PostgreSQL database is the source of high latency, let's evaluate each option: A) Use Amazon CloudWatch to monitor the DMS task. Examine the CDCIncomingChanges metric to identify delays in the CDC from the source database. - This metric would show incoming changes that DMS is capturing, but it focuses on changes arriving in DMS, not specifically whether the source PostgreSQL database is causing the delay. This metric tracks changes entering DMS but doesn't directly confirm if the database itself is slow at producing these changes. B) Verify that logical replication of the source database is configured in the postgresql.conf configuration file. - This option checks if the PostgreSQL database is properly configured to support logical replication, which is needed for change data capture (CDC) by DMS. If the replication configuration is incorrect, it could result in delays. However, this step doesn't directly confirm if the database is slow at replicating or if the issue lies with DMS itself, making it less direct in isolating the source of latency. C) Enable Amazon CloudWatch Logs for the DMS endpoin...

Author: Ethan · Last updated May 21, 2026

A lab uses IoT sensors to monitor humidity, temperature, and pressure for a project. The sensors send 100 KB of data every 10 seconds. A downstream process will read the data from an Amazon S3 bucket every...

Let's evaluate each solution to determine the best option for delivering IoT sensor data to an S3 bucket with the least latency: Option A: Use Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose to deliver the data to the S3 bucket. Use the default buffer interval for Kinesis Data Firehose. - Why it’s rejected: Kinesis Data Firehose automatically batches data before delivering it to the destination (such as S3), and the default buffer interval is 300 seconds (5 minutes). This interval is much too long for the required 30-second data retrieval from the S3 bucket. The latency would be higher than required, so this option is not ideal for low-latency requirements. Option B: Use Amazon Kinesis Data Streams to deliver the data to the S3 bucket. Configure the stream to use 5 provisioned shards. - Why it’s rejected: Kinesis Data Streams is suitable for real-time data streaming, but delivering data to S3 requires a secondary step, such as using a custom process or Kinesis Data Firehose. Configuring 5 provisioned shards is a good way to scale, but it still involves extra steps to write the data to S3, leading to potential additional latency. The process will not be as streamlined and efficient as using Kinesis Data Firehose or other integrated solutions designed for low-latency delivery. Option C: Use Amazon Kinesis Data Streams and call the Kinesis Client Library to deliver the data to the S3 bucket. Use a 5 second buffer interval from an application. - Why it’s rejected: While Kinesis Data Streams with the Kinesis Client Library (KCL) can handle real-time data processing, implementing a custom application that pulls from the stream and writes to S3 introduces complexity and delays. Even though a 5-second buffer interval might reduce latency, the added overhead of managing the custom application and the manual writing of data to S3 makes this option less efficient and introduces operational complexity. Option D:...

Author: Amira · Last updated May 21, 2026

A company wants to use machine learning (ML) to perform analytics on data that is in an Amazon S3 data lake. The company has two data transformation requirements that will give consumers within the company the ability to create reports.The company must perform daily transformations on 300 GB of data that is in a variety format that must arrive in Amazon S3 at a scheduled time. The company must perform one-time transformations of terabytes of archived data that is in the S3 data lake. The company uses Amazon Managed Workflows for...

Let's evaluate the options based on the company’s needs for performing data transformations on daily and archived data in the most cost-effective manner. A) For daily incoming data, use AWS Glue crawlers to scan and identify the schema. - AWS Glue Crawlers are designed to automatically discover schema and metadata of data stored in Amazon S3. This solution is useful for identifying schema on daily incoming data, especially if the data is in varied formats (like CSV, JSON, Parquet, etc.). Once the schema is identified, Glue can perform data transformation tasks. This is a cost-effective solution for schema discovery, as Glue is serverless and can scale based on the data size. It’s suitable for periodic operations (like daily transformations) and minimizes the need for infrastructure management. B) For daily incoming data, use Amazon Athena to scan and identify the schema. - Amazon Athena is a serverless query service that allows you to analyze data directly in Amazon S3 using standard SQL. Athena can also identify the schema if the data is in formats like Parquet or ORC. However, Athena is more suitable for querying data rather than performing complex transformations, especially on a daily basis with large datasets. Athena charges based on the amount of data scanned, which can become costly for large datasets (e.g., 300 GB daily) when performing frequent scans. While useful for queries, Athena is less cost-effective than AWS Glue for daily data transformations and schema management. C) For daily incoming data, use Amazon Redshift to perform transformations. - Amazon Redshift is a data warehousing service optimized for large-scale data analysis. While it can perform data transformations, it is more suited for complex analytics and queries on structured data. Redshift incurs higher costs due to provisioning compute resources, making it less cost-effective for daily data transformations when compared to serverless options like...

Author: Sofia · Last updated May 21, 2026

A retail company uses AWS Glue for extract, transform, and load (ETL) operations on a dataset that contains information about customer orders. The company wants to implement specific validation rule...

Let's evaluate each option based on the need to implement specific validation rules to ensure data accuracy and consistency. A) Use AWS Glue job bookmarks to track the data for accuracy and consistency. - AWS Glue job bookmarks are used to keep track of processed data between ETL job runs, ensuring that only new or modified data is processed in subsequent jobs. While job bookmarks are useful for incremental data processing, they are not specifically designed for validation or enforcing data accuracy and consistency. Job bookmarks track progress but do not allow you to define or check validation rules. Therefore, this option does not meet the need for specific data validation rules. B) Create custom AWS Glue Data Quality rulesets to define specific data quality checks. - AWS Glue Data Quality allows you to create custom rulesets to define specific data quality checks based on your needs (e.g., range checks, null value checks, uniqueness checks, etc.). This solution is ideal for implementing specific validation rules to ensure data accuracy and consistency. You can define rules and apply them to your data in a flexible and granular way. This makes it the most suitable solution for the company's requirement to implement custom validation rules. C) Use the built-in AWS Glue Data Quality transforms for standard data quality validations. - AWS Glue Data Quality transforms provide pre-built data quality checks such ...

Author: Kunal · Last updated May 21, 2026

An insurance company stores transaction data that the company compressed with gzip.The company needs to query the transaction data for occasional audits.Which...

Let's analyze each option to find the most cost-effective solution for the insurance company to query the transaction data for occasional audits: A) Store the data in Amazon Glacier Flexible Retrieval. Use Amazon S3 Glacier Select to query the data. - Amazon Glacier Flexible Retrieval (previously known as Amazon Glacier) is an archival storage solution optimized for infrequently accessed data with low retrieval costs. However, Glacier is designed for long-term archival and does not support fast query performance. Querying data with S3 Glacier Select can be slow and incurs additional costs for retrieval, which may not be the most cost-effective solution for occasional audits. Glacier is best suited for data that is rarely accessed and needs to be retrieved infrequently, making this option less suitable for the company's needs. B) Store the data in Amazon S3. Use Amazon S3 Select to query the data. - Amazon S3 Select enables querying compressed data directly within Amazon S3 without needing to retrieve the entire object. This is a very cost-effective solution when dealing with small to medium-sized datasets stored in S3, especially if the data is already compressed (gzip) in a compatible format (e.g., CSV, JSON, or Parquet). S3 Select can be used to quickly retrieve specific data subsets for occasional audits, making it an efficient and low-cost option for querying compressed data. C) Store the data in Amazon S3. Use Amazon Athena to query the data. - Amazon Athena is a serverless query service that allows SQL queries on data stored in Amazon S3, including support for compressed files like gzip. While Athena provides powerful querying capabilities, it charges based on the am...

Author: Madison · Last updated May 21, 2026

A data engineer finished testing an Amazon Redshift stored procedure that processes and inserts data into a table that is not mission critical. The engineer wants to automatically run the stored procedure o...

To meet the requirement of running the stored procedure daily in a cost-effective manner, we need to consider the various options based on factors like cost, ease of use, and scalability. Option A: Create an AWS Lambda function to schedule a cron job to run the stored procedure. - Pros: AWS Lambda is serverless, meaning no infrastructure to manage, and you only pay for the actual execution time, making it a cost-effective option for lightweight tasks. It can be easily set up with AWS CloudWatch Events (EventBridge) to schedule the procedure execution. - Cons: Lambda has a maximum execution time limit (15 minutes), so if the stored procedure runs longer than that, it might not work. In addition, Lambda might not be the best solution if there are multiple steps or complex workflows required. - Best for: Simple tasks that require minimal execution time and low cost. Option B: Schedule and run the stored procedure by using the Amazon Redshift Data API in an Amazon EC2 Spot Instance. - Pros: EC2 Spot Instances offer significant cost savings when compared to on-demand instances. - Cons: Using EC2 requires you to manage the instance lifecycle, even if it's a Spot instance. Spot instances can be terminated at any time, potentially leading to interruptions in the execution of the stored procedure. Additionally, the Data API itself is a bit more complex and may ...

Author: Benjamin · Last updated May 21, 2026

A marketing company collects clickstream data. The company sends the clickstream data to Amazon Kinesis Data Firehose and stores the clickstream data in Amazon S3. The company wants to build a series of dashboards that hundreds of users from multiple departments will use.The company will use Amazon QuickSight to develop the dashboards. The company wants a solution t...

To meet the requirement of creating a series of dashboards for hundreds of users with daily updates about clickstream data, let's evaluate each option carefully. A) Use Amazon Redshift to store and query the clickstream data. - Pros: Amazon Redshift is a powerful, scalable data warehouse that can handle large amounts of data and complex queries. It integrates well with Amazon QuickSight for dashboarding. - Cons: While Redshift is highly capable, it is typically more expensive than other solutions like Amazon Athena or S3 for this use case, particularly if the clickstream data is stored in Amazon S3. You would need to load data into Redshift and pay for compute and storage costs, which could become costly over time. - Use case: Redshift is ideal for large-scale analytics workloads with complex joins and aggregations, but it may not be the most cost-effective option for simple querying of clickstream data stored in S3, especially when the data updates daily. B) Use Amazon Athena to query the clickstream data. - Pros: Amazon Athena allows you to directly query data stored in Amazon S3 using SQL, without needing to load the data into a separate service like Redshift. Athena is serverless, meaning you only pay for the queries you run, which can be very cost-effective for this type of use case. It’s a good fit for querying large amounts of raw clickstream data in S3. - Cons: Athena’s performance can vary depending on how the data is structured in S3. However, for simple query workloads and daily updates, Athena’s cost structure is advantageous. - Use case: Athena is ideal for ad-hoc querying and scalable, cost-effective querying of large datasets stored in S3, especially if the queries are relatively straightforward, as in this case. C) Use Amazon S3 analytics to query the clickstream data. - Pros: S3 analytics helps optimize queries and manage data lifecycle policies, but it is not specifically designed for querying data. It's mainly used for tracking data access patterns to optimize storage, not for querying large amounts of data. - Cons: S3 analytics does not allow for querying or aggregating clickstream data in the way Athena or Redshift can. It's a tool designed to optimize data storage, not for querying or ...

Author: FrozenWolf2022 · Last updated May 21, 2026

A data engineer is building a data orchestration workflow. The data engineer plans to use a hybrid model that includes some on-premises resources and some resources that are in the cloud. The data engineer wants to prioritize portability and open source resource...

To address the need for a hybrid data orchestration workflow that prioritizes portability and open-source resources, let’s evaluate the available options: A) AWS Data Exchange - Pros: AWS Data Exchange enables the sharing of third-party data, but it's primarily focused on data sharing and not orchestration. It helps in finding, subscribing to, and using data from different providers, rather than orchestrating workflows or processes. - Cons: AWS Data Exchange does not provide capabilities for orchestrating workflows or managing processes in both on-premises and cloud environments. It is not designed for the hybrid orchestration scenario. - Use case: This service would be suitable if the goal were to subscribe to external data sources, but it does not address the requirements for orchestration. B) Amazon Simple Workflow Service (Amazon SWF) - Pros: Amazon SWF is a fully managed service that helps developers coordinate tasks in a distributed application. It can handle workflows that span both on-premises and cloud-based resources. - Cons: SWF is a proprietary AWS service and is not based on open-source frameworks. This means it lacks the portability that the data engineer is looking for, especially for hybrid environments where a flexible, open-source solution would be preferred. - Use case: SWF could be used for workflows, but its lack of open-source nature makes it less ideal in this case. C) Amazon Managed Workflows for Apache Airflow (Amazon MWAA) - Pros: Apache Airflow is an open-source workflow management platform that is widely used for orchestrating complex workflows. Amazon MWAA is a managed service that provid...

Author: Matthew · Last updated May 21, 2026

A gaming company uses a NoSQL database to store customer information. The company is planning to migrate to AWS.The company needs a fully managed AWS solution that will handle high online transaction processing (OLTP) workload, provide single-digit millisecond performance, and pr...

To meet the requirements of handling high online transaction processing (OLTP) workloads, providing single-digit millisecond performance, and ensuring high availability around the world, let's evaluate each option: A) Amazon Keyspaces (for Apache Cassandra) - Pros: Amazon Keyspaces is a managed service for Apache Cassandra, a highly scalable NoSQL database. It is ideal for large-scale, high-throughput, and low-latency workloads. It offers good performance and scalability. - Cons: While Keyspaces can handle large-scale workloads, it may require more operational management compared to some of the other AWS services. Cassandra, even in a managed environment like Keyspaces, often requires additional tuning and expertise to optimize performance and maintain availability across global regions. - Use case: Keyspaces is a good choice for large-scale distributed databases that require extensive customization and tuning but might not offer the easiest operational overhead compared to other options on AWS. B) Amazon DocumentDB (with MongoDB compatibility) - Pros: Amazon DocumentDB is a fully managed database service that is compatible with MongoDB. It is designed to handle high workloads and provide scalability for applications that require document-oriented storage. - Cons: DocumentDB is typically optimized for document-based data and might not be the best fit for high-throughput, low-latency OLTP workloads. While it provides good scalability and availability, it doesn’t always provide the same single-digit millisecond response times as DynamoDB, especially for high-velocity transactional workloads. - Use case: DocumentDB is more suitable for applications that require MongoDB compatibility and document-based storage. It can handle substantial workloads, but may not provide the best performance for high-throughput OLTP applications. C) Amazon Dy...

Author: Lucas Carter · Last updated May 21, 2026

A data engineer creates an AWS Lambda function that an Amazon EventBridge event will invoke. When the data engineer tries to invoke the Lambda function by using an EventBridge event, an Access...

When the data engineer encounters an AccessDeniedException while invoking an AWS Lambda function using an Amazon EventBridge event, the issue likely arises from permission misconfigurations either with the EventBridge event or the Lambda function. Let’s evaluate each option: A) Ensure that the trust policy of the Lambda function execution role allows EventBridge to assume the execution role. - Pros: This option focuses on the trust relationship, where EventBridge must have permission to invoke the Lambda function. However, for an EventBridge rule to invoke Lambda, it needs to have proper permissions in both the EventBridge rule's role and the Lambda function's resource policy, not just the Lambda execution role trust policy. - Cons: The trust policy mainly governs what services (like EventBridge) can assume the Lambda execution role. However, the problem is typically more about what EventBridge is allowed to do (invoke the Lambda) rather than the role assumption itself. - Use case: This option would only apply if there is an issue with Lambda execution role assumption, which is not the root cause of the AccessDeniedException in this scenario. B) Ensure that both the IAM role that EventBridge uses and the Lambda function's resource-based policy have the necessary permissions. - Pros: This is the most likely solution. EventBridge needs the correct permissions to invoke the Lambda function, and Lambda needs to allow EventBridge to invoke it. This can be achieved by configuring the Lambda function's resource-based policy to grant EventBridge permission. Additionally, EventBridge needs the right IAM role permissions to trigger the Lambda function. - Cons: This is the correct solution, and addressin...

Author: FlamePhoenix2025 · Last updated May 21, 2026

A company uses a data lake that is based on an Amazon S3 bucket. To comply with regulations, the company must apply two layers of server-side encryption to files that are uploaded to the S3 bucket. The company wants to use an ...

To determine the correct solution for the company's requirement to apply two layers of server-side encryption (SSE) to files uploaded to the Amazon S3 bucket, let's analyze the options based on key factors such as compliance, security, and operational ease. The company wants to use an AWS Lambda function to apply the encryption. A) Use both server-side encryption with AWS KMS keys (SSE-KMS) and the Amazon S3 Encryption Client. - Reasoning: This option proposes using two encryption layers: SSE-KMS (for encrypting data using AWS Key Management Service) and the Amazon S3 Encryption Client (likely implying client-side encryption before upload). - Why rejected: While this approach could work in theory, it requires significant manual management of the encryption process, including configuring the S3 Encryption Client in the Lambda function, which can be cumbersome. The S3 Encryption Client is client-side encryption, so it would apply encryption on the client side before the data reaches S3, thus complicating the process without providing an ideal method for automated, server-side encryption after the upload. B) Use dual-layer server-side encryption with AWS KMS keys (DSSE-KMS). - Reasoning: This solution offers two layers of server-side encryption using KMS keys (DSSE-KMS). It implies the use of two distinct encryption mechanisms on the server side after the file reaches S3. - Why rejected: AWS does not natively offer "dual-layer" encryption as a named service. While SSE-KMS allows you to apply server-side encryption with KMS, there is no such option as DSSE-KMS for automatic dual-layer encryption. A Lambda fun...

Author: Liam123 · Last updated May 21, 2026

A data engineer notices that Amazon Athena queries are held in a queue before the queries run.How can the d...

Let's analyze each option in detail to determine the best way to prevent queries from being queued in Amazon Athena: A) Increase the query result limit - Explanation: The query result limit determines the amount of data that can be returned by a query. While this setting controls the size of query results, it has no effect on how queries are queued or how resources are allocated for query execution. - Rejection Reason: Increasing the query result limit doesn't prevent queries from being queued. It only affects the size of the result set returned. This does not solve the root issue of query queuing. B) Configure provisioned capacity for an existing workgroup - Explanation: Amazon Athena allows you to configure a workgroup with provisioned capacity, which ensures that a certain amount of resources (e.g., CPU, memory) are allocated to a specific workgroup. With provisioned capacity, Athena queries in that workgroup will have dedicated resources and avoid being queued, as the capacity is pre-allocated. - Why this works: Provisioned capacity directly addresses the issue of queries being queued because it guarantees that the required resources are available for query execution without waiting for available capacity. - Rejection Reason: None, this option directly solves the problem of query queuing. C) Use federated queries ...

Author: CrystalWolfX · Last updated May 21, 2026

A data engineer needs to debug an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The data engineer enabled the bookmark feature for the AWS Glue job.The data engineer has set the maximum concurrency for the AWS Glue job to 1.The AWS Glue job is successfully writing the output to Amazon Redshift. However, the Amazon S3 files that were lo...

To understand why the AWS Glue job is reprocessing the files from Amazon S3 that were already loaded in previous runs, we need to explore the behavior of AWS Glue's bookmark feature and how it interacts with job settings and permissions. A) The AWS Glue job does not have the `s3:GetObjectAcl` permission that is required for bookmarks to work correctly. - Reasoning: The bookmark feature in AWS Glue uses metadata to track which files have already been processed to avoid reprocessing. This metadata is stored in a specific location (usually in the Glue Data Catalog) and relies on the AWS Glue job having sufficient permissions to track the objects' status in S3. - Why rejected: While it’s true that the `s3:GetObjectAcl` permission is necessary for certain Glue functionalities, this specific permission isn't directly related to the bookmark feature. The missing permission would more likely cause a different issue (like failing to access or read S3 objects) rather than the reprocessing of already-loaded files. The problem here seems to be related to how the bookmarks are being tracked rather than missing permissions. B) The maximum concurrency for the AWS Glue job is set to 1. - Reasoning: The maximum concurrency setting controls the number of concurrent tasks that AWS Glue can run for a job. Setting this to 1 ensures that only one task runs at a time, but this is not directly related to the bookmark feature or why files are being reprocessed. - Why rejected: While setting the concurrency to 1 can affect the speed and parallelism of the job, it does not explain why the bookmark feature isn't preventing reprocessing. The problem appears to be related to the state or tracking of processed files, not the concurrency of the job itself. C) The...

Author: Elizabeth · Last updated May 21, 2026

An ecommerce company wants to use AWS to migrate data pipelines from an on-premises environment into the AWS Cloud. The company currently uses a third-party tool in the on-premises environment to orchestrate data ingestion processes.The company wants a migration solution that does not require the company to manage servers. The solution must be able to orchestrate Python ...

To identify the best solution for migrating data pipelines from an on-premises environment into the AWS Cloud, we need to focus on the requirements: - No server management (serverless solution). - Orchestration of Python and Bash scripts. - Minimal or no code refactoring. Let’s evaluate the options based on these key factors: A) AWS Lambda - Reasoning: AWS Lambda is a serverless compute service that allows you to run code without provisioning or managing servers. It supports Python and Bash scripts via Lambda functions. However, Lambda is typically suited for short-duration tasks (maximum runtime of 15 minutes) and can be triggered by events, but it is not designed as an orchestration tool for complex workflows or managing multiple steps in a pipeline. - Why rejected: While AWS Lambda could technically execute individual Python or Bash scripts, it is not an ideal solution for orchestrating a series of scripts. Lambda requires each step to be triggered and managed separately, leading to potentially complex configurations. It's not designed for managing or chaining together multi-step workflows. B) Amazon Managed Workflows for Apache Airflow (Amazon MWAA) - Reasoning: Apache Airflow is a popular open-source tool for orchestrating workflows, and Amazon MWAA provides a managed service for running Apache Airflow in AWS. Airflow is highly flexible, allowing the orchestration of complex data pipelines. It can easily handle Python and Bash scripts. - Why rejected: While Amazon MWAA is a powerful tool for orchestrating complex workflows, it requires more operational management compared to other solutions (e.g., setting up the environment, configuring DAGs, managing Airflow resources). The operational overhead can be higher than other options, and since the company wants to avoid managing servers, this solution may not be the best fit for the “least operation...

Author: Ahmed97 · Last updated May 21, 2026

A retail company stores data from a product lifecycle management (PLM) application in an on-premises MySQL database. The PLM application frequently updates the database when transactions occur.The company wants to gather insights from the PLM application in near real time. The company wants to integrate the insights with other business datasets and to analyze the combined dataset by using an Amazon Redshift data warehouse.The comp...

Let's analyze each option to determine the best solution for integrating the PLM application’s MySQL database updates into Amazon Redshift with the least development effort. A) Run a scheduled AWS Glue extract, transform, and load (ETL) job to get the MySQL database updates by using a Java Database Connectivity (JDBC) connection. Set Amazon Redshift as the destination for the ETL job. - Explanation: AWS Glue is a fully managed ETL service, and it can connect to MySQL databases via JDBC to extract data. You can configure the ETL job to run on a schedule to pull data from MySQL and load it into Amazon Redshift. However, this option involves scheduling jobs to run at intervals, which means it may not achieve near real-time updates. The scheduled nature of the job introduces latency between updates in the MySQL database and when the data is available in Redshift. - Rejection Reason: While AWS Glue is a good solution for ETL, it introduces some latency due to its scheduled nature and is not optimal for near real-time integration. B) Run a full load plus CDC task in AWS Database Migration Service (AWS DMS) to continuously replicate the MySQL database changes. Set Amazon Redshift as the destination for the task. - Explanation: AWS DMS (Database Migration Service) supports change data capture (CDC) to replicate data from MySQL to Amazon Redshift continuously in near real time. DMS can perform full loads followed by incremental replication (CDC), which means it will continuously capture and migrate changes from the MySQL database to Redshift, offering near real-time updates. - Why this works: AWS DMS is a fully managed service designed for database replication with minimal setup. It provides an easy way to replicate MySQL database changes to Redshift continuously, achieving the real-time integration that the company needs. DMS also offers robust support for ongoing database updates and replication with minimal development effort. - Rejection Reason: None. This solution provides the requ...

Author: MysticJaguar44 · Last updated May 21, 2026

A marketing company uses Amazon S3 to store clickstream data. The company queries the data at the end of each day by using a SQL JOIN clause on S3 objects that are stored in separate buckets.The company creates key performance indicators (KPIs) based on the objects. The company needs a serverless solution that will give users the ability to query data by partitioning the data. The solut...

In this scenario, the company needs a serverless solution for querying the clickstream data stored in Amazon S3, with ACID properties for transaction management, the ability to partition the data, and cost-effectiveness. Evaluating each option: A) Amazon S3 Select: - Pros: Amazon S3 Select allows you to query specific data within an S3 object, reducing the amount of data transferred. - Cons: While this option is serverless and cost-effective for reading small portions of data, S3 Select does not provide ACID guarantees nor can it perform complex queries like JOIN operations or handle large datasets efficiently. It also does not offer partitioning or indexing for optimized query performance at scale. B) Amazon Redshift Spectrum: - Pros: Amazon Redshift Spectrum allows querying data directly from S3, and it integrates well with Amazon Redshift. It supports complex SQL queries with joins, aggregations, and partitioning. It also provides ACID guarantees for transactions. - Cons: Redshift Spectrum is not serverless because it requires a running Redshift cluster, which incurs additional costs for cluster uptime and maintenance. This increases the overall cost compared to fully serverless options. C) Amazon Athena: - Pros: Amazon Athena is a fully serverless service that allows you to query S3 data using SQL. It can handle large datasets, supports partitioning, and allows querying with SQL JOIN operations. It also provides ACID compliance when used in conjunction with AWS Glue and partitioning strategies. Athena is highly cost-effective...

Author: Aria · Last updated May 21, 2026

A company wants to migrate data from an Amazon RDS for PostgreSQL DB instance in the eu-east-1 Region of an AWS account named Account_A. The company will migrate the data to an Amazon Redshift cluster in the eu-west-1 Region of an AWS account named Account_B.Which ...

To migrate data using AWS Database Migration Service (AWS DMS) between two AWS accounts and regions (from Account_A in eu-east-1 to Account_B in eu-west-1), we need to ensure the following: 1. Replication instance location: The replication instance must be able to access both the source database (RDS for PostgreSQL in eu-east-1) and the target database (Amazon Redshift in eu-west-1). This means the replication instance needs to be in a region that is reachable by both source and target. 2. Cross-account access: DMS replication instances need to have the correct network and IAM permissions to interact with databases in different AWS accounts. This involves configuring network settings (VPC, subnets, security groups) and IAM roles for cross-account access. Evaluating each option: A) Set up an AWS DMS replication instance in Account_B in eu-west-1: - Pros: This option places the replication instance in Account_B, in the same region as the target Redshift cluster (eu-west-1). It simplifies network access to the target (Amazon Redshift) because the replication instance is already within the same region as the target database. It can also have direct access to Account_B's resources. - Cons: The replication instance will need cross-account access to the source database in Account_A (eu-east-1). This is achievable with the proper IAM role and network configuration (VPC peering, VPN, etc.). This setup is a common and recommended configuration. B) Set up an AWS DMS replication instance in Account_B in eu-east-1: - Cons: While the replication instance is in Account_B, it is in the wrong region (eu-east-1) and will create latency and potential network issues because the target Redshift cluster is located in eu-west-1. Cross-region rep...

Author: David · Last updated May 21, 2026

A company uses Amazon S3 as a data lake. The company sets up a data warehouse by using a multi-node Amazon Redshift cluster. The company organizes the data files in the data lake based on the data source of each data file.The company loads all the data files into one table in the Redshift cluster by using a separate COPY command for each data file location. This approach takes a long time to load all the data files i...

In this scenario, the company wants to increase the speed of data ingestion into an Amazon Redshift cluster while avoiding an increase in costs. The company is currently using separate `COPY` commands for each data file, which is inefficient. Let's evaluate each option based on the requirements: A) Use a provisioned Amazon EMR cluster to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift. - Pros: Amazon EMR is a powerful tool for processing large volumes of data and can help in managing and preprocessing large datasets. If used correctly, it can handle parallel processing of files. - Cons: Provisioned EMR clusters can be costly, especially for large datasets, which contradicts the requirement of not increasing the cost of the process. Also, the step of copying the data into one folder doesn’t directly address the problem of inefficient `COPY` commands or optimize the Redshift loading process. B) Load all the data files in parallel into Amazon Aurora. Run an AWS Glue job to load the data into Amazon Redshift. - Cons: This solution introduces Amazon Aurora as an intermediate storage, which complicates the workflow and increases both operational complexity and cost. Additionally, loading data into Amazon Aurora before copying to Redshift introduces unnecessary overhead. C) Use an AWS Glue job to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift. - Cons: While AWS Glue can help with orchestrating the process of moving data between sources, copying data into one fold...

Author: StarlightBear · Last updated May 21, 2026

A company plans to use Amazon Kinesis Data Firehose to store data in Amazon S3. The source data consists of 2 MB .csv files. The company must convert the .csv files to JSON format. The company must store the files in Apac...

To meet the requirements of converting `.csv` files to `JSON` format and storing the resulting files in Apache Parquet format, we need to evaluate the solutions based on development effort, integration with Kinesis Data Firehose, and the least operational complexity. Let’s examine the options: A) Use Kinesis Data Firehose to convert the .csv files to JSON. Use an AWS Lambda function to store the files in Parquet format. - Reasoning: Kinesis Data Firehose supports direct conversion from formats like JSON, CSV, and Apache ORC, but it cannot convert to Parquet on its own. The option involves using Firehose to convert the files to JSON and then using Lambda to store them in Parquet format. - Why rejected: This option involves extra development effort because you would have to write and maintain a custom Lambda function to convert the data into Parquet format after the JSON conversion. This adds complexity to the solution and is less efficient than a more integrated approach. B) Use Kinesis Data Firehose to convert the .csv files to JSON and to store the files in Parquet format. - Reasoning: Kinesis Data Firehose has built-in support for converting `.csv` to `JSON` but does not have native support for converting to Parquet format directly. - Why rejected: This option would not work because Kinesis Data Firehose cannot convert the `.csv` files to Parquet directly. While it can convert data to JSON, it requires an additional step for the conversion to Parquet, which isn't directly supported in this setup. C) Use Kinesis Data Firehose to invoke an AWS Lambda function that transforms the .csv files to JSON and...

Author: Emma · Last updated May 21, 2026

A company is using an AWS Transfer Family server to migrate data from an on-premises environment to AWS. Company policy mandates the use of TLS 1.2 or above to en...

To meet the company's policy of using TLS 1.2 or above for data encryption in transit with AWS Transfer Family, let's evaluate each option carefully: A) Generate new SSH keys for the Transfer Family server. Make the old keys and the new keys available for use. - Cons: SSH keys are used for authentication purposes, not for encrypting data in transit. SSH keys cannot enforce TLS 1.2 encryption; they are related to securing the connection and verifying the identity of the user or client. This approach does not address the encryption protocol requirement (TLS 1.2) specified by the policy. B) Update the security group rules for the on-premises network to allow only connections that use TLS 1.2 or above. - Cons: Security group rules are used to control access to the server by specifying allowed IP ranges, ports, and protocols, but they cannot enforce encryption standards. Security groups do not have any control over which version of TLS or other encryption protocols are used for communication. Therefore, this approach would not meet the TLS 1.2 encryption requirement. C) Update the security policy of the Transfer Family server to specify a minimum protocol version of TLS 1.2. - Pros: AWS Transfer Family supports the ability to configure the minimum protocol version for secure transfers. By specifying a minimum pr...

Author: ElectricLionX · Last updated May 21, 2026

A company wants to migrate an application and an on-premises Apache Kafka server to AWS. The application processes incremental updates that an on-premises Oracle database sends to the Kafka server. The company wants to use the replatform migration strategy i...

To migrate an on-premises Apache Kafka server to AWS with the least management overhead using a replatform strategy, we need to consider key factors such as ease of management, compatibility with the existing application, and scalability requirements. Option A: Amazon Kinesis Data Streams - Rejected: Amazon Kinesis Data Streams is a fully managed service for streaming data, but it is not directly compatible with Apache Kafka. The company is using Apache Kafka, which means a transition to Kinesis Data Streams would require substantial changes to the application and its architecture, which contradicts the replatform strategy (which aims for minimal changes). Option B: Amazon Managed Streaming for Apache Kafka (Amazon MSK) provisioned cluster - Rejected: While Amazon MSK provisioned clusters offer a fully managed Apache Kafka service, they still require some level of management (e.g., capacity planning, monitoring, and scaling). This option adds more management overhead compared to the serverless option, which is what the company is trying to avoid for a simpler migration. Option C: Amazon Kinesis Data Firehose - Rejected: Amazon Kinesis Data Firehose is a fully managed service for delivering streaming data to destinations like Amazon S3, Amazon Redshift, or Amazon Elasticsearch. However, it is not directly compatible with Kafka and would also ...

Author: Sofia · Last updated May 21, 2026

A data engineer is building an automated extract, transform, and load (ETL) ingestion pipeline by using AWS Glue. The pipeline ingests compressed files that are in an Amazon S3 bucket. The ingestion pipeline must support increme...

To meet the requirement of building an automated ETL pipeline that supports incremental data processing, the data engineer needs a feature that helps track which data has already been processed. This is especially important for incremental processing to ensure that only new or updated data is processed. Option A: Workflows - Rejected: AWS Glue Workflows are designed to manage and orchestrate a series of jobs and crawlers. While workflows are useful for managing complex ETL pipelines with multiple jobs and dependencies, they do not directly handle incremental processing or data tracking. Thus, workflows are not the right choice for supporting incremental data ingestion. Option B: Triggers - Rejected: AWS Glue Triggers are used to schedule and initiate jobs based on specific events or time-based schedules. While triggers can automate the ETL process, they do not inherently handle incremental data processing or track which data has already been processed. They simply initiate the job execution, which means additional logic would be required to handle the incremental aspect. Option C: Job bookmarks - Selected: AWS Glue Job Bookmarks are specifically designed for incremental data processing. When enabled, job bookmarks keep track of the state of the data that has a...

Author: Victoria · Last updated May 21, 2026

A banking company uses an application to collect large volumes of transactional data. The company uses Amazon Kinesis Data Streams for real-time analytics. The company=E2=80=99s application uses the PutRecord action to send data to Kinesis Data Streams.A data engineer has observed network outages during certain times of da...

To configure exactly-once delivery for the entire processing pipeline, we need to ensure that the data is processed once and only once, even in the case of network outages or other interruptions. Let’s evaluate the given options based on their ability to guarantee exactly-once processing while taking into account factors like reliability, ease of implementation, and potential overhead. A) Design the application so it can remove duplicates during processing by embedding a unique ID in each record at the source. - Reasoning: By embedding a unique ID in each record, the application could check for duplicate records at the processing stage, and filter them out. This would prevent duplicates during the processing step. - Why rejected: While this approach can handle duplicate data during processing, it doesn’t prevent duplicates from being sent to Kinesis Data Streams in the first place. Additionally, it places the burden of deduplication on the application logic, which can be error-prone and increase complexity. This solution does not guarantee exactly-once delivery, but only ensures deduplication during processing. B) Update the checkpoint configuration of the Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) data collection application to avoid duplicate processing of events. - Reasoning: Amazon Managed Service for Apache Flink supports exactly-once semantics through checkpointing, which ensures that processed records are stored periodically and that data is reprocessed from the last checkpoint in case of a failure. This can prevent duplicate processing of records in a real-time stream. - Why selected: Flink's exactly-once semantics and its checkpointing mechanism are specifically designed to ensure reliable delivery, even in the event of network outages or failures. By updating the checkpoint configuration, the data engineer can configure Flink to process records exac...

Author: Victoria · Last updated May 21, 2026

What Our Friends Say

What Our Friends Say

Amazon Practice Questions, Discussions & Exam Topics by our Authors

A company receives call logs as Amazon S3 objects that contain sensitive customer information. The company must protect the S3 objects by using encryption. The company must also use encryption keys that only spec...

During a security review, a company identified a vulnerability in an AWS Glue job. The company discovered that credentials to access an Amazon Redshift cluster were hard coded in the job script.A data engineer must remediate the security vulnerability in the AWS Glue job. The solution must sec...

A data engineer needs to use an Amazon QuickSight dashboard that is based on Amazon Athena queries on data that is stored in an Amazon S3 bucket. When the data engineer connects to the QuickSight dashboard, the data engineer receives an error message that...

A data engineer is configuring Amazon SageMaker Studio to use AWS Glue interactive sessions to prepare data for machine learning (ML) models.The data engineer receives an access denied error when the data engineer tries to prepare the da...

A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is...

A company stores 10 to 15 TB of uncompressed .csv files in Amazon S3. The company is evaluating Amazon Athena as a one-time query engine.The company wants to transform the data to optimize query runtime and storage cos...

A company uses Apache Airflow to orchestrate the company's current on-premises data pipelines. The company runs SQL data quality check tasks as part of the pipelines. The company wants to migrate the pipelines to AWS and to use AW...

A company uses Amazon EMR as an extract, transform, and load (ETL) pipeline to transform data that comes from multiple sources. A data engineer must orchestrate the pipeline to maximize...

A data engineer wants to improve the performance of SQL queries in Amazon Athena that run against a sales data table.The data engineer wants to understand the execution plan of a specific SQL statement. The data engineer also wants to see the computational cost...

A company plans to provision a log delivery stream within a VPC. The company configured the VPC flow logs to publish to Amazon CloudWatch Logs. The company needs to send the flow logs to Splunk in near real time for fur...

An online retail company has an application that runs on Amazon EC2 instances that are in a VPC. The company wants to collect flow logs for the VPC and analyze network t...

A company has a data warehouse that contains a table that is named Sales. The company stores the table in Amazon Redshift. The table includes a column that is named city_name. The company wants to query the table to find all row...

A lab uses IoT sensors to monitor humidity, temperature, and pressure for a project. The sensors send 100 KB of data every 10 seconds. A downstream process will read the data from an Amazon S3 bucket every...

A retail company uses AWS Glue for extract, transform, and load (ETL) operations on a dataset that contains information about customer orders. The company wants to implement specific validation rule...

An insurance company stores transaction data that the company compressed with gzip.The company needs to query the transaction data for occasional audits.Which...

A data engineer finished testing an Amazon Redshift stored procedure that processes and inserts data into a table that is not mission critical. The engineer wants to automatically run the stored procedure o...

A data engineer is building a data orchestration workflow. The data engineer plans to use a hybrid model that includes some on-premises resources and some resources that are in the cloud. The data engineer wants to prioritize portability and open source resource...

A gaming company uses a NoSQL database to store customer information. The company is planning to migrate to AWS.The company needs a fully managed AWS solution that will handle high online transaction processing (OLTP) workload, provide single-digit millisecond performance, and pr...

A data engineer creates an AWS Lambda function that an Amazon EventBridge event will invoke. When the data engineer tries to invoke the Lambda function by using an EventBridge event, an Access...

A company uses a data lake that is based on an Amazon S3 bucket. To comply with regulations, the company must apply two layers of server-side encryption to files that are uploaded to the S3 bucket. The company wants to use an ...

A data engineer notices that Amazon Athena queries are held in a queue before the queries run.How can the d...

A company wants to migrate data from an Amazon RDS for PostgreSQL DB instance in the eu-east-1 Region of an AWS account named Account_A. The company will migrate the data to an Amazon Redshift cluster in the eu-west-1 Region of an AWS account named Account_B.Which ...

A company plans to use Amazon Kinesis Data Firehose to store data in Amazon S3. The source data consists of 2 MB .csv files. The company must convert the .csv files to JSON format. The company must store the files in Apac...

A company is using an AWS Transfer Family server to migrate data from an on-premises environment to AWS. Company policy mandates the use of TLS 1.2 or above to en...

A company wants to migrate an application and an on-premises Apache Kafka server to AWS. The application processes incremental updates that an on-premises Oracle database sends to the Kafka server. The company wants to use the replatform migration strategy i...

A data engineer is building an automated extract, transform, and load (ETL) ingestion pipeline by using AWS Glue. The pipeline ingests compressed files that are in an Amazon S3 bucket. The ingestion pipeline must support increme...