2.1 Project Overview

Overview of Project ☁️

This project involves building a serverless data pipeline on AWS for processing CSV files. The pipeline automates the ingestion, transformation, and visualization of data. CSV files are uploaded to an raw data S3 bucket(csv-raw-data), triggering an AWS Lambda function to preprocess the data and store it in the processed data bucket(csv-processed-data) .

AWS Glue is then used for further ETL (Extract, Transform, Load) operations, and the final data is stored in final data bucket (csv-final-data). Finally, Amazon QuickSight is used to create interactive dashboards and reports for visualizing the final data.

  • Data Ingestion: CSV files are uploaded to an Amazon S3 bucket, which serves as the central storage for raw and processed data.
  • Trigger and Transformation: An AWS Lambda function is automatically triggered to preprocess the data, such as filtering or formatting, and pass it to AWS Glue for detailed ETL operations.
  • Data Storage: Processed data is stored in Amazon S3 for scalable storage.
  • Visualization: Amazon QuickSight connects to the data source to create dynamic and interactive dashboards for visualization and reporting.


Services Used 🛠

  1. Amazon S3: Used for scalable storage of raw and processed data, providing event-driven architecture capabilities with S3 event notifications. [Storage]
  2. AWS Lambda: Acts as a serverless compute layer, automatically triggered to preprocess and clean CSV files upon upload to S3. [Compute]
  3. AWS Glue: Provides ETL capabilities to extract, transform, and load data into a usable format for analysis. [ETL/Big Data]
  4. Amazon QuickSight: Offers interactive dashboards and reports for real-time data visualization. [Analytics]
  5. IAM Roles and Policies: Ensures secure access to S3, Lambda, Glue, and QuickSight. [Permissions]


Architectural Diagram ✍️


Steps to be performed 👩‍💻

In the next few lessons, we'll be going through the following steps.

  1. Setup and configuration
  2. Data ingestion and preprocessing
  3. Data transformation with AWS Glue
  4. Data visualization with Amazon Quicksight


Estimated Time & Cost ⚙️

  • This project is estimated to take about 3-4 hours
  • Cost: $0.50-$1


Clean Up 🗑️

1. Delete Amazon QuickSight Resources

  • Go to Amazon QuickSight and remove any dashboards, datasets, and analyses created for this project.
  • If QuickSight is no longer needed, unsubscribe to avoid charges.

2. Delete Raw, Processed and Final Data S3 Buckets

  • Navigate to the S3 console. Empty and delete the three S3 buckets used for storing raw, processed and final CSV files.

3. Delete AWS Glue Resources

  • Go to AWS Glue in the AWS console. Delete the Glue Job created for ETL processing.
  • Remove the Glue Data Catalog table and crawler configured.

4. Delete AWS Lambda Functions

  • Navigate to AWS Lambda. Delete the Lambda function that was used for preprocessing CSV files.

5. Remove IAM Roles and Policies

  • Delete the IAM roles- Lambda-S3-Glue-Role and Glue-Service-Role created for this project.


Complete and Continue  
Discussion

0 comments