2.1 Project Overview
Overview of Project ☁️
This project involves building a serverless data pipeline on AWS for processing CSV files. The pipeline automates the ingestion, transformation, and visualization of data. CSV files are uploaded to an raw data S3 bucket(csv-raw-data
), triggering an AWS Lambda function to preprocess the data and store it in the processed data bucket(csv-processed-data
) .
AWS Glue is then used for further ETL (Extract, Transform, Load) operations, and the final data is stored in final data bucket (csv-final-data
). Finally, Amazon QuickSight is used to create interactive dashboards and reports for visualizing the final data.
- Data Ingestion: CSV files are uploaded to an Amazon S3 bucket, which serves as the central storage for raw and processed data.
- Trigger and Transformation: An AWS Lambda function is automatically triggered to preprocess the data, such as filtering or formatting, and pass it to AWS Glue for detailed ETL operations.
- Data Storage: Processed data is stored in Amazon S3 for scalable storage.
- Visualization: Amazon QuickSight connects to the data source to create dynamic and interactive dashboards for visualization and reporting.
Services Used 🛠
- Amazon S3: Used for scalable storage of raw and processed data, providing event-driven architecture capabilities with S3 event notifications. [Storage]
- AWS Lambda: Acts as a serverless compute layer, automatically triggered to preprocess and clean CSV files upon upload to S3. [Compute]
- AWS Glue: Provides ETL capabilities to extract, transform, and load data into a usable format for analysis. [ETL/Big Data]
- Amazon QuickSight: Offers interactive dashboards and reports for real-time data visualization. [Analytics]
- IAM Roles and Policies: Ensures secure access to S3, Lambda, Glue, and QuickSight. [Permissions]
Architectural Diagram ✍️
Steps to be performed 👩💻
In the next few lessons, we'll be going through the following steps.
- Setup and configuration
- Data ingestion and preprocessing
- Data transformation with AWS Glue
- Data visualization with Amazon Quicksight
Estimated Time & Cost ⚙️
- This project is estimated to take about 3-4 hours
- Cost: $0.50-$1
Clean Up 🗑️
1. Delete Amazon QuickSight Resources
- Go to Amazon QuickSight and remove any dashboards, datasets, and analyses created for this project.
- If QuickSight is no longer needed, unsubscribe to avoid charges.
2. Delete Raw, Processed and Final Data S3 Buckets
- Navigate to the S3 console. Empty and delete the three S3 buckets used for storing raw, processed and final CSV files.
3. Delete AWS Glue Resources
- Go to AWS Glue in the AWS console. Delete the Glue Job created for ETL processing.
- Remove the Glue Data Catalog table and crawler configured.
4. Delete AWS Lambda Functions
- Navigate to AWS Lambda. Delete the Lambda function that was used for preprocessing CSV files.
5. Remove IAM Roles and Policies
- Delete the IAM roles-
Lambda-S3-Glue-Role
andGlue-Service-Role
created for this project.
0 comments