End-to-end data Engineering Project with Python, AWS, and SPARK for YouTube Analysis

Deepak Dewani
14 min readApr 25, 2024

--

Introduction

In today’s data-driven landscape, harnessing the power of large-scale data processing is paramount for organizations aiming to extract actionable insights. This article delves into an end-to-end data engineering project utilizing Python, AWS (Amazon Web Services), and SPARK. On average, it’s estimated that around 2.5 quintillion bytes of data are generated every day. This staggering amount of data comes from various sources such as social media interactions, online transactions, sensor readings, and more. With the proliferation of connected devices and digital technologies, this figure is expected to continue growing exponentially. Here, we’ll gather raw data from diverse sources and formats, posing a significant challenge due to its sheer volume, variety, and speed of accumulation. Through this comprehensive guide, we navigate the complexities of data ingestion, processing, and analysis, showcasing how these powerful technologies synergize to unlock the potential of big data.

High Level Data Architectural Diagram of DE Project
High-Level Data Architectural Diagram of DE Project

To fully engage in this project, a few prerequisites are necessary:

  1. An active AWS Account.

(if you don’t have an AWS Account, please refer to the below LINK to create a 1-year AWS Free Tier Account)

2. Fundamental proficiency in Python and PySpark.

3. Access to a laptop with internet connectivity.

Join us for an immersive journey filled with data and invaluable learning opportunities. Throughout this project, you’ll delve into:

  1. Practical Python application
  2. AWS services such as IAM, S3, Data Catalog, Glue, Athena, and Lambda
  3. Business Intelligence Service such as AWS QuickSight
  4. Pyspark utilization
  5. ETL processes
  6. Data Lake
  7. SQL
  8. Data Ingestion
  9. Big Data Analysis

Get ready to embark on this enriching experience and expand your skill set!

To initiate the project, we’ll begin by accessing your AWS root account to establish a new user profile.

(How to create a sub-user account from Root account)

This approach ensures that subsequent implementations are conducted within the confines of a dedicated user account, thereby safeguarding the integrity of the root account and minimizing the risk of inadvertent role modifications or data loss. This practice adheres to industry best standards, recognizing that the root account possesses elevated privileges, and as such, should be protected and utilized judiciously.

Transferring Source Data from Local to AWS S3 Bucket

The initial phase of our project entails familiarizing ourselves with the dataset, including its structure, content, and source location. We will be analyzing data from various regions, such as the US, India, Russia, etc., sourced from YouTube datasets.

You can obtain the dataset via the provided link and store it locally for further analysis.

Subsequently, we will transfer this locally stored data to an S3 bucket using the AWS CLI. To facilitate this transfer, it’s essential to configure the AWS CLI to establish a seamless connection between the local environment and the AWS account.

How to Configure AWS CLI in Local System, please below link.

Creation of S3 Bucket In AWS

Access the AWS Management Console and navigate to the S3 service. Click on “Create Bucket” to initiate the process. Provide a unique name for your bucket, ensuring it is distinct globally, and then proceed by clicking “Next.” This bucket will serve as the destination for transferring the source data from your local environment.

Proceed to your Command Prompt or Terminal and navigate to the directory where you have stored or downloaded the source data. Once there, execute the following command to initiate the transfer of data from your local environment to the designated S3 bucket.

#Replace It With Your Bucket Name

# To copy all JSON Reference data to same location:
aws s3 cp . s3://your-bucket-name/youtube/raw_statistics_reference_data/ --recursive --exclude "*" --include "*.json"

# To copy all data files to its own location, following Hive-style patterns:
aws s3 cp CAvideos.csv s3://your-bucket-name/youtube/raw_statistics/region=ca/
aws s3 cp DEvideos.csv s3://your-bucket-name/youtube/raw_statistics/region=de/
aws s3 cp FRvideos.csv s3://your-bucket-name/youtube/raw_statistics/region=fr/
aws s3 cp GBvideos.csv s3://your-bucket-name/youtube/raw_statistics/region=gb/
aws s3 cp INvideos.csv s3://your-bucket-name/youtube/raw_statistics/region=in/
aws s3 cp JPvideos.csv s3://your-bucket-name/youtube/raw_statistics/region=jp/
aws s3 cp KRvideos.csv s3://your-bucket-name/youtube/raw_statistics/region=kr/
aws s3 cp MXvideos.csv s3://your-bucket-name/youtube/raw_statistics/region=mx/
aws s3 cp RUvideos.csv s3://your-bucket-name/youtube/raw_statistics/region=ru/
aws s3 cp USvideos.csv s3://your-bucket-name/youtube/raw_statistics/region=us/

Executing this command will initiate the transfer of the local file to the specified S3 bucket.

Creation of Data Catalog with AWS Glue

To create a data catalog, utilize AWS Glue, a comprehensive data integration service. Begin by accessing the AWS Management Console and navigating to AWS Glue. Once in the Glue console, locate and select “Crawler.” This tool is instrumental in scanning your data sources and generating metadata to define the data schema effectively.

Navigate to the AWS Management Console and search for AWS Glue. Once in the Glue service, locate and select “Crawlers” from the menu. Click on “Add Crawler” and proceed to the next step.

Specify the S3 bucket name where your data resides. Create an IAM role for Glue, ensuring it has sufficient permissions to access the S3 bucket. This involves attaching the policy for S3 Full Access to the IAM role. Name the role appropriately, and repeat this process for the GLUEServiceRole. Select these roles under “Choose an existing role.”

Next, create a new database to house the metadata or data catalog. Once completed, click on “Create” and proceed.

Finally, execute the crawler to initiate the process of scanning the data and generating metadata for the data schema.

Creation of Metadata from Crawler
Creation of Metadata from Crawler

Upon successfully executing the crawler, if you observe a screen resembling the provided screenshot, it indicates that the data has been crawled successfully and metadata has been generated from it. This confirms the completion of the initial stage of data catalog creation using AWS Glue.

In the screenshot above, you can locate the “Table Data” section. By clicking on this section, you will be redirected to AWS Athena, where you can write SQL queries to analyze your data. This allows you to interact with your data using SQL and perform various data exploration and analysis tasks.

AWS Athena SQL Query Page
AWS Athena SQL Query Page

After generating the metadata and attempting to run a query in AWS Athena, you may encounter an error similar to the one shown in the screenshot above. This error typically arises because Athena expects data to be in a format where keys and values are on a single line. However, if your data is structured as dictionaries with items spanning multiple lines, this can lead to errors.

Resolving this requires another article.

To resolve this issue, you need to clean the data by restructuring it so that each key-value pair is on a single line. Once the data is cleaned and formatted appropriately, you can run queries on it in Athena without encountering errors. This process ensures the smooth execution of queries and accurate analysis of your data, which can be referenced in the below screenshot.

Successful Query run after resolving the uncleaned data issue
Successful Query run after resolving the uncleaned data issue

Creation of Data Catalog with AWS Glue on cleaned Data

Now that we’ve successfully resolved the uncleaned data issue and converted it into cleaned data stored in a new S3 bucket, it’s time to create the data catalog for the CSV files. Follow these steps in AWS Glue:

  1. Navigate to AWS Glue and search for “crawler.” Click on “Add Crawler.”
  2. Give the crawler a unique name and proceed to the next step.
  3. Provide the S3 bucket path for the CSV files, for example, “s3://your-bucket-name/youtube/raw_statistics/”.
  4. Select the same IAM role that was initially created for Glue and S3, ensuring it has the necessary permissions.
  5. Choose the database for the raw data that was created initially.
  6. Review the settings and click on “Finish.” This will create a table inside the database for the CSV files.

Following these steps will indeed create a data catalog for the CSV files, facilitating easy access and analysis of the raw data stored in Amazon S3. You can utilize the “View data” option to inspect the data within Glue and subsequently query it in Athena. By ensuring that the raw_statistics data is partitioned by region, you’ll obtain data segmented according to regions. This enables you to query the data and retrieve outputs tailored to each specific region, enhancing the granularity and precision of your analysis.

Now that we have both the raw data and the newly generated cleaned data, we can perform a joining transformation to combine the information from both tables. This can be achieved using an Athena query. Here’s an example query for joining the two tables:

Joining two table
Joining two table

Instead of altering the datatype within the query using the cast function, it’s best practice to ensure that all data types are appropriately defined beforehand. The current issue arises from the datatype assigned during the crawling process, which defaults to string. To address this, we should modify the datatype of the crawled version beforehand. Since the crawled version data serves no purpose, it can be safely deleted to resolve this issue. Now, let’s fix the issue by following the below steps:

Fixing the issue of Datatype

To rectify the datatype issue, follow these steps:

  1. Change the datatype of the crawled version:
  • Navigate to Database/tables and select “cleaned_statistics_reference_data.”
  • Click on “Edit schema” and modify the datatype of the “id” column from “string” to “bigint”. Save the changes.

2. Delete the cleaned crawled version of data in the Parquet file.

3. Return to the Lambda function and test the function code using the existing test event.

4. This will generate the cleaned joined from two table Parquet file. Confirm this by checking inside the S3 bucket for the generated file.

5. Finally, run the query in Athena and check the output, which should be the same but without the need for the cast function.

PySpark Job to clean raw data

To transform the raw data CSV files into cleaned Parquet files, we’ll create an ETL (Extract, Transform, Load) job in AWS Glue. Follow these steps:

  1. Go to AWS Glue and search for “ETL Jobs” or “Visual job.” Create a new job.
  2. In the job script, replace the existing script with the provided script.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

predicate_pushdown = "region in ('ca','gb','us')"

# Script generated for node AWS Glue Data Catalog
AmazonS3_node1714032****** = glueContext.create_dynamic_frame.from_catalog(database="your-raw-data-bucket-path", table_name="raw_statistics", transformation_ctx="AmazonS3_node1714032******", push_down_predicate = predicate_pushdown)

# Script generated for node Amazon S3
AmazonS3_node1714032****** = glueContext.write_dynamic_frame.from_options(frame=AmazonS3_node1714032******, connection_type="s3", format="glueparquet", connection_options={"path": "s3://your-raw-data-bucket-path/youtube/raw_statistics/", "partitionKeys": ["region"]}, format_options={"compression": "snappy"}, transformation_ctx="AmazonS3_node1714032******")

job.commit()
  1. The script retrieves the raw data from the raw data bucket and partitions it by region (e.g., “US”, “CA”, “GB”). It then creates Parquet files containing the cleaned data. For simplicity, we’re partitioning the data for three regions only.
  2. After running the job, create new crawlers to catalog the data for each region.
  3. Query the generated data in Athena and run join queries to combine the data from the two different tables.

This process streamlines the transformation of raw CSV data into cleaned Parquet format, enabling efficient querying and analysis in AWS Athena.

Adding Triggers to Lambda Function

What are Triggers?
In AWS Lambda, triggers are events or sources of events that cause a Lambda function to execute. When a trigger occurs, Lambda automatically invokes the associated function. Triggers can come from various AWS services or external sources, and Lambda supports a wide range of triggers to suit different use cases.

To set up a trigger in AWS Lambda to automatically process changes in the cleaned version bucket, follow these steps:

  1. In AWS Lambda, click on the “Add trigger” button at the top of the page.
  2. Select “S3” as the source since we need to monitor changes in an S3 bucket.
  3. Choose the bucket path of your cleaned version bucket, for example, “s3://your-cleaned-bucket/youtube/”.
  4. Select “All object create events” in event types.
  5. Add the key path used in the test function as a prefix, for example, “youtube/raw_statistics_reference_data/”.
  6. Add “.json” as a suffix since we are monitoring JSON files.
  7. Check the box indicating acknowledgment of using the same S3 bucket for both source and destination.
  8. Click on the “Add” button to add the trigger to the Lambda function.

Now, to generate a cleaner version of all the JSON files present inside the S3 raw bucket, follow these steps:

  1. Navigate to the raw data bucket “youtube/raw_statistics_reference_data/”.
  2. Delete all JSON files from it.
  3. Use the aws s3 cp command to copy all the files from local to the S3 raw data bucket. Use the following command:
aws s3 cp . s3://your-raw-data-bucket/youtube/raw_statistics_reference_data/ --recursive --exclude "*" --include "*.json"

This command recursively copies all JSON files from the local directory to the specified S3 bucket path.
By setting up this trigger and copying the files, any new files added, updated, or deleted in the raw data bucket will automatically trigger the Lambda function to generate a cleaner version of the data. This automation reduces the need for manual intervention and streamlines the data processing workflow. After running the trigger and copying the files as instructed, you should be able to confirm the presence of 10 Parquet files in your cleaned data bucket. These Parquet files represent the cleaner version of the data generated automatically by the Lambda function triggered by changes in the raw data bucket.

Creating Final Table for Data Analysis

To automate the process of joining the two tables and generating a final dataset for analysis, follow these steps:

  1. Go to AWS Glue and search for “ETL Jobs” or “Visual job.” Create a new job.
  2. Select two sources from the Data catalog: one for the “raw_statistics” database and the other for the “cleaned_statistics_reference_data” database.
ETL Job to join two tables
ETL Job to join two tables

3. Set up the join operation by specifying the condition: raw_statistics.category_id = cleaned_statistics_reference_data.id.

4. This will create a new table containing data from both tables.

5. To store the output table and facilitate analytical tasks, create a new bucket in S3 named “db_youtube_analytics_version.”

6. Configure the ETL job to store the output table in the newly created S3 bucket.

7. To facilitate storing the output in the bucket, we should create a database. An easy method to accomplish this within Athena is by executing the SQL command “CREATE DATABASE db_youtube_analytics.”

8. To ensure smooth execution, navigate to the job details of the ETL job in AWS Glue. In the IAM section, select the IAM role that we previously created. This role should have appropriate permissions to access both S3 and the Glue service. Selecting this IAM role will enable the job to seamlessly access the required resources for its execution.

Configuration of Target Table (refer to when creating new bucket for analytical)
Configuration of Target Table (refer to when creating new bucket for analytical)

9. Run the ETL job to execute the join operation and generate the final dataset for analysis.

By following all the aforementioned steps, we should have successfully generated the final output data table ready for analysis. To confirm this:

  1. Go to Athena and navigate to the “db_youtube_analytics” database.
  2. Under this database, you should find the “final_output” table generated by the ETL job.
  3. You can query this table to access the cleaned data, which is now formatted and ready for analysis.
  4. Utilize various analytical tasks on this data to assist the business in making informed decisions.

This confirms that the entire data processing pipeline, from extraction to transformation and loading, has been successfully executed, resulting in a finalized dataset available for analysis in Athena.

Creation of a Data Dashboard

AWS Quicksight is a business intelligence service offered by Amazon Web Services (AWS) that empowers users to create and visualize insights from their data, facilitating informed business decisions. Its key features include:

  1. Data Visualization: Quicksight offers a range of visualization options like charts, graphs, tables, and dashboards, making it simple to represent data visually.
  2. Data Integration: It seamlessly integrates with various AWS data sources such as Amazon RDS, Redshift, Aurora, S3, and more, enabling direct connections to data without complex extraction processes.
  3. Ease of Use: Quicksight provides a user-friendly interface with drag-and-drop functionality, catering to users with diverse technical backgrounds.

Configure the AWS QuickSight Account

To configure AWS QuickSight:

  1. Go to the AWS Management Console and search for “AWS QuickSight.”
  2. Sign up for QuickSight using your email ID.
  3. In the bottom section, search for the “standard edition” and click on “here” to sign up for the account.
  4. Provide an account name and email of your choice.
  5. Under “Allow access and autodiscovery for these resources,” check all the boxes.
  6. Under “S3 access,” check all the buckets present in your S3 location. This enables QuickSight to access the necessary data for visualization.
  7. Click on the “Finish” button to authenticate your AWS IAM account to the QuickSight account.

Following these steps will configure AWS QuickSight and enable you to start creating datasets for data visualization.

Creating the first Dataset in AWS QuickSight

On the homepage of QuickSight, navigate to the “Datasets” section and click on “New dataset” in the top right corner. You can import data from various sources within AWS, but for this task, we’ll use the analytical data stored in Athena.

  1. Give a name to the data store, such as “youtube_analytics_dashboard.”
  2. Click on “Validate connection” to ensure the connection is successful.
  3. Click on “Create data source.”
  4. Select your “db_youtube_analytics” database and choose the “final_output” table, then click on “Select.”
  5. This will publish your data source. You can edit the schema if needed by selecting “Edit schema.” You can also preview your data to check its structure.
  6. Click on “Save and publish” to publish your dataset.
  7. Go back to the homepage under “Datasets,” and you will see the “final_output” dataset. Select it and click on the three dots, then select “Create analysis.”
  8. You can now explore the data and create various data visualizations according to your requirements, such as pie charts, bar charts, and more.
Data Visualization of the youtube analysis
Data Visualization of the YouTube Analysis

This process enables you to create insightful data visualizations using AWS QuickSight based on the data stored in your Athena database. Based on the analysis of the data, it appears that the title “Music” is the most liked and most viewed among all others. This suggests that music is a popular genre and attracts a large audience on YouTube.

This conclusion opens up opportunities for further exploration and data visualization. You can create various charts and graphs to delve deeper into the viewership and engagement trends for different video categories, allowing for better insights and decision-making.

With this, we conclude our article. I hope it has provided you with a comprehensive understanding of AWS services and enabled you to create your data visualizations. Congratulations on your achievement! Keep exploring, coding, and learning. Happy coding!

--

--

Deepak Dewani
Deepak Dewani

Written by Deepak Dewani

🚀 Passionate Data Engineer | Turning Raw Data into Actionable Insights. Constant learner and knowledge-sharer. Expert in ETL processes, and Python enthusiast.

Responses (1)