Transforming data with Glue

Introduction

This lab will give you an understanding of the AWS Glue – a fully managed data catalog and ETL service.

Prerequisites:

  • The DMS Lab is a prerequisite for this lab.

Steps

  • Getting Started
  • Part 1: Data Validation and ETL
    • Create Glue Crawler for initial full load data
    • Data ETL Exercise
    • Create Glue Crawler for Parquet Files
  • Part 2: Glue Job Bookmark (Optional)
    • Create Glue Crawler for ongoing replication (CDC Data)
    • Create a Glue Job with Bookmark Enabled
    • Create Glue crawler for Parquet data in S3
    • Generate CDC data and to observe bookmark functionality
  • Part 3: Glue Workflows (Optional)
    • Overview
    • Creating and Running Workflows

Getting Started

Navigate to the AWS Glue service.

Part 1: Data Validation and ETL

Create Glue Crawler for initial full load data

  • On the AWS Glue menu, select Crawlers.
  • Click Add crawler.
  • Enter the crawler name for initial data load. This name should be descriptive and easily recognized (e.g glue-lab-crawler
  • Optionally, enter the description.
  • Choose Crawler Source Type as Data Stores and Click Next
  • On the Add a data store page, make the following selections:
    • For Choose a data store, click the drop-down box and select S3.
    • For Crawl data in, select Specified path in my account.
    • For Include path, browse to the target folder for your DMS initial export, e.g., “s3://dmslab-student-dmslabs3bucket-random/tickets”
  • Click Next.
  • On the Add another data store page, select No. and Click Next.
  • On the Choose an IAM role page, make the following selections:
    • Select Choose an existing IAM role.
    • For IAM role, select -GlueLabRole- created from the AWS CloudFormation template during the student lab. For example dmslab-student-GlueLabRole-ZOQDII7JTBUM
  • Click Next.
  • On the Create a schedule for this crawler page, for Frequency, select Run on demand and Click Next.
  • On the Configure the crawler’s output page, click Add database to create a new database for our Glue Catalogue.
  • Give Catalog database name as per your convenient choice for example ticketdata and click create
  • For Prefix added to tables (optional), leave the field empty.
  • For Configuration options (optional)
    • select Add new columns only and keep the remaining default configuration options
    • Click Next.
  • Review the summary page noting the Include path and Database output and Click Finish. The crawler is now ready to run.
  • Click Run it now.

Crawler will change status from starting to stopping, wait until crawler comes back to ready state, you can see that it has created 15 tables.

In the AWS Glue navigation pane, click Databases > Tables.

Data Validation Exercise

Within the Tables section of your ticketdata database, click the person table.

You may have noticed that some tables (such as person) have column headers such as col0,col1,col2,col3. In absence of headers or when the crawler cannot determine the header type, default column headers are specified.

This exercise uses the person table in an example of how to resolve this issue.

  • Click Edit Schema on the top right side.
  • In the Edit Schema section, double-click col0 (column name) to open edit mode. Type “id” as the column name.
  • Repeat the preceding step to change the remaining column names to match those shown in the following figure.
  • Click Save

Data ETL Exercise

Within the Tables section of your ticketdata database, click the person table

  • In the left navigation pane, under ETL, click Jobs, and then click Add job.
  • On the Job properties page, make the following selections:
    • Name, type Glue-Lab-SportTeamParquet
    • IAM role, choose existing role e.g. dmslab-student-GlueLabRole-ZOQDII7JTBUM
    • Type, Select Spark
    • Glue Version, select Spark 2.4, Python 3(Glue version 1.0) or whichever is the latest version.
    • This job runs, select A proposed script generated by AWS Glue.
    • Script file name, type Glue-Lab-SportTeamParquet.
    • S3 path where the script is stored, provide a unique Amazon S3 path to store the scripts. You can keep the default for this lab.
    • Temporary directory, provide a unique Amazon S3 directory for a temporary directory. You can keep the default for this lab.
  • Click Next
  • On the Choose your data sources page, select sport_team and Click Next.
  • On the Choose a transformation type page, select change schema
  • On the Choose your data targets page, select Create tables in your data target.
  • Data store, select Amazon S3.
  • Format, select Parquet.
  • Target path, choose a folder location which you create at the beginning of this section to store the results e.g., " s3://dmslab-student-dmslabs3bucket-woti4bf73cw3/tickets/dms_parquet/sport_team”
  • Click Next.
  • Click the target Data type to edit the id schema mapping. In String type pop-up window Select double from Column type drop down and click update.
  • Click Save job and edit script.
  • View the job. (This screen provides you with the ability to customize this script as required.)
  • Click Save and then Run Job.
  • In Parameters option
    • You can leave Job bookmark as Disable.
    • You can leave the Job metrics option Unchecked.
    • You can collect metrics about AWS Glue jobs and visualize them on the AWS Glue with job metrics.

AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run.

  • Click Run Job

You will see job in now running as Run job button got disable. Click the cross button located in top right corner to close the window to return to the ETL jobs.

  • Click your job to view history and verify that it ran successfully.

You need to repeat the preceding steps to create new ETL Jobs to transform the additional tables. You will use transform data in next Athena lab. You can continue creating below ETL job without waiting for previous job to finish.

To enable us to join data, we will also update the target data types in the schema.

If below Table1 is indicating need Schema changes as “Yes”.

Refer Table 2 find out column which need to changes with source and target data type during ETL job creation.

Table 1:

Job Name & Script Filename Source Table S3 Target Path Need Schema Change?
Glue-Lab-SportLocationParquet sport_location dms_parquet/sport_location No
Glue-Lab-SportingEventParquet sporting_event dms_parquet/sporting_event Yes
Glue-Lab-SportingEventTicketParquet sporting_event_ticket dms_parquet/sporting_event_ticket Yes
Glue-Lab-PersonParquet person dms_parquet/person Yes

TABLE 2:

Job Name Table Column Source Data Type Target Data Type
Glue-Lab-SportingEventParquet sporting_event start_date_time STRING TIMESTAMP
Glue-Lab-SportingEventParquet sporting_event start_date STRING DATE
Glue-Lab-SportingEventTicketParquet sporting_event_ticket id STRING DOUBLE
Glue-Lab-SportingEventTicketParquet sporting_event_ticket sporting_event_id STRING DOUBLE
Glue-Lab-SportingEventTicketParquet sporting_event_ticket ticketholder_id STRING DOUBLE
Glue-Lab-PersonParquet person id STRING DOUBLE

Once these jobs have completed, we can create a crawler to index these parquet files. Create Glue Crawler for Parquet Files

  • In the AWS Glue navigation menu, click Crawlers, and then click Add crawler.
  • or Crawler name, glue-lab-parquet-crawler and Click Next.
  • In next screen Specify crawler source type, select Data Source as choice for Crawler resource type and click Next.
  • In Add a data store screen
    • For Choose a data store, select S3.
    • For Crawl data in, select Specified path in account.
    • For Include path, specify the S3 Path (Parent Parquet folder) that contains the nested parquet files e.g., s3://dmslab-student-dmslabs3bucket-woti4bf73cw3/tickets/dms_parquet d
    • Click Next.
  • For Add another data store, select No and Click Next.
  • On the Choose an IAM role page
    • select Choose an existing IAM role.
    • For IAM role, select the existing role dmslab-student-GlueLabRole-ZOQDIIXXXXXXXX
    • Click Next.
  • For Frequency, select “Run On Demand” and Click Next.
  • For the crawler’s output database, choose your existing database which you created earlier e.g. ticket-data
  • For the Prefix added to tables (optional), type parquet_
  • Review the summary page and click Finish.
  • On the notification bar, click Run it now.

Once your crawler has finished running, you should report that 5 tables were added.

Confirm you can see the tables:

  • In the left navigation pane, click Tables.
  • Add the filter “parquet” to return the newly created tables

Congratulations!! You have successfully completed this lab