AWS Glue — A simple, easy to use serverless ETL
As businesses grow bigger, there is a huge amount of data that is being collected by the different systems. With the emergence of GDPR, CCPA, and many other privacy regulations, there are lots of changes in the way the data is collected. The data are collected anonymously without being tagged to an individual. Businesses collect the data to understand, recommend and personalize the customer experience. So the data collected should be used in some way or the other. However, the data will be in a raw format with different schemas and so on. First, the data has to be made available in a way that the business can utilize it for some purpose.
What is ETL? — Extract, Transform and Load is not new terminology in computer science. It has been there since the inception of big data. There has been quite a lot of tools and vendors in the market to help with -
- Setting the data pipeline to move the data from a source to target (This target would be a source for ETL pipeline)
- Setting up the ETL pipeline and move the data to DWH systems
- Utilize the data from DWH to present business intelligence, dashboards, API’s and so on
What are some of the challenges faced in ETL? — Businesses want to stay closer to the customer and provide a unique customer experience in a faster way beating their competitors. Big data and analytics help businesses in understanding a customer and provide them more digital engagement using their unique offerings. As the data size grew to PB scale, it needs more hands to handle the ETL jobs. This means more trained ETL developers, DWH specialists, and so on. Also, a lot of custom code needs to be written for the ever-changing data. Meanwhile, the cloud on the other hand was moving from IaaS to serverless software solution offerings addressing specific problems. These SaaS solutions provide the business to scale faster, easy-to-use ETL workflows, spend less time on managing servers, less time on monitoring the applications, less time on determining the workloads, taking care of security and governance, etc.
What are the options with the major cloud vendors? — All the 3 major cloud vendors provide 1) Serverless options for ETL, 2) No to low code development, 3) Data clean up and transformation, 4) Pay-as-you-use pricing, 5) More than 99.9 percent SLA, 6) Most common governance, compliance and security features, 7) Very little learning curve, 8) Integration with data lakes within their ecosystem, 9) Connection to DWH systems, 10) Integration to 3rd party vendors using connectors.
- Azure Data Factory (ADF) — Azure Data Factory is Azure’s cloud ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management. You can also lift and shift existing SSIS packages to Azure and run them with full compatibility in ADF. SSIS Integration Runtime offers a fully managed service, so you don’t have to worry about infrastructure management. Read more at — https://docs.microsoft.com/en-us/azure/data-factory/
- AWS Glue — AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months. Read more at — https://aws.amazon.com/glue/
- Google Cloud Dataflow — Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing. Read more at — https://cloud.google.com/dataflow
Let us see AWS Glue in detail –
AWS Glue is a convenient tool to use for simple to complex ETL pipelines and transformation. There are 3 main components involved in Glue. Let us see one after the other.
- Data Catalog — Data catalog holds the needed information to perform the ETL transformation. Data catalog has the following components –

2) ETL — This holds the needed information to create the ETL Job. ETL has the following components –

3) Security — Different set of properties to encrypt the data-at-rest.

As easy as it is to use Glue and in a majority of cases Glue can be used as an ETL tool, there are some limitations, concerns, and known issues in using the product. Go through the detailed documentation before choosing to go with Glue.
Moving to the cloud and starting to utilize cloud offerings provide a competitive edge for businesses. This is just the beginning of the new era in development strategies and methodologies.
Share your thoughts in the comments!
Happy learning!
Originally published at http://shankarkumarasamy.blog on February 24, 2021.