AWS Glue is a fully managed ETL (Extract, Transform, Load) service that helps prepare and transfer data for analytics.
Key takeaways:
AWS Glue Crawler automates metadata extraction. It scans data sources, infers schema, and organizes metadata in the AWS Glue Data Catalog.
AWS Glue Crawler supports various data stores. It works with multiple AWS data storage systems such as Amazon S3, DynamoDB, MongoDB, and Delta Lake.
Proper IAM roles are required for access. The crawler needs IAM role permissions to access and process data within AWS services.
AWS Glue Crawler enables efficient querying and data analysis. By storing metadata in a structured format, it simplifies querying, access control, and data transformation.
AWS Glue Crawler detects changes in data structure. When run again, it identifies and updates any changes in schema or partitions.
Amazon Web Services (AWS) offers a powerful ETL (Extract, Transform, Load) tool called AWS Glue, designed to streamline the process of preparing and loading data into various AWS services. Whether you’re managing data lakes, performing analytics, or building machine learning pipelines, AWS Glue simplifies data integration by automating key tasks. One of its standout features is the AWS Glue Crawler, which discovers and organizes metadata about your data, making it easier to query, analyze, and manage.
In this Answer, we’ll explore what AWS Glue is, dive deep into how its crawler works with an S3 bucket, and walk through a practical example using a dataset of Netflix movies and TV shows. By the end, you’ll understand how to leverage this tool to unlock the full potential of your data in AWS.
AWS Glue is a fully managed ETL service that integrates seamlessly with other AWS offerings like Amazon S3, Redshift, and Athena. It handles three core functions:
Extract: Pulls data from various sources (e.g., S3, DynamoDB, MongoDB).
Transform: Cleans, enriches, or restructures data for downstream use.
Load: Deposits the processed data into a target AWS service.
Beyond ETL, AWS Glue catalogs your data by collecting and storing metadata—information about the data, such as its structure, datatypes, partitions, and schema. This metadata is stored in the AWS Glue Data Catalog, a centralized repository that acts as a metadata hub, enabling tools like Amazon Athena to query data efficiently.
The AWS Glue Crawler is a key component that automates metadata discovery. It scans your data sources, infers their structure, and populates the Data Catalog with organized tables. This eliminates the need to manually define schemas, saving time and reducing errors.
Scanning: The crawler explores data in sources like S3 buckets, Delta Lakes, or DynamoDB. It navigates folder structures, identifies files, and reads their contents without altering them. For example, it can scan s3://my-bucket/movies/
to find partitioned CSVs.
Inference: It analyzes files to determine their format (e.g., CSV, JSON), partitions (e.g., year=2006), and column data types (e.g., title: string
). By sampling data, it builds a schema automatically, adapting to variations like missing headers.
Storage: The crawler saves its findings as tables in the AWS Glue Data Catalog, detailing schema and locations. It creates new tables or updates existing ones, ensuring metadata like s3://my-bucket/movies/
is query-ready.
Let’s walk through a hands-on example of setting up an AWS Glue Crawler to catalog metadata from an S3 bucket. Our dataset consists of CSV files containing Netflix movies and TV shows, partitioned by release year.
We will use the following dataset, which contains several CSVs of Netflix movies and TV shows partitioned according to their year of release.
,show_id,type,title,director,country,date_added,release_year,rating,duration,listed_in 307,s5618,Movie,Happy New Year,Farah Khan,India,2/1/2017,2006,TV-14,179 min,"Action & Adventure, Comedies, Dramas" 306,s5617,Movie,Dilwale,Rohit Shetty,India,2/1/2017,2006,TV-PG,154 min,"Action & Adventure, Dramas, International Movies" 305,s5616,Movie,Imperial Dreams,Malik Vitthal,United States,2/3/2017,2006,TV-MA,86 min,Dramas 304,s5615,Movie,Daniel Sosa: Sosafado,"Raúl Campos, Jan Suter",Mexico,2/3/2017,2006,TV-MA,78 min,Stand-Up Comedy 303,s5614,Movie,"Michael Bolton's Big, Sexy Valentine's Day Special","Scott Aukerman, Akiva Schaffer",United States,2/7/2017,2006,TV-MA,54 min,"Comedies, Music & Musicals, Romantic Movies" 302,s5613,Movie,Hitler - A Career,"Joachim Fest, Christian Herrendoerfer",West Germany,2/10/2017,2006,TV-MA,150 min,"Documentaries, International Movies" 301,s5611,Movie,David Brent: Life on the Road,Ricky Gervais,United Kingdom,2/10/2017,2006,TV-MA,97 min,"Comedies, International Movies, Music & Musicals" 300,s5608,Movie,Katherine Ryan: In Trouble,Colin Dench,United Kingdom,2/14/2017,2006,TV-MA,64 min,Stand-Up Comedy 299,s5607,Movie,Girlfriend's Day,Michael Paul Stephenson,United States,2/14/2017,2006,TV-MA,71 min,"Comedies, Independent Movies" 298,s5606,Movie,The Memory of Water,Matías Bize,Chile,2/15/2017,2006,TV-MA,88 min,"Dramas, International Movies" 297,s5605,Movie,The Fury of a Patient Man,Raúl Arévalo,Spain,2/15/2017,2006,TV-MA,92 min,"International Movies, Thrillers" 296,s5603,Movie,Rush: Beyond the Lighted Stage,"Sam Dunn, Scot McFadyen",Canada,2/15/2017,2006,TV-MA,107 min,"Documentaries, Music & Musicals" 295,s5601,Movie,A Heavy Heart,Thomas Stuber,Germany,2/15/2017,2006,TV-MA,109 min,"Dramas, Independent Movies, International Movies" 294,s5600,Movie,Rocky Handsome,Nishikant Kamat,India,2/17/2017,2006,TV-MA,119 min,"Action & Adventure, International Movies" 293,s5598,Movie,Tini: The New Life of Violetta,Juan Pablo Buscarini,Spain,2/19/2017,2006,G,99 min,"Children & Family Movies, Music & Musicals" 292,s5597,Movie,Growing Up Wild,Keith Scholey,United States,2/19/2017,2006,G,78 min,"Children & Family Movies, Documentaries" 291,s5596,Movie,Boy Missing,Mar Targarona,Spain,2/19/2017,2006,TV-MA,105 min,"International Movies, Thrillers" 290,s5595,Movie,Trevor Noah: Afraid of the Dark,David Paul Meyer,United States,2/21/2017,2006,TV-19,67 min,Stand-Up Comedy 289,s5591,Movie,I Don't Feel at Home in This World Anymore,Macon Blair,United States,2/24/2017,2006,TV-MA,97 min,"Dramas, Independent Movies, Thrillers" 288,s5590,Movie,Operações Especiais,Tomas Portella,Brazil,2/25/2017,2006,TV-MA,99 min,"Action & Adventure, International Movies" 287,s5589,Movie,Jonas,Lô Politi,Brazil,2/26/2017,2006,TV-MA,97 min,"Dramas, International Movies" 286,s5588,Movie,Force 7,Abhinay Deo,India,2/27/2017,2006,TV-19,123 min,"Action & Adventure, International Movies" 285,s5587,Movie,Mike Birbiglia: Thank God for Jokes,"Seth Barrish, Mike Birbiglia",United States,2/28/2017,2006,TV-MA,71 min,Stand-Up Comedy 284,s5585,Movie,Nila,Selvamani Selvaraj,India,3/1/2017,2006,TV-MA,94 min,"Dramas, International Movies, Romantic Movies" 283,s5583,Movie,Amy Schumer: The Leather Special,Amy Schumer,United States,3/7/2017,2006,TV-MA,57 min,Stand-Up Comedy 282,s5582,Movie,The Butterfly's Dream,Yılmaz Erdoğan,Turkey,3/10/2017,2006,TV-PG,118 min,"Dramas, International Movies, Romantic Movies" 281,s5827,Movie,Real Crime: Diamond Geezers,Tom Whitter,United Kingdom,8/1/2016,2006,TV-18,46 min,Documentaries 280,s5825,Movie,Interview with a Serial Killer,Christopher Martin,United States,8/1/2016,2006,TV-MA,45 min,Documentaries 279,s5822,Movie,Children of God,John Smithson,United Kingdom,8/1/2016,2006,TV-MA,63 min,Documentaries 278,s5821,Movie,Lavell Crawford: Can a Brother Get Some Love?,Michael Drumm,United States,8/2/2016,2006,TV-MA,81 min,Stand-Up Comedy 277,s5820,Movie,David Cross: Making America Great Again!,Alex Coletti,United States,8/5/2016,2006,TV-MA,73 min,Stand-Up Comedy 276,s5819,Movie,Jim Gaffigan: Obsessed,Jay Chapman,United States,8/11/2016,2006,TV-14,70 min,Stand-Up Comedy 275,s5818,Movie,Jim Gaffigan: Mr. Universe,Jay Karas,United States,8/11/2016,2006,TV-14,77 min,Stand-Up Comedy 274,s5817,Movie,Jim Gaffigan: King Baby,Troy Miller,United States,8/11/2016,2006,TV-PG,71 min,Stand-Up Comedy 273,s5816,Movie,Jim Gaffigan: Beyond the Pale,Michael Drumm,United States,8/11/2016,2006,TV-18,72 min,Stand-Up Comedy 272,s5811,Movie,I'll Sleep When I'm Dead,Justin Krook,United States,8/19/2016,2006,TV-MA,80 min,"Documentaries, Music & Musicals" 271,s5810,Movie,XOXO,Christopher Louie,United States,8/26/2016,2006,TV-MA,92 min,"Dramas, Music & Musicals" 270,s5809,Movie,Jeff Foxworthy and Larry the Cable Guy: We’ve Been Thinking...,Jay Karas,United States,8/26/2016,2006,TV-18,75 min,Stand-Up Comedy 269,s5798,Movie,Extremis,Dan Krauss,United States,9/13/2016,2006,TV-PG,25 min,Documentaries 268,s5797,Movie,Sample This,Dan Forrer,United States,9/15/2016,2006,TV-18,83 min,"Documentaries, Music & Musicals" 267,s5796,Movie,The White Helmets,Orlando von Einsiedel,United Kingdom,9/16/2016,2006,TV-PG,41 min,Documentaries 266,s5794,Movie,Cedric the Entertainer: Live from the Ville,Troy Miller,United States,9/16/2016,2006,TV-MA,60 min,Stand-Up Comedy 265,s5793,Movie,ARQ,Tony Elliott,Canada,9/16/2016,2006,TV-MA,89 min,"International Movies, Sci-Fi & Fantasy, Thrillers" 264,s5785,Movie,Iliza Shlesinger: Confirmed Kills,Bobcat Goldthwait,United States,9/23/2016,2006,TV-MA,78 min,Stand-Up Comedy 263,s5784,Movie,Audrie & Daisy,"Bonni Cohen, Jon Shenk",United States,9/23/2016,2006,TV-MA,99 min,Documentaries 262,s5783,Movie,Amanda Knox,"Rod Blackhurst, Brian McGinn",Denmark,9/30/2016,2006,TV-MA,92 min,Documentaries 261,s5781,Movie,Welcome Mr. President,Riccardo Milani,Italy,10/1/2016,2006,TV-MA,99 min,"Comedies, International Movies" 260,s5780,Movie,Unchained: The Untold Story of Freestyle Motocross,"Paul Taublieb, Jon Freeman",United States,10/1/2016,2006,TV-MA,92 min,"Documentaries, Sports Movies" 259,s5779,Movie,Umrika,Prashant Nair,India,10/1/2016,2006,TV-MA,96 min,"Dramas, Independent Movies, International Movies" 258,s5777,Movie,Riphagen - The Untouchable,Pieter Kuijpers,Netherlands,10/1/2016,2006,TV-18,132 min,"Dramas, International Movies" 257,s5775,TV Show,Old Money,David Schalko,United States,10/1/2016,2006,TV-MA,5 Season,"International TV Shows, TV Comedies, TV Dramas" 256,s5774,Movie,My Little Pony Equestria Girls: Legend of Everfree,Ishi Rudell,United States,10/1/2016,2006,TV-Y11,73 min,"Children & Family Movies, Comedies" 255,s5773,Movie,My Big Night,Álex de la Iglesia,Spain,10/1/2016,2006,TV-MA,97 min,"Comedies, International Movies, Music & Musicals" 254,s5771,Movie,Much Ado About Nothing,Alejandro Fernández Almendras,Chile,10/1/2016,2006,TV-MA,96 min,"Dramas, Independent Movies, International Movies" 253,s5768,Movie,Harud,Aamir Bashir,India,10/1/2016,2006,TV-MA,100 min,"Dramas, International Movies" 252,s5766,Movie,Chatô: The King of Brazil,Guilherme Fontes,Brazil,10/1/2016,2006,TV-MA,105 min,"Dramas, International Movies" 251,s5765,Movie,Bombshell,Riccardo Pilizzeri,New Zealand,10/1/2016,2006,TV-MA,86 min,Dramas 250,s5762,Movie,LEGO Jurassic World: The Indominus Escape,Michael D. Black,United States,10/4/2016,2006,TV-Y11,25 min,"Children & Family Movies, Comedies" 249,s5761,Movie,The Siege of Jadotville,Richie Smyth,Ireland,10/7/2016,2006,TV-MA,108 min,"Action & Adventure, Dramas, International Movies" 248,s5757,Movie,Justin Timberlake + the Tennessee Kids,Jonathan Demme,United States,10/12/2016,2006,TV-MA,90 min,Music & Musicals 247,s5756,Movie,Mascots,Christopher Guest,United States,10/13/2016,2006,TV-MA,95 min,Comedies 246,s5755,Movie,Sky Ladder: The Art of Cai Guo-Qiang,Kevin MacDonald,United States,10/14/2016,2006,TV-MA,80 min,Documentaries 245,s5751,Movie,Blind Date,Clovis Cornillac,France,10/15/2016,2006,TV-14,91 min,"Comedies, International Movies, Music & Musicals" 244,s5750,Movie,Bleach the Movie: Hell Verse,Noriyuki Abe,Japan,10/15/2016,2006,TV-14,94 min,"Action & Adventure, Anime Features, Sci-Fi & Fantasy" 243,s5749,Movie,Bleach The Movie: Fade to Black,Noriyuki Abe,Japan,10/15/2016,2006,TV-PG,94 min,"Action & Adventure, Anime Features, Sci-Fi & Fantasy" 242,s5748,Movie,Berserk: The Golden Age Arc I - The Egg of the King,Toshiyuki Kubooka,Japan,10/15/2016,2006,TV-MA,77 min,"Action & Adventure, Anime Features, International Movies" 241,s5747,Movie,A Mighty Team,Thomas Sorriaux,France,10/15/2016,2006,TV-MA,97 min,"Comedies, International Movies, Sports Movies" 240,s5746,Movie,Joe Rogan: Triggered,Anthony Giordano,United States,10/21/2016,2006,TV-MA,64 min,Stand-Up Comedy 239,s5744,Movie,11 años,Roger Gual,Spain,10/27/2016,2006,TV-MA,77 min,"Dramas, International Movies" 238,s5743,Movie,West Coast,Benjamin Weill,France,10/28/2016,2006,TV-MA,81 min,"Comedies, Dramas, International Movies" 237,s5741,Movie,They Are Everywhere,Yvan Attal,France,10/28/2016,2006,TV-MA,110 min,"Comedies, International Movies" 236,s5740,Movie,The African Doctor,Julien Rambaldi,France,10/28/2016,2006,TV-18,94 min,"Comedies, Dramas, International Movies" 235,s5739,Movie,Into the Inferno,Werner Herzog,United Kingdom,10/28/2016,2006,TV-PG,107 min,Documentaries 234,s5738,Movie,I Am the Pretty Thing That Lives in the House,Osgood Perkins,Canada,10/28/2016,2006,TV-18,89 min,"Horror Movies, International Movies, Thrillers" 233,s5737,Movie,Pup Star,Robert Vince,Canada,10/29/2016,2006,G,92 min,"Children & Family Movies, Comedies" 232,s5736,Movie,Spanish Affair 6,Emilio Martínez Lázaro,Spain,11/1/2016,2006,TV-MA,107 min,"Comedies, International Movies, Romantic Movies" 231,s5734,Movie,Norman Lear: Just Another Version of You,"Heidi Ewing, Rachel Grady",United States,11/1/2016,2006,TV-MA,91 min,Documentaries 230,s5727,Movie,A Grand Night In: The Story of Aardman,Richard Mears,United Kingdom,11/1/2016,2006,TV-PG,59 min,Documentaries 229,s5726,Movie,The Ivory Game,"Kief Davidson, Richard Ladkani",Austria,11/4/2016,2006,TV-18,112 min,Documentaries 228,s5725,Movie,"Dana Carvey: Straight White Male, 64",Marcus Raboy,United States,11/4/2016,2006,TV-MA,64 min,Stand-Up Comedy 227,s5722,Movie,Kathleen Madigan: Bothering Jesus,Lorene Machado,United States,11/10/2016,2006,TV-MA,71 min,Stand-Up Comedy 226,s5735,Movie,Santa Pac's Merry Berry Day,Moto Sakakibara,Not Given,11/1/2016,2006,TV-Y,44 min,Movies 225,s5720,Movie,True Memoirs of an International Assassin,Jeff Wadlow,United States,11/11/2016,2006,TV-18,98 min,"Action & Adventure, Comedies" 224,s5719,Movie,Mumbai Cha Raja,Manjeet Singh,India,11/15/2016,2006,TV-MA,77 min,"Dramas, Independent Movies, International Movies" 223,s5715,Movie,Divines,Houda Benyamina,France,11/18/2016,2006,TV-MA,107 min,"Dramas, Independent Movies, International Movies" 222,s5714,Movie,Colin Quinn: The New York Story,Jerry Seinfeld,United States,11/18/2016,2006,TV-MA,62 min,Stand-Up Comedy
The following steps show how we use the crawler on our dataset.
We first create an S3 bucket with a folder to which we upload our dataset. We can do this, using the two commands given below:
The first command creates an S3 bucket, called educative-3213
, while the second command creates a Movies
folder within educative-3213
.
Next, we will upload our dataset to the Movies
folder in the S3 bucket using the following command:
The recursive
flag is used so that the command applies to all files and folders within a specific directory, which, in our case, are all the files and folders inside our local movies
folder.
After running the commands above, we are able to see the S3 bucket, containing a Movies
folder with all our data.
The crawler requires a database that it can use as an output directory; the metadata of any data is stored in a table inside this database.
In AWS Glue, we create a database, naming it crawler-metadata-educative
, using the following command:
After running the command above, we are able to see a new empty database on the “AWS Glue > Data Catalog > Databases” page, which we can get to by going to the AWS Glue homepage and clicking on “Databases” from the sidebar. This database will be pointed toward the Movies
folder in the bucket we created earlier, primarily for monitoring purposes.
The crawler needs several permissions to access the S3 bucket. We use an IAM Role for this.
AWS Identity and Access Management (IAM) role is a feature that gives selective permissions and access to several resources so that AWS services can temporarily gain the permissions defined by the permission policy attached to them. The AWS services that can assume the role are defined by the trust policy attached to them.
Every IAM role requires a trust policy, which specifies the features that can be undertaken by the given role. We use the following trust policy for our role.
In the policy above, we specify that the action of AssumeRole
can only be done by the service glue
.
The role will also need permission policies attached to it so that it can have all the necessary access to resources.
The following two commands are used for the complete creation of our required IAM role.
The first command is for creating an IAM role, named AWSGlueServiceRoleEduc
, with the trust policy, written in the trust.json
file. The second command attaches that role with the “AWSGlueServiceRole” permissions policy, which gives the role access to several required services, while the third command attaches that role with the “AmazonS3FullAccess,” which gives the role further access to S3 buckets.
After running the commands above, we can find our AWSGlueServiceRoleEduc
listed on the “IAM > Roles” page. To access it, go to the IAM homepage and click “Roles” in the sidebar.
After the steps above, we now create the crawler we will be using. We can do this by using the following command.
With the command above, we create a crawler, naming it movies-crawler-educative
. We give it the location of our Movies
folder in the S3 bucket as the data source; this will specify to the crawler which data it has to get the metadata of. We also specify the database crawler-metadata-educative
as the database to use as output.
After running the above command, we find our crawler on the “AWS Glue > Crawlers” page, with its state being “Ready.” We can get to this page by going to the AWS Glue homepage and clicking on “Crawlers” from the sidebar.
After our complete setup is complete, we finally run our crawler using the following command.
When this command is run, the “AWS Glue > Crawlers” page shows the crawler movies-crawler-educative
to be in a “Running” state. After some time, it changes to a “Stopping” state. Under “Table changes”, it should be showing “1 created,” meaning that a table has been created by the crawler during this run.
The crawler’s final state will be “Ready,” with the “Last run” showing a “Succeeded” sign.
By opening the movies-crawler-educative
page we see, under “Table Changes”, that the crawler has made 1 new table, and has also has identified 13 different partitions.
The crawler we ran has saved the metadata information in the database we created and specified to the crawler. A new table, by the name of movies
, has been created by the crawler within the database crawler-metadata-educative
. The number of partitions in this table can be checked using the following command.
The table has several information about our Movies
data. It has identified all partitions, along with other information about our data, which can be seen in the “AWS Glue > Data Catalog > Databases > Tables > movies
” page, which we can go to, by going on the AWS Glue homepage, clicking on “Tables” from the sidebar, and then choosing the movies
table.
However, if we run the crawler again, no new table will be produced. This is because our data’s structure, along with other metadata components, would remain unchanged.
Enter your AWS AccessKeyID
and AWS SecretAccessKey
, and then run the commands given above, in the terminal below. If you don’t have these keys, follow the steps in this documentation under the “Managing access keys (console)” heading to generate the keys.
Note: Kindly remember the following instructions.
In the commands above, you should change the name of the bucket to make it globally unique. Every command using the bucket's name should reflect this change.
After running the command to run the crawler, wait for the state of the crawler to change to "Ready" before running the last command. This usually takes up to 2-3 minutes.
Get hands-on experience with “Building ETL Pipelines on AWS” Cloud Lab and master the art of creating efficient ETL data pipelines with AWS Glue. Start now and transform raw data into actionable insights!
Here are the benefits of using an AWS Glue Crawler:
Automates metadata discovery: Scans and infers schemas/partitions, saving time.
Simplifies integration: Populates the Data Catalog for easy use with Athena or ETL tools.
Boosts query speed: Identifies partitions for faster, cost-effective queries.
Enhances governance: Enables secure, role-based access control.
Cuts costs: Reduces manual effort and resource usage.
AWS crawler is a useful tool for extracting and storing the metadata of any particular data. It stores the required information in an organized manner and can detect changes to the structure and partitions of data if it’s run again.
Haven’t found what you were looking for? Contact Us
Free Resources