How to use SQL to Query S3 files with AWS Athena | Step by Step Tutorial

Introduction to AWS Athena

Wondering what is AWS Athena and how it works? Let me give you a quick introduction…

AWS Athena is a serverless query service which can be used to read and analyze large amounts of data directly from S3 using Standard SQL.

It uses an approach known as schema-on-read, which allows you to project your schema onto your data at the time you execute a query. This eliminates the need for any data loading or ETL.

It uses Apache Hive to create, drop, and alter tables and partitions. You can write Hive-compliant DDL statements and ANSI SQL statements in the Athena query editor. You can also use complex joins, window functions and complex data types on Athena.

It follows a pay-as-you-go pricing model where you pay for the queries you run. The cost is determined by the amount of data scanned by each query executed. Amount of data scanned by the query is displayed after each execution in the console.

In this article, I will show you how to read the data stored in Amazon S3 bucket by connecting it to AWS Athena and executing some SQL queries. Buckle up!

If you prefer to learn through video, check out my YouTube video on this guide here.

How to read data from Amazon S3 using AWS Athena

Source data in Amazon S3

Before we go to how-to steps, let’s first look at the sample data in the S3 bucket we want to read…

I have two buckets created in S3, one contains raw source data to read, and another is to store the results of SQL queries executed in AWS Athena.

S3 bucket “aws-simplified-athena-demo” contains source data I want to query.

Source data in this bucket contains raw transaction data in JSON format. Each row has a unique ID, type of transaction, purchase amount and the date of transaction.

Note: Unfortunately, the classifier does not work correctly with standard JSON format. There should be just one entry per line, no commas at the end of each line and no square brackets at the start and end of the JSON array.

S3 bucket “aws-simplified-results” is to store the results of my SQL queries executed, which is empty as no queries are executed as of now.

You can set up some interesting use-cases on this bucket. For example, you can trigger an AWS Lambda function to execute anytime a file gets uploaded to this bucket.

Prerequisites

1) Copy the path of the source data file

Navigate to AWS S3 service. Go to the S3 bucket where source data is stored and click on the file.

Click on the Copy Path button to copy the S3 URI for file.

2) Configure Output Path in Athena

Navigate to AWS Athena service.

In the Settings tab on top-right, enter S3 Bucket name where results of Athena queries will be stored, then click Save.

Steps to analyze S3 data using AWS Athena Query Editor

Step 1: Connect Data Source

In AWS Athena Query Editor, click on Connect Data Source.

Step 2: Choose a data source

Choose where the data is located and metadata catalog. We are going to use query data in Amazon S3 since that’s where our raw data is and AWS Glue data catalog for getting schema for the source data.

Click on the Next button.

Step 3: Configure Connection details

Select the first option to set up AWS Glue Crawler to retrieve schema for source data automatically.

Click on the Connect to AWS Glue button. This will open up a new tab to set up a new crawler in AWS Glue.