Introduction to AWS Athena
Wondering what is AWS Athena and how it works? Let me give you a quick introduction…
AWS Athena is a serverless query service which can be used to read and analyze large amounts of data directly from S3 using Standard SQL.
It uses an approach known as schema-on-read, which allows you to project your schema onto your data at the time you execute a query. This eliminates the need for any data loading or ETL.
It uses Apache Hive to create, drop, and alter tables and partitions. You can write Hive-compliant DDL statements and ANSI SQL statements in the Athena query editor. You can also use complex joins, window functions and complex data types on Athena.
It follows a pay-as-you-go pricing model where you pay for the queries you run. The cost is determined by the amount of data scanned by each query executed. Amount of data scanned by the query is displayed after each execution in the console.
In this article, I will show you how to read the data stored in Amazon S3 bucket by connecting it to AWS Athena and executing some SQL queries. Buckle up!
If you prefer to learn through video, check out my YouTube video on this guide here.
How to read data from Amazon S3 using AWS Athena
Source data in Amazon S3
Before we go to how-to steps, let’s first look at the sample data in the S3 bucket we want to read…
I have two buckets created in S3, one contains raw source data to read, and another is to store the results of SQL queries executed in AWS Athena.
S3 bucket “aws-simplified-athena-demo” contains source data I want to query.
Source data in this bucket contains raw transaction data in JSON format. Each row has a unique ID, type of transaction, purchase amount and the date of transaction.
Note: Unfortunately, the classifier does not work correctly with standard JSON format. There should be just one entry per line, no commas at the end of each line and no square brackets at the start and end of the JSON array.
S3 bucket “aws-simplified-results” is to store the results of my SQL queries executed, which is empty as no queries are executed as of now.
You can set up some interesting use-cases on this bucket. For example, you can trigger an AWS Lambda function to execute anytime a file gets uploaded to this bucket.
1) Copy the path of the source data file
Navigate to AWS S3 service. Go to the S3 bucket where source data is stored and click on the file.
Click on the Copy Path button to copy the S3 URI for file.
2) Configure Output Path in Athena
Navigate to AWS Athena service.
In the Settings tab on top-right, enter S3 Bucket name where results of Athena queries will be stored, then click Save.
Steps to analyze S3 data using AWS Athena Query Editor
Step 1: Connect Data Source
In AWS Athena Query Editor, click on Connect Data Source.
Step 2: Choose a data source
Choose where the data is located and metadata catalog. We are going to use query data in Amazon S3 since that’s where our raw data is and AWS Glue data catalog for getting schema for the source data.
Click on the Next button.
Step 3: Configure Connection details
Select the first option to set up AWS Glue Crawler to retrieve schema for source data automatically.
Click on the Connect to AWS Glue button. This will open up a new tab to set up a new crawler in AWS Glue.
There are two ways to get the schema for your source data. First way is to use AWS Glue Crawler which will crawl your data and inspect the JSON objects that are within the source data S3 Bucket, connecting that to a pseudo table in AWS Athena so that you can query over the columns in your JSON file. Alternative way is to use a manual process where you can create tables in Athena, manually specifying the names and the types of each column.
Before using one of the above two ways, keep in mind that you will incur additional cost by using AWS Glue Crawler whereas there is no cost associated with the manual process! For more information on pricing, see https://calculator.aws
In this example, we are going to use AWS Glue Crawler to identify schema for source data automatically.
Step 4: Create crawler in AWS Glue
In the new tab opened in Step 3, follow the instructions below to set up new crawler:
1. Enter the crawler name and click Next.
2. Select Data Stores as the crawler’s source type and click Next.