DynamoDB Scan and Query operations may appear similar at face value. They both retrieve data from your DynamoDB table, but the two APIs are very different in terms of speed, cost and scaling. In this article, I’m going to help you understand the difference between these two APIs, and help you decide when to use what in your application.
Prefer a video format? Check out my YouTube video on DynamoDB Scan vs Query here.
DynamoDB scan is an operations that allows you to retrieve every item in your table. It is the equivalent of a
SELECT * FROM table in SQL world. To filter your results based on some attribute, you can use the
FilterExpression parameter as part of your scan call. FilterExpressions are the equivalent of adding
WHERE clauses if you’re coming from SQL.
There are a couple important considerations when using the Scan operation highlighted below:
1 – A Scan Call Will Only Ever Return 1MB of Data (Maximum)
If you’re coming from SQL, this can be a bit of a shocker. However DynamoDB is subject to some of the same limitations that SQL databases are. This means that scan operations have a limit on how much data can be retrieved in one go. If you want to get more data, you need to make a paginated call to get the next batch of results.
2 – Scans Can Be VERY Expensive
A scan operation can return every result in your table. This may seem great at face value, however there are significant cost implications.
An important component of the cost structure for DynamoDB is a concept called Read Capacity Units, or RCUs for short. DynamoDB charges you based on the amount of RCUs you use. So naturally, a table that has high read throughput will have a higher cost than a table with lower read throughput (assuming storage is the same).
However, with the DynamoDB Scan operation, you are charged for iterating over every item in your table, even if you are using a FilterExpression to reduce your result set.
This was a shocker for me when I first started using Dynamo. Scans can be ridiculously expensive. Even if you use a FilterExpression to reduce the amount of results you get returned, DynamoDB will still consume RCUs for every item in the table. This means scans can get expensive quick, especially at large table sizes.
The reason is because DynamoDB internally needs to iterate over every row in order to determine if the criteria you specified in your FilterExpression matches the item.
As a general rule of thumb, you should be avoiding scans altogether. They seem attractive to developers coming from SQL, but they have significant cost and performance implications when your table size grows.
Here’s an example of a simple query expression:
table = dynamodb.Table('CustomerOrders') response = table.query( KeyConditionExpression=Key('CustomerId').eq(1) )
All in all, scan operations can be useful in some very isolated circumstances. However I would strongly encourage you to revisit your DynamoDB Table Design if you’re overly relying on them. Typically, you should be leveraging direct GetItem or Query operations to retrieve your data. Below is a great video I put together on DynamoDB Schema Design that will help you understand optimal table structure and key design.
You can learn all about DynamoDB Schema Design in my YouTube video.
Now, lets talk about Queries.
DynamoDB Queries can be thought of as a subset of a Scan operation, but without the performance and cost concerns.
Queries allow you to retrieve data with the matching provided partition key. For example, consider if you had a table that is used to retrieve all orders for a given customer, it may look like this:
|CustomerId (Partition Key)||OrderId (Range Key)||Cost (Other Attribute)|
Using the query operation, we can make a single call to our table to get all orders for CustomerId 1. This call would result in the top two records being returned.
Unlike Scans, DynamoDB does not need to iterate over the entire table to find your result set, so you’re only charged RCUs for reading the two rows that match the partition key. In the language of Dynamo, you specify the partition key using the
KeyConditionExpression (see additional documentation and examples here) which has a fairly straight forward syntax.
Much like the Scan Operation, you can also provide a
FilterExpression into your API call. This will reduce the amount of results returned back to you. However, the same rules apply as with scan – you will still be charged for reading all results with the matching partition key despite only being returned back the subset.
For example, if you were to perform a Query for CustomerId 1, but you specify a FilterExpression to filter results for those with Cost > 50.00, you will be charged RCUs for both rows despite only being returned the first row – so tread carefully.
Other things to remember when using Query is that they follow a very similar set of rules as Scan. That is, only 1MB resultset size, and FilterExpressions are applied AFTER the results are internally retrieved in Dynamo.
One other thing is that Queries are only really applicable if you’re table design uses a range key. If you ONLY have a partition key and no range key, there is no need to use Queries since by definition, all rows will have a unique key and can be retrieved with a GetItem call.
DynamoDB Scan and Query operations seem similar at face value, but are fundamentally different in how they should be used. Scans allow you to retrieve all your data, but can be costly and have poor performance. Queries allow you to retrieve data by partition key and are much preferred to scans, but can limit how you can access your data.
For the above reason, its extremely important to design your table carefully so you can leverage Query or GetItem operations as much as possible. This will minimize your cost and ensure optimal performance even at the scale.
If you enjoyed this article, I have a YouTube playlist with dozens of DynamoDB videos you should check out here.