These are my top 5 tips I wish someone told me about before using AWS DynamoDB.
Standard transactional databases aren’t always the best fit for many use cases. Modern NoSQL databases like dynamodb are meant to provide predictable performance at any scale.
My motive through this blog is to illuminate the top 5 tips on DynamoDB that you need to know about when working with DynamoDB. But before we do, lets quickly review what DynamoDB is.
Let’s get started
What is a DynamoDB?
Amazon DynamoDB is a fully managed NoSQL database service that lets you offload the administrative burdens of operating and scaling a distributed database. Using a managed database like DynamoDB, you don’t need to handle dealing with hardware provisioning, setup and configuration, replication, software patching, or cluster scaling.
DynamoDB at its foundation is a key-value and document database that provides fast and predictable performance with unmatchable seamless scalability. It offers single-digit millisecond performance at any level with the capability of handling more than 10 trillion requests per day and can support spikes of more than 20 million requests per second.
DynamoDB offers features such as:
- Multi-regional database
- It comes with built-in security
- Backup and restoring
- In-memory caching for internet-scale applications.
- Multi-active and a durable database
My 5 Golden tips for working with DynamoDB
1. Use GSIs sparingly
GSI stands for Global Secondary Indexes or indices, and they are a way to perform constant time lookup based on values in a column that is not a primary key.
For instance, imagine you have a customer table and want to retrieve all customer orders with a particular country. For this, you can set up a GSI on the Country so that when you are querying this table, instead of having to provide it with customer details and Order details or whatever you are looking for, you can query the index and find all the data that’s related to that particular Country. Isn’t that helpful?
One thing to keep in mind is that using a GSI does have cost implications. Acting one is effectively doubling your read and write capacity for every GSI that you add to your table. With this in mind, use GSIs sparingly.
2. Be careful with item sizes and pagination
This is an issue that is probably the most common and the most troublesome because it looks like you’re doing everything right, but you’re not getting all of your data back when performing a query.
For example, many of you may be doing scans, or maybe doing queries and facing an issue similar to where you have ten rows, or perhaps a hundred rows in your database, but only getting back 50 records or something; what’s going on here?
Behind the scenes in Dynamodb, every single query or every scan operation only returns one megabyte worth of data to you. So if you’re 50 rows constituting one megabyte, and you perform a query on that item, you’re only going to get back to the first 50 rows. You need to do another call to get the next 50. A lot of people get really confused with this. You need to look at the last evaluated key, which is a property of the response that will indicate that more results exist, and you need to make a subsequent call using that last evaluated key to get the entire result set.
3. Leverage pre-joined data with many to one Relationships
You can leverage pre-joined data with many to one relationship by using a partition key and a sort key combination. Let me explain to you a bit of what I exactly mean by this using a real-life use case.
Assume you have a use case where you are talking about customers and orders, And you want to be able to retrieve all of a customer’s orders over time by using customer id. So traditionally, if you were using a sql-style system, you’d have a customer’s table and an Orders table, and then you’d have a foreign key relationship between those two. Then you’d query one and point your foreign key and get the results. All those results now work perfectly in SQL, but it’s not how you would do it in Dynamo.
You could do that in DynamoDB in a way that you would perform one query, get the results, inspect the results, and then perform the second query, but this is just not a scalable way to access your data.
A better approach would be to leverage this one-to-many relationship. So in this specific example, you would have our customer ID as the partition key, and then you would have the order ID as the sort key. If you practically think about it, you know, customers have many orders over time. So the customer ID would repeat many times as a partition key in your table, and then you’d have a different sort key every time for every different order ID that exists. Every customer and Order ID would be a unique combination, which would constitute your primary key in this pattern.
We can very quickly answer the question by using the query API. It gives you all of the orders for this particular customer by just queering on that customer ID. This is a really handy way to model your data and retrieve it quickly.
4. Avoid Scans
Scans are the equivalent of a select * from database and it’s not one of the things that Dynamo is designed for. If you don’t already know, When you scan a table, you are consuming read capacity. Some of you may say that’s okay, I have a filter expression, but it will not help you.
Despite filter expressions reducing the size of the number of records you get back, It doesn’t reduce the cost since Dynamo will still look at a Page of data, and apply the filter AFTER its retrieved that page.
In this case, Dynamo will charge you for scanning the entire page worth of data. So be careful when using scans – they can end up costing an arm and a leg.
Queries are the one of the preferred ways to access data in DynamoDB. Watch out a detail video on DynamoDB Queries below.
5. Dynamo Streams
In my opinion, DynamoDB streams are probably one of the most underutilized features and coolest features of DynamoDB in general.
Dynamo streams allows you to capture item level changes on each of the records that exist in your table, and it is outputted as a stream that you can capture events on.
You can listen to the updates off of that stream, and you can hook it up to a Lambda function so that your Lambda function automatically gets invoked whenever any items change in your database.
From there, you can analyze the ‘before’ and ‘after’ of a record level update to perform a diff. Alternatively, you can capture the new value and keep an up to date dashboard based on changes in your dataset.