S3 is a general object storage service built ontop of Amazon’s cloud infrastructure. Learn about the Core S3 Concepts that you need to know about it in this article.
S3 is one of the most popular services on AWS. Launched in 2006, the service has since added a ton of new and useful features, but many of the core concepts have stayed the same.
In this blog post, I want to introduce you to the Core Concepts of Amazon S3. We’re going to learn what S3 is, some of its major concepts, useful features, pricing, use case examples, and much more.
So let’s get started…
What is Amazon S3?
S3 stands for Simple Storage Service and is an general object storage service built by AWS. It takes advantage of the vast infrastructure AWS all over the world.
I like to think of S3 as something similar to Dropbox or Google Drive in the sense that you can use it to store any file type (within some reasonable size limits). Although similar to these other products, its more oriented towards software oriented projects, but really, there’s nothing stopping you from using it to archive your personal photo library if you wish. After all, this is AWS we’re talking about here!
In terms of content that you can store on S3, you can store big files, small files, media content, source code, spreadsheets, emails, json documents, and basically anything that you can think of. Keep in mind though for a single object, there is a maximum size limit of 5 TB. This size limit probably won’t affect 99.9999% of you, but its important to know what the constraints are.
Below, I run you through some of the main performance dimensions you need to know about if using S3.
Performance is another reason many users flock to S3 as an object storage solution, and this is really where the advantage is over more conventional cloud storage products such as Google Drive and Dropbox.
S3 is an extremely scalable solution. You can think of it like an all you can eat buffet – there is no limit to the amount of content you can upload to S3. In distributed system world, we call S3 a horizontally scalable solution. Note that horizontally scaled systems can continue to provide predictable performance even when size grows tremendously. S3 shines in this regard.
The nice thing about S3 is that it can support applications that need to PUT or GET objects at very high throughputs, and still experience very very low latencies. For example, I once built an application that needed to read S3 objects out of a bucket at over 50 read calls PER SECOND. Size volumes varied from around 50KB to 100KB, but latencies were always low and predictable (often lower than 100ms).
In the performance category, S3 has defined a whole bunch of best practices when it comes to using S3 in highly concurrent scenarios. This includes strategies such as using multiple connections, using appropriate retry strategies, and many more tips. The link above calls out many of the suggestions by AWS – I highly suggest you give it a read.
One of the really nice things about relying on cloud storage by AWS is that any service built on it gets to take advantage of the highly distributed nature of cloud computing. Service creators are able to spread out their work across different logical units (data centers, regions) to ensure that the product / service they are offering is consistently accessible. Also, AWS has dedicated networking backchannels between its data centers that allows for it to quickly start serving traffic from another data center in the scenario where one is having problems.
S3 is a rockstar when it comes to availability. Since it piggy backs ontop of AWS cloud infrastructure, its able to offer very reliable availability guarantees. Do note that the actual availability guarantee (usually defined in % format like 99%) is dependent on the different Storage Tiers that you apply to your data (more on this later).
The standard tier which is default offers 99.99% availability guarantees. There are other tiers with slightly lower guarantees (99.5% is the lowest), but this only applies to certain data storage tiers that have lower cost. If you care about your data being always accessible and don’t mind paying a premium for it, then the Standard Tier will do just fine. If you’re looking to save on cost, you can consider some of the lower availability tiers. We’ll discuss in depth the different storage tiers in an upcoming section to help you understand more of when to use what.
Another nice thing to know about is that Amazon S3 has an SLA in terms of availability, and if Amazon ever fails to meet its SLA, they will offer service credits based on the downtime percentage to your nice bill. This is a commitment to availability on AWS side. There’s a screenshot below of the uptime percentage that AWS guarantees and the credits that get applied if AWS fails to meet them.
Do keep in mind though that outages in S3 are extremely rare, so you may never need to worry about this.
In cloud computing, durability refers to how healthy or resilient your is when it comes to data loss. Since data in an S3 bucket for example is stored on the cloud, we need a way to measure how likely it is for your data to become lost. In S3’s case, its durability is advertised as 99.99999999999% (11 9’s). Without having to say it, the chances of you losing your data are not impossible, just insanely improbable.
So how does S3 achieve this incredible durability? The process involves making multiple copies of the data you upload, and separating them onto multiple different physical devices across a minimum of three availability zones.
Additionally, S3 also offers a “versioning” feature such that each time you update a version of your object, the previous version is persisted alongside it. This won’t make ‘duplicate objects’ in your bucket per se – it’ll just allow you to revert back to any other previous version with a couple clicks. Versioning does come at an extra cost, but is useful for those of you looking to go the extra mile when it comes to durability.
Integrations with Other AWS Services
This is by far my favourite feature of using S3 – its native integration into other AWS services. S3 in itself is pretty un-interesting – cloud storage…. cool?
However, what makes S3 so powerful is the fact that you can combine some of its features with other AWS services to achieve some very interesting functions. For example, you can take advantage of the S3 events feature and integrate it with a Lambda function. With this setup, you can trigger a lambda function every time a S3 object is uploaded. This can be useful for data processing, file upload notifications, and many more interesting use cases.
These are just a couple examples, but there’s many other interrogations with AWS services worth noting. We’ll go over a couple of the more popular ones later in this post.
In the next section, we’ll start to get into the details of Amazon S3’s core concepts: buckets and objects.
Buckets are the highest order concept in Amazon S3. A bucket is really just a container for items that you would like to store within a certain namespace.
When you create a bucket, you give it a name. Its important to note here that bucket names must be globally unique across ALL of AWS. So there can’t be two buckets called test, or two called production, even if they are owned by someone on a different AWS account.
In terms of what buckets represent, I like to think of buckets as a general purpose file system. The bucket itself is the top level folder and within it you can have subfolders, files, subfolders within subfolders, so on and so forth. Basically, the same as a folder on Mac / Linux.
In terms of the visualizing buckets, here’s what a view of what they look like from the AWS console. Note that bucket names cannot contain special characters, must be globally unique, and need to be assigned to a corresponding AWS region upon creation.
Now we know about the basic file structure of S3. Lets move on to the content we’re storing inside them, more commonly referred to as S3 Objects.
Objects are the content that you’ll be storing inside of your buckets. They are the files, or collection of files that you upload to S3. This can range from media files, source code, zip files, and the list goes on.
Its important to remember that 5 TB size limit here – objects uploaded cannot exceed this limit.
Larger objects may have problems getting uploaded into the AWS Cloud – this can be due to spotty networking, flaky wifi networks, and many more reasons. For those of you with rather large files, you can break your files up into smaller ‘chunks’ using S3’s multi-part upload feature. This allows you to take advantage of parallel threading to upload your objects quicker and more reliably into the cloud.
Below, we can see a list of Objects stored within an S3 Bucket. Notice we can assess object file types, modification dates, size, and storage class (more on this later). Also notice that the last record in this list is a sub-folder. These subfolders may contain objects or other sub folders. Its up to the user to decide their bucket/subfolder structure.
In terms of accessing objects, there’s a couple different ways that you can retrieve your uploaded content from s3. The first method is using a URL following the schema below.
You’ll notice this is a HTTP link – this means that those who have the link will be able to read your object. Do note that this will not work by default – bucket owners need to explicitly specify that content within their bucket CAN be public, and whether certain objects or folders/subfolders ARE public. Its a two-step process that helps you shoot yourself in the foot by accidentally exposing your bucket to the public internet.
The second (and more popular) way to access your bucket’s content is programmatically. The code itself is pretty trivial. In the below example, I’m using Python and leveraging Boto3 library’s
get_object method to retrieve my object.
Content retrieved can be deserialized as a bytestream into a system known object format where it can be manipulated based on your needs. When downloading from S3, there are different availability guarantees depending on which storage class you’re using. In the next section, we’ll talk about these storage classes and how they can drastically impact the price of using S3.
S3 Storage Classes
When S3 was originally created, there was never a concept of Storage Classes – there was only one classification you could use and only one pricing point. Storage Classes are a relatively new concept introduced post release.
Over time, S3 began being used from a variety of different use cases – and the list keeps on growing. This includes applications leveraging S3 for real time applications, data processing, data archiving, event based architectures, and many more examples.
It turned out that one size doesn’t fit all. These different use cases have different different requirements in terms of things like latency and availability. Storage classes address this problem by allowing you to classify objects within buckets into different storage classes that have different performance characteristics, and also different pricing models.
Firstly, the thing to remember is that S3 allow you to drastically reduce costs, but it comes with certain sacrifices. The biggest ones are in terms of access time (latency) to your objects, availability (how often your data is available from the S3 service) and durability (how ‘redundant’ or ‘secure’ your data is on the cloud).
A list of storage classes and a brief explanation is outlined below
- Standard Tier – The default and most commonly used tier. Great for ‘read after write’ scenarios requiring low latency, high durability and availability guarantees.
- Intelligent Tier – A tier that attempts to shuffle around your data into different tiers based on access patterns. Shuffling is done automatically and unbeknowns to the customer.
- Standard IA – Infrequent access tier. This is more suitable for older data that doesn’t need to be access often, but when it does, you need low and predictable performance.
- One Zone IA – The same as Standard IA, except your data is only persisted in one availability zone instead of a minimum of 3. Has the lowest durability guarantees at 99.5% and comes with lower cost points.
- Glacier – Suitable for archive data that needs occasional access. SLA for object retrieval ranges from a few minutes to a few hours. Low cost point.
- Deep Glacier – Best for long lasting archival data (i.e. for compliance, regulation, or policy purposes). Object retrieval within 12 hours of request time. Lowest cost point.
- Outposts – On premise S3. Essentially emulates Amazon S3 service within on-prem devices and makes the files available from local machines.
As you may have imagined, each tier comes with a wide variety of performance characteristics. For a more comprehensive visual of the difference between the tiers see the chart below.
In terms of pricing – hold off on that thought. We’re going to visit the pricing of each of these tiers in