Infrastructure at your Service

Open source Team

DynamoDB Scan: the most efficient operation 😉

By December 20, 2020 AWS, NoSQL 3 Comments

By Franck Pachot

.
The title is provocative on purpose because you can read in many places that you should avoid scans, and that Scan operations are less efficient than other operations in DynamoDB. I think that there is a risk, reading those message without understanding what is behind, that people will actually avoid Scans and replace them by something that is even worse. If you want to compare the efficiency of an operation, you must compare it when doing the same thing, or it is an Apple vs. Orange comparison. Here I’ll compare with two extreme use cases: the need to get all items, and the need to get one item only. And then I’ll explain further what is behind the “avoid scans” idea.

I have created a table with 5000 items:


aws dynamodb create-table --table-name Demo \
 --attribute-definitions AttributeName=K,AttributeType=N \
 --key-schema            AttributeName=K,KeyType=HASH \
 --billing-mode PROVISIONED --provisioned-throughput ReadCapacityUnits=25,WriteCapacityUnits=25

for i in {1..5000} ; do
aws dynamodb put-item     --table-name Demo --item '{"K":{"N":"'${i}'"},"V":{"S":"'"$RANDOM"'"}}'
done

Because each time I demo on a small table I have people commenting with “this proves nothing, the table is too small” I have to precise that you don’t need petabytes to understand how it scales. Especially with DynamoDB which is designed to scale linearly: there is no magic that will happen after reaching a threshold, like you can have in RDBMS (small scans optimized with cache, large scans optimized with storage index / zone maps). If you have doubts, you can run the same and change 5000 by 5000000000 and you will observe the same, but you do that on your own cloud bill, not mine 😉

Let’s count the items:


[[email protected] DynamoDBLocal]$ aws dynamodb scan --table-name Demo --select=COUNT --return-consumed-capacity TOTAL --output text

5000    None    5000
CONSUMEDCAPACITY        6.0     Demo

This is a Scan operation. The consumed capacity is 6 RCU. Is this good or bad? Efficient or not?
First, let’s understand those 6 RCU. I have 5000 items, their size is a bit less than 10 bytes (2 attributes with name in one character, number up to 5 digits). This is about 48 KiloBytes, read with eventual consistency (we don’t read all mirrors) where reading 4 KiloBytes costs 0.5 RCU. The maths is easy: 48 / 4 / 2 = 6. If you test it on 5000 millions of items as I suggested for those who don’t believe in small test cases, you will see 6 million RCU. It is just elementary arithmetic, cross-multiply and you get it, there’s no magic. So, if you provisioned the maximum on-demand RCU, which I think is 40000 RCU/Second by default, you can count those 5000 million items in two minutes and a half. Is that inefficient? Try parallel scans…

Scan

You see where I’m coming. There’s no operation to “avoid” or ban. It just depends on what you want to do. Counting all items is done with a scan and you cannot do faster in DynamoDB. Except if you maintain a global counter, but then you will double the cost of each putItem. You don’t make it faster, you just transfer the cost to another part of the application.

You may want to do something more complex than a count. This is a scan that sums the values of the attribute “V”:


[[email protected] DynamoDBLocal]$ aws dynamodb scan --table-name Demo --select=SPECIFIC_ATTRIBUTES --projection-expression=V --return-consumed-capacity TOTAL --output text \
| awk '/^CONSUMEDCAPACITY/{rcu=rcu+$2}/^V/{sum=sum+$2;cnt=cnt+1}END{printf "%10.2f rcu   %10d items %10d sum(V)\n",rcu,cnt,sum}'

      6.00 rcu         5000 items   81599797 sum(V)

The code handles pagination (not needed here as my table is less than 1MB, but for people trying on 5000 million items, they can copy/paste this). I’ve described scan pagination in a previous post so you understand why I use the “text” output here. No surprise, a Scan is a Scan and there’s no cache in DynamoDB to make it faster when you read the same data frequently: 6 RCU again.

GetItem

Then, what will happen if you tell your developers that they must avoid scans? The table design is already there, and they need to get the count and the sum. This is not a critical use-case, maybe just to display it in a daily dashboard, so there’s no point to add the overhead of maintaining counters, with a lambda or AWS Glue Elastic Views. A Scan is perfectly valid here. But they try to avoid this “inefficient scan” and then come with this idea: they know the last item number inserted (5000 in my demo) and then use the “efficient” getItem call:


[[email protected] DynamoDBLocal]$ for i in {1..5000} ; do  aws dynamodb get-item --table-name Demo --key '{"K":{"N":"'$i'"}}' --return-consumed-capacity TOTAL ; done \
| awk '/^CONSUMEDCAPACITY/{rcu=rcu+$2}/^V/{sum=sum+$2;cnt=cnt+1}END{printf "%10.2f rcu   %10d items %10d sum(V)\n",rcu,cnt,sum}'

   2500.00 rcu         5000 items   81599797 sum(V)

No surprises if you know how it works: each getItem costs 0.5 RU and then the total is 2500 RCU. Most of the time, you get to read the same block of data from the storage, but this still counts as RCU. This is 416 times more expensive than the scan. So, let’s refine the “scan is the least efficient operation” claim by:

  • Scan is the worst efficient operation to get one item
  • Scan is the most efficient operation to get many items

Size

What means “many” here? As I did here, getting all items is where scan is the most efficient. But given what we know in my example, as getItem costs 0.5 RCU per item and a Scan costs 6 RCU, we can say that Scan is the most efficient operation when getting more than 12 items. However, this depends on two things. First, depending on which predicate filters those 12 items, a Query may be faster than Scan. This depends on the data model and it is not the case with my table here. Second, this factor of 12 depends on the size of the items. Because:

  • The Scan operation depends on the size of the table (all items with all attributes) and not on the number of items read
  • The GetItem operation depends on the number of items reads (and their size when larger than 4KB)

In my example, I have small items (10 bytes) and then a Scan cat get more than 400 items per 0.5 RCU. Where GetItem can get at most 1 item per RCU. With this, the Scan is quickly more efficient than GetItem. And this does not depend on the size of the table, but the size of each items. This is important because the best practice documentation also says “you should avoid using a Scan operation on a large table or index with a filter that removes many results” . If we take the “avoid” as absolute, this is true, but it can also apply to any operation: avoid to read your data and everything will be faster and cheaper 😉 If we take “avoid” as using another access type, like GetItem, then this is wrong: the table size does not count in the efficiency. This claim is right only when this “filter that removes many results” is an equality predicate on the partition key. But at the time the developer reads this, the table design is done and it is too late. In NoSQL, you don’t have the agility to change the partitioning key without huge refactoring of the code, because you don’t have the RDBMS logical data independence. The best you can do for this use-case is a Scan and, maybe, cache it with your application code, or a DAX service, if it occurs too frequently.

All this is not new for SQL people. This myth of “full table scans are evil” is very old. Then, people realized that a full table scan may be the most efficient, especially with all optimization that happened in the last decades (hash joins, re-fetching, direct-path reads, storage indexes, adaptive plans,…). Please never say that something is inefficient without the context, of you will miss the best of it. When a “best practice” is spread without the context, it becomes a myth. DynamoDB has the advantage to be simple (limited access paths, no cost-based optimizer,…), and then it is easy to understand the cost of an access path rather than apply some best practices blindly.

How do you measure efficiency? When you look at the number of items you can get with one RCU, a Scan is actually the most efficient. And, please, don’t think that we should “avoid” scans as if another operation can be more efficient. What we should avoid with DynamoDB is a data model that requires scans for critical operations. Remember that it is a key-value datastore: optimized to get one item with GetItem (or one collection with Query) for one hash key value. When you need to read many items, it is still efficient with an appropriate composite key defined for that, like in the Single Table Design where one Query can retrieve with one RCU all items to be joined, or with Global Secondary Index as it is a replica with a different partitioning schema. But as soon as you read from all partitions, a Scan is the most efficient operation.

3 Comments

  • Sam says:

    This is a phenomenal article, that does a really good job of lay-explaining with terrific examples. I would include some timing measures.
    E.g. I scan 200K lines which takes me 17 seconds serially, 10s in parallel, with 50% filtering vs. querying the same in N seconds…

  • Hi Sam,

    Thanks for the feedback and for the occasion to explain why I prefer to put metrics like RCU rather than clock timings.

    First reason is reproducible measures. The time to process this depends on many parameters in the service setting (provisioned or on-demand capacity, auto-scaling, adaptive capacity), in the usage configuration (latency to DynamoDB endpoint, CLI), size data read and retreived,… Those will be different in any environment. In my opinion, time comparisons and benchmarks are only good for vendors to select the right workload that shows they are x42 faster than their competitor. By reading the time improvement of one specific case, people may take bad decisions for their system which is different. When we talk about concepts and operation numbers (like RCU/WCU) the idea can be projected to one specific context, because the concepts are the same. Doing this with response time is misleading.

    Second reason is that DynamoDB is billed by RCU/WCU so this is what counts finally. if you want to increase the throughput, you can scale. If you want to increase the response time, you can add an accelerator service. The limit is the price, so provisionned capacity units.

    Third reason is that the response time can be optimized by the cloud provider without us even knowing it (AWS can upgrade the hardware and the software) and my measure from one day cannot be compared to another day. The goal of this blog post is to understand the concepts: Scan/Query for many colocated items, Get for few scattered items. And the concepts will not change in the future.

    Fourth, my blogging style is showing test cases that can be easily copy/pasted and reproduced to understand the concepts. Readers can do this for free, as I limit RCU/WCU to 25 which is possible on AWS free tier. I like experimenting on free labs to learn. Of course, those who want to see the timing on their environment can just run the same with more data.

    That’s a remark I see very often when I talk about the services built for massive scale. Because people are obsessed with timing figures (like “single digit milliseconds”). Cloud and hardware vendors like that because they have selling offers to reduce the response time and increase the thoughput. But the true scalability comes from the application design.

    Franck.

  • Claudio says:

    I found the article very insightful. Thanks, C.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Open source Team
Open source Team