Infrastructure at your Service

Franck Pachot

Amazon DynamoDB: a r(el)ational Glossary

By September 17, 2020 AWS, Cloud No Comments

By Franck Pachot

.
There are many NoSQL databases. And, because SQL is an ISO standard, “No SQL” also means “No Standard”. Many have a similar API and similar objects, but with completely different names. Today, NoSQL databases are used as an additional datastore for some well-defined use cases for which a hashed key-value store fits better than a relational table. And it quickly became “Not Only SQL” as it is complementary to RDBMS databases using SQL. But at the origin, the idea was to replace the RDBMS databases, refusing the SQL API, and then inventing a “No SQL” data store. When you want to replace something rather than proposing something new, you often adopt the same language to make it look similar. And this why, in my opinion we find some relational database terms like “Table” and “Index”. But they have a different meaning. Here is a dictionary where I try to explain the DynamoDB artifacts and differentiate from their Relational and SQL meaning.

Attribute

This is taken from the Relational semantic. An “Attribute” (a “Column” in SQL) is an element of a “Relation” (a “Table” in SQL) for which a “Tuple” (a “Row” in SQL) can have a value. Basically, when you store a list of your friends, they have a first name, a last name, a birthdate,… and many other properties. “First Name”, “Last Name” are attributes of the relation “Friend”. The combination of those properties is a fact like a sentence saying “I have a friend whose first name is Donald and last name is Chamberlain and was born on December 21, 1944”. In DynamoDB an attribute is a JSON member. The previous tuple could be written as: {“first name”:”Donald”, “last name”: “Chamberlain”, birthdate: “1944-12-21”}.

But there are many differences. A relational attribute has a “Domain” (a “Data Type” in SQL) which defines the set of possible values. A “First Name” or “Last Name” is a combination of characters within a characterset, with a specific maximum length (because the application that manipulates it have some memory buffers, screen width, paper length,… physical limits). A “Birthdate” is a day within a specific calendar,… They are constrained so that the developer that reads an attribute knows that she will not encounter any edge case at runtime. If the domain says that the first name holds maximum 30 of [A-Za-z-‘] characters, then she knows that she will not buffer overflow when she allocates 30 bytes (or 60 in UTF-16, or 120 in UTF-8), and that she doesn’t have to code some additional assertions in her code to be sure there’s no non-alphanumeric characters there. In JSON you can encounter any character string in a value. And DynamoDB attributes have only a few data type constraints: Null, Number, String, Boolean, Binary. And they are defined per attribute value as there is no schema. The format of a date is just a character string with a convention. Nothing will tell you that “1752-09-13” is not a valid date.

If you come from Oracle, you can see a DynamoDB attribute like the ANYDATA data type:

[email protected]> create table demo (att anydata);
Table DEMO created.
[email protected]> insert into demo values (anydata.convertNumber(42));
1 row inserted.
[email protected]> insert into demo values (anydata.convertVarchar2(q'{Don't panic}'));
1 row inserted.
[email protected]> select * from demo;
                                                                            ATT
_______________________________________________________________________________
OPAQUE(ANYDATA TypeCode: "TYPECODE_NUMBER" - ANYDATA Value: "42")
OPAQUE(ANYDATA TypeCode: "TYPECODE_VARCHAR2" - ANYDATA Value: "Don't panic")

Another difference is that a relational attribute should be scalar (holds only one value). DynamoDB can store a whole hierarchy in a JSON attribute. The 1st Normal Form says that the value must be atomic even if Chris Date accepts nested tables in 1NF. And SQL allows this. Actually, it is not NoSQL vs. SQL here. Usually, in SQL you store hierarchies into another table but can store it as a hierarchy with nested tables, XML, or JSON. Usually, in NoSQL document database you store a whole hierarchy as one value but the “single-table” design splits them into multiple items of a collection.

If you compare with MongoDB, a DynamoDB attribute is similar to a MongoDB field.

Item

Even if an Item in DynamoDB looks like a relational “Tuple” (or “Row” in SQL) there’s no ambiguity. We are in a key-value store and the item is the value. We are in a key-value store and the item is the value. This would be stored in a SQL BLOB (Binary Large Object) in a relational database. As JSON, it could be the OSON datatype in Oracle or JSONB in PostgreSQL. It is a column, not a row.

The equivalent of a DynamoDB item would be a BLOB storing JSON in a table with a key and this BLOB only. This is, for example, how SODA, the NoSQL API for Oracle RDBMS, stores a collection:


[email protected]_tp> soda create MyDocumentStore
Successfully created collection: MyDocumentStore

[email protected]_tp> ddl "MyDocumentStore"

  CREATE TABLE "DEMO"."MyDocumentStore"
   (    "ID" VARCHAR2(255) COLLATE "USING_NLS_COMP" NOT NULL ENABLE,
        "CREATED_ON" TIMESTAMP (6) DEFAULT sys_extract_utc(SYSTIMESTAMP) NOT NULL ENABLE,
        "LAST_MODIFIED" TIMESTAMP (6) DEFAULT sys_extract_utc(SYSTIMESTAMP) NOT NULL ENABLE,
        "VERSION" VARCHAR2(255) COLLATE "USING_NLS_COMP" NOT NULL ENABLE,
        "JSON_DOCUMENT" BLOB,
         CHECK ("JSON_DOCUMENT" is json format oson (size limit 32m)) ENABLE,
         PRIMARY KEY ("ID") 
   ) ;

Here JSON_DOCUMENT is in format OSON, the Oracle binary implementation for JSON, stored in a BLOB which can be any size. The “soda” and “ddl” are standard SQLcl commands.

If you compare with MongoDB, a DynamoDB item is similar to a MongoDB document except that it is a JSON string value where MongoDB can store larger documents (up to 16MB) in binary format (BSON). But we will see that the idea is to split a document to multiple items in DynamoDB, with a key starting with the same prefix to be hashed and range partitioned together.

Collection

This is a NoSQL term. A “Collection” is a set of “Item”. But in DynamoDB this has a specific meaning and the SQL equivalent would be a table “Partition”. When you HASH partition an RDBMS table, you do the same as a DynamoDB collection: you apply a hash function on the partition key value which determines the physical segment of the table where you want to store it. This is how you scale any database, NoSQL or RDBMS: partition by hash and distribute those partition over the storage, processes, and maybe nodes.

A DynamoDB collection is not the same as a MongoDB collection. It is more like a chunk. What is called a collection in MongoDB is called a table in DynamoDB.

Table

This is where using the same name in DynamoDB as in SQL database can mislead to a bad data model. A “Relation” (“Table” in SQL) is a set of tuples (“rows”) that belongs to the same business entity. In DynamoDB you can store anything in the same table as it is just a key-value store. For example, you store the customer information with their orders within the same table and within the same collection. And that’s the goal: the data model is optimized for one use-case and if this use case is to display one customer orders, then you want to retrieve all that with one ‘get’. A SQL table has the opposite goal: centralize the data for many use-cases. You have one table with your customers. And one table with your orders. And the CRM can query the customers only. The Delivery will query only the orders. The Marketing wants to get some customer information for specific orders. The billing wants to get all orders for a customer…

Again, it is not NoSQL vs. SQL and you can have some nuances of both. In a RDBMS when you want to pre-build a hierarchy to optimize a specific use-case, you get the RDBMS to maintain transparently a redundant physical store like an index or a materialized view. And in DynamoDB the “single-table” design splits a document into multiple items to get it accessed from multiple use cases.

Actually, a Table in DynamoDB could be seen as a Materialized View in SQL rather than as a table. It is pre-built by the application and not by the RDBMS but it is like a view (purpose-built for specific queries) and materialized (the physical colocation of data is there by design). The worst mistake a developer can do is consider it as a SQL table and store related items in multiple DynamoDB tables. This would be bad for performance and cost. What you store in a DynamoDB table is probably at the same scope as what you store in a SQL schema for a small application.

Index

An index in RDBMS is a physical structure that provides fast access to a small number of rows because it is ordered on a value. That’s the same idea in DynamoDB: access with another attribute than the key. Another similarity is the fact that the index is transparently maintained. The developer declares the index and doesn’t have to take care of the maintenance in his code. Of course, in NoSQL the maintenance of the index can be in eventual consistency but it is still automatically maintained. However, there is a huge difference for your coding effort. In RDBMS you never query the index. The developer doesn’t have to know the physical design. An index is a physical structure created for performance reasons. SQL provides a logical view on the database: you query the table and the RDBMS optimizer decides how it is the most efficient get the rows required for the result. In DynamoDB you see the indexes as a table: you need to “get” on them if you want to use them.

In NoSQL, you have no optimizer. You do it in the code. If you add a new index for fast access, you need to change the code to use it. In RDBMS you have the agility to add an index (even online without application interruption in some RDBMS) and queries will use it when it is appropriate. It is all about having a simple API with calls to embed in your procedural code or calling a database service that does the procedural access for you.

In DynamoDB you have local indexes and global indexes. That’s the same idea as in RDBMS where in index can be local, partitioned in the same way as the table, or global, partitioned on another value. And you find the same optimization ideas. I have a few blog posts related to covering indexes in DynamoDB.

Primary Key

In DynamoDB as well as in SQL a Primary Key is a unique identifier for an item or row. It is enforced to be unique: If you put an item that already exists, it will replace it (the default in DynamoDB and can do the same in SQL with the MERGE statement) or reject it (the default in SQL and you can do the same in DynamoDB with an attribute_not_exists condition). RDBMS usually uses the primary key for the physical organization as well. Oracle is very agile on that and only specific table organization (like IOT) requires that the physical order follows the primary key. Other RDBMS (like MySQL InnoDB) will store the tables clustered on the primary key. In DynamoDB, like in any key-value store, the primary key determines the physical access through physical structures. A DynamoDB is hash partitioned on the first column of the primary key. And may be sorted on a second attribute. If you want to hash on multiple attributes, your code will concatenate the values to be the first attribute. Again, NoSQL provides a simple API and the developer does the additional logic in the procedural code. In RDBMS, a primary key is a declared constraint that may, or may not, be related to the physical organization. And can be compound with multiple columns. RDBMS can also have additional keys for which the uniqueness is enforced and for which the referential integrity can be validated.

In summary

This glossary is aimed at people working in SQL and NoSQL to avoid confusion between the different meanings of the same terms used in RDBMS and DynamoDB. The most important you should remember is that a Table in DynamoDB is different than a Table in SQL. Yon don’t need to separate to different tables because of the size: tables are automatically partitioned. You don’t need to separate to different tables because of the structure: all are key-value. And, you will quickly have a generic key name where you put different entities and the key value being build to control the clustering of data. Think of a DynamoDB table as a key-value store. You can have multiple stores to isolate multiple applications with different performance requirement (and cost). But an application should store its data into one store only. I’m not giving examples here because you can follow the two links on “single-table” which will bring you to the excellent examples from Rick Houlian and Alex DeBrie. I wrote this blog post as an expansion of my following tweet:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Franck Pachot
Franck Pachot

Principal Consultant / Database Evangelist
Oracle ACE Director
Oracle Database OCM 12c certified
AWS Database Specialty certified
Oak Table member

RSS for this blog: feed
Twitter: @FranckPachot
LinkedIn : www.linkedin.com/in/franckpachot
Podcast en français: DBPod