Infrastructure at your Service

I would like to share some personal thoughts. September 28th 2021, Oracle released the eleventh Exadata data machine called “X9M-2” (2 CPU sockets), X9M-8 (8 sockets) and ZDLRA X9M. Exadata is a computing platform to run Oracle RDBMS, zero data loss recovery appliance (ZDLRA) is a platform to backup Oracle RDBMS and based on Exadata hardware.

Combining hard- and software and importance of documentation

Oracle Exadata is undoubtedly an interesting platform to run Oracle RDBMS on. It’s the combination of hard- and software that makes the difference to other platforms. Maybe you remember the famous release of Apple iPhone that redefined smartphone market using sweet spots on hard- and software?
For the readers technically interested, I recommend to have a look at documentation. See here for example on hardware details allowing to compare old and new. Please also find a comparison in the next section.
Regarding ZDLRA, one can find site planning, network configuration, hardware and software installation, and maintenance in documentation. To date, thee is no documentation available on new features, but you may check all books at a later time.

What changed? A comparison

Please find in documentation hardware details on a single webpage. In whitepapers, Oracle briefly introduces features and list some benchmark results:

Compute nodes hardware

Metric # CPU sockets X8M X9M increase %
8K database read I/Os per second 2 (X?M-2) 1’500’000 2’800’000 87 %
8 (X?M-8) 5’000’000 5’000’000 0 %
8K flash write I/Os per second 2 (X?M-2) 980’000 2’000’000 204 %
8 (X?M-8) 3’000’000 3’000’000 0 %
Max Memory in GB 2 (X?M-2) 1536 2048 33 %
8 (X?M-8) 6144 6144 0 %
Total CPU cores 2 (X?M-2) 48 64 33 %
single core marks ca. 960 ca. 1200 25 %
multi core marks ca. 24’000 ca. 45’000 87.5 %
model Intel® Xeon®
8260, 2.4 Ghz
Intel® Xeon® 8358, 2.6Ghz n/a
Total CPU cores 8 (X?M-8) 192

192

0 %
single core marks ca. 960

ca. 960

0 %
multi core marks ca. 26’000

ca. 26’000

0 %
model Intel® Xeon®
8268, 2.9 Ghz

Intel® Xeon® 8268, 2.9 Ghz

n/a

CPU scores are taken from geekbench 5 results database, take with caution, may not be reliable. No increase on X?M-8 compute nodes, as same hardware on both versions. Main reason to use X?M-8 is more max memory available per compute node (think of in-memory database option). No average latency given on IOPS, therefore results are not completely documented and one need to interpret them with caution.

Storage nodes hardware

Metric Model X8M X9M increase %
8K database read I/Os per second Extreme Flash (EF) 1’500’000 2’300’000 53 %
High capacity (HC) 1’500’000 2’300’000 53 %
Extended (XT) n/a n/a n/a
8K flash write I/Os per second Extreme Flash (EF) 470’000 614’000 31 %
High capacity (HC) 470’000 614’000 31 %
Extended (XT) n/a n/a n/a
Flash raw capacity in TB Extreme Flash (EF) 51.2, PCIe 3.0 51.2, PCIe4.0 0 %
High capacity (HC) 25.6, PCIe 3.0 25.6,  PCI 4.0 0 %
Extended (XT) 0 1
Disk raw capacity in TB Extreme Flash (EF) 0 0 0 %
High capacity (HC) 168 216 28.6%
Extended (XT) 168 216 28.6%
Persistent memory raw capacity in TB Extreme Flash (EF) 1.5
Series 100
1.5
Series 200
0%
32 % bandwidth
High capacity (HC) 1.5
Series 100
1.5,
Series 200
0%
32 % bandwidth
Total CPU cores Extreme Flash (EF) 32

Intel Xeon 5218 processors (2.3GHz)

32

Intel® Xeon® 8352Y 2.2 Ghz

0 %
High capacity (HC) 32

Intel Xeon 5218 processors (2.3GHz)

32

Intel® Xeon® 8352Y 2.2 Ghz

0 %

No increase on flash storage and persistent memory capacity. PMEM bandwidth increase according to Intel. No average latency given on IOPS, therefore results are not completely documented and one need to interpret them with caution.

Full rack configuration

Performance increase between X8M and X9M platform according to Oracle tests (we don’t know how Oracle is testing and if test are consistent) are quite impressive. Interestingly, x9M full rack with high capacity storage uses 14 instead of 12 storage nodes. Besides taht, there ist more CPU power as more modern hardware available both on compute and storage nodes. No average latency given on IOPS, therefore results are not completely documented and one need to interpret them with caution.

Metric X8M X9M Increase
8K database read I/Os per second 12’000’000 22’400’000 86.7%
8K flash write I/Os per second 5’640’000 8’596’000 52.4 %
Database Servers
Total cores
8 x X8M-2
400
8 x X9M-2
512
0 %
28 %
Storage Servers
Total cores
12 x High capacity
384
14 x High capacity
448
16.7 %
Disk raw capacity in TB 2016

3024

 50%
Flash raw capacity in TB 307.2 358.4 16.7 %
Persistent Memory raw capacity in TB 18 21

Please note there are also other configurations available:

Configuration

Compute nodes Storage nodes Remarks
X8M-2 X9M-2 X8M-2 X9M-2
Eight rack 2 2 3 3
  • Minimum Exadata configuration
  • Only half the CPU cores as of quarter rack activated, half the storage build in
  • Expand by upgrade to quarter rack first, then add additional compute and storage nodes
Quarter rack 2 2 3 3
  • Expand by adding additional compute and storage nodes (elastic configuration)
Half rack 4 4 6 7
Full rack 8 X8M-2
or
3 X8M-8
8 X9M-2
or
3 X9M-8
12 14
Elastic configuration single rack up to 19,
912 cores max
up to 19,
1216 cores max
up to 19,
608 cores max
up to 18,
576 cores max
  • One can combine compute and storage nodes in a rack
multi rack up to 32 compute nodes  ?, but possibly higher than X8M up to 64 storage nodes ?, but possibly higher than X8M
  • Multi rack deployments are possible by connecting racks via integrated RoCE network fabric. RoCE stands for Remote direct memory access over Converged Ethernet.
  • Limits without using additonal RoCE network fabric
    X8M: 8 racks
    X9M: 12 racks

Oracle Exadata Configuration Assistant (OECA) for Exadata simplifies the elastic configuration process. OECA helps you to investigate and plan a variety of elastic configuration scenarios. Yet not updated with X9M platform data, but certainly in future. Download it here, and Oracle account is needed.

Some words on performance

Please be advised: Performance increase is not only due to hardware change (latest Intel Xeon Ice lake CPUs, PCIe 4.0 instead 3.0, series 200 persistent memory instead 100, … you name it), but also to software which may be used on both X8 and X9M platforms. It’s interesting to see what implications there are when choosing a processor platform. Maybe to see AMD and or ARM chips in future on Exadata?
Furthermore, I believe hardware tuning is the easiest of all methods to tune Oracle RDBMS. But not the one with biggest effects. There can be a bigger impact by tuning on SQL level or simply keep active data sets in databases small (e.g. one uses partitioning and offload data on a regular basis). Both SQL tuning and data archiving means that a tuning specialist needs to understand how end users use a service. And how the service is meant to be used. Unfortunately there is a difference between the two… Therefore tuning is individual in every company even using the exact same software. Writing this I got remembered to the 2009 german movie “Same Same But Different“…
To sum it up: Tuning is hard work, a true craft, sometimes even art… It’s advisable to tune with a scientific approach (knowing the effects before implementing them) and end tuning efforts when impact on performance from an end user perspective is too low. So get end users in the tuning boat.

What can be learned from cloud

Speaking on performance tuning, one can think of services that can not be run on public clouds because of network latency. But one can learn from cloud services to engineer environments on-premises. I see at least two interesting aspects:

  1. Automatisation
    Saves time and ensures quality. DBAs end up becoming developers. Running services autonomously by data mine data ditctionary (or simple use “statistics”, but this term is not as modern as “data mining”) is also part of automatisation.
  2. Standardisation
    Many database workloads can fit in a few standards. Maybe you start to set them like T-shirt sizes S, M, L, XL on your oracle RDBMS offerings? No rule without exception, there are always and will be services demanding more or less. But don’t underestimate the power of Standardisation.
    T-shirt sizes are usefull to define limits when consolidate a large number of database on same hardware. Exadata offers Ressource Manager to limit CPU, memory and disk usage.

To me it’s important to understand that both automatisation and standardisation can be achieved running services on premises as well. But probably at higher costs and time efforts. Data can be the second valuable asset in a company after employees and must be protected (see section “On trust”), so money is not important. Time and agility are some main reasons why to move workload to cloud. But a supplier should not take decisions for the customer, it’s always him to decide.

On trust

At the end, trust between customer and supplier is an important factor to choose a solution. Trust is a cultural thing, not everybody act the same. This makes live quite interesting, don’t you think? Regarding Exadata [email protected] service, customers trust Oracle corporation to run their workloads and deal with their data even hardware stays in customers data center. Not every supplier is capable protecting data as many incidents showed in the past, and risks will increase with standardisation (less variants to choose from) and monopolisation (fewer vendors). There would be an interesting experiment on [email protected] services: What happens if one caps internet access to the Exadata hardware? Will databases continue to run? Legally, you may not be allowed to.

On appliances

A vendor may know it’s solution best on capabilities and limits, so it’s a logic step to also offer appliances. The more complex a solution, the more this can become true. Oracle RDBMS is a complex piece of software. Appliances and services have to fit in an existing environment. My experience is: The bigger the company, the bigger the environment, the more vendors/silos there are, the less standards used. Quite an impressive amount of people are dealing to migrate data and services from one platform to another. This is by far not every time successful. Furthermore, legacy platforms stay longer as expected. It’s easy to end up as a zoo keeper.

Time to refactor?

On the other hand: if only the vendor is able to build meaningful appliances to its software, did the vendor reach or even miss the point in time to refactor its software products? Oracle RDBMS was first commercially available in 1979. Honestly, I have no idea how much code and ideas still exist from older times in today versions. And personally, I think Oracle strategy is mainly on introducing new features (cloud drives development) and not robustness and fixing bugs. It’s also quite hard for developers to keep up with what hardware and software today are offering. Modern programming languages and remote direct memory access (RDMA) and persistent memory are just some examples. One goal is that hardware interacts the most direct way with hardware using the least amount of software (and main CPU cycles), so overall performance will improve. Dealing with Oracle RDBMS, licence costs that are CPU based may be lowered. Maybe one time you pay for payload and not CPU power?

On future

But maybe at the end the most fundamental steps are to

  • release software using an appropriate open source licence
  • work constantly on simplicity and close with end customers
  • identify and lowering technical depths before implementing new features
  • monetise not by selling licences but services

Back to reality

I think most developers, C-Managers and end users don’t care about databases at all. If it comes to Oracle RDBMS, few people understand the complex engine. Most workloads fit in standard configuration, but there are huge risk for misconfiguration. If in doubt, leave standard values and avoid to use hidden parameters. PL/SQL is far too complex, as a result most payload is simply being processed on application server or client level. To transport data has negative effects on costs, energy consumption and performance. On the other hand processing data outside database level may avoid database vendor lock-in. Application come and go, database engines stay.

An optimal service

  • may be available promptly,
  • usage may be self explaining (focus on usability),
  • payload may stay under end user control (with the basic feature to erase) and
  • it’s performance may scale automatically up and down according to current workload while maintaining low end users response times.

Easy requirements, hard to materialise.

Thinking out loud. What is your opinion?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Casimir Schmid
Casimir Schmid

Consultant