Infrastructure at your Service

By Franck Pachot

.
This post is the first one from a series of small examples on recent Oracle features. My goal is to present them to people outside of Oracle and relational databases usage, maybe some NoSQL players. And this is why the title is “select from a flat-file” rather than “Inline External Tables”. In my opinion, the names of the features of Oracle Database are invented by the architects and developers, sometimes renamed by Marketing or CTO, and all that is very far from what the users are looking for. In order to understand “Inline External Table” you need to know all the history behind: there were tables, then external tables, and there were queries, and inlined queries, and… But imagine a junior who just wants to query a file, he will never find this feature. He has a file, it is not a table, it is not external, and it is not inline. What is external to him is this SQL language and what we want to show him is that this language can query his file.

I’m running this in the Oracle 20c preview in the Oracle Cloud.

In this post, my goal is to load a small fact and dimension table for the next posts about some recent features that are interesting in data warehouses. It is the occasion to show that with Oracle we can easily select from a .csv file, without the need to run SQL*Loader or create an external table.
I’m running everything from SQLcl and then I use the host command to call curl:


host curl -L http://opendata.ecdc.europa.eu/covid19/casedistribution/csv/ | dos2unix | (sed -u 1q; sort -t, -n -k4,4 -k3,3 -k2,2 ) > /tmp/covid-19.csv

This gets the latest number of COVID-19 cases per day and per country.

It looks like this:


SQL> host head  /tmp/covid-19.csv

dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp
31/12/2019,31,12,2019,0,0,Afghanistan,AF,AFG,37172386,Asia
31/12/2019,31,12,2019,0,0,Algeria,DZ,DZA,42228429,Africa
31/12/2019,31,12,2019,0,0,Armenia,AM,ARM,2951776,Europe
31/12/2019,31,12,2019,0,0,Australia,AU,AUS,24992369,Oceania
31/12/2019,31,12,2019,0,0,Austria,AT,AUT,8847037,Europe
31/12/2019,31,12,2019,0,0,Azerbaijan,AZ,AZE,9942334,Europe
31/12/2019,31,12,2019,0,0,Bahrain,BH,BHR,1569439,Asia
31/12/2019,31,12,2019,0,0,Belarus,BY,BLR,9485386,Europe
31/12/2019,31,12,2019,0,0,Belgium,BE,BEL,11422068,Europe

I sorted them on date on purpose (next post may talk about data clustering) but this is the way the file comes anyway.

I need a directory object to access the file:


SQL> create or replace directory "/tmp" as '/tmp';

Directory created.

You don’t have to use quoted identifiers if you don’t like it. I find it convenient here.

I can directly select from the file, the EXTERNAL clause mentioning what we had to put in an external table before 18c:


SQL> 
select *
 from external (
  (
   dateRep                    date
   ,n_day                     number
   ,n_month                   number
   ,n_year                    number
   ,cases                     number
   ,deaths                    number
   ,countriesAndTerritories   varchar2(50)
   ,geoId                     varchar2(10)
   ,countryterritoryCode      varchar2(3)
   ,popData2018               number
   ,continentExp              varchar2(10)
  )
  default directory "/tmp"
  access parameters (
     records delimited by newline skip 1 -- skip header
     logfile 'covid-19.log'
     badfile 'covid-19.bad'
     fields terminated by "," optionally enclosed by '"'
(dateRep date 'dd/mm/yyyy',n_day,n_month,n_year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp))
  location ('covid.csv')
  reject limit 10000 - because I've seen some bad data ;)
 )
.


SQL> /

      DATEREP    N_DAY    N_MONTH    N_YEAR    CASES    DEATHS                       COUNTRIESANDTERRITORIES       GEOID    COUNTRYTERRITORYCODE    POPDATA2018    CONTINENTEXP
_____________ ________ __________ _________ ________ _________ _____________________________________________ ___________ _______________________ ______________ _______________
31/12/2019          31         12      2019        0         0 Afghanistan                                   AF          AFG                           37172386 Asia
31/12/2019          31         12      2019        0         0 Algeria                                       DZ          DZA                           42228429 Africa
31/12/2019          31         12      2019        0         0 Armenia                                       AM          ARM                            2951776 Europe
31/12/2019          31         12      2019        0         0 Australia                                     AU          AUS                           24992369 Oceania
31/12/2019          31         12      2019        0         0 Austria                                       AT          AUT                            8847037 Europe
31/12/2019          31         12      2019        0         0 Azerbaijan                                    AZ          AZE                            9942334 Europe
31/12/2019          31         12      2019        0         0 Bahrain                                       BH          BHR                            1569439 Asia

ORA-01013: user requested cancel of current operation

SQL>

I cancelled it as that’s too long to display here.

As the query is still in the buffer, I just add a CREATE TABLE in front of it:


SQL> 1
  1* select *
SQL> c/select/create table covid as select/
   create table covid as select *
  2   from external (
  3    (
  4     dateRep                    varchar2(10)
  5     ,day                       number
...

SQL> /

Table created.

SQL>

While I’m there I’ll quickly create a fact table and a dimension hierarchy:


SQL> 
create table continents as select rownum continent_id, continentexp continent_name from (select distinct continentexp from covid where continentexp!='Other');

Table created.

SQL> create table countries as select country_id,country_code,country_name,continent_id,popdata2018 from (select distinct geoid country_id,countryterritorycode country_code,countriesandterritories country_name,continentexp continent_name,popdata2018 from covid where continentexp!='Other') left join continents using(continent_name);

Table created.

SQL> 
create table cases as select daterep, geoid country_id,cases from covid where continentexp!='Other';

Table created.

SQL> 
alter table continents add primary key (continent_id);

Table altered.

SQL> 
alter table countries add foreign key (continent_id) references continents;

Table altered.

SQL> 
alter table countries add primary key (country_id);

Table altered.

SQL> 
alter table cases add foreign key (country_id) references countries;

Table altered.

SQL> 
alter table cases add primary key (country_id,daterep);

Table altered.

SQL>

This creates a CASES fact table with only one measure (covid-19 cases) and two dimensions. To get it simple, the date dimension here is just a date column (you usually have a foreign key to a calendar dimension). The geographical dimension is a foreign key to the COUNTRIES table which itself has a foreign key referencing the CONTINENTS table.

12c Top-N queries

In 12c we have a nice syntax for Top-N queries with the FETCH FIRST clause of the ORDER BY:


SQL> 
select continent_name,country_code,max(cases) from cases join countries using(country_id) join continents using(continent_id) group by continent_name,country_code order by max(cases) desc fetch first 10 rows only;

   CONTINENT_NAME    COUNTRY_CODE    MAX(CASES)
_________________ _______________ _____________
America           USA                     48529
America           BRA                     33274
Asia              CHN                     15141
Europe            RUS                     11656
America           ECU                     11536
Asia              IND                      9851
Europe            ESP                      9181
America           PER                      8875
Europe            GBR                      8719
Europe            FRA                      7578

10 rows selected.

This returns the 10 countries which had the maximum covid-19 cases per day.

20c WINDOW clauses

If I want to show the date with the maximum value, I can use analytic functions and in 20c I don’t have to repeat the window several times:


SQL> 
select continent_name,country_code,top_date,top_cases from (
   select continent_name,country_code,daterep,cases
    ,first_value(daterep)over(w) top_date
    ,first_value(cases)over(w) top_cases
    ,row_number()over(w) r
    from cases join countries using(country_id) join continents using(continent_id)
    window w as (partition by continent_id order by cases desc)
   )
   where r=1 -- this to get the rows with the highest value only
   order by top_cases desc fetch first 10 rows only;

   CONTINENT_NAME    COUNTRY_CODE      TOP_DATE    TOP_CASES
_________________ _______________ _____________ ____________
America           USA             26/04/2020           48529
Asia              CHN             13/02/2020           15141
Europe            RUS             12/05/2020           11656
Africa            ZAF             05/06/2020            3267
Oceania           AUS             23/03/2020             611

The same can be done before 20c but you have to write the (partition by continent_id order by cases desc) for each projection.

In the next post I’ll show a very nice feature. Keeping the 3 tables normalized data model but, because storage is cheap, materializing some pre-computed joins. If you are a fan of NoSQL because “storage is cheap” and “joins are expensive”, then you will see what we can do with SQL in this area…

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Franck Pachot
Franck Pachot

Principal Consultant / Database Evangelist