Infrastructure at your Service

Michael Schwalm

Oracle LOBs: Infer the file type with dbms_lob package

LOBs (Large OBjects) first appeared in Oracle 7 and were created to store large amount of data such as document files, video, pictures, etc. Today, we can store up to 128 TB of data in a LOB, depending on the DB_BLOCK_SIZE parameter settings. In this posting, I will show you how to load LOB files into the Oracle database and will present a way to identify the file type of the LOB stored in the database, based on the LOB value column only.

Each file type is associated to a “file signature”. This signature is composed of several bytes (i. e. 00 01), and allows to uniquely identify a file type (i.e. zip file, png file, etc). The signature corresponds to the first bytes of the file. To identify the file type of a LOB stored in a database, we just have to extract the first bytes of the LOB value and then compare them to a list of known bytes. I have created a PL/SQL script for that purpose.

The first step is to create a table containing LOB files. Here, I will create a simple table to store internal Binary objects (BLOB). The statement used is:

CREATE TABLE app_doc (  doc_id NUMBER,  doc_value BLOB);
 
SQL> desc app_doc;
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
DOC_ID                                             NUMBER
DOC_VALUE                                          BLOB

Now, I want to load a file from the operating file system to the table APP_DOC. The file to load is a picture saved as PNG format. The path is “/oracle/documents/my_pic.png”.

First, we must create an Oracle directory in order to access the file:

SQL> CREATE OR REPLACE DIRECTORY DIR_DOCUMENTS AS '/oracle/documents';

We then use the DBMS_LOB package through a PL/SQL block to load a file in the database:

DECLARE
  src_lob  BFILE := BFILENAME('DIR_DOCUMENTS', 'my_pic.png');
  dest_lob BLOB;
BEGIN
  INSERT INTO app_doc VALUES(1, EMPTY_BLOB())
     RETURNING doc_value INTO dest_lob;
  DBMS_LOB.OPEN(src_lob, DBMS_LOB.LOB_READONLY);
  DBMS_LOB.LoadFromFile( DEST_LOB => dest_lob,
                         SRC_LOB  => src_lob,
                         AMOUNT   => DBMS_LOB.GETLENGTH(src_lob) );
  DBMS_LOB.CLOSE(src_lob);
  COMMIT;
END;
/

The table has now a new record:

SQL> select count (*) from app_doc;
  COUNT(*)
----------
         1

I have populated the table with 28 files of several types (png, pdf, bmp, etc.) and the table has 28 records. Here is the list of files loaded into the table:

oracle@vmtest:/oracle/documents/ [rdbms11203] ls -1
001.png
002.png
003.png
004.png
005.png
006.png
007.png
008.png
009.png
010.png
011.png
012.png
013.png
014.png
015.png
016.png
017.png
018.png
100021667-logo-dbi-services-sa.jpg
dbi_twitter_bigger.jpg
e18294-SecureFiles_and_Large_Objects_Developers_Guide.pdf
e25523-VLDB_and_Partitioning_Guide.pdf
file.zip
Mybmp2.bmp
Mybmp.bmp
Mydoc1.doc
mypdf.pdf
text3.txt

Here is an extract of the PL/SQL written to identify the LOB file type:

-- Create a temporary table to store known document types
CREATE GLOBAL TEMPORARY TABLE temp_doc_types (id NUMBER, type VARCHAR(5), sign VARCHAR(4)) ON COMMIT DELETE ROWS;

 

-- Populate the temporary table
INSERT INTO temp_doc_types VALUES (1,'jpeg','FFD8');
INSERT INTO temp_doc_types VALUES (2,'gif','4749');
INSERT INTO temp_doc_types VALUES (3,'png','8950');
INSERT INTO temp_doc_types VALUES (4,'bmp','424D');
INSERT INTO temp_doc_types VALUES (5,'pdf','2550');
INSERT INTO temp_doc_types VALUES (6,'doc','D0CF');
INSERT INTO temp_doc_types VALUES (7,'zip','504B');
INSERT INTO temp_doc_types VALUES (8,'rar','5261');

 

In this example, I chose to compare only the four first bytes of the LOB value. The number of bytes composing a file signature can be larger, but to simplify the example, I only used file signatures that have four bytes.

The number of files of each extension is returned by this statement:

SQL> select decode(type,NULL,'Unknown document type',type) as "TYPE OF DOCUMENT",
  2  count(*) as "NUMBER OF DOCUMENTS"
  3  from temp_doc_types a,
  4  (select (dbms_lob.substr(doc_value,2,1)) as FILE_TYPE from app_doc) b
  5  where a.sign(+)=b.FILE_TYPE group by a.type;
 
TYPE OF DOCUMENT      NUMBER OF DOCUMENTS
--------------------- -------------------
Unknown document type                   1
doc                                     1
bmp                                     2
png                                    18
pdf                                     3
jpeg                                    2
zip                                     1

 

The statement gets the first four bytes of each LOB and compares them to the temporary table, matching the LOB to a type.

  • The decode function is used to display ‘Unknown document type’ when NULL is returned and no extension matches between the LOB and the temporary table. This is the case for text files which have no file signature.
  • The DBMS_LOB.SUBSTR function is used to retrieve only the first four bytes from the LOB value. Parameters are the source LOB value, the amount of bytes to read (4), and the starting position (1).
  • An outer join is used to count LOBs not matching the temporary table. This will correspond to unidentified files types.

To finish, I drop the temporary table before exiting the job:

-- The temporary table is dropped before exit
DROP TABLE temp_doc_types;
exit;

And just for the eyes, this is the result in a HTML fashion, generated directly from sqlplus using the MARKUP option and an css style sheet. I let you read the Oracle documentation in order to know how to use the MARKUP option ;-)

sql_report

 

Leave a Reply


− five = 3

Michael Schwalm
Michael Schwalm

Consultant