Infrastructure at your Service

Franck Pachot

Oracle recovery concepts

By March 29, 2020 Oracle No Comments

I’ve published a while ago a twitter thead on some Oracle recovery concepts. For those who are not following twitter, I’m putting the whole thread here:

🔴⏬ Here I start a thread about some Oracle Database concepts. We will see how far it goes - all questions/comments welcome.

🔴⏬ A database (or DBMS - database management system) stores (for short and long term) and manipulates (from many concurrent users/devices) your #data.

🔴⏬ #data is logically structured (tablespaces, schemas, tables, columns, datatypes, constraints,…). The structure is described by #metadata.

🔴⏬ Stored #data (tables and indexes) is physically structured (blocks, extents, segments, datafiles), which is also maintained by #metadata

🔴⏬ The logical #metadata and some physical #metadata are stored in the #dictionary ,also known #catalog

🔴⏬ The #dictionary is also data itself: it is the system data describing the user data. Its own metadata is fixed, hardcoded in the software for #bootstrap

🔴⏬ The set of persistent files that stores system data (metadata for user data) and user data is what we call (with Oracle) the #database

🔴⏬ The database files are internally referenced by an identifier (file_id, file#, Absolute File Number,…). The file names are not defined in the #dictionary but (only) in the main metadata file for the database, the #controlfile

🔴⏬ So, we have user #data and its #metadata (system data) in #datafiles. Where is metadata for system data? The code that contains this metadata (structures), and the functions to manipulate data, is the database #software

🔴⏬ Oracle database software installed on the server is often referred by its install location environment variable, the $ORACLE_HOME

🔴⏬ As with any software, the Oracle Database binary code is loaded by the OS to be run by multiple #process, which work in #memory

🔴⏬ The processes and memory running the Oracle software for one database system is what is called an Oracle #instance

🔴⏬ I think the #instance was simply called the Oracle ‘system’ at some time as we identify with Oracle System ID (ORACLE_SID) and modify it with ALTER SYSTEM

🔴⏬ The #instance processes open the database files when needed, to read or write data (user #data or #metadata), when started in state #OPEN

🔴⏬ Before being in #OPEN state, where all database files are opened, the instance must read the list of database files from the #controlfile, when the state is #MOUNT

🔴⏬ Multiple instances can open the same database from different nodes, in the case of Real Application Cluster (RAC) and synchronizes themselves though the #shared storage and the private network #interconnect

🔴⏬ For the instance to know which #controlfile to read, we provide its name as an instance parameter, which is read when the instance is started in its first state: #NOMOUNT

🔴⏬Those parameters (memory sizes, database to open, flags…) are stored on the server in the instance configuration file, the server-side parameter file: the #spfile

🔴⏬ The #spfile is in binary format, stored on the server, updated by the instance. When we create a new database we create the spfile from a #pfile

🔴⏬ The #pfile is a simple text file that lists the instance parameter names and value (and comments). It is also referred to as #init.ora

🔴⏬ So, the #spfile or #pfile is the instance metadata used to open the #controlfile, which is the database metadata, which is used to open the dictionary, which is the user data metadata used to… but at the root is the $ORACLE_SID

🔴⏬ $ORACLE_SID identifies which #spfile or #pfile to read when starting the instance in #nomount. It is an #environment #variable

🔴⏬ The instance will read by default, in $ORACLE_HOME/dbs, spfile$ORACLE.SID or init$ORACLE_SID.ora or init.ora when not provided with a specific #pfile

🔴⏬ That’s for the first to start the #instance. But when the instance is running we can connect to it by attaching our process to the shared memory structure, called System Global Area: #SGA

🔴⏬ The #SGA shared memory is identified with a key that is derived from ORACLE_SID and ORACLE_HOME. If no instance was started with the same ORACLE_SID and ORACLE_HOME (literally the same) you get a “connect to idle instance”

🔴⏬ Of course, connecting by attaching to the SGA is possible only from the same host. This protocol is called #bequeath connection

🔴⏬ In order to connect remotely, there’s a process running on the server which knows the $ORACLE_SID and $ORACLE_HOME. It is the local #listener

🔴⏬ The local listener listens on a TCP/IP port for incoming connection and handles the creation of process and attach to the SGA, just by being provided with the desired #service_name

🔴⏬ So, how does the local listener know which #service_name goes to which instance (and then which database)? It can be listed in the listener’s configuration, that’s #static registration

🔴⏬ But in High Availability where multiple instances can run one service, it is the instance which tells it’s local listener which service it runs. That’s #dynamic registration

🔴⏬ Of course, the connection can start a session only when authorized (CREATE SESSION privilege) so the user/password hash is verified and also the privileges. All this is stored in the dictionary. V$SESSION_CONNECT_INFO shows that as #database #authentication

🔴⏬ Database authentication can be done only when the database is opened (access to the dictionary). Not possible in #nomount or #mount. For these, the system passwords are cached in a #password file

🔴⏬ The password file is found in $ORACLE_HOME/dbs and its name is orapw$ORACLE_SID and, once created, changing of password must be done from the database to be sure it is in sync between dictionary and password file

🔴⏬ When connecting locally with bequeath protocol, belonging to a privileged system group may be sufficient. The password provided is not even verified in that case. That uses #OS authentication

🔴⏬ #OS authentication is one case of passwordless authentication. By default, the Linux users in group ‘dba’ have passwordless bequeath (i.e local) access with the highest privilege #sysdba

🔴⏬ I said that data modifications are written to database files, but that would not be efficient when having multiple users because the database is #shared

🔴⏬ Reading from #shared resources requires only a ‘share’ lock but the only way to write without corruption to a shared resource is with an #exclusive lock

🔴⏬ Locking a full portion of data to write directly to disk is done only for specific non-concurrent bulk loads (direct-path inserts like with APPEND hint). All conventional modifications are done in memory into shared #buffers

🔴⏬ Writing the modifications to disk is done asynchronously by a background process, the #dbwriter

🔴⏬ Reads are also going through the #shared buffers to be sure to see the current version. This is a #logical read

🔴⏬ If the buffer is not already in memory, then before the logical read, it must be read from disk with a #physical read

🔴⏬ Keeping many buffers in memory for a while also saves a lot of disk access which is usually latency expensive. This memory is stored in the #SGA as the #buffer cache

🔴⏬ As the changes are written in memory (buffer cache), they can be lost in case of instance or server crash. To protect for this, all changes are logged to a #log buffer

🔴⏬ The log buffer is in memory and asynchronously flushed to persistent storage (disk), to the #online redo logs (and, maybe, to some #standby redo logs in remote locations).

🔴⏬ When a user commits a transaction, the server must be sure that the redo which protects his changes is flushed to disk. If not, before saying ‘commit successful’ it waits on #log file sync

🔴⏬ When changes are written in memory (#buffer cache) the #redo that protects it from an instance crash must be written to persistent storage before the change itself. This is #WAL (Write Ahead Logging)

🔴⏬ This redo is written by #logwriter. It must be fast because this is where the user may have to wait for physical writes. The advantage is that the writes are sequential, with higher throughput than the #dbwriter which has to do #random writes scattered in all datafiles

🔴⏬ The #redo log stream can be long as it contains all changes. But for server/instance recovery we need only the redo for the changes that were not yet flushed to disk by #dbwriter, in the #dirty blocks

🔴⏬ The instance ensures that regularly all #dirty buffers are flushed, so that the previous #redo can be discarded. It is known as a #checkpoint

🔴⏬ That’s sufficient for instance recovery (redo the changes that were made only in memory and lost by the instance crash) but what if we lose or corrupt a #datafile, like #media failure?

🔴⏬ As with any #persistent data, we must take backups (copy of files) so that, in case of some loss or corruption, we can #restore in a predictable time.

🔴⏬ After restoring the backup we need to apply the redo to roll forward the modifications that happened between the beginning of backup until the point of failure. That’s media #recovery

🔴⏬ The recovery may need more than the online redo logs for the changes between the restored backup and the last checkpoint. This is why before being overwritten, the online redo logs are #archived

🔴⏬ We always want to protect for instance failure (or all the database is inconsistent) but we can choose not to protect for media failure (and accept outage at backup and transaction loss at restore) when the database is in #noarchivelog mode

🔴⏬ If the redo cannot be written to disk, the database cannot accept more changes as it cannot ensure the D in ACID: transaction #durability

🔴⏬ As the online redo logs are allocated and formated at instance startup, they can always be written even if the filesystem is full (except if size of fs. is virtual). But they can be overwritten only when checkpoint made them inactive, or we wait on “checkpoint not complete”

🔴⏬ In archive log mode, there’s another requirement to overwrite an online redo log: it must have been archived to ensure media recovery. If not yet archived, we wait on “file switch (archiving needed)”.

🔴⏬ Archived logs are never overwritten. If the destination is full, the instance hangs. You need to move or backup them elsewhere. The most important to monitor is V$RECOVERY_AREA_USAGE so that PERCENT_SPACE_USED – PERCENT_SPACE_RECLAIMABLE never goes to 100% (stuck archiver)

🔴⏬ How long to keep the archived logs or their backups? You need the redo from the latest backup you may want to restore: the backup retention window. When a database backup is obsolete, the previous archived logs become obsolete. RMAN knows that and you just “delete obsolete”

🔴⏬ In the recovery area, files are managed by Oracle and you don’t need to “delete obsolete”. Obsolete files are automatically deleted when space is needed (“space pressure”). That’s the PERCENT_SPACE_RECLAIMABLE of V$RECOVERY_AREA_USAGE

🔴⏬ at the end of recovery, the database state is recovered at the same state as it was at the point-in-time when the last applied redo was generated. Only some uncommitted changes are lost (those that were in log buffer, in memory, and now lost).

🔴⏬ If the recovery reaches the point of failure, all new changes can continue from there. If not, because we can’t or just because we do a point-in-time recovery, the chain of redo is broken and we #resetlogs to a new #incarnation.

🔴⏬ At the end of recovery, the transactions that are not committed cannot continue (we lost the state of the session, probably some redo and block changes, and the connection to the user is lost) and must be un-done with #rollback

🔴⏬ Oracle does not store directly the #rollback information in the redo stream like other databases, because #rollback is also used for another reason: rollback a transaction or re-build a past version of a block.

🔴⏬ When data is changed (in a memory buffer) the #rollback information that can be used to build the previous version of the buffer is stored as special system data: the #undo

🔴⏬ Actually, the #rollback segment is involved for all letters in ACID, mainly: Atomicity (if we cannot commit we must rollback) and Isolation (undo the uncommitted changes made by others)

🔴⏬ Whether the #redo is primarily optimized for a sequential access on time (replay all changes in the same order), the #undo is optimized to be accessed by #transaction (@jloracle Why Undo? …)

🔴⏬In summary, changes made to data and undo blocks generate the #redo for it, which is applied in memory to the buffer. This redo goes to disk asynchronously. When your changes are committed, the database guarantees that your redo reached the disk so that recovery is possible.

🔴⏬ The rollforward + rollback is common to many databases, but some are faster than others there. PostgreSQL stores old versions in-place and this rollback phase is immediate. Oracle stores it in UNDO, checkpointed with data, but has to rollback all incomplete transactions.

🔴⏬ Mysql InnoDB is similar to Oracle. SQL Server stores the undo with the redo, then may have to the transaction log from before the last checkpoint if transactions stay long Then rollforward time can be unpredictable. This changed recently with Accelerated Database Recovery.

🔴⏬For an in-depth on Oracle recovery internals, there is this old document still around on internet archives. From Oracle7 – 25 years ago! – but the concepts are still valid. 
Any questions?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Franck Pachot
Franck Pachot

Principal Consultant / Database Evangelist
Oracle ACE Director, Oracle Database OCM 12c
AWS Database Specialty certified, AWS Data Hero
Oak Table member

RSS for this blog: feed
Twitter: @FranckPachot
LinkedIn :
Podcast en français: DBPod