I’ve published a while ago a twitter thead on some Oracle recovery concepts. For those who are not following twitter, I’m putting the whole thread here:
🔴⏬ Here I start a thread about some Oracle Database concepts. We will see how far it goes - all questions/comments welcome.
🔴⏬ A database (or DBMS - database management system) stores (for short and long term) and manipulates (from many concurrent users/devices) your #data.
🔴⏬ The set of persistent files that stores system data (metadata for user data) and user data is what we call (with Oracle) the #database
🔴⏬ The database files are internally referenced by an identifier (file_id, file#, Absolute File Number,…). The file names are not defined in the #dictionary but (only) in the main metadata file for the database, the #controlfile
🔴⏬ So, we have user #data and its #metadata (system data) in #datafiles. Where is metadata for system data? The code that contains this metadata (structures), and the functions to manipulate data, is the database #software
🔴⏬ Oracle database software installed on the server is often referred by its install location environment variable, the $ORACLE_HOME
🔴⏬ The processes and memory running the Oracle software for one database system is what is called an Oracle #instance
🔴⏬ I think the #instance was simply called the Oracle ‘system’ at some time as we identify with Oracle System ID (ORACLE_SID) and modify it with ALTER SYSTEM
🔴⏬ Multiple instances can open the same database from different nodes, in the case of Real Application Cluster (RAC) and synchronizes themselves though the #shared storage and the private network #interconnect
🔴⏬Those parameters (memory sizes, database to open, flags…) are stored on the server in the instance configuration file, the server-side parameter file: the #spfile
🔴⏬ So, the #spfile or #pfile is the instance metadata used to open the #controlfile, which is the database metadata, which is used to open the dictionary, which is the user data metadata used to… but at the root is the $ORACLE_SID
🔴⏬ The #SGA shared memory is identified with a key that is derived from ORACLE_SID and ORACLE_HOME. If no instance was started with the same ORACLE_SID and ORACLE_HOME (literally the same) you get a “connect to idle instance”
🔴⏬ Of course, connecting by attaching to the SGA is possible only from the same host. This protocol is called #bequeath connection
🔴⏬ The local listener listens on a TCP/IP port for incoming connection and handles the creation of process and attach to the SGA, just by being provided with the desired #service_name
🔴⏬ But in High Availability where multiple instances can run one service, it is the instance which tells it’s local listener which service it runs. That’s #dynamic registration
🔴⏬ Of course, the connection can start a session only when authorized (CREATE SESSION privilege) so the user/password hash is verified and also the privileges. All this is stored in the dictionary. V$SESSION_CONNECT_INFO shows that as #database #authentication
🔴⏬ The password file is found in $ORACLE_HOME/dbs and its name is orapw$ORACLE_SID and, once created, changing of password must be done from the database to be sure it is in sync between dictionary and password file
🔴⏬ When connecting locally with bequeath protocol, belonging to a privileged system group may be sufficient. The password provided is not even verified in that case. That uses #OS authentication
🔴⏬ I said that data modifications are written to database files, but that would not be efficient when having multiple users because the database is #shared
🔴⏬ Locking a full portion of data to write directly to disk is done only for specific non-concurrent bulk loads (direct-path inserts like with APPEND hint). All conventional modifications are done in memory into shared #buffers
🔴⏬ Writing the modifications to disk is done asynchronously by a background process, the #dbwriter
🔴⏬ If the buffer is not already in memory, then before the logical read, it must be read from disk with a #physical read
🔴⏬ As the changes are written in memory (buffer cache), they can be lost in case of instance or server crash. To protect for this, all changes are logged to a #log buffer
🔴⏬ When a user commits a transaction, the server must be sure that the redo which protects his changes is flushed to disk. If not, before saying ‘commit successful’ it waits on #log file sync
🔴⏬ When changes are written in memory (#buffer cache) the #redo that protects it from an instance crash must be written to persistent storage before the change itself. This is #WAL (Write Ahead Logging)
🔴⏬ This redo is written by #logwriter. It must be fast because this is where the user may have to wait for physical writes. The advantage is that the writes are sequential, with higher throughput than the #dbwriter which has to do #random writes scattered in all datafiles
🔴⏬ The #redo log stream can be long as it contains all changes. But for server/instance recovery we need only the redo for the changes that were not yet flushed to disk by #dbwriter, in the #dirty blocks
🔴⏬ After restoring the backup we need to apply the redo to roll forward the modifications that happened between the beginning of backup until the point of failure. That’s media #recovery
🔴⏬ The recovery may need more than the online redo logs for the changes between the restored backup and the last checkpoint. This is why before being overwritten, the online redo logs are #archived
🔴⏬ We always want to protect for instance failure (or all the database is inconsistent) but we can choose not to protect for media failure (and accept outage at backup and transaction loss at restore) when the database is in #noarchivelog mode
🔴⏬ If the redo cannot be written to disk, the database cannot accept more changes as it cannot ensure the D in ACID: transaction #durability
🔴⏬ As the online redo logs are allocated and formated at instance startup, they can always be written even if the filesystem is full (except if size of fs. is virtual). But they can be overwritten only when checkpoint made them inactive, or we wait on “checkpoint not complete”
🔴⏬ In archive log mode, there’s another requirement to overwrite an online redo log: it must have been archived to ensure media recovery. If not yet archived, we wait on “file switch (archiving needed)”.
🔴⏬ Archived logs are never overwritten. If the destination is full, the instance hangs. You need to move or backup them elsewhere. The most important to monitor is V$RECOVERY_AREA_USAGE so that PERCENT_SPACE_USED – PERCENT_SPACE_RECLAIMABLE never goes to 100% (stuck archiver)
🔴⏬ How long to keep the archived logs or their backups? You need the redo from the latest backup you may want to restore: the backup retention window. When a database backup is obsolete, the previous archived logs become obsolete. RMAN knows that and you just “delete obsolete”
🔴⏬ In the recovery area, files are managed by Oracle and you don’t need to “delete obsolete”. Obsolete files are automatically deleted when space is needed (“space pressure”). That’s the PERCENT_SPACE_RECLAIMABLE of V$RECOVERY_AREA_USAGE
🔴⏬ at the end of recovery, the database state is recovered at the same state as it was at the point-in-time when the last applied redo was generated. Only some uncommitted changes are lost (those that were in log buffer, in memory, and now lost).
🔴⏬ If the recovery reaches the point of failure, all new changes can continue from there. If not, because we can’t or just because we do a point-in-time recovery, the chain of redo is broken and we #resetlogs to a new #incarnation.
🔴⏬ At the end of recovery, the transactions that are not committed cannot continue (we lost the state of the session, probably some redo and block changes, and the connection to the user is lost) and must be un-done with #rollback
🔴⏬ Oracle does not store directly the #rollback information in the redo stream like other databases, because #rollback is also used for another reason: rollback a transaction or re-build a past version of a block.
🔴⏬ Actually, the #rollback segment is involved for all letters in ACID, mainly: Atomicity (if we cannot commit we must rollback) and Isolation (undo the uncommitted changes made by others)
🔴⏬ Whether the #redo is primarily optimized for a sequential access on time (replay all changes in the same order), the #undo is optimized to be accessed by #transaction (@jloracle Why Undo? https://jonathanlewis.wordpress.com/2010/02/09/why-undo/ …)
🔴⏬In summary, changes made to data and undo blocks generate the #redo for it, which is applied in memory to the buffer. This redo goes to disk asynchronously. When your changes are committed, the database guarantees that your redo reached the disk so that recovery is possible.
🔴⏬ The rollforward + rollback is common to many databases, but some are faster than others there. PostgreSQL stores old versions in-place and this rollback phase is immediate. Oracle stores it in UNDO, checkpointed with data, but has to rollback all incomplete transactions.
🔴⏬ Mysql InnoDB is similar to Oracle. SQL Server stores the undo with the redo, then may have to the transaction log from before the last checkpoint if transactions stay long Then rollforward time can be unpredictable. This changed recently with Accelerated Database Recovery.
🔴⏬For an in-depth on Oracle recovery internals, there is this old document still around on internet archives. From Oracle7 – 25 years ago! – but the concepts are still valid.