Infrastructure at your Service

Morgan Patou

Understand the Lifecycle of Alfresco Nodes

I guess you all already know what a lifecycle is. We born, we live and we die… In fact, it’s exactly the same for Aflresco Nodes! Well, at least from an end user point of view. But what is really going on behind that? This is what I will try to explain in this post.

First of all what is an Alfresco Node? For most people, a Node is just a document stored in Alfresco but in reality it’s much more than that: everything in Alfresco is a Node! A Node has a type that defines its properties and it also have some associations with other Nodes. This is a very short and simplified description but we don’t need to understand what exactly a Node is to understand the lifecycle process. So as said above, Alfresco Nodes have their own lifecycle but it’s a little bit more complicated than just three simple steps.

Please note that in this post, I will use $ALF_HOME as a reference to the location where alfresco has been installed (e.g. /opt/alfresco-4.2.c) and $ALF_DATA as a reference to the alf_data location. By default the alf_data folder is $ALF_HOME/alf_data/.

I. Creation

In this post I will use a document as an Alfresco Node to easily understand the process. So this lifecycle start with the creation of a new document named “Test_Lifecycle.docx” with the following creation date: Febrary the 1st,2015 at 16:45.

When a document is created in Alfresco, three things are done:

  • File System: the content of this file is stored on the Alfresco “Content Store”. The Content Store is by default under $ALF_DATA/contentstore. This file is actually put somewhere under this folder that depends on the creation time and an ID is given to this file. For our file, it would be: $ALF_DATA/contentstore/2015/2/1/16/45/408a6980-237e-4315-88cd-6955053787c3.bin.
  •  Database: the medatada of this file are stored on the Alfresco Database. In fact in the DB, this document is mainly referenced using its NodeRef or NodeID. This NodeRef is something we can see on the Alfresco Web Interface from the document details page (web preview of a document): http://HOSTNAME:PORT/share/page/document-details?nodeRef=workspace://SpacesStore/09a8bd9f-0246-47a8-9701-29436c7d29a6. Please be aware that the NodeRef contains an UUID but it’s not the same that the ID on the Content Store side… Moreover, there is a property in the DB that link the NodeRef to the Content Store’s ID for Alfresco to be able to retrieve the content of a file.
  • Index: an index is created for this file in the Search engine (can be Lucene for older versions of Alfresco or Solr for newer versions). This index is in the “workspace” store.

II. Update, Review, Approve, Publish, aso…

Once the document is created, his life really begins. You can update/review/approve/publish it manually or automatically with different processes. All these actions are part of the active life of a document. From an administration point of view, there isn’t that much to say here.

III. Deletion – User level

When a document isn’t needed anymore, for any reason, a user with sufficient permissions is able to delete it. For our example, let’s say that a user deleted our file “Test_Lifecycle.docx” using the Alfresco Share Web Interface on Febrary the 20th, 2015 at 15:30 (19 days after creation). When using the Web Interface or Web Services to delete a document, the “nodeService.deleteNode” method is called. So what happened to our “three things”?

  • FS: nothing changed on the Content Store. The file content is still here.
  • DB: on the DB side, the NodeRef changed from workspace://SpacesStore/09a8bd9f-0246-47a8-9701-29436c7d29a6 to archive://SpacesStore/09a8bd9f-0246-47a8-9701-29436c7d29a6 (the “store_id” field changed on the “alf_node” table).
  • Index: same thing for the search index: the index is moved from the “workspace” store to the “archive” store.

Actually, when a user deletes a document from a Web Interface of Alfresco, the document is just moved to a “global trashcan”. By default all users have access to this global trashcan in Alfresco Explorer to restore the documents they may have deleted by mistake. Of course they can’t see all documents but only the ones related to them. On Alfresco Share, the access to this global trashcan is configured in the share-config.xml file and by default on most versions, only administrators have access to it.

The only way to avoid this global trashcan is to programmatically delete the document by applying the aspect “cm:temporary” to the document and then call the “nodeService.deleteNode” on it. In that way, the document is removed from the UI and isn’t put in the global trashcan.

 IV. Deletion – Administration level

What I describe here as an “Administration level” is the second level of deletion that happen by default. This level is the deletion of the document from the global trashcan. If the document is still in the global trashcan, Administrators (or users if you are using Alfresco Explorer) can still restore the document. If the document is “un-deleted”, then it will return exactly where it was before and of course metadata & index of this document will be moved from the “archive” store to the “workspace” store to return in an active life.

On April the 1st, 2015 at 08:05 (40 days after deletion at user level), an administrator decides to remove “Test_Lifecycle.docx” from the global trashcan. This can be done manually or programmatically. Moreover, there are also some existing add-ons that can be configured to automatically delete elements in the trashcan older than XX days. This time, the “NodeArchiveService.purgeArchiveNode” method is called (archiveService.purge in some older Alfresco versions). So what happened to our “three things” this time?

  • FS: still nothing changed on the Content Store. The file content is still here.
  • DB: on the DB side, the document is still there but some references/fields (not all) are removed. All references on the “alf_content_data” table are removed when only some fields are emptied on the “alf_node” table. For Alfresco 4.0 and below, the “node_deleted” field on the table “alf_node” is changed from 0 to 1. On newer versions of Alfresco, the “node_deleted” doesn’t exist anymore but the QNAME of the node (field “type_qname_id”) on the “alf_node” table is changed from 51 (“content”) to 140 (“deleted”). So the Node is now deleted from the global trashcan and Alfresco knows that this node can now be safely deleted but this will not be done now… Once the “node_deleted” or “type_qname_id” is set, the “orphan_time” field on the “alf_content_url” table for this document is also changed from NULL to the current unix timestamp (+ gmt offset). In our case it will be orphan_time=1427875500540.
  • Index: the search index for this Node is removed.

As you can see, there are still some remaining elements on the File System and in the DB. That’s why there is a last step in our lifecycle…

V. Deletion – One more step.

As you saw before, the document “Test_Lifecycle.docx” is now considered as an “orphaned” Node. On Alfresco, by default, all orphaned Nodes are protected for 14 days. That means that during this period, the orphaned Nodes will NOT be touched at all. Of course this value can be changed easily on Alfresco configuration files. So what happen after 14 days? Well in fact every day at 4am (again, by default… can be changed), a scheduled job (the “contentStoreCleaner”) scan Alfresco for orphaned Nodes older than 14 days. Therefore on April the 15th, 2015 at 04:00, the scheduled job runs and here is what it does:

  • FS: the content file is moved from $ALF_DATA/contentstore/ to $ALF_DATA/contentstore.deleted/
  • DB: on the DB side, the document is still here but the line related to this document on the “alf_content_url” table (this table contains the orphan_time and reference to the FS location) is removed.
  • Index: nothing to do, the search index was already removed.

You can avoid this step where documents are put on the “contentstore.deleted” folder by setting “system.content.eagerOrphanCleanup=true” in the alfresco-global.properties configuration file. If you do so, after 14 days, the document on the File System is not moved but will be deleted instead.

VI. Deletion – Still one more step…!

As said before, there are still some references to the “Test_Lifecycle.docx” document on the Alfresco Database (especially on the “alf_node” table). Another scheduled job, the nodeServiceCleanup runs every day at 21:00 to clean everything that is related to Nodes that has been deleted (orphaned nodes) for more than 30 days. So here is the result:

  • FS: the content file is still on the $ALF_DATA/contentstore.deleted/ folder
  • DB: the DB is finally clean!
  • Index: nothing to do, the search index was already removed.

VII. Deletion – Oh you must be kidding me!?

So many steps, isn’t it! As saw before, the only remaining thing to do is to remove the content file from the $ALF_DATA/contentstore.deleted/ folder. You probably think that there is also a job that do that for you after XX days but it’s not the case, there is nothing in Alfresco that deletes the content file from this location. In consequences, if you want to clean the File System, you will have to do it by yourself.

On Unix for example, you can simply create a crontab entry:

50 23 * * * $ALF_HOME/scripts/cleanContentStoreDeleted.sh

Then create this script. Below is an example of content that will do the trick but you can put what you want/need in this script to remove the content as per your requirements or create a backup or … :

#!/bin/bash

CS_DELETED=$ALF_DATA/contentstore.deleted/

# Remove all files from contentstore.deleted older than 30 days
find $CS_DELETED -type f -mtime +30 | xargs rm 2> /dev/null
# Remove all empty folders from contentstore.deleted older than 60 days
find $CS_DELETED -type d -mtime +60 -empty | xargs rm -r 2> /dev/null

Please be aware that you should be sure that the folder “$ALF_DATA/contentstore.deleted” exists… And when I say that, I mean YOU MUST ABSOLUTELY BE SURE that it exists. Please also never remove anything under the “contentstore” folder and never remove the “contentstore.deleted” folder itself!

You may wonder why the DB isn’t cleaned automatically when the trashcan is cleaned and why the file content is also kept 14 days… Well I can assure you that there are several reasons and the principal one is for backup/restore performance concerns. I will not explain it in details but basically as the file content isn’t touched for 14 days by default, that means that you can restore your database up to 14 days in the past and your database will still be consistent with the File System without to restore the FS! Of course if you just do that you will lose the documents uploaded/changed in the last 14 days because your database wasn’t aware of these files 14 days ago. But you can just backup/restore the content files created in the last 14 days with an incremental backup.

E.g.: Today (05-May-2015), I want to backup the FS (only the last 14 days), then I will have to backup all folders inside $ALF_DATA/contentstore/2015/5 AND I will also have to backup all folders inside $ALF_DATA/contentstore/2015/4 with a folder name bigger or equal than 30 (days in Apr) + 5 (days in May) – 14= 21.
I hope this post was clear enough because it’s true that it can be hard to understand everything regarding the lifecycle of Alfresco Node and to deal with it properly.

15 Comments

  • Anand Mahajan says:

    Hi Morgan,

    Can u pls share the name of property and XML for 4.0 community version of alfresco , by which i can schedule the deletion if doc from the Trash .

  • Alexandre says:

    Hi Morgan,
    How schedule the deletion of documents from the trashcan (admin task) in 5.1?

  • Titto says:

    Hi Morgan,
    And during unit test?
    for example testing an REPO AMP project it don’t have a index (not supported in alfresco 5), and if I perform a nodeService.deleteNode(nodeRef) and then I try to get the node by the nodeRef, it is always available.

    How to delete a node definitively for unit test purpose?

  • vikash says:

    awesome blog…

  • Chaitanya says:

    Hi Morgan, I have cleared my trashcan content with admin login. Can I directly delete the contents of contentstore.deleted manually on the same day (or) I need to wait until DB is cleared

    • Morgan Patou says:

      Hi Chaitanya,

      If you take a look at the step IV of my blog, you will see that as soon as you delete something from the Trashcan, the only thing that is done is that some DB properties are changed and some others are removed and the Index is also removed. But there is no change on the File System.

      So can you remove the contentstore.deleted? => Yes always, it is always safe to do that
      Will the document that you just deleted from the Trashcan be deleted by doing that? => No the document removed from the Trashcan will remain on the “contentstore” for 14 days (default) and therefore it will not be taken in your “rm” command… After this period, the document will be moved to the contentstore.deleted.

      Regards,
      Morgan

      • Chaitanya says:

        Thanks Morgan,

        With the content of your blog I understood the life cycle of nodes in alfresco

      • Chaitanya says:

        Hi Morgan,

        We can restore/backup the content when the content is in orphan state(14 days before moving content to contentstore.deleted). Can we still have our contents backup from the contentstore.deleted ?? And What is the use of having contents in the contentstore.deleted folder ?? Please Clarify

        • Morgan Patou says:

          Hi Chaitanya,

          When the node becomes an orphan (still in contentstore but deleted from trashcan), you will still have the relation between the DB and the FileSystem location. Therefore you can find which nodes are in this state. However some important information will already be removed from the DB. Therefore if you want to restore a document in this state, you will either need to update the DB to reinsert the missing properties but that’s a little bit tricky… A simpler solution might be to just take the file from the FileSystem of your Alfresco and then reimport the document back into Alfresco. So either a restore of a part of the DB or just import the document as a new one… Each solutions have pros and cons.
          Regards,
          Morgan

          • Chaitanya says:

            Hi Morgan,

            I didn’t understand the line “”just take the file from the FileSystem of your Alfresco and then reimport the document back into Alfresco” ” . Upto my knowledge, content (any format) uploaded into the alfresco will be saved in binary (.bin format) in File System of alfresco. How can we convert that (.bin) back to normal content of the respective format (.doc,.ppt,.mp4) ? and what is the exact use of having contentstore.deleted folder in alfresco.

          • Morgan Patou says:

            Hi Chaitanya,

            The files stored in the contentstore or contentstore.deleted aren’t in binary format. Yes the extension of the file is .bin BUT the content of this .bin file is actually NOT a binary. So if you import a PDF file into Alfresco, it will create a .bin file. If you copy this file to /tmp for example and then rename it to .pdf, then you will be able to open it with a PDF reader.
            => This is only true if you aren’t using the ContentStore Encryption which is a Enterprise Feature, in which case the files will be encrypted…

            The use of the contentstore.deleted is just so that you can keep removed files as long as you want/need to… It’s not Alfresco that will decide the retention policy at your place.

            Regards,
            Morgan

  • Chaitanya says:

    Hi Morgan,

    Thanks for giving reply to my previous query.

    Will the activity feed for the deleted nodes will be auto removed from alf_activity_feed table (after the content has been moved to contentstore.deleted).
    If not how can I proceed to clean the unnecessary activity records in the alf_activity_feed table.
    With the limited infrastructure, when alf_activity_feed table is storing more feed records, the web scripts such as user-feed / admin-user-feed are returning errors(timeout , internal server errors) and even my-activities dashlet is also unable to load.

    Please suggest.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Morgan Patou
Morgan Patou

Senior Consultant & Technology Leader ECM