Infrastructure at your Service

Morgan Patou

Alfresco: some useful database queries

In my previous post, I talked about the Lifecycle of Alfresco Nodes. You may have noticed that I tried to insert in my explanations some elements that are specific to databases (tables, fields, aso…). These elements are quite essential to prepare a post like this one: more database oriented. I already explained what exactly are the consequences on the database side when a node is removed and I will try in this post to share some useful queries regarding these points but not only!
For this post, I used my local Alfresco Community 4.2.c installation with a PostgreSQL database. For your information, it just take 30 minutes to get this test environment ready with the Alfresco’s installer (Windows, Mac or Unix). Of course, use the Database only for your daily administration work is certainly not the best idea but in some cases, it can really be faster and easier to just run some SQL commands at the DB level…

I. Document information

So let start this post with some generic queries that can be used to retrieve some information about documents. In this part, all columns of the results will be the same because I just pick up the same fields in my queries but the filter part (the WHERE clause) changes a little bit to be able to retrieve some information from different elements.

The first command I would like to show you is how to retrieve some information about documents based on the size of the content. Here, I just uploaded the document “Test_Lifecycle.docx” with a size of 52MB. So based on that, let’s say that I want to retrieve all elements on my Alfresco installation with a content that is bigger than40MB. In the same approach, you can select all elements with a content that is smaller than or between XX and YYMB. The conversion in MB is done using the round() function. Therefore, if you want this value to be in KB instead, just remove one division by 1024 in each round() function:

All documents bigger than 40MB

SELECT n.id AS "Node ID",
n.store_id AS "Store ID",
round(u.content_size/1024/1024,2) AS "Size (MB)",
n.uuid AS "Document ID (UUID)",
n.audit_creator AS "Creator",
n.audit_created AS "Creation Date",
n.audit_modifier AS "Modifier",
n.audit_modified AS "Modification Date",
p1.string_value AS "Document Name",
u.content_url AS "Location"
FROM alf_node AS n,
alf_node_properties AS p,
alf_node_properties AS p1,
alf_namespace AS ns,
alf_qname AS q,
alf_content_data AS d,
alf_content_url AS u
WHERE n.id=p.node_id
AND ns.id=q.ns_id
AND p.qname_id=q.id
AND p.long_value=d.id
AND d.content_url_id=u.id
AND p1.node_id=n.id
AND p1.qname_id IN (SELECT id FROM alf_qname WHERE local_name='name')
AND round(u.content_size/1024/1024,2)>40
ORDER BY u.content_size DESC;

I will just put it once but here is the result of this command in this case:

 Node ID | Store ID | Size (MB) |          Document ID (UUID)          | Creator |         Creation Date         | Modifier |       Modification Date       |    Document Name    |                            Location                            
---------+----------+-----------+--------------------------------------+---------+-------------------------------+----------+-------------------------------+---------------------+----------------------------------------------------------------
  131856 |        6 |     52.00 | eb267742-c018-4ba5-8ca4-75ca23c860f0 | Morgan  | 2015-04-30T12:05:50.613+02:00 | Morgan   | 2015-04-30T12:05:50.613+02:00 | Test_Lifecycle.docx | store://2015/4/30/12/5/0e111f05-7fcf-4a44-b719-b94cd04dd5ab.bin

So why did I selected these fields?!

  • Node ID: can be useful to join different tables
  • Store ID: a Store ID of 6 means that your document is in its active life. A Store ID of 5 means that this document has been deleted by a user and is now in the global trashcan
  • Size (MB): what we are searching for…
  • Document ID (UUID): the unique identifier of this document. The simplest way to preview this document is just to open the following url in any browser: http://HOSTNAME:PORT/share/page/document-details?nodeRef=workspace://SpacesStore/eb267742-c018-4ba5-8ca4-75ca23c860f0 (workspace://SpacesStore for store_id=6)
  • Creator, Modifier, Dates: well…
  • Document Name: can be useful to know the type of document without opening an URL (file extension)
  • Location: the actual location of the content’s file on the File System. The “store://” refers to $ALF_DATA/contentstore/

The second command I would like to show you is how to retrieve some information based on the actual UUID of a document. As explained above, the UUID of a document can be found in the URL of its detail’s page:

A document using its UUID

SELECT n.id AS "Node ID",
n.store_id AS "Store ID",
round(u.content_size/1024/1024,2) AS "Size (MB)",
n.uuid AS "Document ID (UUID)",
n.audit_creator AS "Creator",
n.audit_created AS "Creation Date",
n.audit_modifier AS "Modifier",
n.audit_modified AS "Modification Date",
p1.string_value AS "Document Name",
u.content_url AS "Location"
FROM alf_node AS n,
alf_node_properties AS p,
alf_node_properties AS p1,
alf_namespace AS ns,
alf_qname AS q,
alf_content_data AS d,
alf_content_url AS u
WHERE n.id=p.node_id
AND ns.id=q.ns_id
AND p.qname_id=q.id
AND p.long_value=d.id
AND d.content_url_id=u.id
AND p1.node_id=n.id
AND p1.qname_id IN (SELECT id FROM alf_qname WHERE local_name='name')
AND n.uuid='eb267742-c018-4ba5-8ca4-75ca23c860f0';

Another possible command would be to find some information based on the File System location. That can be useful for example if there is a big document on the File System and you want to know the type of this document with the extension, its name or maybe some other information about the creator/modifier:

A document using its path on the File System

SELECT n.id AS "Node ID",
n.store_id AS "Store ID",
round(u.content_size/1024/1024,2) AS "Size (MB)",
n.uuid AS "Document ID (UUID)",
n.audit_creator AS "Creator",
n.audit_created AS "Creation Date",
n.audit_modifier AS "Modifier",
n.audit_modified AS "Modification Date",
p1.string_value AS "Document Name",
u.content_url AS "Location"
FROM alf_node AS n,
alf_node_properties AS p,
alf_node_properties AS p1,
alf_namespace AS ns,
alf_qname AS q,
alf_content_data AS d,
alf_content_url AS u
WHERE n.id=p.node_id
AND ns.id=q.ns_id
AND p.qname_id=q.id
AND p.long_value=d.id
AND d.content_url_id=u.id
AND p1.node_id=n.id
AND p1.qname_id IN (SELECT id FROM alf_qname WHERE local_name='name')
AND u.content_url='store://2015/4/30/12/5/0e111f05-7fcf-4a44-b719-b94cd04dd5ab.bin';

 

II. Number of…

From a reporting point of view, let’s say that you need some information regarding the number of… something. In this case and if you want to use your DB directly, then there is a really simple solution (simple but is it the best?) because Alfresco provide a Database architecture that is quite simple to understand and to use to get what you need. Indeed, if you take a look at the “alf_qname” table, you will see that every element that is part of Alfresco has its QName listed here. A QName is the Qualified Name of a repository item. This can be seen as a kind of “Super-Type”:

alfresco=> SELECT * FROM alf_qname;
 id | version | ns_id | local_name
----+---------+-------+------------
  1 |       0 |     1 | store_root
  2 |       0 |     1 | aspect_root
  3 |       0 |     1 | container
  4 |       0 |     1 | children
  5 |       0 |     2 | user
...
 24 |       0 |     6 | folder
...
 51 |       0 |     6 | content
...
133 |       0 |     6 | thumbnail
134 |       0 |    13 | rendition
...

As you can see above, if you are searching for something that has a content, it can be done quite easily using the id or the local_name that correspond to that. So based on this table, here are some queries that can be useful:

Retrieve the number of users in the Repository

SELECT count(*)
FROM alf_node AS n,
alf_qname AS q
WHERE n.type_qname_id=q.id
AND q.local_name='user';

Retrieve the number of elements with a content in the Repository (include system’s documents)

SELECT count(*)
FROM alf_node AS n,
alf_qname AS q
WHERE n.type_qname_id=q.id
AND q.local_name='content';

Retrieve the number of thumbnails in the Repository

SELECT count(*)
FROM alf_node AS n,
alf_qname AS q
WHERE n.type_qname_id=q.id
AND q.local_name='thumbnail';

Retrieve the number of renditions in the Repository

SELECT count(*)
FROM alf_node AS n,
alf_qname AS q
WHERE n.type_qname_id=q.id
AND q.local_name='rendition';

Of course you can do that for all QNames but you can also be more precise! So based on the query to retrieve the number of elements with a content, if you only want the number of documents of a specific type, then you can simply complete your query:

Retrieve the number of XML documents in the Repository

SELECT count(*)
FROM alf_node AS n,
alf_qname AS q,
alf_node_properties AS p
WHERE n.type_qname_id=q.id
AND p.node_id=n.id
AND p.qname_id IN
(SELECT id
FROM alf_qname
WHERE local_name='name')
AND q.local_name='content'
AND p.string_value LIKE '%.xml';

Retrieve the number of PDF documents in the Repository

SELECT count(*)
FROM alf_node AS n,
alf_qname AS q,
alf_node_properties AS p
WHERE n.type_qname_id=q.id
AND p.node_id=n.id
AND p.qname_id IN
(SELECT id
FROM alf_qname
WHERE local_name='name')
AND q.local_name='content'
AND p.string_value LIKE '%.pdf';

As the creation date, creator, modification date and modifier information are also stored on the “alf_node” table, you can also very easily filter your query based on the creation/update date of an Alfresco Node. That’s pretty cool, right?! ;)

III. Lifecycle specific

To complete the relation between this blog post and the previous one, I wanted to share some queries that can be used to identify the current state of a document. As explained in my previous post, a document that is not yet deleted will be in the store named “workspace://SpacesStore”. A document that has been deleted by a user will be in the sotre named “archive://SpacesStore” and when this document is removed from the global trashcan, the orphan_time is set to the current timestamp. With all these information and with the “alf_node” and “alf_content_url” tables we can easily build our own queries to find what is needed.

alfresco=> SELECT * FROM alf_store;
 id | version | protocol  |       identifier        | root_node_id
----+---------+-----------+-------------------------+--------------
  1 |       1 | user      | alfrescoUserStore       |            1
  2 |       1 | system    | system                  |            5
  3 |       1 | workspace | lightWeightVersionStore |            9
  4 |       1 | workspace | version2Store           |           10
  5 |       1 | archive   | SpacesStore             |           11
  6 |       1 | workspace | SpacesStore             |           12
(6 rows)

So let’s find all documents that have been created and aren’t deleted yet:

All documents created in their active life

SELECT *
FROM alf_node
WHERE store_id=6
AND type_qname_id=51;

The next step on the lifecycle is when the documents have been deleted by a user but aren’t deleted from the global trashcan (orphan_time is still NULL):

All documents created that are in the global trashcan (deleted by users)

SELECT *
FROM alf_node
WHERE store_id=5
AND type_qname_id=51;

Finally, when the documents are removed from the global trashcan, some references/fields are removed, the QName of these documents change from “content” (51) to “deleted” (140) on the “alf_node” table and the orphan_time is set to the current timestamp on the “alf_content_url” table:

All elements that have been removed from the global trashcan and that are now orphaned

SELECT *
FROM alf_content_url
WHERE orphan_time IS NOT NULL;

I hope you enjoyed this blog post because it was quite hard for me to write something about database queries without giving up my soul to the DB world! See you soon ;).

 

12 Comments

  • myriam says:

    Thank you very much!! this helped a lot

     
  • Mallesham says:

    HI Morgan,

    this is really pretty good for learners and appreciated your effort.

    Can you please help me out to understand how the version nodes will have a link/reference with workspace/spaceStore noderefs from DB side.

    All nodes i can see in alf_node with different store_id so what is the common property for store-id 4 & 6 to get all the current nodes and version nodes.

    Just want to see all the current and version nodes relationship in DB side. if possible please provide me the sql

    Thanks in advance!!!!!

    Regards,
    Mallesham

     
  • Wim says:

    Thanks lots for these examples.
    We have a multi-tenanted server, and in order to get the sizes per tenant I combined a few of your sample queries to break down content count & size by created user. Then I could split the usernames to find the tenant and use Excel to report on usage by tenant.
    Not the cleanest code but it works!

    See below:
    SELECT max(audit_creator) creator, sum(s.bin_size) size, count(*) cnt
    FROM alf_node AS n,
    alf_qname AS q,
    (SELECT n.id AS “id”,
    n.store_id AS “Store ID”,
    round(u.content_size/1024/1024,2) AS “bin_size”,
    n.uuid AS uuid,
    n.audit_creator AS “Creator”,
    n.audit_created AS “Creation Date”,
    n.audit_modifier AS “Modifier”,
    n.audit_modified AS “Modification Date”,
    p1.string_value AS “Document Name”,
    u.content_url AS “Location”
    FROM alf_node AS n,
    alf_node_properties AS p,
    alf_node_properties AS p1,
    alf_namespace AS ns,
    alf_qname AS q,
    alf_content_data AS d,
    alf_content_url AS u
    WHERE n.id=p.node_id
    AND ns.id=q.ns_id
    AND p.qname_id=q.id
    AND p.long_value=d.id
    AND d.content_url_id=u.id
    AND p1.node_id=n.id
    AND p1.qname_id IN (SELECT id FROM alf_qname WHERE local_name=’name’)) s
    WHERE n.type_qname_id=q.id
    and n.id=s.id
    AND q.local_name=’content’
    group by n.audit_creator;

     
    • Morgan Patou says:

      Thanks for this precision Wim! I must admit that I never tried that on a multi-tenant installation. I would have expected another solution to separate the documents per tenant since Alfresco is able to separate them properly and I don’t think that Alfresco is checking which user created the document for that but that’s still a solution so thanks ;).

      Regards,
      Morgan

       
  • Very helpful information. Thank you.

     
  • Julian says:

    Hi Morgan,

    Thanks for the examples.

    Anyway, at the first one (Document Information) you’re assuming that the unique property that has a long type is the content size. I think it would be better to filter explicity by the size property, adding something like this toe the where clause:

    p.qname_id = (SELECT id FROM alf_qname WHERE local_name=’content’ and ns_id = (select id from alf_namespace where uri =”http://www.alfresco.org/model/content/1.0″) )

    My two cents.

    Kind regards,
    Julian

     
  • Ilse says:

    Hi,

    thanks for the example. Where do I have to put the sql queries?

    Regards,
    Ilse

     
  • Mozart says:

    You saved my day! Thank you!

     
  • Parth says:

    This is very useful. Thanks Morgan.

    Many times users would edit “Name” of document and remove file extension. Is there a way to capture extension of the file even if document name has been changed from say ‘Training.docx’ to ‘Training’?

     
    • Morgan Patou says:

      Hi Parth,

      It is true that on the queries above, I used the name of the file but if you check the full properties of a file, it should contain a field saying that this file is a Microsoft Word 2007 file for example. Therefore there is a field in the Database that store this information, it isn’t in the name of the file but I’m afraid this might not be exactly what you are searching for.

      Of course, you can also create your own extension which would link a new property to all documents of your Alfresco installation and that would store the extension.

      Now the problem with renaming the file inside Alfresco without extension… If new users are coming and downloading these files without extensions, then it might cause issues. I already saw corrupted files because someone renamed a file without its extension and overwritten something inside the repository. A best solution for me would be to actually hide the extension on the Share UI or making it a non-editable field, allowing just the users to rename the files without touching the extension at all… That would require a lot of customization from my point of view.

      Regards,
      Morgan

       

Leave a Reply


− five = 4

Morgan Patou
Morgan Patou

Senior Consultant