Infrastructure at your Service

Documentum is using xPlore for the Full Text indexing/search processes. If you aren’t very familiar with how xPlore is working, you might want to know how it is possible to index large documents or you might be confused about some documents not being indexed (and therefore not searchable). In this blog, I will try to explain how xPlore can be configured to be able to index these big documents without causing too much trouble because by default, these documents are just not indexed which might be an issue. Documents tend to be bigger and bigger and therefore the thresholds for the xPlore indexing might be a little bit outdated…

In this blog, I will go through all thresholds that can be configured on the different components and I will try to explain a little bit what it’s all about. Before starting, I believe a very short (and definitively not exhaustive) introduction on the indexing process of the xPlore is therefore required. As soon as you install an IndexAgent, it will trigger the creation of several things on the associated repository, including new events registration on the ‘dmi_registry‘. When working with documents, these events (‘dm_save‘, ‘dm_saveasnew‘, …) will generate new entries in the ‘dmi_queue_item‘. The IndexAgent will then access the ‘dmi_queue_item‘ and retrieve the document that needs indexing (add/update/remove from index). Then from here a CPS is called and processing the document (language identification, text extraction, tokenization, lemmatization, stemming, …). My point here is that there are two main sides of the indexing process: the IndexAgent and then the CPS. This is also true for the thresholds: you will need to configure them properly on both sides.

 

I. IndexAgent

 
On the IndexAgent side, there isn’t much configuration possible strictly related to the size of documents since there is only one but it’s kind of the most important one since it’s the first barrier that will block your indexing if not configured properly.

In the file indexagent.xml (found under $JBOSS_HOME/server/DctmServer_Indexagent/deployments/IndexAgent.war/WEB-INF/classes), in the exporter section, you can find the parameter ‘contentSizeLimit‘. This parameter controls the maximum size of a document that can be send to indexing. This is the real size of the document (‘content_size‘/’full_content_size‘); it is not the size of its text once extracted. The reason for that is simple: this limit is on the IndexAgent size and the text hasn’t been extracted yet so the IndexAgent do not know how big the extracted text will be. If the size of the document exceeds the value defined for ‘contentSizeLimit‘, then the IndexAgent will not even try to process it, it will just reject it and in this case, you will see a message that the document exceeded the limit on both the IndexAgent logs as well as in the ‘dmi_queue_item‘ object. Other documents of the same batch aren’t impacted, the parameter ‘contentSizeLimit‘ is for each and every document. The default value for this parameter is 20 000 000 bytes (19,07 MB).

If you are going to change this value, then you might need some other updates. You can tweak some other parameters if you are seeing issues while indexing large documents, all of them can be configured inside this indexagent.xml file. For example, you might want to look at the ‘content_clean_interval‘ (in milliseconds) which controls when the export of the document (dftxml document) will be removed from the staging area of the IndexAgent (location of ‘local_content_area‘). If the value is too small, then the CPS might try to retrieve a file to process it for indexing but the IndexAgent might have removed the file already. The default value for this parameter is 1 200 000 (20 minutes).

 

II. CPS

 
On the CPS side, you can look at several other size related parameters. You can find these parameters (and many others) in two main locations. The first is global to the Federation: indexserverconfig.xml (found under $XPLORE_HOME/config by default but you can change it (E.g.: a shared location for a Multi-Node FT)). The second one is a CPS-specific configuration file: PrimaryDsearch_local_configuration.xml for a PrimaryDsearch or <CPS_Name>_configuration.xml for a CPS Only (found under $XPLORE_HOME/dsearch/cps/cps_daemon/).

The first parameter to look for is ‘max_text_threshold‘. This parameter controls the maximum text size of a document. This is the size of its text after extraction; it is not the real size of the document. If the text size of the document exceeds the value defined for ‘max_text_threshold‘, then the CPS will act according to the value defined for the ‘cut_off_text‘. With a ‘cut_off_text‘ set to true, the documents that exceed ‘max_text_threshold‘ will have the first ‘max_text_threshold‘ MB indexed but the CPS will stop once it reached the limit. In this case the CPS log will contain something like ‘doc**** is partially processed’ and the dftxml of this document will contain the mention ‘partialIndexed‘. This means that the CPS stopped at the limit defined and therefore the index might be missing some content. With a ‘cut_off_text‘ set to false (default value), the documents that exceed ‘max_text_threshold‘ will be rejected and therefore not full text indexed at all (only metadata is indexed). Other documents of the same batch aren’t impacted, the parameter ‘max_text_threshold‘ is for each and every document. The default value for this parameter is 10 485 760 bytes (10 MB) and the maximum value possible is 2 147 483 648 (2 GB).

The second parameter to look for is ‘max_data_per_process‘. This parameter controls the maximum text size that a CPS Batch should handle. The CPS is indexing documents/items per batches (‘CPS-requests-batch-size‘). By default, a CPS will process up to 5 documents per batch but, if I’m not mistaken, it can be less if there isn’t enough documents to process. If the total text size to be processed by the CPS for the complete batch is above ‘max_data_per_process‘, then the CPS will reject the full batch and it will therefore not full text index the content of any of the documents. This is going to be an issue if you increased the previous parameters but miss/forget this one. Indeed, you might end-up with very small documents not indexed because they were in a batch containing some big documents. To be sure that this parameter doesn’t block any batch, you can set it to ‘CPS-requests-batch-size‘*’max_text_threshold‘. The default value for this parameter is 31 457 280 bytes (30 MB) and the maximum value possible is 2 147 483 648 (2 GB).

As for the IndexAgent, if you are going to change these values, then you might need some other updates. There are a few timeouts values like ‘request_time_out‘ (default 600 seconds), ‘text_extraction_time_out‘ (between 60 and 300 – default to 300 seconds) or ‘linguistic_processing_time_out‘ (between 60 and 360 – default to 360 seconds) that are probably going to be exceeded if you are processing large documents so you might need to tweak these values.

 

III. Summary

 

Parameter Limit on Short Description Default Value Sample Value
contentSizeLimit IndexAgent (indexagent.xml) Maximum size of document 20 000 000 bytes (19,07 MB) 104 857 600 bytes (100 MB)
max_text_threshold CPS (*_configuration.xml) Maximum text size of the document’s content 10 485 760 bytes (10 MB) 41 943 040 bytes (40 MB)
max_data_per_process CPS (*_configuration.xml) Maximum text size of the CPS batch 31 457 280 bytes (30 MB) 5*41 943 040 bytes = 209 715 200 (200 MB)

 
In summary, the first factor to consider is ‘contentSizeLimit‘ on the IndexAgent side. All documents with a size (document size) bigger than ‘contentSizeLimit‘ won’t be submitted to full text indexing, they will be skipped. The second factor is then either ‘max_text_threshold‘ or ‘max_data_per_process‘ or both, it depends which size you assigned them. They both rely on text size after extraction and they can both cause a document (or the batch) to be rejected from indexing.

Increasing the size thresholds is a somewhat complex exercise that needs careful thinking and alignment of numerous satellite parameters so that they can all work together without disrupting the performance or stability of the xPlore processes. These satellite parameters can be timeouts, cleanup, batch size, request size or even JVM size.

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Morgan Patou
Morgan Patou

Senior Consultant & Technology Leader ECM