Infrastructure at your Service

Cesare Cervini

Two techniques for cloning a repository filestore, part I

I must confess that my initial thought for the title was “An optimal repository filestore copy”. Optimal, really ? Relatively to what ? Which variable(s) define(s) the optimality ? Speed/time to clone ? Too dependent on the installed hardware and software, and the available resources and execution constraints. Simplicity to do it ? Too simple a method can result in a very long execution time while complexity can give a faster solution but be fragile, and vice-versa. Besides, simplicity is a relative concept; a solution may look simple to someone and cause nightmares to some others. Beauty ? I like that one but no, too fuzzy too. Finally, I settled for the present title for it is neutral and up to the point. I leave it up to the reader to judge if the techniques are optimal or simple or beautiful. I only hope that they can be useful to someone.
This article has two parts. In each, I’ll give an alternative for copying a repository’s filestores from one filesystem to another. Actually, both techniques are very similar, they just differ in the way the the content files’ path locations are determined. But let’s start.

A few simple alternatives

Cloning a Documentum repository is a well-known procedure nowadays. Globally, it implies to first create a placeholder docbase for the clone and then to copy the meta-data stored in the source database, e.g. through an export/import, usually while the docbase is stopped, plus the document contents, generally stored on disks. If a special storage peripheral is used, such as a Centera CAS or a NAS, there might be a fast, low-level way to clone to content files directly at the device level, check with the manufacturer.
If all we want is an identical copy of the whole contents’ filesystem, the command dd could be used, e.g. supposing that the source docbase and the clone docbase both use for their contents a dedicated filesystem mounted on /dev/mapper/vol01 respectively on /dev/mapper/vol02:

sudo dd if=/dev/mapper/vol01 bs=1024K of=/dev/mapper/vol02

The clone’s filesystem /dev/mapper/vol02 could be mounted temporarily on the source docbase’s machine for the copy and later dismounted. If this is not possible, dd can be used over the network, e.g.:

# from the clone machine;
ssh [email protected] 'dd if=/dev/mapper/vol01 bs=1024K | gzip -1 -' | zcat | dd of=/dev/mapper/vol02
 
# from the source machine as root;
dd if=/dev/mapper/vol01 bs=1024K | gzip -1 - | ssh [email protected] 'zcat | dd of=/dev/mapper/vol02'

dd, and other partition imaging utilities, perform a block by block mirror copy of the source, which is much faster than working at the file level, although it depends on the percentage of used space (if the source filesystem is almost empty, dd will spend most of its time copying unused blocks, which is useless. Here, a simple file by file copy would be more effective).
If it is not possible to work at this low level, e.g. filesystems are not dedicated to repositories’ contents or the types of the filesystems differ, then a file by file copy is required. Modern disks are quite fast, especially for deep-pocket companies using SSD, and so file copy operations should be acceptably quick. A naive command such as the one below could even be used to copy the whole repository’s content (we suppose that the clone will be on the same machine):

my_docbase_root=/data/Documentum/my_docbase
dest_path=/some/other/or/identical/path/my_docbase
cp -rp ${my_docbase_root} ${dest_path}/../.

If $dest_path differs from $my_docbase_root, don’t forget to edit the dm_filestore’s dm_locations of the clone docbase accordingly.

If confidentiality is requested and/or the copy occurs across a network, scp is recommended as it also encrypts the transferred data:

scp -rp ${my_docbase_root} [email protected]${dest_machine}:${dest_path}/../.

The venerable tar command could also be used, on-the-fly and without creating an intermediate archive file, e.g.:

( cd ${my_docbase_root}; tar cvzf - * ) | ssh [email protected]${dest_machine} "(cd ${dest_path}/; tar xvzf - )"

Even better, the command rsync could be used as it is much more versatile and efficient (and still secure too if configured to use ssh, which is the default), especially if the copy is done live several times in advance during the days preceding the kick-off of the new docbase; such copies will be performed incrementally and will execute quickly, providing an easy way to synchronize the copy with the source. Example of use:

rsync -avz --stats ${my_docbase_root}/ [email protected]{dest_machine}:${dest_path}

The trailing / in the source path means copy the content of ${loc_path} but not the directory itself as we assumed it already exists.
Alternatively, we can include the directory too:

rsync -avz --stats ${my_docbase_root} [email protected]{dest_machine}:${dest_path}/../.

If we run it from the destination machine, the command changes to:

rsync -avz --stats [email protected]_machine:${my_docbase_root}/ ${dest_path}/.

The −−stats option is handy to obtain a summary of the transferred files and the resulting performance.

Still, if the docbase is large and contains millions to hundreds of millions of documents, copying them to another machine can take some time. If rsync is used, repeated executions will just copy over modified or new documents and optionally remove the deleted ones if the archiving mode is requested (the -a option above) but the first run will take time anyway. Logically, reasonably taking advantage of the available I/O bandwidth by having several rsync running at once should reduce the time to clone the filestore, shouldn’t it ? Is it possible to apply here a divide-and-conquer technique and process each part simultaneously ? It is, and here is how.

How Documentum stores the contents on disk

Besides the sheer volume of documents and possibly the limited network and disk I/O bandwidth, one reason the copy can take a long time, independently from the tools used, is the peculiar way Documentum stores its contents on disk, with all the content files exclusively at the bottom of a 6-level deep sub-tree with the following structure (assuming $my_docbase_root has the same value as above; the letter at column 1 means d for directory and f for file):

cd $my_docbase_root
d filestore_1
d <docbase_id>, e.g. 0000c350
d 80 starts with 80 and increases by 1 up to ff, for a total of 2^7 = 128 sub-trees;
d 00 up to 16 ^ 16 = 256 directories directly below 80, from 00 to ff
d 00 again, up to 256 directories directly below, from 00 to ff, up to 64K total subdirectories at this level; let's call these innermost directories "terminal directories";
f 00[.ext] files are here, up to 256 files, from 00 to ff per terminal directory
f 01[.ext] f ...
f ff[.ext] d 01
f 00[.ext] f 01[.ext] f ...
f ff[.ext] d ...
d ff
f 00[.ext] f 01[.ext] f ...
f ff[.ext] d 01
d ...
d ff
d 81
d 00
d 01
d ...
d ff
d ...
d ff
d 00
d 01
d ...
d ff
f 00[.ext] f 01[.ext] f ...
f ff[.ext]

The content files on disk may have an optional extension and are located exclusively at the extremities of their sub-tree; there are no files in the intermediate sub-directories. Said otherwise, the files are the leaves of a filestore directory tree.
All the nodes have a 2-digit lowercase hexadecimal name, from 00 to ff, possibly with holes in the sequences when sub-directories or deleted documents have been swept by the DMClean job. With such a layout, each filestore can store up to (2^7).(2^8).(2^8).(2^8) files, i.e. 2^31 files or a bit more than 2.1 billions files. Any number of filestores can be used for virtually an “unlimited” number of content files. However, since each content object has an 16-digit hexadecimal id whose only last 8 hexadecimal digits are really distinctive (and directly map to the filesystem path of the content file, see below), a docbase can effectively contain “only” up to 16^8 files, i.e. 2^32 content files or slightly more than 4.2 billions files distributed among all the filestores. Too few ? There is hope, aside from spawning new docbases. The knowledge base note here explains how “galactic” r_object_id are allocated if more than 4 billions documents are present in a repository, so it should be possible to have literally gazillions of documents in a docbase. It is not clear though whether this galactic concept is implemented yet or whether it has ever been triggered once, so let us stay with our feet firmly on planet Earth for the time being.

It should be emphasized that such a particular layout in no way causes a performance penalty in accessing the documents’contents from within the repository because their full path can easily be computed by the content server out of their dm_content.data_ticket attribute (a signed decimal number), e.g. as shown with the one-liner:

data_ticket=-2147440384
echo $data_ticket | gawk '{printf("%x\n", $0 + 4294967296)}'
8000a900

or, more efficiently entirely in the bash shell:

printf "%x\n" $(($data_ticket + 4294967296))
8000a900

This value is now split apart by groups of 2 hex digits with a slash as separator: 80/00/a9/00
To compute the full path, the filestore’s location and the docbase id still need to be prepended to the above partial path, e.g. ${my_docbase_root}/filestore_01/0000c350/80/00/a9/00. Thus, knowing the r_object_id of a document, we can find its content file on the filesystem as shown (or, preferably, using the getpath API function) and knowing the full path of a content file makes it possible to find back the document (or documents as the same content can be shared among several documents) in the repository it belongs to. To be complete, the explanation still needs the concept of filestore and its relationship with a location but let’s stop digressing (check paragraph 3 below for a hint) and focus back to the subject at hand. We have now enough information to get us started.

As there are no files in the intermediate levels, it is necessary to walk the entire tree to reach them and start their processing, which is very time-consuming. Depending on your hardware, a ‘ls -1R’ command can takes hours to complete on a set of fairly dense sub-trees. A contrario, this is an advantage for processing through rsync because rsync is able to create all the necessary sub-path levels (aka “implied directories” in rsync lingo) if the -R|−−relative option is provided, as if a “mkdir -p” were issued; thus, in order to optimally copy an entire filestore, it would be enough to rsync only the terminal directories, once identified, and the whole sub-tree would be recreated implicitly. In the illustration above, the rsync commands for those paths are:

cd ${my_docbase_root}
rsync -avzR --stats filestore_01/80/00/00 [email protected]{dest_machine}:${dest_path}
rsync -avzR --stats filestore_01/80/00/01 [email protected]{dest_machine}:${dest_path}
rsync -avzR --stats filestore_01/80/00/ff [email protected]{dest_machine}:${dest_path}
rsync -avzR --stats filestore_01/ff/ff/00 [email protected]{dest_machine}:${dest_path}

In rsync ≥ v2.6.7, it is even possible to restrict the part within the source full path that should be copied remotely, so no preliminary cd is necessary, e.g.:

rsync -avzR --stats ${my_docbase_root}/./filestore_01/80/00/00 [email protected]{dest_machine}:${dest_path}

Note the /./ path component, it marks the start of the relative path to reproduce remotely. This command will create the directory ${dest_path}/filestore_01/80/00/00 on the remote host and copy its content there.
Path specification can be quite complicated, so use the -n|−−dry-run and -v|−−verbose (or even -vv for more details) options to have a peek at rsync’s actions before they are applied.

With the -R option, we get to transfer only the terminal sub-directories AND their original relative paths, efficiency and convenience !
We potentially replace millions of file by file copy commands with only up to 128 * 64K directory copy commands per filestore, which is much more concise and efficient.

However, if there are N content files to transfer, at least ⌈N/256⌉ such rsync commands will be necessary, e.g. a minimum of 3’900 commands for 1 million files, subject of course to their distribution in the sub-tree (some lesser dense terminal sub-directories can contain less than 256 files so more directories and hence commands are required). It is not documented how the content server distributes them over the sub-directories and there is no balancing job that relocates the content files in order to reduce the number of terminal directories by increasing the density of the left ones and removing the emptied ones. All this is quite sensible because, while it may matter to us, it is a non-issue to the repository.
Nevertheless, on the bright side, since a tree is by definition acyclic, rsync transfers won’t overlap and therefore can be parallelized without synchronization issues, even when intermediate “implied directories”‘s creations are requested simultaneously by 2 or more rsync commands (if rsync performs an operation equivalent to “mkdir -p”, possible errors due to race conditions during concurrent execution can be simply ignored since the operation is idempotent).

Empty terminal paths can be skipped without fear because they are not referenced in the docbase (only content files are) and hence their absence from the copy cannot introduce inconsistencies.

Of course, in the example above, those hypothetical 3900 rsync commands won’t be launched at once but in groups of some convenient value depending on the load that the machines can endure or the application’s response time degradation if the commands are executed during the normal work hours. Since we are dealing with files and possibly the network, the biggest bottleneck will be the I/Os and care should be exercised not to saturate them. When dedicated hardware such as high-speed networked NAS with SSD drives are used, this is less of a problem and more parallel rsync instances can be started, but such expensive resources are often shared across a company so that someone might still be impacted at one point. I for one remember one night as I was running one data-intensive DQL query in ten different repositories at once and users were suddenly complaining that they couldn’t work their Word documents any more because response time fell down to a crawl. How was that possible ? What was the relationship between DQL queries in several repositories and a desktop program ? Well, the repositories used Oracle databases whose datafiles were stored on a NAS also used as a host for networked drives mounted on desktop machines. Evidently, that configuration was somewhat sloppy but one never knows how things are configured at a large company, so be prepared for the worst and set up the transfer so that they can be easily suspended, interrupted and resumed. The good thing with rsync in archive mode is that the transfers can be resumed where they left off at a minimal cost just by relaunching the same commands with no need to compute a restarting point.

It goes without saying that setting up public key authentication or ssh connection sharing is mandatory when rsync-ing to a remote machine in order to suppress the thousands of authentication requests that will pop up during the batch execution.

But up to 128 * 64K rsync commands per filestore : isn’t that a bit too much ?

rsync performs mostly I/Os operations, reading and writing the filesystems, possibly across a network. The time spent in those slow operations (mostly waiting for them to complete, actually) by far outweighs the time spent launching the rsync processes and executing user or system code, especially if the terminal sub-directories are densely filled with large files. Moreover, if the DMClean jobs has been run ahead of time, this 128 * 64K figure is a maximum, it is only reached if all of the terminal sub-directories are not empty.
Still, rsync has some cleverness of its own in processing the files to copy so why not let it do its job ? Is it possible to reduce their number ? Of course, by just stopping one level before the terminal sub-directories, at their parents’ level. From there, up to 128 * 256 rsync commands are necessary, down from 128 * 64K commands. rsync would then explore itself the up to 64K terminal directories below, hopefully more efficiently than when explicitly told so. For sparse sub-directories or small docbases, this could be more efficient. If so, what would be the cut-off point ? This is a complex question depending on so many factors that there is no other way to answer it than to experiment with several situations. A few informal tests show that copying terminal sub-directories with up to 64K rsync commands is about 40% faster than copying their parent sub-directories. If optimality is defined as speed, then the “64k” variant is the best; if it is defined as compactness, then the “256” variant is the best. One explanation for this could be that the finer and simpler the tasks to perform, the quicker they terminate and free up processes to be reused. Or maybe rsync is overwhelmed by the up to 64K sub-directories to explore and is not so good at that and needs some help. The scripts in this article allow experimenting with the “256” variant.

To summarize up to this point, we will address the cost of navigating the special Documentum sub-tree layout by walking the location sub-trees up to the last directory level (or up to 2 levels above if so requested) and generate efficient rsync commands that can easily be parallelized. But before we start, how about asking the repository about its contents ? As it keeps track of it, wouldn’t this alternative be much easier and faster than navigating complex directory sub-trees ? Let’see.

Alternate solution: just ask the repository !

Since a repository obviously knows where its content files are stored on disks, it makes sense to get this information directly from it. In order to be sure to include all the possible renditions as well, we should query dmr_content instead of dm_document(all) (note that the DQL function MFILE_URL() returns those too, so that a “select MFILE_URL(”) from dm_document(all)” could also be used here). Also, unless the dmDMClean job is run beforehand, dmr_content includes orphan contents as well, so this point must be clarified ahead. Anyway, by querying dmr_content we are sure not to omit any content, orphans or not.
The short python/bash-like pseudo-code shows how we could do it:

for each filestore in the repository:
(
   for each content in the filestore:
      get its path, discard its filename;
      add the path to the set of paths to transfer;
   for each path in the set of paths:
      generate an rsync -avzR --stat command;
) &

Line 4 just gets the terminal sub-directories, while line 5 ensures that they are unique in order to avoid rsync-ing the same path multiple times. We use sets here to guarantee distinct terminal path values (set elements are unique in the set).
Line 7 outputs the rsync commands for the terminal sub-directories and the destination.
Even though all filestores are processed concurrently, there could be millions of contents in each filestore and such queries could take forever. However, if we run several such queries in parallel, each working on its own partition (i.e. a non-overlapping subset of the whole such that their union is equal to the whole), we could considerably speed it up. Constraints such “where r_object_id like ‘%0′”, “where r_object_id like ‘%1′”, “where r_object_id like ‘%2′”, .. “where r_object_id like ‘%f'” can slice up the whole set of documents into 16 more or less equally-sized subsets (since the r_object_id is essentially a sequence, its modulo 16 or 256 distribution is uniform), which can then be worked on independently and concurrently. Constraints like “where r_object_id like ‘%00′” .. “where r_object_id like ‘%ff'” can produce 256 slices, and so on.
Here is a short python 3 program that does all this:

#!/usr/bin/env python

# 12/2018, C. Cervini, dbi-services;
 
import os
import sys
import traceback
import getopt
from datetime import datetime
import DctmAPI
import multiprocessing
import multiprocessing.pool

def Usage():
   print("""Purpose:
Connects as dmadmin/xxxx to a given repository and generates rsync commands to transfer the given filestores' contents to the given destination;
Usage:
   ./walk_docbase.py -h|--help | -r|--repository  [-f|--filestores [{,}]|all] [-d|--dest ] [-l|--level ]
Example:
   ./walk_docbase.py --docbase dmtest
will list all the filestores in docbase dmtest and exit;
   ./walk_docbase.py --docbase dmtest --filestore all --dest [email protected]:/documentum/data/cloned_dmtest
will transfer all the filestores' content to the remote destination in /documentum/data, e.g.:
   dm_location.path = /data/dctm/dmtest/filestore_01 --> /documentum/data/cloned_dmtest/filestore_01
if dest does not contain a file path, the same path as the source is used, e.g.:
   ./walk_docbase.py --docbase dmtest --filestore filestore_01 --dest [email protected]
will transfer the filestore_01 filestore's content to the remote destination into the same directory, e.g.:
   dm_location.path = /data/dctm/dmtest/filestore_01 --> /documentum/dctm/dmtest/filestore_01
In any case, the destination root directory, if any is given, must exist as rsync does not create it (although it creates the implied directories);
Generated statements can be dry-tested by adding the option --dry-run, e.g.
rsync -avzR --dry-run /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0c [email protected]:/home/dctm/dmtest
Commented out SQL statements to update the dm_location.file_system_path for each dm_filestore are output if the destination path differs from the source's;
level is the starting sub-directory level that will be copied by rsync;
Allowd values for level are 0 (the default, the terminal directories level), -1 and -2;
level -1 means the sub-directory level above the terminal directories, and so on;
Practically, use 0 for better granularity and parallelization;
""")

def print_error(error):
   print(error)

def collect_results(list_paths):
   global stats
   for s in list_paths:
      stats.paths = stats.paths.union(s)

def explore_fs_slice(stmt):
   unique_paths = set()
   lev = level
   try:
      for item in DctmAPI.select_cor(session, stmt):
         fullpath = DctmAPI.select2dict(session, f"execute get_path for '{item['r_object_id']}'")
         last_sep = fullpath[0]['result'].rfind(os.sep)
         fullpath = fullpath[0]['result'][last_sep - 8 : last_sep]
         for lev in range(level, 0):
            last_sep = fullpath.rfind(os.sep)
            fullpath = fullpath[ : last_sep]
         unique_paths.add(fullpath)
   except Exception as e:
      print(e)
      traceback.print_stack()
   DctmAPI.show(f"for stmt {stmt}, unique_paths={unique_paths}")
   return unique_paths

# --------------------------------------------------------
# main;
if __name__ == "__main__":
   DctmAPI.logLevel = 0
 
   # parse the command-line parameters;
   # old-style for more flexibility is not needed here;
   repository = None
   s_filestores = None
   dest = None
   user = ""
   dest_host = ""
   dest_path = ""
   level = 0
   try:
       (opts, args) = getopt.getopt(sys.argv[1:], "hr:f:d:l:", ["help", "docbase=", "filestore=", "destination=", "level="])
   except getopt.GetoptError:
      print("Illegal option")
      print("./graph-stats.py -h|--help | [-r|--repository ] [-f|--filestores [{,}]|all] [-d|--dest ][-l|--level ]")
      sys.exit(1)
   for opt, arg in opts:
      if opt in ("-h", "--help"):
         Usage()
         sys.exit()
      elif opt in ("-r", "--repository"):
         repository = arg
      elif opt in ("-f", "--filestores"):
         s_filestores = arg
      elif opt in ("-d", "--dest"):
         dest = arg
         p_at = arg.rfind("@")
         p_colon = arg.rfind(":")
         if -1 != p_at:
            user = arg[ : p_at]
            if -1 != p_colon:
               dest_host = arg[p_at + 1 : p_colon]
               dest_path = arg[p_colon + 1 : ]
            else:
               dest_path = arg[p_at + 1 : ]
         elif -1 != p_colon:
            dest_host = arg[ : p_colon]
            dest_path = arg[p_colon + 1 : ]
         else:
            dest_path = arg
      elif opt in ("-l", "--level"):
         try:
            level = int(arg)
            if -2 > level or level > 0:
               print("raising")
               raise Exception()
         except:
            print("level must be a non positive integer inside the interval (-2,  0)")
            sys.exit()
   if None == repository:
      print("the repository is mandatory")
      Usage()
      sys.exit()
   if None == dest or "" == dest:
      if None != s_filestores:
         print("the destination is mandatory")
         Usage()
         sys.exit()
   if None == s_filestores or 'all' == s_filestores:
      # all filestores requested;
      s_filestores = "all"
      filestores = None
   else:
      filestores = s_filestores.split(",")
 
   # global parameters;
   # we use a typical idiom to create a cheap namespace;
   class gp: pass
   gp.maxWorkers = 100
   class stats: pass

   # connect to the repository;
   DctmAPI.show(f"Will connect to docbase(s): {repository} and transfer filestores [{s_filestores}] to destination {dest if dest else 'None'}")

   status = DctmAPI.dmInit()
   session = DctmAPI.connect(docbase = repository, user_name = "dmadmin", password = "dmadmin")
   if session is None:
      print("no session opened, exiting ...")
      exit(1)

   # we need the docbase id in hex format;
   gp.docbase_id = "{:08x}".format(int(DctmAPI.dmAPIGet("get,c,docbaseconfig,r_docbase_id")))

   # get the requested filestores' dm_locations;
   stmt = 'select fs.r_object_id, fs.name, fs.root, l.r_object_id as "loc_id", l.file_system_path from dm_filestore fs, dm_location l where {:s}fs.root = l.object_name'.format(f"fs.name in ({str(filestores)[1:-1]}) and " if filestores else "")
   fs_dict = DctmAPI.select2dict(session, stmt)
   DctmAPI.show(fs_dict)
   if None == dest:
      print(f"filestores in repository {repository}")
      for s in fs_dict:
         print(s['name'])
      sys.exit()

   # filestores are processed sequentially but inside each filestore, the contents are queried concurrently;
   for storage_id in fs_dict:
      print(f"# rsync statements for filestore {storage_id['name']};")
      stats.paths = set()
      stmt = f"select r_object_id from dmr_content where storage_id = '{storage_id['r_object_id']}' and r_object_id like "
      a_stmts = []
      for slice in range(16):
         a_stmts.append(stmt + "'%0{:0x}'".format(slice))
      gp.path_pool = multiprocessing.pool.Pool(processes = 16)
      gp.path_pool.map_async(explore_fs_slice, a_stmts, len(a_stmts), callback = collect_results, error_callback = print_error)
      gp.path_pool.close()
      gp.path_pool.join()
      SQL_stmts = set()
      for p in stats.paths:
         last_sep = storage_id['file_system_path'].rfind(os.sep)
         if "" == dest_path:
            dp = storage_id['file_system_path'][ : last_sep] 
         else:
            dp = dest_path
         # note the dot in the source path: relative implied directories will be created from that position; 
         print(f"rsync -avzR {storage_id['file_system_path'][ : last_sep]}/.{storage_id['file_system_path'][last_sep : ]}/{str(gp.docbase_id)}/{p} {(user + '@') if user else ''}{dest_host}{':' if dest_host else ''}{dp}")
         if storage_id['file_system_path'][ : last_sep] != dest_path:
            # dm_location.file_system_path has changed, give the SQL statements to update them in clone's schema;
            SQL_stmts.add(f"UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '{storage_id['file_system_path'][ : last_sep]}', '{dest_path}') WHERE r_object_id = '{storage_id['loc_id']}';")
      # commented out SQL statements to run before starting the repository clone;
      for stmt in SQL_stmts:
         print(f"# {stmt}")

   status = DctmAPI.disconnect(session)
   if not status:
      print("error while  disconnecting")

On line 10, the module DctmAPI is imported; such module was presented in a previous article (see here) but I include an updated version at the end of the present article.
Note the call to DctmAPI.select_cor() on line 51; this is a special version of DctmAPI.select2dict() where _cor stands for coroutine; actually, in python, it is called a generator but it looks very much like a coroutine from other programming languages. Its main interest is to separate navigating through the result set from consuming the returned data, for more clarity; also, since the the data are consumed one row at a time, there is no need to read them all into memory at once and pass them to the caller, which is especially efficient here where we can potentially have millions of documents. DctmAPI.select2dict() is still available and used when the expected result set is very limited, as for the list of dm_locations on line 154. By the way, DctmAPI.select2dict() invokes DctmAPI.select_cor() from within a list constructor, so they share that part of the code.
On line 171, function map_async is used to start 16 concurrent calls to explore_fs_slice on line 47 per filestore (the r_object_id % 16 expressed in DQL as r_object_id like ‘%0’ .. ‘%f’), each in its own process. That function repeatedly gets an object_id from the coroutine above and calls the administrative method get_path on it (we could query the dm_content.data_ticket and compute ourselves the file path but would it be any faster ?); the function returns a set of unique paths for its slice of ids. map_async then waits until all the processes terminate. Their result is collected by the callback collect_results starting on line 42; its parameter, list_paths, is a list of sets received from map_async (which received the sets from the terminating concurrent invocation of explore_fs_slice and put them in a list) that are further made unique by union-ing them into a global set. Starting on line 175, this set is iterated to generate the rsync commands.
Example of execution:

./walk-docbase.py -r dmtest -f all -d [email protected]:/home/dctm/dmtest | tee clone-content
# rsync statements for filestore filestore_01;
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0c [email protected]:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/02 [email protected]:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/06 [email protected]:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0b [email protected]:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/04 [email protected]:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/09 [email protected]:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/01 [email protected]:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/03 [email protected]:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/07 [email protected]:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/00 [email protected]:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/05 [email protected]:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0a [email protected]:/home/dctm/dmtest
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/08 [email protected]:/home/dctm/dmtest
# UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '/home/dmadmin/documentum/data/dmtest', '/home/dctm/dmtest') WHERE r_object_id = '3a00c3508000013f';
# rsync statements for filestore thumbnail_store_01;
# rsync statements for filestore streaming_store_01;
# rsync statements for filestore replicate_temp_store;
# rsync statements for filestore replica_filestore_01;

This execution generated the rsync commands to copy all the dmtest repository’s filestores to the remote host dmtest’s new location /home/dctm/dmtest. As the filestores’ dm_location has changed, an SQL statement (to be taken as an example because it is for an Oracle RDBMS; the syntax may differ in another RDBMS) has been generated too to accommodate the new path. We do this in SQL because the docbase clone will still be down at this time and the change must be done at the database level.
The other default filestores in the test docbase are empty and so no rsync are necessary for them; normally, the placeholder docbase already has initialized their sub-tree.
Those rsync commands could be executed in parallel, say, 10 at a time, by launching them in the background with “wait” commands inserted in between, like this:

./walk-docbase.py -r dmtest -f all -d [email protected]:/home/dctm/dmtest | gawk -v nb_rsync=10 'BEGIN {
print "\#!/bin/bash"
nb_dirs = 0
}
{
print $0 " &"
if (!$0 || match($0, /^#/)) next
if (0 == ++nb_dirs % nb_rsync)
print "wait"
}
END {
if (0 != nb_dirs % nb_rsync)
print "wait"
}' | tee migr.sh
#!/bin/bash
# rsync statements for filestore filestore_01; &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/08 [email protected]:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/02 [email protected]:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/04 [email protected]:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/03 [email protected]:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/06 [email protected]:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/01 [email protected]:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0a [email protected]:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/07 [email protected]:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0c [email protected]:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/05 [email protected]:/home/dctm/dmtest &
wait
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/00 [email protected]:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/09 [email protected]:/home/dctm/dmtest &
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0b [email protected]:/home/dctm/dmtest &
# UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '/home/dmadmin/documentum/data/dmtest', '/home/dctm/dmtest') WHERE r_object_id = '3a00c3508000013f';
# rsync statements for filestore thumbnail_store_01;
# rsync statements for filestore streaming_store_01;
# rsync statements for filestore replicate_temp_store;
# rsync statements for filestore replica_filestore_01;
wait

Even though there may be a count difference of up to 255 files between some rsync commands, they should complete roughly at the same time so that nb_rsync commands should be running at any time. If not, i.e. if the transfers frequently wait for a few long running rsync to complete (it could happen with huge files), it may be worth using a task manager that makes sure the requested parallelism degree is respected at any one time throughout the whole execution.
Let’s now make the generated script executable and launch it:

chmod +x migr.sh
time ./migr.sh

The parameter level lets one choose the levels of the sub-directories that rsync will copy, 0 (the default) for terminal sub-directories, -1 for the level right above them and -2 for the level above those. As discussed, the lesser the level, the lesser rsync commands are necessary, e.g. up to 128 * 64K for level = 0, up to 128 * 256 for level = -1 and up to 128 for level = -2.
Example of execution:

./walk-docbase.py -r dmtest -d dd --level -1
# rsync statements for filestore filestore_01;
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80/00 [email protected]:/home/dctm/dmtest &
# UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '/home/dmadmin/documentum/data/dmtest', 'dd') WHERE r_object_id = '3a00c3508000013f';
# rsync statements for filestore thumbnail_store_01;
# rsync statements for filestore streaming_store_01;
# rsync statements for filestore replicate_temp_store;
# rsync statements for filestore replica_filestore_01;

And also:

./walk-docbase.py -r dmtest -d dd --level -2
# rsync statements for filestore filestore_01;
rsync -avzR /home/dmadmin/documentum/data/dmtest/./content_storage_01/0000c350/80 [email protected]:/home/dctm/dmtest &
# UPDATE dm_location SET file_system_path = REPLACE(file_system_path, '/home/dmadmin/documentum/data/dmtest', 'dd') WHERE r_object_id = '3a00c3508000013f';
# rsync statements for filestore thumbnail_store_01;
# rsync statements for filestore streaming_store_01;
# rsync statements for filestore replicate_temp_store;
# rsync statements for filestore replica_filestore_01;

So, this first solution looks quite simple eventhough it initially puts a little, yet tolerable stress on the docbase. The python script connects to the repository and generates the required rsync commands (with an user-selectable compactness level) and a gawk filter prepares an executable with those statements launched in parallel N (user-selectable) at a time.
Performance-wise, its not so good because all the contents must be queried for their full path, and that’s a lot of queries for a large repository.

All this being said, let’s see now if a direct and faster, out of the repository filesystem copy procedure can be devised. Please, follow the rest of this article in part II. The next paragraph just lists the latest version of the module DctmAPI.py.

DctmAPI.py revisited

"""
This module is a python - Documentum binding based on ctypes;
requires libdmcl40.so/libdmcl.so to be reachable through LD_LIBRARY_PATH;
C. Cervini - dbi-services.com - december 2018

The binding works as-is for both python2 amd python3; no recompilation required; that's the good thing with ctypes compared to e.g. distutils/SWIG;
Under a 32-bit O/S, it must use the libdmcl40.so, whereas under a 64-bit Linux it must use the java backed one, libdmcl.so;

For compatibility with python3 (where strings are now unicode ones and no longer arrays of bytes, ctypes strings parameters are always converted to unicode, either by prefixing them
with a b if litteral or by invoking their encode('ascii', 'ignore') method; to get back to text from bytes, b.decode() is used;these works in python2 as well as in python3 so the source is compatible with these two versions of the language;
"""

import os
import ctypes
import sys, traceback

# use foreign C library;
# use this library in eContent server = v6.x, 64-bit Linux;
dmlib = 'libdmcl.so'

dm = 0

class getOutOfHere(Exception):
   pass

def show(mesg, beg_sep = False, end_sep = False):
   "displays the message msg if allowed"
   if logLevel > 0:
      print(("\n" if beg_sep else "") + repr(mesg), ("\n" if end_sep else ""))

def dmInit():
   """
   initializes the Documentum part;
   returns True if successfull, False otherwise;
   dmAPI* are global aliases on their respective dm.dmAPI* for some syntaxic sugar;
   since they already have an implicit namespace through their dm prefix, dm.dmAPI* would be redundant so let's get rid of it;
   returns True if no error, False otherwise;
   """

   show("in dmInit()")
   global dm

   try:
      dm = ctypes.cdll.LoadLibrary(dmlib);  dm.restype = ctypes.c_char_p
      show("dm=" + str(dm) + " after loading library " + dmlib)
      dm.dmAPIInit.restype    = ctypes.c_int;
      dm.dmAPIDeInit.restype  = ctypes.c_int;
      dm.dmAPIGet.restype     = ctypes.c_char_p;      dm.dmAPIGet.argtypes  = [ctypes.c_char_p]
      dm.dmAPISet.restype     = ctypes.c_int;         dm.dmAPISet.argtypes  = [ctypes.c_char_p, ctypes.c_char_p]
      dm.dmAPIExec.restype    = ctypes.c_int;         dm.dmAPIExec.argtypes = [ctypes.c_char_p]
      status  = dm.dmAPIInit()
   except Exception as e:
      print("exception in dminit(): ")
      print(e)
      traceback.print_stack()
      status = False
   finally:
      show("exiting dmInit()")
      return True if 0 != status else False
   
def dmAPIDeInit():
   """
   releases the memory structures in documentum's library;
   returns True if no error, False otherwise;
   """
   status = dm.dmAPIDeInit()
   return True if 0 != status else False
   
def dmAPIGet(s):
   """
   passes the string s to dmAPIGet() method;
   returns a non-empty string if OK, None otherwise;
   """
   value = dm.dmAPIGet(s.encode('ascii', 'ignore'))
   return value.decode() if value is not None else None

def dmAPISet(s, value):
   """
   passes the string s to dmAPISet() method;
   returns TRUE if OK, False otherwise;
   """
   status = dm.dmAPISet(s.encode('ascii', 'ignore'), value.encode('ascii', 'ignore'))
   return True if 0 != status else False

def dmAPIExec(stmt):
   """
   passes the string stmt to dmAPIExec() method;
   returns TRUE if OK, False otherwise;
   """
   status = dm.dmAPIExec(stmt.encode('ascii', 'ignore'))
   return True if 0 != status else False

def connect(docbase, user_name, password):
   """
   connects to given docbase as user_name/password;
   returns a session id if OK, None otherwise
   """
   show("in connect(), docbase = " + docbase + ", user_name = " + user_name + ", password = " + password) 
   try:
      session = dmAPIGet("connect," + docbase + "," + user_name + "," + password)
      if session is None or not session:
         raise(getOutOfHere)
      else:
         show("successful session " + session)
         show(dmAPIGet("getmessage," + session).rstrip())
   except getOutOfHere:
      print("unsuccessful connection to docbase " + docbase + " as user " + user_name)
      session = None
   except Exception as e:
      print("Exception in connect():")
      print(e)
      traceback.print_stack()
      session = None
   finally:
      show("exiting connect()")
      return session

def execute(session, dql_stmt):
   """
   execute non-SELECT DQL statements;
   returns TRUE if OK, False otherwise;
   """
   show("in execute(), dql_stmt=" + dql_stmt)
   try:
      query_id = dmAPIGet("query," + session + "," + dql_stmt)
      if query_id is None:
         raise(getOutOfHere)
      err_flag = dmAPIExec("close," + session + "," + query_id)
      if not err_flag:
         raise(getOutOfHere)
      status = True
   except getOutOfHere:
      show(dmAPIGet("getmessage," + session).rstrip())
      status = False
   except Exception as e:
      print("Exception in execute():")
      print(e)
      traceback.print_stack()
      status = False
   finally:
      show(dmAPIGet("getmessage," + session).rstrip())
      show("exiting execute()")
      return status

def select2dict(session, dql_stmt, attr_name = None):
   """
   execute the DQL SELECT statement passed in dql_stmt and returns an array of dictionaries (one per row) into result;
   attributes_names is the list of extracted attributes (the ones in SELECT ..., as interpreted by the server); if not None, attribute namea are appended to it, otherwise nothing is returned;
   """
   show("in select2dict(), dql_stmt=" + dql_stmt)
   return list(select_cor(session, dql_stmt, attr_name))

def select_cor(session, dql_stmt, attr_name = None):
   """
   execute the DQL SELECT statement passed in dql_stmt and return one row at a time;
   coroutine version;
   if the optional attributes_names is not None, it contains an appended list of attributes returned by the result set, otherwise no names are returned;
   return True if OK, False otherwise;
   """
   show("in select_cor(), dql_stmt=" + dql_stmt)

   status = False
   try:
      query_id = dmAPIGet("query," + session + "," + dql_stmt)
      if query_id is None:
         raise(getOutOfHere)

      # iterate through the result set;
      row_counter = 0
      if None == attr_name:
         attr_name = []
      width = {}
      while dmAPIExec("next," + session + "," + query_id):
         result = {}
         nb_attrs = dmAPIGet("count," + session + "," + query_id)
         if nb_attrs is None:
            show("Error retrieving the count of returned attributes: " + dmAPIGet("getmessage," + session))
            raise(getOutOfHere)
         nb_attrs = int(nb_attrs) 
         for i in range(nb_attrs):
            if 0 == row_counter:
               # get the attributes' names only once for the whole query;
               value = dmAPIGet("get," + session + "," + query_id + ",_names[" + str(i) + "]")
               if value is None:
                  show("error while getting the attribute name at position " + str(i) + ": " + dmAPIGet("getmessage," + session))
                  raise(getOutOfHere)
               attr_name.append(value)
               if value in width:
                  width[value] = max(width[attr_name[i]], len(value))
               else:
                  width[value] = len(value)

            is_repeating = dmAPIGet("repeating," + session + "," + query_id + "," + attr_name[i])
            if is_repeating is None:
               show("error while getting the arity of attribute " + attr_name[i] + ": " + dmAPIGet("getmessage," + session))
               raise(getOutOfHere)
            is_repeating = int(is_repeating)

            if 1 == is_repeating:
               # multi-valued attributes;
               result[attr_name[i]] = []
               count = dmAPIGet("values," + session + "," + query_id + "," + attr_name[i])
               if count is None:
                  show("error while getting the arity of attribute " + attr_name[i] + ": " + dmAPIGet("getmessage," + session))
                  raise(getOutOfHere)
               count = int(count)

               for j in range(count):
                  value = dmAPIGet("get," + session + "," + query_id + "," + attr_name[i] + "[" + str(j) + "]")
                  if value is None:
                     value = "null"
                  #result[row_counter] [attr_name[i]].append(value)
                  result[attr_name[i]].append(value)
            else:
               # mono-valued attributes;
               value = dmAPIGet("get," + session + "," + query_id + "," + attr_name[i])
               if value is None:
                  value = "null"
               width[attr_name[i]] = len(attr_name[i])
               result[attr_name[i]] = value
         if 0 == row_counter:
            show(attr_name.append)
         yield result
         row_counter += 1
      err_flag = dmAPIExec("close," + session + "," + query_id)
      if not err_flag:
         show("Error closing the query collection: " + dmAPIGet("getmessage," + session))
         raise(getOutOfHere)

      status = True

   except getOutOfHere:
      show(dmAPIGet("getmessage," + session).rstrip())
      status = False
   except Exception as e:
      print("Exception in select2dict():")
      print(e)
      traceback.print_stack()
      status = False
   finally:
      return status

def select(session, dql_stmt, attribute_names):
   """
   execute the DQL SELECT statement passed in dql_stmt and outputs the result to stdout;
   attributes_names is a list of attributes to extract from the result set;
   return True if OK, False otherwise;
   """
   show("in select(), dql_stmt=" + dql_stmt)
   try:
      query_id = dmAPIGet("query," + session + "," + dql_stmt)
      if query_id is None:
         raise(getOutOfHere)

      s = ""
      for attr in attribute_names:
         s += "[" + attr + "]\t"
      print(s)
      resp_cntr = 0
      while dmAPIExec("next," + session + "," + query_id):
         s = ""
         for attr in attribute_names:
            value = dmAPIGet("get," + session + "," + query_id + "," + attr)
            if "r_object_id" == attr and value is None:
               raise(getOutOfHere)
            s += "[" + (value if value else "None") + "]\t"
            show(str(resp_cntr) + ": " + s)
         resp_cntr += 1
      show(str(resp_cntr) + " rows iterated")

      err_flag = dmAPIExec("close," + session + "," + query_id)
      if not err_flag:
         raise(getOutOfHere)

      status = True
   except getOutOfHere:
      show(dmAPIGet("getmessage," + session).rstrip())
      status = False
   except Exception as e:
      print("Exception in select():")
      print(e)
      traceback.print_stack()
      print(resp_cntr); print(attr); print(s); print("[" + value + "]")
      status = False
   finally:
      show("exiting select()")
      return status

def walk_group(session, root_group, level, result):
   """
   recursively walk a group hierarchy with root_group as top parent;
   """

   try:
      root_group_id = dmAPIGet("retrieve," + session + ",dm_group where group_name = '" + root_group + "'")
      if 0 == level:
         if root_group_id is None:
            show("Cannot retrieve group [" + root_group + "]:" + dmAPIGet("getmessage," + session))
            raise(getOutOfHere)
      result[root_group] = {}

      count = dmAPIGet("values," + session + "," + root_group_id + ",groups_names")
      if "" == count:
         show("error while getting the arity of attribute groups_names: " + dmAPIGet("getmessage," + session))
         raise(getOutOfHere)
      count = int(count)

      for j in range(count):
         value = dmAPIGet("get," + session + "," + root_group_id + ",groups_names[" + str(j) + "]")
         if value is not None:
            walk_group(session, value, level + 1, result[root_group])

   except getOutOfHere:
      show(dmAPIGet("getmessage," + session).rstrip())
   except Exception as e:
      print("Exception in walk_group():")
      print(e)
      traceback.print_stack()

def disconnect(session):
   """
   closes the given session;
   returns True if no error, False otherwise;
   """
   show("in disconnect()")
   try:
      status = dmAPIExec("disconnect," + session)
   except Exception as e:
      print("Exception in disconnect():")
      print(e)
      traceback.print_stack()
      status = False
   finally:
      show("exiting disconnect()")
      return status

Highlighted are the main changed lines that added a generator-based select() function yielding a dictionary for each row from the result set and changed select2dict() to use it.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Cesare Cervini
Cesare Cervini