Infrastructure at your Service

Cesare Cervini

Two techniques for cloning a repository filestore, part II

This is part II of a two-part article. In part I, we introduced the Documentum repository file structure and saw a first way to transfer content files from one filesystem to another using the repository to get their full path on disk. We saw how to generate an easy to parallelize set of rsync commands to completely or partially copy a repository’s content files. Here, we’ll show another way to do that which does not imply querying the repository.
Before this, however, we need test data, i.e. sub-directories with files on disk. The good thing is that those don’t have to be linked to documents in the repository, they can just be loose files, even empty ones, scattered over Documentum-style sub-trees, and they are easy to produce.

Creating a test sub-tree

The usual test repository is an out of the box, practically empty one; therefore we need to create a dense Documentum-like sub-tree with lots of directories (up to 128 x 256 x 256 = 8128K ones per filestore, but we won’t go so far) and a few files, say 1 per subdirectory. No need to actually create related documents in a repository since we want to perform the copy of the files from outside the repository. As said above, the number of files is not really important as only their parent directory (and indirectly, its content) is being passed to rsync. A one-file terminal sub-directory would be enough for it to trigger its transfer. Also, their size does not matter either (we don’t want to measure the transfer speed, just to prepare performant rsync statements), so 0-byte files will do perfectly.
The short python program below creates a Documentum-style subdirectory.

#! /usr/bin/python3
import datetime
import time
import sys
import os
import random
import multiprocessing
import multiprocessing.pool
import functools
import traceback
import getopt

# creates a Documentum tree at a given filesystem location using multi-processes for speed;
# Usage:
#   ./make-tree.py root_point
# Example:
#   ./make-tree.py "/home/dmadmin/documentum/data/dmtest/content_storage01/0000c35c/80"

# prerequisite to prevent the error OSError: [Errno 24] Too many open files if too many processes are forked;
# ulimit -n 10000

# 12/2018, C. Cervini, dbi-services;

def Usage():
   print("""Generates a dense Documentum-style sub-tree (i.e. a 3-level subtree with files only in the last one) with 255 x 256 sub-directories and 256 x 256 x 100 empty files under a given sub-directory;
Randomization of number of subdirectories and number of files at each level is possible within given limits;
This is useful to test different tree walking algorithms.
Created directories and files under the root directory have hexadecimal names, i.e. 00, 01, ... ff.
Usage:
   make-tree.py [-h|--help] | [-d|--dry-run] | [-r|--root-dir ]
A dry-run will just print to stdout the command for the nodes to be created instead of actually creating them, which is useful to process them later by some more efficient parallelizing tool;
Example:
   make-tree.py --root-dir "dmtest/content_storage01/0000c35c/80"
If a forest is needed, just invoke this program in a shell loop, e.g.:
   for i in 80 81 82 83 84 85; do
      make-tree.py --root-dir "dmtest/content_storage01/0000c35c/${i}" &
   done
This command will start 6 O/S python processes each creating a subtree named 80 .. 85 under the root directory.
The created 6-tree forest will have the following layout:
dmtest/content_storage01/0000c35c                                                d
   /80                                                                           d
      /00                                                                        d
         /00                                                                     d
            /00                                                                  f
            /01                                                                  f
            ...                                                                ...
            /63                                                                  f
         /01
            /00
            /01
            ...
            /63
         ...
         /ff
            /00
            /01
            ...
            /63
      /01
         /00
            /00
            /01
            ...
            /63
         /01
            /00
            /01
            ...
            /63
         ...
         /ff
            /00
            /01
            ...
            /63
      ...
      /ff
         /00
            /00
            /01
            ...
            /63
         /01
            /00
            /01
            ...
            /63
         ...
         /ff
            /00
            /01
            ...
            /63
   /81
   ...
   /82
   ...
   /83
   ...
   /84
   ...
   /85
   ...
It will contain 6 x 256 x 256 directories and 6 x 256 x 256 x 1 files, unless randomization is requested, see the gp.* parameters;
   """)

# this function cannot be local to make_tree(), though it is only used there, because of the following error:
#     Can't pickle local object 'make_tree..make_files'
# the error prevents it to be invoked as a callback;
# actually, functions used in processes must have been entirely defined prior their usage; this implies that functions cannot be fork processes that calls themselves;
# therefore, the master function make_tree is needed that detaches the processes that execute the functions defined earlier, make_files and make_level;
# moreover, processes in a pool are daemonic and are not allowed to fork other processes;
# we must therefore use a global pool of processes allocated in the master function;
def make_files(dir):
   if gp.bRandomFiles:
      nbFiles = random.randint(gp.minFiles, gp.maxFiles)
   else:
      nbFiles = gp.maxFiles
   for nf in range(nbFiles):
      fullPath = dir + "/" + (gp.fileFormat % nf)
      if gp.bTest:
         print(f"touch {fullPath}")
      else:
         try:
            fd = os.open(fullPath, os.O_CREAT); os.close(fd)
         except:
           traceback.print_exc()
           print("ignoring ...")
   return nbFiles

# ditto;
def make_level(dir):
   """
   create a directory level under dir;
   """
   global gp

   if gp.bRandomSubDirs:
      nbSubDirs = random.randint(gp.minSubDirs, gp.maxSubDirs)
   else:
      nbSubDirs = gp.maxSubDirs
   level_dirs = []
   for nd in range(nbSubDirs):
      subDir = dir + "/" + (gp.dirFormat % nd)
      if gp.bTest:
         print("mkdir", subDir)
      else:
         try:
            os.mkdir(subDir)
         except:
            traceback.print_exc()
            print("ignoring ...")
      level_dirs.append(subDir)
   return level_dirs

# the master function;
# it creates 2 levels of directories and then empty files under the deep-most directories;
def make_tree(root):
   global gp, stats
   sub_level_dirs = []

   # list_dirs contains a list of the values returned by parallelized calls to function make_level, i.e. a list of lists;
   # get_dirs is called as a callback by map_async at job completion;
   def get_dirs(list_dirs):
      global stats
      nonlocal sub_level_dirs
      for l in list_dirs:
         stats.nb_created_dirs += len(l)
         sub_level_dirs.extend(l)

   # nb_files contains a list of the values returned by parallelized calls to function make_files, i.e. a list of numbers;
   # get_nb_files is called as a callback by map_async at job completion;
   def get_nb_files(nb_files):
      global stats
      stats.nb_dirs_at_bottom += len(nb_files)
      stats.nb_created_files += functools.reduce(lambda x, y: x + y, nb_files)
      stats.nb_created_dirs_with_files += functools.reduce(lambda x, y: x + y, [1 if x > 0 else 0 for x in nb_files])

   # callback for error reporting;
   def print_error(error):
      print(error)

   # first directory level;
   level_dirs = make_level(root)
   stats.nb_created_dirs += len(level_dirs)

   # 2nd directory level;
   sub_level_dirs = []
   gp.pool = multiprocessing.pool.Pool(processes = gp.maxWorkers)
   gp.pool.map_async(make_level, level_dirs, len(level_dirs), callback = get_dirs, error_callback = print_error)
   gp.pool.close()
   gp.pool.join()

   # make dummy files at the bottom-most directory level;
   gp.pool = multiprocessing.pool.Pool(processes = gp.maxWorkers)
   gp.pool.map_async(make_files, sub_level_dirs, len(sub_level_dirs), callback = get_nb_files, error_callback = print_error)
   gp.pool.close()
   gp.pool.join()

# -----------------------------------------------------------------------------------------------------------
if __name__ == "__main__":
# main;

   # global parameters;
   # we use a typical idiom to create a cheap namespace;
   class gp: pass
   
   # dry run or real thing;
   gp.bTest = 0
   
   # root directory;
   gp.root = None
   try:
       (opts, args) = getopt.getopt(sys.argv[1:], "hdr:", ["help", "dry-run", "root-dir="])
   except getopt.GetoptError:
      print("Illegal option")
      print("./make-tree.py -h|--help | [-d|--dry-run] | [-r|--root-dir ]")
      sys.exit(1)
   for opt, arg in opts:
      if opt in ("-h", "--help"):
         Usage()
         sys.exit()
      elif opt in ("-d", "--dry-run"):
         gp.bTest = 1
      elif opt in ("-r", "--root-dir"):
         gp.root = arg
   if None == gp.root:
      print("Error: root_dir must be specified")
      print("Use -h or --help for help")
      sys.exit()

   # settings for Documentum;
   # nb sub-directories below the "80";
   gp.maxLevels = 2
   
   # nb tree depth levels;
   gp.maxDepth = gp.maxLevels + 1
   
   # random nb sub-directories in each level;
   # maximum is 256 for Documentum;
   gp.bRandomSubDirs = 0
   gp.minSubDirs = 200
   gp.maxSubDirs = 256
   
   # random nb files in each level;
   # maximum is 256 for Documentum;
   gp.bRandomFiles = 0
   gp.minFiles = 0
   gp.maxFiles = 1
   
   # node names' format;
   gp.dirFormat = "%02x"
   gp.fileFormat = "%02x"
   
   # maximum numbers of allowed processes in the pool; tasks will wait until some processes are available;
   # caution not to choose huge values for disk I/Os are saturated and overall performance drops to a crawl;
   gp.maxWorkers = 40
   
   # again but for the counters;
   class stats: pass
   stats.nb_dirs_at_bottom = 0
   stats.nb_created_files = 0
   stats.nb_created_dirs = 0
   stats.nb_created_dirs_with_files = 0

   # initialize random number generator;
   random.seed()
         
   startDate = datetime.datetime.now()
   print(f"started building a {gp.maxDepth}-level deep tree starting at {gp.root} at {startDate}")
   if not gp.bTest:
      try:
         os.makedirs(gp.root)
      except:
         pass
   make_tree(gp.root)
   print("{:d} created files".format(stats.nb_created_files))
   print("{:d} created dirs".format(stats.nb_created_dirs))
   print("{:d} created dirs at bottom".format(stats.nb_dirs_at_bottom))
   print("{:d} total created dirs with files".format(stats.nb_created_dirs_with_files))
   print("{:d} total created nodes".format(stats.nb_created_dirs + stats.nb_created_files))

   endDate = datetime.datetime.now()
   print(f"Ended building subtree {gp.root} at {endDate}")
   print(f"subtree {gp.root} built in {(endDate - startDate).seconds} seconds, or {time.strftime('%H:%M:%S', time.gmtime((endDate - startDate).seconds))} seconds")

No special attention as been given to the user interface and most settings are hard-coded in the script; their values are currently set to produce a dense (256 x 256 = 64K sub-directories/filestore) Documentum-style directory sub-tree each with just 1 zero-byte file. It is pointless, but possible, to have more files here since we only want to generate a set of rsync commands, one for each non empty subdirectory.
Examples of execution:
Create a dense 64K sub-tree at relative path clone/dmtest/content_storage01/0000c35a/86, with one empty file leaf in each:

./make-tree-dctm4.py --root-dir clone/dmtest/content_storage01/0000c35a/86
started building a 3-level deep tree starting at clone/dmtest/content_storage01/0000c35a/86 at 2018-12-31 18:47:20.066665
65536 created files
65792 created dirs
65536 created dirs at bottom
65536 total created dirs with files
131328 total created nodes
Ended building subtree clone/dmtest/content_storage01/0000c35a/86 at 2018-12-31 18:47:34.612075
subtree clone/dmtest/content_storage01/0000c35a/86 built in 14 seconds, or 00:00:14 seconds

Sequentially create a 6-tree forest starting at relative path clone/dmtest/content_storage01/0000c35a:

for i in 80 81 82 83 84 85; do
./make-tree-dctm4.py --root-dir clone/dmtest/content_storage01/0000c35a/${i}
done
started building a 3-level deep tree starting at clone/dmtest/content_storage01/0000c35a/80 at 2018-12-31 18:41:31.933329
65536 created files
65792 created dirs
65536 created dirs at bottom
65536 total created dirs with files
131328 total created nodes
Ended building subtree clone/dmtest/content_storage01/0000c35a/80 at 2018-12-31 18:41:39.694346
subtree clone/dmtest/content_storage01/0000c35a/80 built in 7 seconds, or 00:00:07 seconds
started building a 3-level deep tree starting at clone/dmtest/content_storage01/0000c35a/81 at 2018-12-31 18:41:39.738200
65536 created files
65792 created dirs
65536 created dirs at bottom
65536 total created dirs with files
131328 total created nodes
Ended building subtree clone/dmtest/content_storage01/0000c35a/81 at 2018-12-31 18:42:14.166057
subtree clone/dmtest/content_storage01/0000c35a/81 built in 34 seconds, or 00:00:34 seconds
...
subtree clone/dmtest/content_storage01/0000c35a/84 built in 22 seconds, or 00:00:22 seconds
started building a 3-level deep tree starting at clone/dmtest/content_storage01/0000c35a/85 at 2018-12-31 18:43:06.644111
65536 created files
65792 created dirs
65536 created dirs at bottom
65536 total created dirs with files
131328 total created nodes
Ended building subtree clone/dmtest/content_storage01/0000c35a/85 at 2018-12-31 18:43:41.527459
subtree clone/dmtest/content_storage01/0000c35a/85 built in 34 seconds, or 00:00:34 seconds

So, the test creation script is quite quick with the current 40 concurrent workers. Depending on your hardware, be careful with that value because the target disk can easily be saturated and stop responding, especially if each tree of a forest is created concurrently.
Just out of curiosity, how long would it take to ls to navigate this newly created forest ?

# clear the disk cache first;
sudo sysctl vm.drop_caches=3
vm.drop_caches = 3
 
time ls -1R clone/dmtest/content_storage01/0000c35a/* | wc -l
1840397
real 2m27,959s
user 0m4,291s
sys 0m19,587s
 
# again without clearing the disk cache;
time ls -1R clone/dmtest/content_storage01/0000c35a/* | wc -l
1840397
real 0m2,791s
user 0m0,941s
sys 0m1,950s

ls is very fast because there is just one leaf file in each sub-tree; it would be whole different story with well filled terminal sub-directories. Also, the cache has a tremendous speed up effect and the timed tests will always take care to clear them before a new run. The command above is very effective for this purpose.

Walking the trees

As written above, the main task before generating the rsync commands is to reach the terminal sub-directories. Let’s see if the obvious “find” command could be used at first, e.g.:

loc_path=/home/documentum/data/dmtest/./content_storage_01
find ${loc_path} -type d | gawk -v FS='/' -v max_level=11 '{if (max_level == NF) print}'
/home/documentum/data/dmtest/./content_storage_01/0000c350/80/00/00
/home/documentum/data/dmtest/./content_storage_01/0000c350/80/00/01
...
/home/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0b
...

find stopped at the last subdirectory, just before the files, as requested.
Can we get rid of the extra-process used by the gawk filter ? Let’s try this:

find ${loc_path} -mindepth 4 -maxdepth 4 -type d
/home/documentum/data/dmtest/./content_storage_01/0000c350/80/00/00
/home/documentum/data/dmtest/./content_storage_01/0000c350/80/00/01
...
/home/documentum/data/dmtest/./content_storage_01/0000c350/80/00/0b
...

We can, good. The -mindepth and -maxdepth parameters, and the -type filter, let us jump directly to the directory level of interest, which is exactly what we want.

The “find” command is very fast; e.g. on the test forest above:

time find ${loc_path} -mindepth 4 -maxdepth 4 -type d | wc -l
458752
 
real 0m10,423s
user 0m0,536s
sys 0m2,450s

10 seconds for the forest’s 458’000 terminal directories, with disk cache emptied beforehand, impressing. If those directories were completely filled, they would contain about 117 millions files, a relatively beefy repository. Thus, find is a valuable tool, also because it is directed to stop before reaching the files. Finally, Documentum’s unusual file layout does not look so weird now, does it ? Let’s therefore use find to generate the rsync commands on the terminal sub-directories it returns:

find $loc_path -mindepth 4 -maxdepth 4 -type d | xargs -i echo "rsync -avzR {} [email protected]${dest_machine}:{}" > migr

Example of execution:

find $loc_path -mindepth 4 -maxdepth 4 -type d | xargs -i echo "rsync -avzR {} [email protected]_host:/some/other/dir/dmtest"
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/ea [email protected]_host:/some/other/dir/dmtest
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/5e [email protected]_host:/some/other/dir/dmtest
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/aa [email protected]_host:/some/other/dir/dmtest
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/cc [email protected]_host:/some/other/dir/dmtest
...

Subsequently, the commands in migr could be executed in parallel N at a time, with N a reasonable number that won’t hog the infrastructure.
The same gawk script showed in part I could be used here:

find $loc_path -mindepth 4 -maxdepth 4 -type d | gawk -v nb_rsync=10 -v [email protected]_machine:/some/other/place/dmtest 'BEGIN {
print "\#!/bin/bash"
}
{
printf("rsync -avzR %s %s &\n", $0, dest)
if (0 == ++nb_dirs % nb_rsync)
print "wait"
}
END {
if (0 != nb_dirs % nb_rsync)
print "wait"
}' > migr.sh

Output:

rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/ea [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/5e [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/aa [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/cc [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/e5 [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/bd [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/1d [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/61 [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/39 [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/75 [email protected]_machine:/some/other/place/dmtest &
wait
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/6d [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/d2 [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/8c [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/a1 [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/84 [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/b8 [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/a4 [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/27 [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/fe [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/e6 [email protected]_machine:/some/other/place/dmtest &
wait
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/4f [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/84/ea/82 [email protected]_machine:/some/other/place/dmtest &
...
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/86/95/92 [email protected]_machine:/some/other/place/dmtest &
wait
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/86/95/20 [email protected]_machine:/some/other/place/dmtest &
rsync -avzR /home/documentum/data/dmtest/./content_storage01/0000c35a/86/95/95 [email protected]_machine:/some/other/place/dmtest &
wait
 
chmod +x migr.sh
nohup ./migr.sh &

The good thing with rsync in archive mode is that the commands can be interrupted anytime; once relaunched, they will quickly resume the transfers where they left since they work incrementally.
Here, no SQL statements to correct the dm_locations, if needed, because we don’t connect to the repository to get the necessary information. It is not a big deal to produce them in case of need, however, just don’t forget.

Feeling uncomfortable because of the many produced rsync commands ? As in part I, just tell find to stop one level above the terminal sub-directory using “-mindepth 3 -maxdepth 3” at the cost of some speed decrease, though.

Any faster find alternative ?

As fast as find is when walking a docbase’s directories with lots of filestores, it can still take a long time despite the depth and type limits. Like we asked before for rsync, is there any way to speed it up by running several find commands concurrently on their own directory partition ? Such a find could be launched on sub-directory /home/dmadmin/data/dmtest/filestore_1, and others on /home/dmadmin/data/dmtest/filestore_i, all simultaneously. E.g.:

slice=0; for fs in dmtest/./*; do
(( slice += 1 ))
(
find $fs -mindepth 4 -maxdepth 4 -type d | gawk -v nb_rsync=10 -v [email protected]_machine:/some/other/place/dmtest 'BEGIN {
print "\#!/bin/bash"
}
{
printf("rsync -avzR %s %s &\n", $0, dest)
if (0 == ++nb_dirs % nb_rsync)
print "wait"
}
END {
if (0 != nb_dirs % nb_rsync)
print "wait"
}' > migr${slice}.sh
chmod +x migr${slice}.sh
./migr${slice}.sh
) &
done

But in such approach, the degree of parallelism is limited by the number of sub-directories to explore (i.e. here, the number of filestores in the repository). What about the script below instead ?

#! /usr/bin/python3
import datetime
import time
import sys
import os
import multiprocessing
import multiprocessing.pool
import getopt

# walks a Documentum tree at a given filesystem location using multi-processes for speed;
# C. Cervini, dbi-services.com, 12/2018;

def Usage():
   print("""Walks a sub-tree and prints the files' full path;
Usage:
walk-tree.py [-h|--help] | [-p |--print] [-r|--root-dir ] [-v|--verbose] [-m|--max_level]
-p |--print for outputting the files found; default action is do nothing, which is handy for just having a count of objects;
-r|--root-dir is the starting directory to explore;
-v|--verbose for outputting timing and count information; off by default;
-m|--max_level stops recursion at the max_level-th tree level, not included; like find's -maxdepth parameter;
Example:
walk-tree.py --print --root-dir ./dmtest/content_storage01/0000c35d/80
walk-tree.py --verbose --root-dir ./dmtest/content_storage01/0000c35d/80
""")

# this function cannot be local to do_tree(), though it is only used there, because of the following error:
#     Can't pickle local object 'do_tree..do_level'
# the error prevents it to be invoked as a local callback;
# actually, functions used in processes must have been entirely defined prior to their usage; this implies that functions cannot fork processes that calls themselves;
# therefore, the master function do_tree is needed that detaches the processes that execute the functions defined earlier, do_level here;
# moreover, processes in a pool are daemonic and are not allowed to fork other processes;
# we must therefore use a global pool of processes allocated in the master function;
def do_level(dir):
   """
   lists the sub-directores and files under dir;
   """
   global gp
   result = {'nb_files': 0, 'nb_dirs': 0, 'level_dirs': []}
   if gp.max_level > 0 and len() - gp.root_level > gp.max_level:
      #print("too deep, returning")
      if gp.bPrint:
         print(dir)
      return result
   for node in os.listdir(dir):
      fullpath = os.path.join(dir, node)
      if os.path.isdir(fullpath):
         result['nb_dirs'] += 1
         result['level_dirs'].append(fullpath)
      elif os.path.isfile(fullpath):
         if gp.bPrint:
            print(fullpath)
         result['nb_files'] += 1
   return result

# the master function;
def do_tree(root):
   global gp, stats
   sub_level_dirs = []

   # list_dirs contains a list of the values returned by parallelized calls to function do_level, i.e. a list of dictionaries;
   # get_dirs is invoked as a callback by map_async once all the processes in the pool have terminated executing do_level() and returned their result;
   def get_dirs(list_dirs):
      global stats
      nonlocal sub_level_dirs
      for l in list_dirs:
         stats.nb_dirs += len(l['level_dirs'])
         stats.nb_files += l['nb_files']
         stats.nb_dirs_with_files += (1 if l['nb_files'] > 0 else 0)
         sub_level_dirs.extend(l['level_dirs'])

   # callback for error reporting;
   def print_error(error):
      print(error)

   # first directory level;
   level_data = do_level(root)
   stats.nb_files += level_data['nb_files']
   stats.nb_dirs += len(level_data['level_dirs'])
   stats.nb_dirs_with_files += (1 if level_data['nb_files'] > 0 else 0)

   # all the other directory sub-levels;
   while level_data['nb_dirs'] > 0:
      sub_level_dirs = []
      gp.pool = multiprocessing.pool.Pool(processes = gp.maxWorkers)
      gp.pool.map_async(do_level, level_data['level_dirs'], len(level_data['level_dirs']), callback = get_dirs, error_callback = print_error)
      gp.pool.close()
      gp.pool.join()
      level_data['nb_files'] = 0
      level_data['nb_dirs'] = len(sub_level_dirs) 
      level_data['level_dirs'] = sub_level_dirs

# -----------------------------------------------------------------------------------------------------------
if __name__ == "__main__":
   # main;
   # global parameters;
   # we use a typical idiom to create a cheap but effective namespace for the global variables used as execution parameters;
   class gp: pass

   # root directory;
   gp.root = None
   gp.bPrint = False
   gp.bVerbose = False
   gp.root_level = 0
   gp.max_level = 0
   try:
       (opts, args) = getopt.getopt(sys.argv[1:], "hpr:m:v", ["help", "print", "root-dir=", "verbose", "max_level="])
   except getopt.GetoptError:
      print("Illegal option")
      print("./walk-tree.py -h|--help | [-p | --print] [-r|--root-dir ] [-v|--verbose] [-m|--max_level]")
      sys.exit(1)
   for opt, arg in opts:
      if opt in ("-h", "--help"):
         Usage()
         sys.exit()
      elif opt in ("-p", "--print"):
         gp.bPrint = True
      elif opt in ("-r", "--root-dir"):
         gp.root = arg
      elif opt in ("-v", "--verbose"):
         gp.bVerbose = True
      elif opt in ("-m", "--max_level"):
         try:
            gp.max_level = int(arg)
         except:
            print("invalid value for max_level")
            sys.exit()
   if None == gp.root:
      print("Error: root_dir must be specified")
      print("Use -h or --help for help")
      sys.exit()
   gp.root_level = len()

   # maximum numbers of allowed processes; tasks will wait until some processes are available;
   # caution not to choose huge values for disk I/Os are saturated and overall performance drops to a crawl;
   gp.maxWorkers = 50

   # again but for the counters;
   class stats: pass
   stats.nb_files = 0
   stats.nb_dirs = 0
   stats.nb_dirs_with_files = 0

   startDate = datetime.datetime.now()
   if gp.bVerbose:
      print(f"started walking {gp.root} at {startDate}")
   do_tree(gp.root)
   endDate = datetime.datetime.now()
   if gp.bVerbose:
      print("{:d} found files".format(stats.nb_files))
      print("{:d} found dirs".format(stats.nb_dirs))
      print("{:d} total found nodes".format(stats.nb_dirs + stats.nb_files))
      print("{:d} total found dirs with files".format(stats.nb_dirs_with_files))
      print(f"Ended walking subtree {gp.root} at {endDate}")
      print(f"subtree {gp.root} walked in {(endDate - startDate).seconds} seconds, or {time.strftime('%H:%M:%S', time.gmtime((endDate - startDate).seconds))} seconds")

Here, we have a pool of workers which receives tasks to explore sub-directories up to a given depth, just like find’s maxdepth. But is it any faster than a sequential find ? On my test laptop with an external USB 3.1 spinning drive, find gives the best result:

sudo sysctl vm.drop_caches=3; time find dmtest/content_storage01 -mindepth 4 -maxdepth 4 -type d | wc -l
vm.drop_caches = 3
1787628
 
real 35m19,989s
user 0m9,316s
sys 0m43,632s

The python script is very close but lags 2 minutes behind with 35 workers:

sudo sysctl vm.drop_caches=3; time ./walk-tree4.py -v --root-dir dmtest/content_storage01 -m 3
vm.drop_caches = 3
started walking dmtest/content_storage01 at 2019-01-01 17:47:39.664421
0 found files
1797049 found dirs
1797049 total found nodes
0 total found dirs with files
Ended walking subtree dmtest/content_storage01 at 2019-01-01 18:25:18.437296
subtree dmtest/content_storage01 walked in 2258 seconds, or 00:37:38 seconds
real 37m38,996s
user 0m34,138s
sys 1m11,747s

Performance gets worst when the number of concurrent workers is increased, likely because the disk is saturated. It was not designed for such intensive use in the first place. Whether with find or the python script, most of the execution time is spent waiting for the I/Os to complete, with very little time spent in user or system code. Obviously, there is little to optimize here, except switch to a faster disks sub-system, e.g. beginning with an SSD drive. Even a find’s parallel equivalent would not bring any improvement, it would only put more pressure on the disk and be counter-productive. But on real production infrastructures, with large enough disk bandwidth, if may be worth parallelizing if the volume is large, and that’s where the script can make a difference.

So, which one is better ?

Both alternatives generates more or less the same rsync commands to execute later; the only difference is the source of the information to produce those commands: the repository in the first alternative and the filesystem in the second.
The first alternative looks simpler and cleaner because it works from a repository and get its information directly from it. But if one wants to be absolutely, positively sure not to forget any content file, the second alternative is better as it works directly from the filesystem; since no queries are run for each of the contents, a lot of time is saved, even despite the required disk walking. It is ideal when an exact clone is needed, orphans, and possibly garbage, included. Its python script variant is interesting in that it can readily take advantage of the faster I/Os through an easy to set concurrency level.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Cesare Cervini
Cesare Cervini