Automating bulk import

869 views
Skip to first unread message

Jeff Neeve

unread,
Nov 10, 2011, 1:55:07 PM11/10/11
to Alfresco Bulk Filesystem Import
Just looking for ideas on automating this and any potential
problems...

For fun, I tried using curl to use basic http authentication to start
the webscript.

That works fine, but by default you just get html output from the
template so I modified it to output the status code (200) etc.

curl -sLG -w "%{http_code} %{url_effective}" -d "sourceDirectory=/
export/home/alfresco/sourceFolder&targetPath=/Company%20Home/bulktest"
"http://admin:admin@localhost:8080/alfresco/s/bulk/import/filesystem/
initiate" -o /dev/null

But, since that just reports the status of the script being started,
how could you tell if it was actually succesfull or not in uploading
the files?

And is there any issues with say filelocking - IE: If a process was
coping a large file to the incoming directory, at the same time the
bulk load process was started.

Thanks.

Peter Monks

unread,
Nov 11, 2011, 8:38:57 PM11/11/11
to Alfresco Bulk Filesystem Import
G'day Jeff,

Keep in mind that imports always run asynchronously (i.e. on one or
more background threads), so the only way to monitor the status of an
import is to poll. Although the status Web Script defaults to HTML
output, it also has an XML template, which is intended for
programmatic monitoring of import status. This XML view is accessed
as is usual for any Web Script - simply append a ".xml" extension to
the end of the Web Script URI and it'll return XML. Here's an example
URL:

http://localhost:8080/alfresco/service/bulk/import/filesystem/status.xml

As for file locking, it depends on a whole host of factors, including
which OS Alfresco is running on, which filesystem the source directory
is (NTFS, ext3/4, ZFS, etc. etc.) and how the external process is
creating the large files. In short, there's no way of knowing without
trying it out and seeing what happens.

That said, there's a bigger issue here, and that's the likelihood of
race conditions. The import tool performs file I/O in two distinct
phases:
1. directory scanning phases (which lists the contents of a directory,
but doesn't read the files in any way)
2. a file read phase (which reads the content of the files, but
doesn't list the parent directory in any way)
Depending on the volume and size of the contents of a directory, these
phases could be temporally decoupled - when running the unit tests I
regularly see separation of several seconds, but that could quite
easily stretch into the minutes range (particularly if the source disk
subsystem is "slow" e.g. it's a remote mount).

Race conditions can creep in because of this temporal decoupling -
specifically, if any structural operation (create, move, rename,
delete) is performed on a file in between steps 1 and 2.
Unfortunately there's no simple and performant way of safe guarding
against this kind of thing, so the tool fundamentally assumes that the
source directory is immutable for the duration of the import.

If you have a need to perform imports on a changing data set (for
example you're repeatedly importing new data from some kind of feed)
then you should use double buffering. This involves having two
directories on disk, one of which is being read by the import tool and
the other being written to by the feed process. Once an import is
complete, you'd delete the contents of that directory (the one that
was just imported), have the feed process switch over to writing to
that (now empty) directory, then initiate an import on the directory
that the feed process was previously writing to. Basically what
you're doing is adding a level of coordination between the import tool
and the feed process to ensure that they're never directly accessing
the same directory at the same time, while still allowing both to
operate concurrently (and also avoiding any unnecessary copying).

For anyone who's ever coded video games, this should be immediately
familiar as exactly the same thing as double buffering / page
flipping, but applied to files on disk rather than VRAM. ;-)

Cheers,
Peter

Peter Monks

unread,
Nov 11, 2011, 8:46:16 PM11/11/11
to Alfresco Bulk Filesystem Import
For the videogame coders it's also worth mentioning that triple
buffering can also be useful (just like in videogames) - specifically,
if the feed process can't be controlled in the way I describe below.
But the details of making that work reliably are somewhat more
complicated, not to mention OS specific, so I'll leave it as an
exercise for the reader. ;-)

Cheers,
Peter

Chini

unread,
Jul 27, 2012, 11:59:41 AM7/27/12
to alfresco-bulk-f...@googlegroups.com
Hi Peter,

I am very impressed about the progress you have made so far concerning this overall bulk import tool.
Actually I considered to build something similar on my own before I came across this project of yours - great!
My motivation is a large data migration I am facing from FileNet to Alfresco, talking about millions of dossiers (folders/containers) and containing documents.

BUT:
I am going to use a relational db as a temp storage system for all relevant meta-data coming from FileNet. The binary content can easily be written to the Alfresco content store location directly (to allow the less time-consuming in-place import later on), initiated by the source repository extraction process (out of scope).
So one slightly different approach I going for is to pull meta-data to be imported from a RDBMS, for each item (dossier and documents) to be transferred. Mapping that meta data which is already stored in the db to a xml file structure seems massive overhead to me, in terms of programming effort and execution performance.

Have you considered to implement such functionality, too?
If not I can easily adept your code for my own needs and (depending on the results :) contribute my extension to your code base.

Regards,
Stephan

Colin Stephenson

unread,
Nov 14, 2013, 3:25:17 PM11/14/13
to alfresco-bulk-f...@googlegroups.com
I am "hacking" a script together to allow for putting a number of bulk imports into a txt and launching from a script.  Disclaimer, this was knocked together after the Alfresco Summit party.  The importfile it expects is formed like

/home/alfresco/TDGfABI/target/test_folder_00000,/Company Home/Sites/mytestsite104/documentLibrary/Purged/test_folder_00000


#!/bin/bash

importfile=$1
sleepyperiod=$2
U=$3
PW=$4
HOST=$5
PORT=$6

function usage {
  echo "$0 <import file> <sleepy period> <user> <password> <host> <:port>"
}

function pollStatus {
  sp=$1
  status=`curl -s -G "http://${U}:${PW}@${HOST}${PORT}/alfresco/service/bulk/import/filesystem/status.json"`

  rv=0
  inprogress=`echo $status | grep "inProgress" | cut -d":" -f2`
  if [ "$inprogress" == ' true, "status" ' ]; then
    rv=0
    echo -n "."
    sleep $sp
  else
    rv=1
  fi
  return $rv
}

function ingestSite {
  src="$1"
  target="$2"
  sp=$3

  sowhat=`curl -s -G "http://${U}:${PW}@${HOST}${PORT}/alfresco/s/bulk/import/filesystem/initiate?targetPath=$target&sourceDirectory=$src"`

  rv=$?
  return $rv
}

function ingestFile {
  in=$1
  sp=$2

  while read line
  do
    src=`echo $line | cut -f1 -d","`
    target=`echo $line | cut -f2 -d","`
    src="${src/ /%20}"
    target="${target/ /%20}"
    ingestSite "$src" "$target" $sp
    rv=$?

    if [ $rv -eq 0 ]; then
    echo "Ingesting $src to $target"
        rv2=0
        while [ $rv2 -eq 0 ]; do
          pollStatus $sp
          rv2=$?
        done
    fi
  done < $in
}

if [ $# -ne 6 ]; then
  usage
  exit 1
fi

if [ -e "$importfile" ]; then
  ingestFile $importfile $sleepyperiod
else
  echo "Cannot locate import file"
  usage
fi

Peter Monks

unread,
Nov 17, 2013, 1:50:03 PM11/17/13
to alfresco-bulk-f...@googlegroups.com
Nice Colin - thanks for sharing!  Any thoughts on hosting this somewhere more permanent, perhaps in the project itself in a "contrib" folder or something?

Cheers,
Peter
 

 


--
You received this message because you are subscribed to the Google Groups "Alfresco Bulk Filesystem Import" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alfresco-bulk-filesys...@googlegroups.com.
To post to this group, send email to alfresco-bulk-f...@googlegroups.com.
Visit this group at http://groups.google.com/group/alfresco-bulk-filesystem-import.
For more options, visit https://groups.google.com/groups/opt_out.

Colin Stephenson

unread,
Nov 17, 2013, 2:17:06 PM11/17/13
to alfresco-bulk-f...@googlegroups.com
Seeing as this was creating during the summit, not much thinking was involved ;)  Having a "contrib" folder sounds like a grand idea.  Either that or a Wiki section for contributions which can be voted one for inclusion ?
To unsubscribe from this group and stop receiving emails from it, send an email to alfresco-bulk-filesystem-import+unsubscribe@googlegroups.com.
To post to this group, send email to alfresco-bulk-filesystem-imp...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages