Je viens de commité, dans la revision 3815 une évolution du mini outils de script et un exemple de package.
Cette révision sera compilée cette nuit et disponible sur http://www-igm.univ-mlv.fr/~unitex/index.php?page=3&html=beta.html et des maintenant sur le serveur svn)
Le package testHour.lingpkg contient à la fois les ressources Unitex (dictionnaires, graphes, Alphabet, fichiers de normalisation…) et un fichier de script.
C’est un fichier zip non compressé (obtenu avec l’outils zip et les options -0 -X -r)
Ainsi, si on a sur le disque simplement l’exécutable Unitex, le fichier testHour.lingpkg et un fichier testdate.txt contenant la ligne :
It is soon midnight and 14:05pm after. We go work at 8:00am, and to go home at 7:15pm.
Avec la bonne ligne de commande (décrite dans le script ci-dessous), on récupère les fichiers result_date.txt :
It is soon midnight and <time hour="14" min="05" meridiem="pm"/> after. We go work <time hour="8" min="00" meridiem="am"/>, and to go home <time hour="7" min="15" meridiem="pm"/>.
Et result_date.xml:
<concord>
<concordance start="24" end="31"><time hour="14" min="05" meridiem="pm"/></concordance>
<concordance start="50" end="59"><time hour="8" min="00" meridiem="am"/></concordance>
<concordance start="76" end="85"><time hour="7" min="15" meridiem="pm"/></concordance>
</concord>
Le but de cette fonctionnalité est de regrouper dans ce fichier .lingpkg à la fois les ressources linguistiques et la configuration de son utilisation, pour permettre l’utilisation d’Unitex dans une chaines de traitement sans que le code de cette chaine ne soit dépendante du paramétrage d’Unitex.
Ainsi, par exemple, si on doit ajouter un dictionnaire ou modifié un paramètre, il suffira de mettre à jour le package lingpkg.
Y:\>unzip -v y:\unitex\ling\English\LingPkg\testHour.lingpkg
Archive: y:/unitex/ling/English/LingPkg/testHour.lingpkg
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
0 Stored 0 0% 24/04/2015 14:57 00000000 resource/
0 Stored 0 0% 08/04/2015 16:16 00000000 resource/english/
0 Stored 0 0% 08/04/2015 16:14 00000000 resource/english/alphabet/
450 Stored 450 0% 29/09/2011 19:12 fa0f5504 resource/english/alphabet/Alphabet.txt
0 Stored 0 0% 24/04/2015 14:59 00000000 resource/english/dela/
18 Stored 18 0% 24/04/2015 14:59 ed7cf49d resource/english/dela/sample-dlc.bin
26 Stored 26 0% 24/04/2015 14:59 ae5c0783 resource/english/dela/sample-dlc.inf
0 Stored 0 0% 08/04/2015 16:14 00000000 resource/english/graph/
2900 Stored 2900 0% 29/09/2011 19:12 ecc59d54 resource/english/graph/AAA-hoursgilles.fst2
5150 Stored 5150 0% 29/09/2011 19:12 bc59d7a5 resource/english/hourstestgilles.txt
0 Stored 0 0% 08/04/2015 16:14 00000000 resource/english/norm/
22 Stored 22 0% 29/09/2011 19:12 775865cf resource/english/norm/Norm.txt
0 Stored 0 0% 08/04/2015 16:14 00000000 resource/english/others/
450 Stored 450 0% 29/09/2011 19:12 fa0f5504 resource/english/others/Alphabet.txt
22 Stored 22 0% 29/09/2011 19:12 775865cf resource/english/others/Norm.txt
0 Stored 0 0% 24/04/2015 15:44 00000000 script/
4106 Stored 4106 0% 24/04/2015 15:57 3bdb71c1 script/standard.uniscript
-------- ------- --- -------
13144 13144 0% 17 files
Y:\>unzip -c y:\unitex\ling\English\LingPkg\testHour.lingpkg script/standard.uniscript
Archive: y:/unitex/ling/English/LingPkg/testHour.lingpkg
extracting: script/standard.uniscript
# demo for Unitex Script and Resource package
# here are sample of usage of package with UnitexToolLogger executable
# (just remove the # and merges all lines)
# These sample are under MS-Windows
# put testHour.lingpkg in c:\HourTest
# and a file named testdate.txt with this line:
# It is soon midnight and 14:05pm after. We go work at 8:00am, and to go home at 7:15pm.
# you'll get result on two files result_date.txt and result_date.xml
# to uncompress resource package on disk at c:\UnitexPkgResource and working file at c:\UnitexPkgWork
# { InstallLingResourcePackage -r -p c:\HourTest\testHour.lingpkg -x "c:\UnitexPkgResource" -v }
# { RunScript -v -a INPUT_FILE_1=c:\HourTest\testdate.txt -a "CORPUS_WORK_DIR=c:\UnitexPkgWork" -a "PACKAGE_DIR=c:\UnitexPkgResource" -a OUTPUT_FILE_1=c:\HourTest\result_date.txt -a OUTPUT_FILE_2=c:\HourTest\result_date.xml "c:\UnitexPkgResource\script\standard.uniscript" }
# { InstallLingResourcePackage -r -p c:\HourTest\testHour.lingpkg -x "c:\UnitexPkgResource" -u -v }
# to work on Virtual File (resource package on $:UnitexPkgResource and working file at $:UnitexPkgWork)
# { InstallLingResourcePackage -r -p c:\HourTest\testHour.lingpkg -x "$:UnitexPkgResource" -v }
# { RunScript -v -a INPUT_FILE_1=c:\HourTest\testdate.txt -a "CORPUS_WORK_DIR=$:UnitexPkgWork/" -a "PACKAGE_DIR=$:UnitexPkgResource" -a OUTPUT_FILE_1=c:\HourTest\result_date.txt -a OUTPUT_FILE_2=c:\HourTest\result_date.xml "$:UnitexPkgResource\script\standard.uniscript" }
# { InstallLingResourcePackage -r -p c:\HourTest\testHour.lingpkg -x "$:UnitexPkgResource" -u -v }
# to work on Unitex Optimized Virtual File (resource package on $:UnitexPkgResource and working file at $:UnitexPkgWork)
# { InstallLingResourcePackage -r -p c:\HourTest\testHour.lingpkg -x "*UnitexPkgResource" -v }
# { RunScript -v -a INPUT_FILE_1=c:\HourTest\testdate.txt -a "CORPUS_WORK_DIR=*UnitexPkgWork/" -a "PACKAGE_DIR=*UnitexPkgResource" -a OUTPUT_FILE_1=c:\HourTest\result_date.txt -a OUTPUT_FILE_2=c:\HourTest\result_date.xml "*UnitexPkgResource\script\standard.uniscript" }
# { InstallLingResourcePackage -r -p c:\HourTest\testHour.lingpkg -x "*UnitexPkgResource" -u -v }
# With Linux/Unix/Mac, just replace \ by / and c:\ by your home directory in these command line.
# Using Unitex Library, you can install the package, and then run several RunScript
# Note:
# When this script file is read, all \ are converted to / under Linux/Unix/Mac
# When this script file is read, all / are converted to \ under MS-Windows
# define a subdirectory with an unique vale
# when Unitex Library running in multithread mode,
# two script running at same time (in different thread and same process)
# will always have different value for CORPUS_WORK_DIR
CURRENT_WORK_DIR = {CORPUS_WORK_DIR}/{UNIQUE_VALUE}
#
DuplicateFile --make-dir {CURRENT_WORK_DIR}
# we copy the input file from user path to working location
DuplicateFile -i {INPUT_FILE_1} {CURRENT_WORK_DIR}/corpus.txt
DuplicateFile --make-dir {CURRENT_WORK_DIR}/corpus_snt
Normalize {CURRENT_WORK_DIR}/corpus.txt -r {PACKAGE_DIR}/resource/english/norm/Norm.txt
Tokenize {CURRENT_WORK_DIR}/corpus.snt -a {PACKAGE_DIR}/resource/english/alphabet/Alphabet.txt
Dico -t {CURRENT_WORK_DIR}/corpus.snt -a {PACKAGE_DIR}/resource/english/alphabet/Alphabet.txt {PACKAGE_DIR}/resource/english/dela/sample-dlc.bin
Locate -t {CURRENT_WORK_DIR}/corpus.snt -a {PACKAGE_DIR}/resource/english/alphabet/Alphabet.txt -L -R --all -b -Y {PACKAGE_DIR}/resource/english/graph/AAA-hoursgilles.fst2
Concord {CURRENT_WORK_DIR}/corpus_snt/concord.ind -m {CURRENT_WORK_DIR}/corpus.txt
Concord {CURRENT_WORK_DIR}/corpus_snt/concord.ind --xml
# we copy the output files from working location to user path
# merged file from first concord
DuplicateFile -i {CURRENT_WORK_DIR}/corpus.txt {OUTPUT_FILE_1}
# xml file from second concord
DuplicateFile -i {CURRENT_WORK_DIR}/corpus_snt/concord.xml {OUTPUT_FILE_2}
# perform a cleanup of working directory
DuplicateFile --recursive-delete {CURRENT_WORK_DIR}