Script et Package de ressource linguistique dans Unitex

79 views
Skip to first unread message

Gilles Vollant

unread,
Apr 24, 2015, 11:26:04 AM4/24/15
to unitex-...@googlegroups.com

 

Je viens de commité, dans la revision 3815 une évolution du mini outils de script et un exemple de package.

 

Cette révision sera compilée cette nuit et disponible sur http://www-igm.univ-mlv.fr/~unitex/index.php?page=3&html=beta.html et des maintenant sur le serveur svn)

 

Le package testHour.lingpkg contient à la fois les ressources Unitex (dictionnaires, graphes, Alphabet, fichiers de normalisation…) et un fichier de script.

 

C’est un fichier zip non compressé (obtenu avec l’outils zip et les options -0 -X -r)

 

Ainsi, si on a sur le disque simplement l’exécutable Unitex, le fichier testHour.lingpkg et un fichier testdate.txt contenant la ligne :

It is soon  midnight and 14:05pm after. We go work at 8:00am, and to go home at 7:15pm.

 

Avec la bonne ligne de commande (décrite dans le script ci-dessous), on récupère les fichiers result_date.txt :

It is soon midnight and <time hour="14" min="05" meridiem="pm"/> after. We go work <time hour="8" min="00" meridiem="am"/>, and to go home <time hour="7" min="15" meridiem="pm"/>.

 

Et result_date.xml:

<concord>

<concordance start="24" end="31"><time hour="14" min="05" meridiem="pm"/></concordance>

<concordance start="50" end="59"><time hour="8" min="00" meridiem="am"/></concordance>

<concordance start="76" end="85"><time hour="7" min="15" meridiem="pm"/></concordance>

</concord>

 

Le but de cette fonctionnalité est de regrouper dans ce fichier .lingpkg à la fois les ressources linguistiques et la configuration de son utilisation, pour permettre l’utilisation d’Unitex dans une chaines de traitement sans que le code de cette chaine ne soit dépendante du paramétrage d’Unitex.

Ainsi, par exemple, si on doit ajouter un dictionnaire ou modifié un paramètre, il suffira de mettre à jour le package lingpkg.

 

Y:\>unzip -v y:\unitex\ling\English\LingPkg\testHour.lingpkg

Archive:  y:/unitex/ling/English/LingPkg/testHour.lingpkg

Length   Method    Size  Cmpr    Date    Time   CRC-32   Name

--------  ------  ------- ---- ---------- ----- --------  ----

       0  Stored        0   0% 24/04/2015 14:57 00000000  resource/

       0  Stored        0   0% 08/04/2015 16:16 00000000  resource/english/

       0  Stored        0   0% 08/04/2015 16:14 00000000  resource/english/alphabet/

     450  Stored      450   0% 29/09/2011 19:12 fa0f5504  resource/english/alphabet/Alphabet.txt

       0  Stored        0   0% 24/04/2015 14:59 00000000  resource/english/dela/

      18  Stored       18   0% 24/04/2015 14:59 ed7cf49d  resource/english/dela/sample-dlc.bin

      26  Stored       26   0% 24/04/2015 14:59 ae5c0783  resource/english/dela/sample-dlc.inf

       0  Stored        0   0% 08/04/2015 16:14 00000000  resource/english/graph/

    2900  Stored     2900   0% 29/09/2011 19:12 ecc59d54  resource/english/graph/AAA-hoursgilles.fst2

    5150  Stored     5150   0% 29/09/2011 19:12 bc59d7a5  resource/english/hourstestgilles.txt

       0  Stored        0   0% 08/04/2015 16:14 00000000  resource/english/norm/

      22  Stored       22   0% 29/09/2011 19:12 775865cf  resource/english/norm/Norm.txt

       0  Stored        0   0% 08/04/2015 16:14 00000000  resource/english/others/

     450  Stored      450   0% 29/09/2011 19:12 fa0f5504  resource/english/others/Alphabet.txt

      22  Stored       22   0% 29/09/2011 19:12 775865cf  resource/english/others/Norm.txt

       0  Stored        0   0% 24/04/2015 15:44 00000000  script/

    4106  Stored     4106   0% 24/04/2015 15:57 3bdb71c1  script/standard.uniscript

--------          -------  ---                            -------

   13144            13144   0%                            17 files

 

Y:\>unzip -c y:\unitex\ling\English\LingPkg\testHour.lingpkg script/standard.uniscript

Archive:  y:/unitex/ling/English/LingPkg/testHour.lingpkg

extracting: script/standard.uniscript

# demo for Unitex Script and Resource package

 

 

 

# here are sample of usage of package with UnitexToolLogger executable

# (just remove the # and merges all lines)

 

# These sample are under MS-Windows

# put testHour.lingpkg in c:\HourTest

# and a file named testdate.txt with this line:

#  It is soon  midnight and 14:05pm after. We go work at 8:00am, and to go home at 7:15pm.

# you'll get result on two files result_date.txt and result_date.xml

 

# to uncompress resource package on disk at c:\UnitexPkgResource and working file at c:\UnitexPkgWork

#  { InstallLingResourcePackage -r -p c:\HourTest\testHour.lingpkg -x "c:\UnitexPkgResource" -v }

#  { RunScript -v -a INPUT_FILE_1=c:\HourTest\testdate.txt -a "CORPUS_WORK_DIR=c:\UnitexPkgWork" -a "PACKAGE_DIR=c:\UnitexPkgResource" -a OUTPUT_FILE_1=c:\HourTest\result_date.txt -a OUTPUT_FILE_2=c:\HourTest\result_date.xml "c:\UnitexPkgResource\script\standard.uniscript" }

#  { InstallLingResourcePackage -r -p c:\HourTest\testHour.lingpkg -x "c:\UnitexPkgResource" -u -v }

 

#  to work on Virtual File (resource package on $:UnitexPkgResource and working file at $:UnitexPkgWork)

#  { InstallLingResourcePackage -r -p c:\HourTest\testHour.lingpkg -x "$:UnitexPkgResource" -v }

#  { RunScript -v -a INPUT_FILE_1=c:\HourTest\testdate.txt -a "CORPUS_WORK_DIR=$:UnitexPkgWork/" -a "PACKAGE_DIR=$:UnitexPkgResource" -a OUTPUT_FILE_1=c:\HourTest\result_date.txt -a OUTPUT_FILE_2=c:\HourTest\result_date.xml "$:UnitexPkgResource\script\standard.uniscript" }

#  { InstallLingResourcePackage -r -p c:\HourTest\testHour.lingpkg -x "$:UnitexPkgResource" -u -v }

 

# to work on Unitex Optimized Virtual File (resource package on $:UnitexPkgResource and working file at $:UnitexPkgWork)

#  { InstallLingResourcePackage -r -p c:\HourTest\testHour.lingpkg -x "*UnitexPkgResource" -v }

#  { RunScript -v -a INPUT_FILE_1=c:\HourTest\testdate.txt -a "CORPUS_WORK_DIR=*UnitexPkgWork/" -a "PACKAGE_DIR=*UnitexPkgResource" -a OUTPUT_FILE_1=c:\HourTest\result_date.txt -a OUTPUT_FILE_2=c:\HourTest\result_date.xml "*UnitexPkgResource\script\standard.uniscript" }

#  { InstallLingResourcePackage -r -p c:\HourTest\testHour.lingpkg -x "*UnitexPkgResource" -u -v }

 

# With Linux/Unix/Mac, just replace \ by / and c:\ by your home directory in these command line.

 

# Using Unitex Library, you can install the package, and then run several RunScript

 

# Note:

# When this script file is read, all \ are converted to / under Linux/Unix/Mac

# When this script file is read, all / are converted to \ under MS-Windows

 

# define a subdirectory with an unique vale

# when Unitex Library running in multithread mode,

#   two script running at same time (in different thread and same process)

#   will always have different value for CORPUS_WORK_DIR

CURRENT_WORK_DIR = {CORPUS_WORK_DIR}/{UNIQUE_VALUE}

 

#

DuplicateFile --make-dir {CURRENT_WORK_DIR}

 

# we copy the input file from user path to working location

DuplicateFile -i {INPUT_FILE_1} {CURRENT_WORK_DIR}/corpus.txt

DuplicateFile --make-dir {CURRENT_WORK_DIR}/corpus_snt

Normalize {CURRENT_WORK_DIR}/corpus.txt -r {PACKAGE_DIR}/resource/english/norm/Norm.txt

Tokenize {CURRENT_WORK_DIR}/corpus.snt -a {PACKAGE_DIR}/resource/english/alphabet/Alphabet.txt

Dico -t {CURRENT_WORK_DIR}/corpus.snt -a {PACKAGE_DIR}/resource/english/alphabet/Alphabet.txt {PACKAGE_DIR}/resource/english/dela/sample-dlc.bin

Locate -t {CURRENT_WORK_DIR}/corpus.snt -a {PACKAGE_DIR}/resource/english/alphabet/Alphabet.txt -L -R --all -b -Y {PACKAGE_DIR}/resource/english/graph/AAA-hoursgilles.fst2

Concord {CURRENT_WORK_DIR}/corpus_snt/concord.ind -m {CURRENT_WORK_DIR}/corpus.txt

Concord {CURRENT_WORK_DIR}/corpus_snt/concord.ind --xml

 

# we copy the output files from working location to user path

# merged file from first concord

DuplicateFile -i {CURRENT_WORK_DIR}/corpus.txt {OUTPUT_FILE_1}

 

# xml file from second concord

DuplicateFile -i {CURRENT_WORK_DIR}/corpus_snt/concord.xml {OUTPUT_FILE_2}

 

# perform a cleanup of working directory

DuplicateFile --recursive-delete {CURRENT_WORK_DIR}

 

 

 

 

 

 

 

 

Reply all
Reply to author
Forward
0 new messages