CDX - INDEX

105 views
Skip to first unread message

Carlos Córdova

unread,
Oct 25, 2016, 11:52:01 PM10/25/16
to openwayback-dev
Hello friends.

I'm trying to will configure a redhat server CDX according settings:


WARC's have my files in /opt /warc:

[root@webarchive-testing WARC]# ls -l /opt/warc/

archivo1.warc.gz 
archivo2.warc.gz 
cdx-index/
cdx.sh 
path-index.txt

copy the scripts from the installation page:


but it shows me the following error:

[root@webarchive-testing WARC]# sh -x cdx.sh
+ ARCHIVE_BASE_DIR = /opt/warc
+ TARGET_FILE = file
+ Tempfile = file.tmp
+ Unset to i
cdx.sh: line 14: syntax error near unexpected token `<'
cdx.sh: line 14: `done <<($ ARCHIVE_BASE_DIR find -type f -regex -print0" * \ w arc \ .gz $..? ") '

I have set the export in my /etc/profile:

[root@webarchive-testing WARC]# cat  /etc/profile | grep "LC"
export LC_ALL = C

I can target that does not work the scripts.

Greetings.

--Carlos

Kristinn Sigurðsson

unread,
Oct 26, 2016, 5:34:52 AM10/26/16
to openway...@googlegroups.com

Hi Carlos,

 

Just to be clear, the script you are referring to (at the bottom of https://github.com/iipc/openwayback/wiki/How-to-configure) is only to create the ‘path-index’ file. I.e. the file that maps WARC and ARC filenames to actual URIs (either via HTTP or on the local filesystem).

 

The script is also provided as an example only. It may not be suitable for everyone’s needs.

 

In your case, you seem to have made a mistake in copying this line:

 

done < <(find $ARCHIVE_BASE_DIR -type f -regex ".*\.w?arc\.gz$" -print0)

 

Note the space following the first and second ‘<’

That space is vitally important!

 

Best,

Kris

 

Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel: +354 5255600 | www.landsbokasafn.is
Leiddu hugann að umhverfinu áður en þú prentar út tölvupóst

Fyrirvari / Disclaimer

--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Carlos Córdova

unread,
Oct 26, 2016, 1:17:05 PM10/26/16
to openwayback-dev

thank you very much for your reply. Sent copy and paste the line and I get the same result This is the scripts.
#!/bin/bash # Find all ARC/WARC files ARCHIVE_BASE_DIR=/opt/warc; TARGET_FILE=file; tempfile="$TARGET_FILE.tmp"; unset a i while IFS= read -r -d $'\0' file; do archive=$(basename $file); echo -e "$archive\t$file" >> $tempfile; done < <(find $ARCHIVE_BASE_DIR -type f -regex ".*\.w?arc\.gz$" -print0) # Now sort the file export LC_ALL=C; sort $tempfile > $TARGET_FILE; rm $tempfile ~

Is there any program to generate scripts or files * .CDX and path-index.txt ?

thanks

--Carlos
Reply all
Reply to author
Forward
0 new messages