I have a small book collection (~150) that I thought would be neat to catalog by the Library of Congress catalog numbers. I have found a LOC search form that will allow me to input the ISBN, and it will return the information I want:
I have the list of book ISBNs in a text file, so scripting this should be quite easy. The problem is I can't figure out how to submit the form from the command line. I figured wget would be the best way, but everything I try results in downloading a single line that reads "Your form didn't include an ACTION!" So I thought I would turn to here for help. The test ISBN I am using is for The Linux Cookbook: 1886411484, QA76.76.O63S788 2001.
And a related side question. From my reading, I've learned that the Z39.50 protocol is used to query databases, usually library related. Is anyone aware of an ISBN database table that can be downloaded by the user, preferably in a format that can be imported into MySQL or PostgreSQL?
Thanks, Craig
Sent - Gtek Web Mail
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/134832611...@webmail.gtek.biz
Yeah, I was messing with the --post-data, but I didn't know I had to use an ACTION key. Will play with that.
> But I get "session expired" :-(
>
> (note the "SESSION_ID" field value is completely arbitrary in the above line)
>
>> And a related side question. From my reading, I've learned that the
>> Z39.50 protocol is used to query databases, usually library related. Is
>> anyone aware of an ISBN database table that can be downloaded by the
>> user, preferably in a format that can be imported into MySQL or
>> PostgreSQL?
>
> Well, according to this:
>
> http://www.loc.gov/z3950/gateway.html#about
>
> You can query the database by means of Z39.50 client, should you find one ;-)
I kind if figured that would be what I needed, but I'm not aware of any Z39.50 clients.
Thanks!
Sent - Gtek Web Mail
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/134833133...@webmail.gtek.biz
http://www.ukoln.ac.uk/interop-focus/bath/
There are also bindings for C, C++ and PHP. You'll find them at
IndexData's web site.
As far as importing into MySQL or Postgresql, that is up to how you
decide to map the Bath Profile (most likely the one used) over to your
own database structure. The database being queried via Z39.50 probably
has its data in the MARC21 format and has over 1000 fields and subfields
each with a specific meaning.
Thanks for the info. I didn't realize MARC21 was so complex, but I can always create queries that select what I need, I just need to know what to query against. I will read up on what you provided.
Sent - Gtek Web Mail
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/1348331062...@webmail.gtek.biz
What you need to do is capture the http header sent when you click
"submit query" then replace the test ISBN number with whatever number
you want to search. Wireshark can do this. Simply look for the query
packet(s).
At some point I thought about trying capture what was being submitted, but since my http protocol knowledge is limited I thought the information might also be being sent as a URL, which I figured would make wget perfect for this. I've got wireshark loaded on something around here, so I will investigate this line of thought. Thanks!
Sent - Gtek Web Mail
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/134833111...@webmail.gtek.biz
Ah, that makes sense. I will probably get after this again later today or tomorrow, and I will definitely post any success stories. It will probably take me a while to get back up to speed with perl since I haven't touched it in a couple of years.
Craig
Sent - Gtek Web Mail
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/1348332366...@webmail.gtek.biz
> Have a look at:
> http://search.cpan.org/dist/WWW-Mechanize/
>
> Have a read of:
> http://www.perl.com/pub/2003/01/22/mechanize.html
>
> Do a google search on "perl www::mechanize"
Thanks for the reply (and to the other kind folks that took time
to reply). I will have to put this quest off until the weekend at this
point, so know that I am not ignoring the help, please.
Craig
I've been doing some reading, and there is work under way to modernize the
classification system. In the meantime this works for my needs. I do appreciate
the suggestion.
Sent - Gtek Web Mail
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/134895687...@webmail.gtek.biz
The script, and my list of 25 test ISBNs are included below. Interestingly,
about five, or 20% come up with no record found.
If I try to do anything more fancy then I will learn how to query the MARC
system directly. The LOC site has a lot of information available.
I appreciate all of the help and suggestions I received.
#!/bin/bash
#*******************************************#
# getLOCinfo.sh #
# #
# A script to read a list of ISBN numbers #
# from an input file, and to retrieve the #
# LOC info for that item from the LOC web #
# search form. #
# #
# The input file is expected to contain #
# a single line of ISBN numbers separated #
# by whitespace. Alternatively, the file #
# can contain one ISBN per line as long as #
# all but the final line ends with white- #
# space followed by a backslash (actually #
# I think all lines can end that way). #
#*******************************************#
# Script Constants:
BASE_URL="http://www.loc.gov/cgi-bin/zgate"
E_BAD_ARGS=65
E_BAD_FILE=66
E_NO_SESSION_ID=67
NUM_ARGS=2
NUM_EXPIRED=10
SUCCESS=0
# Script variables:
expired_count=0
result="Your session has expired"
result_url=$BASE_URL
session_url=$BASE_URL
# A function to get a new sessionid:
GetSessionID ()
{
session_url=$BASE_URL"?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/"
session_url=$session_url"locils2.html,z3950.loc.gov,7090"
sessionid=`wget $session_url -o /dev/null -O - | \
grep SESSION_ID | \
cut -d "\"" -f4`
if [ -z $sessionid ]
then
echo "Unable to get session ID. Exiting"
exit $E_NO_SESSION_ID
fi
}
# A function to "build" the request URL:
BuildURL ()
{
url=$BASE_URL"?ACTION=SEARCH&DBNAME=VOYAGER&ESNAME=B&MAXRECORDS=20&"
url=$url"RECSYNTAX=1.2.840.10003.5.10&REINIT=/cgi-bin/zgate?ACTION=INIT&"
url=$url"FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,"
url=$url"7090&srchtype=1,1016,2,102,3,3,4,2,5,100,6,1&SESSION_ID=$1&"
url=$url"TERM_1=$2"
}
# Make sure file names were supplied when the script was called:
if [ $# -ne $NUM_ARGS ]
then
echo "ERROR: Incorrect number of parameters supplied. Exiting..."
exit $E_BAD_ARGS
fi
# Make sure the input file exists and is not empty:
if [ ! -f "$1" ] || [ ! -s "$1" ]
then
echo "ERROR: $1 not found or is an empty file. Exiting..."
exit $E_BAD_FILE
fi
# Truncate the output file if necessary:
if [ -s $2 ]
then
echo -n "Warning: $2 exists and is not empty. Continue [y/N]? "
read input
if [ `echo $input | tr A-Z a-z` != "y" ]
then
echo "Please provide a valid output file name"
exit $E_BAD_FILE
fi
cat /dev/null > $2
fi
# Get a session ID:
GetSessionID
# Read the file contents:
read isbn_list < $1
for isbn in $isbn_list
do
BuildURL $sessionid $isbn
result=`wget $url -o /dev/null -O - | tr "\n" " "`
while [ -n "`echo $result | sed -n -e '/Your session has expired/Ip'`" ] &&
[ $expired_count -lt $NUM_EXPIRED ]
do
let "expired_count+=1"
GetSessionID
BuildURL $sessionid $isbn
result=`wget $url -o /dev/null -O - | tr "\n" " "`
done
if [ $expired_count -eq $NUM_EXPIRED ]
then
echo "Unable to get session ID. Exiting"
exit $E_NO_SESSION_ID
else
expired_count=0
fi
if [ -n "`echo $result | sed -n -e '/No records matched your query/Ip'`" ]
then
# Print the not found message to stderr:
echo "$isbn: No record found" >&2
else
echo -n "\"$isbn\"," >> $2
echo $result | sed -n -e 's/.*<pre>\(.*\)<\/pre>.*/\1/Ip' | \
sed -e 's/ \+/ /g' | \
sed -e 's/^Author: /"/' | \
sed -e 's/\., [0-9]\{4\}-[0-9]\{0,4\} \(Title: \)/. \1/' | \
sed -e 's/\. Title: /","/' | \
sed -e 's/\. Published: /","/' | \
sed -e 's/, c\([0-9]\{4\}\)\. LC Call No.: /","\1","/' | \
sed -e 's/ *$/"/' \
>> $2
fi
done
exit $SUCCESS
##### ISBN List: ###############################################################
0805375651 \
0314027157 \
0201087987 \
9780980232714 \
0131774115 \
0789731274 \
1874416656 \
1886411484 \
9780425238981 \
0070726922 \
0495011622 \
1565927699 \
0673524841 \
0721659659 \
9781847991683 \
0596100795 \
0596001584 \
9780980455205 \
0835930513 \
9780954452971 \
0619121475 \
9780321553577 \
0130424110 \
0201612445 \
9780123705488
Sent - Gtek Web Mail
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/134895847...@webmail.gtek.biz
Well that certainly looks a lot better than what I came up with. I will
have to give it a try, but doubt I will have time before Friday to play
with this again. I'll let you know. Out of curiosity, can this be done
with lynx instead since I have it installed? If not, I can always
install elinks.
Thanks!
Sent - Gtek Web Mail
--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/1349192494...@webmail.gtek.biz