Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Using wget to fill in a form

224 views
Skip to first unread message

cr...@gtek.biz

unread,
Sep 22, 2012, 11:20:03 AM9/22/12
to
Greetings,

I have a small book collection (~150) that I thought would be neat to catalog by the Library of Congress catalog numbers. I have found a LOC search form that will allow me to input the ISBN, and it will return the information I want:

[code]http://www.loc.gov/cgi-bin/zgate?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,7090[/code]

I have the list of book ISBNs in a text file, so scripting this should be quite easy. The problem is I can't figure out how to submit the form from the command line. I figured wget would be the best way, but everything I try results in downloading a single line that reads "Your form didn't include an ACTION!" So I thought I would turn to here for help. The test ISBN I am using is for The Linux Cookbook: 1886411484, QA76.76.O63S788 2001.

And a related side question. From my reading, I've learned that the Z39.50 protocol is used to query databases, usually library related. Is anyone aware of an ISBN database table that can be downloaded by the user, preferably in a format that can be imported into MySQL or PostgreSQL?

Thanks, Craig


Sent - Gtek Web Mail



--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/134832611...@webmail.gtek.biz

Gary Dale

unread,
Sep 22, 2012, 11:30:02 AM9/22/12
to
On 22/09/12 11:01 AM, cr...@gtek.biz wrote:
> Greetings,
>
> I have a small book collection (~150) that I thought would be neat to catalog by the Library of Congress catalog numbers. I have found a LOC search form that will allow me to input the ISBN, and it will return the information I want:
>
> [code]http://www.loc.gov/cgi-bin/zgate?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,7090[/code]
>
> I have the list of book ISBNs in a text file, so scripting this should be quite easy. The problem is I can't figure out how to submit the form from the command line. I figured wget would be the best way, but everything I try results in downloading a single line that reads "Your form didn't include an ACTION!" So I thought I would turn to here for help. The test ISBN I am using is for The Linux Cookbook: 1886411484, QA76.76.O63S788 2001.
>
> And a related side question. From my reading, I've learned that the Z39.50 protocol is used to query databases, usually library related. Is anyone aware of an ISBN database table that can be downloaded by the user, preferably in a format that can be imported into MySQL or PostgreSQL?
>
> Thanks, Craig
>
The url you give is for the form. If you enter an ISBN number it will do
the search.

What you need to do is capture the http header sent when you click
"submit query" then replace the test ISBN number with whatever number
you want to search. Wireshark can do this. Simply look for the query
packet(s).


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/505DD8C...@rogers.com

Lars Noodén

unread,
Sep 22, 2012, 11:30:02 AM9/22/12
to
On 9/22/12 6:01 PM, cr...@gtek.biz wrote:
[snip]
> And a related side question. From my reading, I've learned that the
> Z39.50 protocol is used to query databases, usually library related.
> Is anyone aware of an ISBN database table that can be downloaded by
> the user, preferably in a format that can be imported into MySQL or
> PostgreSQL?
[snip]

You could use Perl and ZOOM to make Z39.50 queries directly:

http://search.cpan.org/~mirk/Net-Z3950-ZOOM/lib/ZOOM.pod

For background see the Bath Profile:

http://www.ukoln.ac.uk/interop-focus/bath/

There are also bindings for C, C++ and PHP. You'll find them at
IndexData's web site.

As far as importing into MySQL or Postgresql, that is up to how you
decide to map the Bath Profile (most likely the one used) over to your
own database structure. The database being queried via Z39.50 probably
has its data in the MARC21 format and has over 1000 fields and subfields
each with a specific meaning.



Regards
/Lars


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/505DD82E...@gmail.com

Lars Noodén

unread,
Sep 22, 2012, 11:40:02 AM9/22/12
to
On 9/22/12 6:01 PM, cr...@gtek.biz wrote:
> Greetings,
>
> I have a small book collection (~150) that I thought would be neat to
> catalog by the Library of Congress catalog numbers. I have found a
> LOC search form that will allow me to input the ISBN, and it will
> return the information I want:
>
> [code]http://www.loc.gov/cgi-bin/zgate?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,7090[/code]
>
> I have the list of book ISBNs in a text file, so scripting this
> should be quite easy. The problem is I can't figure out how to submit
> the form from the command line. I figured wget would be the best way,
> but everything I try results in downloading a single line that reads
> "Your form didn't include an ACTION!" So I thought I would turn to
> here for help. The test ISBN I am using is for The Linux Cookbook:
> 1886411484, QA76.76.O63S788 2001.
[snip]

If you want to screen scrape, the URI would be like this:

http://www.loc.gov/cgi-bin/zgate?ACTION=SEARCH&DBNAME=VOYAGER&ESNAME=B&MAXRECORDS=20&RECSYNTAX=1.2.840.10003.5.10&REINIT=/cgi-bin/zgate?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,7090&srchtype=1,1016,2,102,3,3,4,2,5,100,6,1&SESSION_ID=4493330&TERM_1=1886411484

However, the session ID expires after only a few minutes so you will
need a fresh one.

Regards,
/Lars


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/505DDA90...@gmail.com

Gary Dale

unread,
Sep 22, 2012, 11:50:02 AM9/22/12
to
The fields you need are shown in the page source:

<FORM METHOD="POST"ACTION="/cgi-bin/zgate">
<INPUT NAME="ACTION"VALUE="SEARCH"TYPE="HIDDEN">
<INPUT NAME="DBNAME"VALUE="VOYAGER"TYPE="HIDDEN">
<INPUT NAME="ESNAME"VALUE="B"TYPE="HIDDEN">
<INPUT NAME="MAXRECORDS"VALUE="20"TYPE="HIDDEN">
<INPUT NAME="RECSYNTAX"VALUE="1.2.840.10003.5.10"TYPE="HIDDEN">
<INPUT NAME="REINIT"TYPE="HIDDEN"VALUE="/cgi-bin/zgate?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,7090">
<INPUT NAME="srchtype"VALUE="1,1016,2,102,3,3,4,2,5,100,6,1"TYPE="HIDDEN">

<P>
<STRONG>Enter Search Term(s):</STRONG><br>(The search term can be a single word or a phrase from anywhere in the record. Enter an author's name in indirect order, i.e., last_name, first_name.)<p>
<INPUT NAME="TERM_1"SIZE="60">
<p>
<INPUT TYPE="SUBMIT"VALUE="Submit Query">
<INPUT Type="RESET"VALUE="Clear Form">
<HR>
Use of this form results in a search of the LC Voyager database (approximately
14 million records). This database contains records in all bibliographic
formats (i.e., books, serials, music, maps, manuscripts, computer files, and
visual materials), and includes the retrospective, unedited older bibliographic
records known as the PreMARC File. LC name and subject authority records
cannot be searched.
<INPUT NAME="SESSION_ID"VALUE="5923056"TYPE="HIDDEN">
</FORM>


You need to construct the query using those fields with those values, with TERM_1 containing the ISBN number.

From the error you are getting, it seems like your query either didn't include the SEARCH action or the header wasn't understood.




--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/505DDC87...@rogers.com

Camaleón

unread,
Sep 22, 2012, 12:10:02 PM9/22/12
to
On Sat, 22 Sep 2012 10:01:51 -0500, craig wrote:

> Greetings,
>
> I have a small book collection (~150) that I thought would be neat to
> catalog by the Library of Congress catalog numbers. I have found a LOC
> search form that will allow me to input the ISBN, and it will return the
> information I want:
>
> [code]http://www.loc.gov/cgi-bin/zgate?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,7090[/code]
>
> I have the list of book ISBNs in a text file, so scripting this should
> be quite easy. The problem is I can't figure out how to submit the form
> from the command line. I figured wget would be the best way, but
> everything I try results in downloading a single line that reads "Your
> form didn't include an ACTION!" So I thought I would turn to here for
> help. The test ISBN I am using is for The Linux Cookbook: 1886411484,
> QA76.76.O63S788 2001.

As others suggest, the query should be something like:

wget http://www.loc.gov/cgi-bin/zgate --post-data="ACTION=SEARCH&TERM_1=1886411484&SESSION_ID=1234567"

But I get "session expired" :-(

(note the "SESSION_ID" field value is completely arbitrary in the above line)

> And a related side question. From my reading, I've learned that the
> Z39.50 protocol is used to query databases, usually library related. Is
> anyone aware of an ISBN database table that can be downloaded by the
> user, preferably in a format that can be imported into MySQL or
> PostgreSQL?

Well, according to this:

http://www.loc.gov/z3950/gateway.html#about

You can query the database by means of Z39.50 client, should you find one ;-)

Greteings,

--
Camaleón


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/k3knbc$mv$1...@ger.gmane.org

cr...@gtek.biz

unread,
Sep 22, 2012, 12:30:01 PM9/22/12
to
> As others suggest, the query should be something like:
>
> wget http://www.loc.gov/cgi-bin/zgate
> --post-data="ACTION=SEARCH&TERM_1=1886411484&SESSION_ID=1234567"

Yeah, I was messing with the --post-data, but I didn't know I had to use an ACTION key. Will play with that.

> But I get "session expired" :-(
>
> (note the "SESSION_ID" field value is completely arbitrary in the above line)
>
>> And a related side question. From my reading, I've learned that the
>> Z39.50 protocol is used to query databases, usually library related. Is
>> anyone aware of an ISBN database table that can be downloaded by the
>> user, preferably in a format that can be imported into MySQL or
>> PostgreSQL?
>
> Well, according to this:
>
> http://www.loc.gov/z3950/gateway.html#about
>
> You can query the database by means of Z39.50 client, should you find one ;-)

I kind if figured that would be what I needed, but I'm not aware of any Z39.50 clients.

Thanks!


Sent - Gtek Web Mail



--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: http://lists.debian.org/134833133...@webmail.gtek.biz

cr...@gtek.biz

unread,
Sep 22, 2012, 12:30:02 PM9/22/12
to
For background see the Bath Profile:

http://www.ukoln.ac.uk/interop-focus/bath/

There are also bindings for C, C++ and PHP. You'll find them at
IndexData's web site.

As far as importing into MySQL or Postgresql, that is up to how you
decide to map the Bath Profile (most likely the one used) over to your
own database structure. The database being queried via Z39.50 probably
has its data in the MARC21 format and has over 1000 fields and subfields
each with a specific meaning.

Thanks for the info. I didn't realize MARC21 was so complex, but I can always create queries that select what I need, I just need to know what to query against. I will read up on what you provided.


Sent - Gtek Web Mail



--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: http://lists.debian.org/1348331062...@webmail.gtek.biz

cr...@gtek.biz

unread,
Sep 22, 2012, 12:30:02 PM9/22/12
to
The url you give is for the form. If you enter an ISBN number it will do
the search.

What you need to do is capture the http header sent when you click
"submit query" then replace the test ISBN number with whatever number
you want to search. Wireshark can do this. Simply look for the query
packet(s).

At some point I thought about trying capture what was being submitted, but since my http protocol knowledge is limited I thought the information might also be being sent as a URL, which I figured would make wget perfect for this. I've got wireshark loaded on something around here, so I will investigate this line of thought. Thanks!


Sent - Gtek Web Mail



--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: http://lists.debian.org/134833111...@webmail.gtek.biz

Lars Noodén

unread,
Sep 22, 2012, 12:50:02 PM9/22/12
to
On 9/22/12 7:28 PM, cr...@gtek.biz wrote:
[snip]
> I kind if figured that would be what I needed, but I'm not aware of any Z39.50 clients.
[snip]

Using ZOOM, mentioned in my previous post, you can use your perl script
as a Z39.50 client to search the LOC catalog directly. There are also
C, C++ and PHP bindings.

Regards,
/Lars


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/505DEA1A...@gmail.com

cr...@gtek.biz

unread,
Sep 22, 2012, 12:50:02 PM9/22/12
to
> Using ZOOM, mentioned in my previous post, you can use your perl script
> as a Z39.50 client to search the LOC catalog directly. There are also
> C, C++ and PHP bindings.

Ah, that makes sense. I will probably get after this again later today or tomorrow, and I will definitely post any success stories. It will probably take me a while to get back up to speed with perl since I haven't touched it in a couple of years.

Craig


Sent - Gtek Web Mail



--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: http://lists.debian.org/1348332366...@webmail.gtek.biz

Lars Noodén

unread,
Sep 22, 2012, 1:10:02 PM9/22/12
to
On 9/22/12 7:46 PM, cr...@gtek.biz wrote:
>> Using ZOOM, mentioned in my previous post, you can use your perl
>> script as a Z39.50 client to search the LOC catalog directly.
>> There are also C, C++ and PHP bindings.
>
> Ah, that makes sense. I will probably get after this again later
> today or tomorrow, and I will definitely post any success stories. It
> will probably take me a while to get back up to speed with perl since
> I haven't touched it in a couple of years.

It's there in the repository as libnet-z3950-zoom-perl

Regards,
/Lars


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/505DF02...@gmail.com

Camaleón

unread,
Sep 22, 2012, 1:40:01 PM9/22/12
to
On Sat, 22 Sep 2012 11:28:50 -0500, craig wrote:

>> As others suggest, the query should be something like:
>>
>> wget http://www.loc.gov/cgi-bin/zgate
>> --post-data="ACTION=SEARCH&TERM_1=1886411484&SESSION_ID=1234567"
>
> Yeah, I was messing with the --post-data, but I didn't know I had to use
> an ACTION key. Will play with that.

(...)

Mmm... there's another door you can knock:

wget http://www.loc.gov/search --post-data="q=1886411484&all=true&st=list"

Greetings,

--
Camaleón


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/k3ksum$mv$2...@ger.gmane.org

Gary Dale

unread,
Sep 22, 2012, 10:40:01 PM9/22/12
to
On 22/09/12 11:34 AM, Lars Noodén wrote:
> On 9/22/12 6:01 PM, cr...@gtek.biz wrote:
>> Greetings,
>>
>> I have a small book collection (~150) that I thought would be neat to
>> catalog by the Library of Congress catalog numbers. I have found a
>> LOC search form that will allow me to input the ISBN, and it will
>> return the information I want:
>>
>> [code]http://www.loc.gov/cgi-bin/zgate?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,7090[/code]
>>
>> I have the list of book ISBNs in a text file, so scripting this
>> should be quite easy. The problem is I can't figure out how to submit
>> the form from the command line. I figured wget would be the best way,
>> but everything I try results in downloading a single line that reads
>> "Your form didn't include an ACTION!" So I thought I would turn to
>> here for help. The test ISBN I am using is for The Linux Cookbook:
>> 1886411484, QA76.76.O63S788 2001.
> [snip]
>
> If you want to screen scrape, the URI would be like this:
>
> http://www.loc.gov/cgi-bin/zgate?ACTION=SEARCH&DBNAME=VOYAGER&ESNAME=B&MAXRECORDS=20&RECSYNTAX=1.2.840.10003.5.10&REINIT=/cgi-bin/zgate?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,7090&srchtype=1,1016,2,102,3,3,4,2,5,100,6,1&SESSION_ID=4493330&TERM_1=1886411484
>
> However, the session ID expires after only a few minutes so you will
> need a fresh one.
>
> Regards,
> /Lars
The solution is to wget the form to get a session id then submit the
query using that session id. If running multiple queries then keep
submitting them using the session id until one is rejected. With any
luck, you should be able to run multiple queries and also be able to
detect when a query is rejected due to an expired session.

You also could simply keep the get form / submit query pairing since I
doubt that the (possibly) unnecessary extra form gets are going to cause
a huge slowdown. I just think it's better to try to minimize traffic
where possible.


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/505E745D...@rogers.com

Jude DaShiell

unread,
Sep 22, 2012, 11:50:01 PM9/22/12
to
wget isn't the right tool for that job. However its brother wput may be
---------------------------------------------------------------------------
jude <jdas...@shellworld.net>
Adobe fiend for failing to Flash



--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/alpine.BSF.2.01.1...@freire1.furyyjbeyq.arg

Pertti Kosunen

unread,
Sep 23, 2012, 5:40:02 AM9/23/12
to
On 22.9.2012 18:01, cr...@gtek.biz wrote:
> I have the list of book ISBNs in a text file, so scripting this
> should be quite easy. The problem is I can't figure out how to submit
> the form from the command line.

http://curl.haxx.se/docs/manpage.html

It should be quite easy with curl.


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/505ED6DF...@pp.nic.fi

Chris Bannister

unread,
Sep 25, 2012, 10:20:02 AM9/25/12
to
On Sat, Sep 22, 2012 at 10:01:51AM -0500, cr...@gtek.biz wrote:
> Greetings,
>
> I have a small book collection (~150) that I thought would be neat to catalog by the Library of Congress catalog numbers. I have found a LOC search form that will allow me to input the ISBN, and it will return the information I want:
>
> [code]http://www.loc.gov/cgi-bin/zgate?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,7090[/code]
>
> I have the list of book ISBNs in a text file, so scripting this should be quite easy. The problem is I can't figure out how to submit the form from the command line. I figured wget would be the best way, but everything I try results in downloading a single line that reads "Your form didn't include an ACTION!" So I thought I would turn to here for help. The test ISBN I am using is for The Linux Cookbook: 1886411484, QA76.76.O63S788 2001.

Have a look at:
http://search.cpan.org/dist/WWW-Mechanize/

Have a read of:
http://www.perl.com/pub/2003/01/22/mechanize.html

Do a google search on "perl www::mechanize"


Apologies for the 'z' in "mechanize"

--
"If you're not careful, the newspapers will have you hating the people
who are being oppressed, and loving the people who are doing the
oppressing." --- Malcolm X


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/20120925140529.GU8247@tal

cr...@gtek.biz

unread,
Sep 25, 2012, 10:30:02 AM9/25/12
to

> Have a look at:

> http://search.cpan.org/dist/WWW-Mechanize/
>
> Have a read of:
> http://www.perl.com/pub/2003/01/22/mechanize.html
>
> Do a google search on "perl www::mechanize"

 

Thanks for the reply (and to the other kind folks that took time

to reply). I will have to put this quest off until the weekend at this

point, so know that I am not ignoring the help, please.

 

Craig

Hendrik Boom

unread,
Sep 28, 2012, 6:40:02 PM9/28/12
to
On Sat, 22 Sep 2012 10:01:51 -0500, craig wrote:

> Greetings,
>
> I have a small book collection (~150) that I thought would be neat to
> catalog by the Library of Congress catalog numbers.

This isn't what you asked for at all, but you might consider the BLISS
classification instead. It's more modern, and its classification guides
are legitimately available for free download. Some are scanned PDFs,
others are available as source code (XML, I believe).

They've learned a lot about the structure of classification systems since
LC was set up.

It's used by a number of libraries in England, I believe.

-- hendrik


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/k458eg$l66$2...@ger.gmane.org

John Hasler

unread,
Sep 28, 2012, 7:20:01 PM9/28/12
to
Hendrik Boom writes:
> It's more modern, and its classification guides are legitimately
> available for free download.

What about LCC is not in the public domain?

<http://www.loc.gov/catdir/cpso/lcco/>
--
John Hasler


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/87r4ply...@thumper.dhh.gt.org

cr...@gtek.biz

unread,
Sep 29, 2012, 6:20:02 PM9/29/12
to
> They've learned a lot about the structure of classification systems since
> LC was set up.

I've been doing some reading, and there is work under way to modernize the
classification system. In the meantime this works for my needs. I do appreciate
the suggestion.


Sent - Gtek Web Mail



--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: http://lists.debian.org/134895687...@webmail.gtek.biz

cr...@gtek.biz

unread,
Sep 29, 2012, 6:50:02 PM9/29/12
to
In the end I did pretty much as suggested, using wget and re-using session IDs.
I created a bash script that gets a session ID, reads the list of ISBN numbers,
and then tries to retrieve their info. If the retrieval returns a session
expired then it gets a new one. It also does a decent job of outputting the
retrieved records into a csv format for easy import into a database or XML.

The script, and my list of 25 test ISBNs are included below. Interestingly,
about five, or 20% come up with no record found.

If I try to do anything more fancy then I will learn how to query the MARC
system directly. The LOC site has a lot of information available.

I appreciate all of the help and suggestions I received.

#!/bin/bash

#*******************************************#
# getLOCinfo.sh #
# #
# A script to read a list of ISBN numbers #
# from an input file, and to retrieve the #
# LOC info for that item from the LOC web #
# search form. #
# #
# The input file is expected to contain #
# a single line of ISBN numbers separated #
# by whitespace. Alternatively, the file #
# can contain one ISBN per line as long as #
# all but the final line ends with white- #
# space followed by a backslash (actually #
# I think all lines can end that way). #
#*******************************************#

# Script Constants:
BASE_URL="http://www.loc.gov/cgi-bin/zgate"
E_BAD_ARGS=65
E_BAD_FILE=66
E_NO_SESSION_ID=67
NUM_ARGS=2
NUM_EXPIRED=10
SUCCESS=0

# Script variables:
expired_count=0
result="Your session has expired"
result_url=$BASE_URL
session_url=$BASE_URL

# A function to get a new sessionid:
GetSessionID ()
{
session_url=$BASE_URL"?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/"
session_url=$session_url"locils2.html,z3950.loc.gov,7090"
sessionid=`wget $session_url -o /dev/null -O - | \
grep SESSION_ID | \
cut -d "\"" -f4`
if [ -z $sessionid ]
then
echo "Unable to get session ID. Exiting"
exit $E_NO_SESSION_ID
fi
}

# A function to "build" the request URL:
BuildURL ()
{
url=$BASE_URL"?ACTION=SEARCH&DBNAME=VOYAGER&ESNAME=B&MAXRECORDS=20&"
url=$url"RECSYNTAX=1.2.840.10003.5.10&REINIT=/cgi-bin/zgate?ACTION=INIT&"
url=$url"FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,"
url=$url"7090&srchtype=1,1016,2,102,3,3,4,2,5,100,6,1&SESSION_ID=$1&"
url=$url"TERM_1=$2"
}

# Make sure file names were supplied when the script was called:
if [ $# -ne $NUM_ARGS ]
then
echo "ERROR: Incorrect number of parameters supplied. Exiting..."
exit $E_BAD_ARGS
fi

# Make sure the input file exists and is not empty:
if [ ! -f "$1" ] || [ ! -s "$1" ]
then
echo "ERROR: $1 not found or is an empty file. Exiting..."
exit $E_BAD_FILE
fi

# Truncate the output file if necessary:
if [ -s $2 ]
then
echo -n "Warning: $2 exists and is not empty. Continue [y/N]? "
read input
if [ `echo $input | tr A-Z a-z` != "y" ]
then
echo "Please provide a valid output file name"
exit $E_BAD_FILE
fi
cat /dev/null > $2
fi

# Get a session ID:
GetSessionID

# Read the file contents:
read isbn_list < $1

for isbn in $isbn_list
do
BuildURL $sessionid $isbn
result=`wget $url -o /dev/null -O - | tr "\n" " "`
while [ -n "`echo $result | sed -n -e '/Your session has expired/Ip'`" ] &&
[ $expired_count -lt $NUM_EXPIRED ]
do
let "expired_count+=1"
GetSessionID
BuildURL $sessionid $isbn
result=`wget $url -o /dev/null -O - | tr "\n" " "`
done

if [ $expired_count -eq $NUM_EXPIRED ]
then
echo "Unable to get session ID. Exiting"
exit $E_NO_SESSION_ID
else
expired_count=0
fi

if [ -n "`echo $result | sed -n -e '/No records matched your query/Ip'`" ]
then
# Print the not found message to stderr:
echo "$isbn: No record found" >&2
else
echo -n "\"$isbn\"," >> $2
echo $result | sed -n -e 's/.*<pre>\(.*\)<\/pre>.*/\1/Ip' | \
sed -e 's/ \+/ /g' | \
sed -e 's/^Author: /"/' | \
sed -e 's/\., [0-9]\{4\}-[0-9]\{0,4\} \(Title: \)/. \1/' | \
sed -e 's/\. Title: /","/' | \
sed -e 's/\. Published: /","/' | \
sed -e 's/, c\([0-9]\{4\}\)\. LC Call No.: /","\1","/' | \
sed -e 's/ *$/"/' \
>> $2
fi
done

exit $SUCCESS

##### ISBN List: ###############################################################

0805375651 \
0314027157 \
0201087987 \
9780980232714 \
0131774115 \
0789731274 \
1874416656 \
1886411484 \
9780425238981 \
0070726922 \
0495011622 \
1565927699 \
0673524841 \
0721659659 \
9781847991683 \
0596100795 \
0596001584 \
9780980455205 \
0835930513 \
9780954452971 \
0619121475 \
9780321553577 \
0130424110 \
0201612445 \
9780123705488


Sent - Gtek Web Mail



--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: http://lists.debian.org/134895847...@webmail.gtek.biz

Morten Bo Johansen

unread,
Sep 30, 2012, 7:40:02 AM9/30/12
to
cr...@gtek.biz <cr...@gtek.biz> wrote:

> I have a small book collection (~150) that I thought would be neat to
> catalog by the Library of Congress catalog numbers. I have found a LOC
> search form that will allow me to input the ISBN, and it will return
> the information I want:

[..]

> I have the list of book ISBNs in a text file, so scripting this should
> be quite easy. The problem is I can't figure out how to submit the form
> from the command line. I figured wget would be the best way, but
> everything I try results in downloading a single line that reads "Your
> form didn't include an ACTION!" So I thought I would turn to here for
> help. The test ISBN I am using is for The Linux Cookbook: 1886411484,
> QA76.76.O63S788 2001.

There are several urls on loc.gov that will retrieve book information
from an ISBN. The one below has no problem with session cookies. So
wouldn't this quick and dirty one-liner do what you want?


#!/bin/sh

# loc.sh <ISBN>

elinks -dump -dump-charset utf8 -no-references -no-numbering \
"http://www.loc.gov/cgi-bin/zclient?host=z3950.loc.gov&port=\
7090&attrset=BIB1&rtype=USMARC&DisplayRecordSyntax=HTML&ESN=F&startrec=\
1&maxrecords=10&dbname=Voyager&srchtype=1,7,2,3,3,1,4,1,5,1,6,1&term_term_1=\
$1"

so loc.sh 1886411484 will output the information for the Linux Cookbook
in a pure text format.

> And a related side question. From my reading, I've learned that the
> Z39.50 protocol is used to query databases, usually library related. Is
> anyone aware of an ISBN database table that can be downloaded by the
> user, preferably in a format that can be imported into MySQL or
> PostgreSQL?

Probably, but I suppose the output is very standardized and then you can
easily convert it to csv-format or something.


Regards,

Morten


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: http://lists.debian.org/slrnk6gbg...@gatsby.mbjnet.dk

cr...@gtek.biz

unread,
Oct 2, 2012, 11:50:02 AM10/2/12
to
> There are several urls on loc.gov that will retrieve book information
> from an ISBN. The one below has no problem with session cookies. So
> wouldn't this quick and dirty one-liner do what you want?
>
>
> #!/bin/sh
>
> # loc.sh <ISBN>
>
> elinks -dump -dump-charset utf8 -no-references -no-numbering \
> "http://www.loc.gov/cgi-bin/zclient?host=z3950.loc.gov&port=\
> 7090&attrset=BIB1&rtype=USMARC&DisplayRecordSyntax=HTML&ESN=F&startrec=\
> 1&maxrecords=10&dbname=Voyager&srchtype=1,7,2,3,3,1,4,1,5,1,6,1&term_term_1=\
> $1"
>
> so loc.sh 1886411484 will output the information for the Linux Cookbook
> in a pure text format.
>

Well that certainly looks a lot better than what I came up with. I will
have to give it a try, but doubt I will have time before Friday to play
with this again. I'll let you know. Out of curiosity, can this be done
with lynx instead since I have it installed? If not, I can always
install elinks.

Thanks!


Sent - Gtek Web Mail



--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: http://lists.debian.org/1349192494...@webmail.gtek.biz

0 new messages