How to trick the USPTO into talking to me?

277 views
Skip to first unread message

Robert S. Kissel

unread,
Mar 14, 2012, 9:24:02 PM3/14/12
to www-mecha...@googlegroups.com
You'd think the U.S. Patent & Trademark Office would just have a very
simple query for this, but no.

I've been scratching my head for days, now, trying to figure out what
I need to do to retrieve the form on this page:

https://ramps.uspto.gov/eram/getMaintFeesInfo.do

The obvious business of capturing the hidden field contents and
loading a fields hash appropriately will only return a page with an
error they probably submit whenever they guess that you're a program,
rather than a Web browser; but if you try to retrieve the form in
Opera or Firefox all will go as expected.

Since the transaction is over HTTPS, I can't just sniff at it and
write a simulation of every bite the server and client are throwing at
one another.

This should be an hour's coding, not days and days of fruitless head scratching.

Does anybody know what the trouble is? Is there a special step I have
to do to make the Mechanize object impersonate a browser or hurl back
whatever idiotic cookies and markers and other such tin-cans the
server is tossing at visitors?

Code that doesn't do what we want follows:
==========================
#!/usr/bin/perl -w

use WWW::Mechanize;

$patno or $patno = '7107206';
$serno or $serno = '10130357';

my $url = 'https://ramps.uspto.gov/eram/patentMaintFees.do';
my $m = WWW::Mechanize->new();
$m->get($url);

my( $parameter ) = $m->content =~ m{getMaintFeesInfo\.do;([^'"]*)};
my( $signature ) = $m->content =~ m/name=\"signature\"\s+value=\"([^"]+)\"/;
my( $loadtime ) = $m->content =~ m/name=\"loadTime\"\s+value=\"([^"]+)\">/ ;
my( $sessionId ) = $m->content =~ m/\"sessionId\" value=\"(.+)\">/ ;
print "$parameter\n$signature\n$sessionId\n$loadtime\n";

my $fields = {
'patentNum' => $patno,
'applicationNum' => $serno,
'signature' => $signature,
'loadTime' => $loadtime,
'sessionId' => $sessionId,
'maintFeeAction' => 'Retrieve Fees to Pay',
'maintFeeYear' => '04',
};

my $r = $m->submit_form( form_number => 1, fields => $fields );

print $r->content;

bem

unread,
Mar 16, 2012, 6:29:35 AM3/16/12
to www-mecha...@googlegroups.com
You could use WWW::Mechanize::FormFiller

#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use WWW::Mechanize::FormFiller;
use URI::URL;

my $agent = WWW::Mechanize->new( autocheck => 1 );
my $formfiller = WWW::Mechanize::FormFiller->new();
$agent->env_proxy();

  $agent->get('https://ramps.uspto.gov/eram/getMaintFeesInfo.do');
   $agent->form_number(1) if $agent->forms and scalar @{$agent->forms};
  { local $^W; $agent->current_form->value('patentNum', '7107206'); };
  { local $^W; $agent->current_form->value('applicationNum', '10130357'); };
  $agent->submit();

Robert S. Kissel

unread,
Mar 17, 2012, 12:19:10 AM3/17/12
to WWW::Mechanize users
The problem isn't that the form isn't getting filled; the problem is
that instead of coming back with an answer, it comes back (with the
form filled) and says it cannot answer because I'm a machine, rather
than a human using a Web client program on a GUI interface. Somehow
it is detecting that, and I don't know what I can do to sniff a
legitimate transaction (since it's in https) to try to write a
simulator of what Opera or Chrome or Firefox does when it connects.

Andy Lester

unread,
Mar 17, 2012, 12:21:24 AM3/17/12
to www-mecha...@googlegroups.com

On Mar 16, 2012, at 11:19 PM, Robert S. Kissel wrote:

it cannot answer because I'm a machine, rather
than a human using a Web client program on a GUI interface. 

That's because you're a machine, rather than a human using a Web client program on a GUI interface.

If you're not supposed to be scraping the site, then maybe you shouldn't be scraping the site.

xoa


Natxo Asenjo

unread,
Mar 17, 2012, 6:51:48 AM3/17/12
to www-mecha...@googlegroups.com
it probably checks your user agent, sees it is www::mechanize and decides then to block you. Try changing it to something else:

http://search.cpan.org/~jesse/WWW-Mechanize-1.72/lib/WWW/Mechanize.pm#$mech-%3Eagent_alias%28_$alias_%29

--
groet,
natxo

Robert S. Kissel

unread,
Mar 18, 2012, 9:55:11 AM3/18/12
to WWW::Mechanize users
On Mar 17, 5:51 am, Natxo Asenjo <natxo.ase...@gmail.com> wrote:
> it probably checks your user agent, sees it is www::mechanize and decides
> then to block you. Try changing it to something else:
> http://search.cpan.org/~jesse/WWW-Mechanize-1.72/lib/WWW/Mechanize.pm...

I believe I tried various aliases already, and got the same results.
Does the Mechanize agent hurl back cookies set by the server?

Meir Guttman

unread,
Mar 18, 2012, 10:14:26 AM3/18/12
to www-mecha...@googlegroups.com

...


>
> I believe I tried various aliases already, and got the same results.
> Does the Mechanize agent hurl back cookies set by the server?
>

> --

Yes, but only if you «use HTTP::Cookies;» see http://search.cpan.org/~gaas/HTTP-Cookies-6.01/lib/HTTP/Cookies.pm

In many sites, the cookies are the only link between successive queries.


Robert S. Kissel

unread,
Mar 19, 2012, 12:38:38 AM3/19/12
to WWW::Mechanize users
On Mar 18, 9:14 am, Meir Guttman <m...@guttman.co.il> wrote:
> > Does the Mechanize agent hurl back cookies set by the server?
> Yes, but only if you «use HTTP::Cookies;» seehttp://search.cpan.org/~gaas/HTTP-Cookies-6.01/lib/HTTP/Cookies.pm
> In many sites, the cookies are the only link between successive queries.

Hmm. Let me see if that makes a difference. I'm already pretty sure
there *IS* some sort of cookie going back and forth that has some
magic numbers in it. Sometimes I long for the days of gopher.

Meir Guttman

unread,
Mar 19, 2012, 9:20:16 AM3/19/12
to www-mecha...@googlegroups.com
Dear Robert

An indispensible tool to monitor the two directions traffic is the "WireShark" application (http://www.wireshark.org/) which is very easy to work with.

Meir

Warren Humphreys

unread,
Mar 19, 2012, 11:43:46 AM3/19/12
to www-mecha...@googlegroups.com, www-mecha...@googlegroups.com
For an even better http/https debugging tool, give www.charlesproxy.com a try. No affiliation, just a happy user.

Warren

> --
> You received this message because you are subscribed to the Google Groups "WWW::Mechanize users" group.
> To post to this group, send email to www-mecha...@googlegroups.com.
> To unsubscribe from this group, send email to www-mechanize-u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/www-mechanize-users?hl=en.
>

bruce

unread,
Mar 19, 2012, 12:02:00 PM3/19/12
to www-mecha...@googlegroups.com
hi..

i do a lot of screen/app scraping. don't use perl for it though.

here's a high level of what i've found to work.

use one of the plugins for firefox, to examine the net traffic between
browser/target site. this gives you a good idea of what the expected
to/from process should be. it also lets you know if you can do a
straight curl kind of process, or if you need to get more complex.

simple cookie/login/form kinds of sites can be handled using
python/libxml2dom functions, with xpath to extract the required data
from the dom.

for complex sites that implement javascript/dynamic content
generation, you aren't going to easily get the content unless you
replicate the browser session. this can be accomplished, using one of
the headless java apps like htmlunit, or selenium.

bottom line, takes some effort, but most results are doable.

peace

Robert S. Kissel

unread,
Mar 21, 2012, 4:58:37 PM3/21/12
to WWW::Mechanize users
On Mar 19, 10:43 am, Warren Humphreys <w...@rren.co.uk> wrote:
> For an even better http/https debugging tool, give www.charlesproxy.coma try. No affiliation, just a happy user.

Yes, I downloaded "Charles," tried it, and seeing that it does pretty
much what I needed without my having to set up my own "tunnel" to
sniff at the secure communication, purchased it.

I see that the USPTO is hurling a "session" cookie of its own back and
forth, in addition to whatever "jsession" the server is tossing
around, and so I suppose the next thing I must do is figure out where
this gets set, and make sure that Mechanize is hurling it back.

It never fails to amaze me how COMPLEX everybody tries to make
everything on the Web, when the whole idea is simplicity, brief,
stateless connections, and "webbing" information together. The whole
IDEA is to make it possible for mechanized readings and writings and
sharing and propagation of information. It seems nobody utilizes a
simple GET query anymore if they can set up javascript and java and
"best viewed in" and cookies right and left--and then everybody moans
about needing more bandwidth!

Especially NOW with so many people accessing HTTP from their
telephones, you'd think people would get it.

Meir Guttman

unread,
Mar 21, 2012, 5:42:57 PM3/21/12
to www-mecha...@googlegroups.com
Well my dear Robert, you just joined the club...

> --
> You received this message because you are subscribed to the Google
> Groups "WWW::Mechanize users" group.

> To post to this group, send email to www-mechanize-
> us...@googlegroups.com.


> To unsubscribe from this group, send email to www-mechanize-

> users+un...@googlegroups.com.


> For more options, visit this group at
> http://groups.google.com/group/www-mechanize-users?hl=en.
>

> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.927 / Virus Database: 271.1.1/4283 - Release Date:
> 03/21/12 09:34:00

Robert S. Kissel

unread,
Mar 21, 2012, 8:26:24 PM3/21/12
to WWW::Mechanize users
Well, thanks to Warren Humphreys' recommendation of the "Charles"
proxy, I was at last able to confirm that while my code is throwing
back the cookies, for some reason, Mechanize is NOT sending in the
value of one of the fields--and that might well be the whole problem.

What I may need to do is make up a POST request instead of using the
"submit_form" method, unless one of you knows a way to force the
Mechanize agent to include the name and value attributes of an input
of type SUBMIT when it sends a form.

The code
my $fields = {
'patentNum' => $patno,
'applicationNum' => $serno,
'signature' => $signature,
'loadTime' => $loadtime,
'sessionId' => $sessionId,
'maintFeeAction' => 'Retrieve Fees to Pay',
'maintFeeYear' => '04',
};

results in the following POST string:

patentNum=7107206&applicationNum=10130357&signature=...&loadTime=...&sessionId=...&maintFeeYear=04

whereas the "real" browser sends this one:

patentNum=7107206&applicationNum=10130357&signature=...&loadTime=...&sessionId=...&maintFeeAction=Retrieve
+Fees+to+Pay&maintFeeYear=04

Warren Humphreys

unread,
Mar 21, 2012, 8:59:32 PM3/21/12
to www-mecha...@googlegroups.com
It's rather handy isn't it? :-)

It made my life a *lot* easier when I was writing some scripts to automatically grab my bank statement. I can't remember the exact details now, but I was having similar problems with values in fields which Charles helped me spot very quickly!

Warren

Andy Lester

unread,
Mar 21, 2012, 9:30:40 PM3/21/12
to www-mecha...@googlegroups.com

On Mar 21, 2012, at 7:26 PM, Robert S. Kissel wrote:

What I may need to do is make up a POST request instead of using the
"submit_form" method, unless one of you knows a way to force the
Mechanize agent to include the name and value attributes of an input
of type SUBMIT when it sends a form.

You use the ->click() method to specify which button you're clicking, or specify button => 'whatever' in ->submit_form().

Don Bowen

unread,
Jan 27, 2014, 3:35:22 PM1/27/14
to www-mecha...@googlegroups.com
Robert, did you ever figure it out completely?

I'm trying to catch up and perhaps emulate what you did in Python.

Robert S. Kissel

unread,
Feb 4, 2014, 5:28:52 PM2/4/14
to www-mecha...@googlegroups.com
> Don Bowen <don.e...@gmail.com> Jan 27 12:35PM -0800
> Robert, did you ever figure it out completely?
> I'm trying to catch up and perhaps emulate what you did in Python.

No, Don Bowen, I'm afraid I never figured out what was going amiss.

Reply all
Reply to author
Forward
0 new messages