Parts List extraction

Jeffrey Dixon

unread,

Oct 24, 2014, 5:34:45 PM10/24/14

to user--petap...@googlegroups.com

I think that it should be very doable to extract a reasonably accurate parts list from the HTML text of a US patent document, which is contained in the element html > body > coma (everything but the abstract).

Example regular expression to use : \b(\w+)\b\s+\b(\d+)\b. This is obviously very rudimentary - some refinements could be added to it to pick up multiple comma-separated reference numerals instead of just the first one, like "fasteners 20, 22", to exclude group 1 from being the word "and" as in "FIGS. 1 and 2", and to not capture claim references like "the device of claim 6", but none of that should be too hard. A slightly trickier coding challenge may be capturing multi-word parts, as in "upper platen 20" and "lower platen 22," but a couple of promising methodologies immediately come to mind - for instance, you could search backwards from a match for successive preceding words, and continue collecting the preceding words as long as they appear before each instance of the particular match. Or, you could search backwards from the first instance of each match until you hit the indefinite article "a"/"an," and everything between a/an and the number is your part. E.g., "a flexible, heat-resistant, non-stick release sheet 24 composed of PTFE or similar material."

Anyway, an image of the output from regex101.com for matches of the simple regular expression \b(\w+)\b\s+\b(\d+)\b found in the HTML text of USP 6,293,874 is attached. You can see that it does a halfway decent job of finding parts, and it would be helpful even just to provide a list of these matches on Petapator, so that you don't have to manually transcribe part names onto the figures, or keep flipping back and forth to see what they are. The split screen view helps for that if you are viewing the patent on the web with Petapator enabled, so that at least you can keep the figure in view while you search for the part name, but a parts list would be better and useful for reviewing the PDF as well.

PartsListRegEx.JPG

Kenneth Yip

unread,

Oct 25, 2014, 10:29:36 AM10/25/14

to user--petap...@googlegroups.com

Hi Jeffrey,

Thanks for the suggestion and the example of regular expression. I did try this before and even use frequency count to check the correct parts list. That can be done.

Where should I display the parts list? In the individual patent page or the search result page?

Thanks

Kenneth

Jeffrey Dixon

unread,

Oct 25, 2014, 3:08:04 PM10/25/14

to user--petap...@googlegroups.com

Hi Kenneth,

I think both - in the search results you could make it another "Show" option you can toggle on and off like Word Cloud, Figures, and Details.

In the individual patent page, you could have a separate parts list button alongside your "F AS E P G." From there you would preferably have it pop up in a separate window that a user can minimize/restore so that it doesn't have to block the text or figures when the user is not looking at it. The option to print the parts list separately and/or download it as a .txt file or similar would also be very helpful.

Thanks,

Jeff

--
You received this message because you are subscribed to a topic in the Google Groups "User Forum Petapator / Aspator" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/user--petapator-aspator/P3QNzQvxjPc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to user--petapator-a...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
------m--------m------
| | (OO) | |
||(~)||

Monkey

Kenneth Yip

unread,

Oct 25, 2014, 10:31:41 PM10/25/14

to user--petap...@googlegroups.com

Hi Jeff,

Your wish is my command. I will put this feature in the next release. But I don't have time now. Will probably need to wait until Christmas. Is this fine with you?

Thanks

Kenneth

--
You received this message because you are subscribed to the Google Groups "User Forum Petapator / Aspator" group.
To unsubscribe from this group and stop receiving emails from it, send an email to user--petapator-a...@googlegroups.com.

Jeffrey Dixon

unread,

Oct 26, 2014, 11:57:35 AM10/26/14

to user--petap...@googlegroups.com

Sure, thanks for listening!

Reply all

Reply to author

Forward