RegEx for extract text string from html

254 views

Skip to first unread message

treid

unread,

Aug 31, 2010, 8:15:26 PM8/31/10

to TextSoap

Hi

I want to extract data from some html files and am looking to TextSoap
to help me do this.

Essentially all the data in the html have been tagged with IDs; eg.

Product A
Some text describing the product
Manufacturer A

And I would like to extract the text in columns separated by tabs; eg.

Product A <tab> Some text describing the product <tab> Manufacturer

As a newbie to TextSoap and RegEx I'm hoping someone can lend some
advise on some possible ways to go about this. As I understand it
RegEx is the way forward with this but to be honest I'm a little lost
with how to make use of it for the above examples.

Ultimately I want to batch process a number of html files to extract
the data into a table. Some advise on how to do this would be great as
well.

Thanks

Mark Munz

unread,

Sep 2, 2010, 2:43:26 AM9/2/10

to text...@googlegroups.com

This is sort of a quick draft, but should give you something to work with:

You'll need 4 actions:

1. Extract Text
Use the expression: <\s*span\s+id=\"product-\w+">.*<\s*/\s*span\s*>

2. Find/Replace Text - Regular Expression
Find: <\s*span\s+id=\"product-name">(.*)<\s*/\s*span\s*>\s*\n
Replace: \1\t

3. Find/Replace Text - Regular Expression
Find: <\s*span\s+id=\"product-description">(.*)<\s*/\s*span\s*>\s*\n
Replace: \1\t

4. Find/Replace Text - Regular Expression
Find: <\s*span\s+id=\"product-manufacturer">(.*)<\s*/\s*span\s*>\s*\n
Replace: \1\n

Don't let this scare you too much. Realize that the \s* just means
there could be zero or more spaces here and needs to be put anywhere
there could be a potential space. The first action matches any span
with an id that starts with product-(someword).

It wasn't clear if these spans were imbedded or on their own lines. If
they're embedded, you may need to take off the \s*\n at the end of the
find/replace expressions.

The three find and replace actions just finds each span, captures the
textual part and then replaces it with the text + tab. The last one
replaces with the textual + newline for manufacturer, since it is the
last item on the line

I would test this out on a few files, then use Automator to create a
workflow with a Clean Text Files action and call your custom cleaner
from it and then create a droplet. You can then drop your html files
on the droplet. I always recommend working on a copy of your original
files to insure no data is lost.

Hope that helps. I do realize that regular expressions can quickly
overwhelm you. But understanding some of the common tasks folks want
to do with text processing may lead to some new actions down the road
to try and address those more common tasks.

Mark

> --
> You received this message because you are subscribed to the Google Groups "TextSoap" group.
> To post to this group, send email to text...@googlegroups.com.
> To unsubscribe from this group, send email to textsoap+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/textsoap?hl=en.
>
>