Using regular expression to extract text between HTML tags

26 views
Skip to first unread message

Zeynel

unread,
Nov 15, 2009, 8:51:35 PM11/15/09
to Regex
Hello,

I am planning to crawl a large number of lawyer bios by using 80legs.
http://www.80legs.com/ I am trying to understand if regular
expressions can be used to extract the data inside the <title> </
title> tags as ordered in csv rows.

I know that we can extract the text inside the tags by using regex as
explained here http://www.regular-expressions.info/examples.html but
is it possible to extract the lawyer name and firm name like this

Gregg M Galardi, Skadden Arps
John M. Simpson, Fulbright & Jaworski
Todd Wagner, Sidley Austin LLP

from the below data by using regular expressions:

<title>Gregg M Galardi - Skadden, Arps</title>
<title> John M. Simpson - Partner - The International Law Firm of
Fulbright & Jaworski</title>
<title>Sidley Austin LLP - Our People - Todd Wagner</title>

Thank you.

eugeny....@gmail.com

unread,
Nov 16, 2009, 1:09:09 PM11/16/09
to Regex
That is an example of a task which is easier to solve without using
regexes.
I see the following difficult points:
1) Sometimes a company name follows <title>, sometimes lawyer name
follows <title>.
A logic of how to distinguish them is not 100% clear even to me, a
human being.
Needless to say it will be more difficult to explain to a machine. Can
you explain that logic?

2) You want commas to be eliminated from inside company name.
example
input: Skadden, Arps
output: Skadden Arps
Can there be multiple commas in input? Or only one?

3) You want words like "The International Law Firm of " to be
eliminated.
can you provide a List of such strings to be cut?

These are questions you need to answer yourself before applying any
method, be it regular expression or not.

Zeynel

unread,
Nov 16, 2009, 8:29:25 PM11/16/09
to Regex
Thank you for your answer. I can supply a list of law firms. There are
only 250 firms but about 100,000 lawyers. Would a list of firms help?

On Nov 16, 1:09 pm, "Eugeny.Satt...@gmail.com"

eugeny....@gmail.com

unread,
Nov 16, 2009, 10:44:45 PM11/16/09
to Regex


On 17 ноя, 05:29, Zeynel <azeyn...@gmail.com> wrote:
> Thank you for your answer. I can supply a list of law firms. There are
> only 250 firms but about 100,000 lawyers. Would a list of firms help?
>
Hm... It will require a regular expression like
(Number 1 Law Firm|The Second to None Lawyers |Etc. Law|...|250tn
Lawyers)
Isn't it going to be too lengthy? It will work but, but it will be
a) hard to edit
b) hard to read

Looks more like a small database task?

How I would do it.
1) Grab content from within tags using regex
2) Replace " - " with tabs
3) select Everything and paste into Excel or any other spreadsheet
processor.
The tabs will ensure that it gets into two columns.
4) In the third column, drag down the countIF furmula checking whether
text in the neighbouring column is present in law firms list. If
formula produces "1" or more, swap contents of name&surname cell and
company name cell.

Eugeny Sattler

unread,
Nov 18, 2009, 5:30:52 AM11/18/09
to Regex
> How I would do it.
> 1) Grab content from within tags using regex
> 2) Replace " - " with tabs
> 3) select Everything and paste into Excel or any other spreadsheet
> processor.
> The tabs will ensure that it gets into two columns.
> 4) In the third column, drag down the countIF furmula checking whether
> text in the neighbouring column is present in law firms list. If
> formula produces "1" or more, swap contents of name&surname cell and
> company name cell.

Here is the regex and the excel file that illustrate my approach.

The regex (in free spacing mode and in "dot matches newlines" mode)

<title>
((?:(?!</title>)(?! - ).)*?) #column 1
(\x20-\x20(?:(?!</title>)(?! - ).)*?) #column2
(\x20-\x20(?:(?!</title>)(?! - ).)*?)? #column3 (optional)
</title>

As the next step, I apply this S&R operation to each line that the
above regex produces.
I search for (\r\n| - |, ) and replace with a space.
Note1: I assume that " - " can not occur in a layer name or in company name
Note2. I assume that any comma is to be deleted.

So I get 3 columns (based on the example given; tabs invisible, sorry)

Gregg M Galardi Skadden Arps
John M. Simpson Partner The International Law Firm of Fulbright & Jaworski
Sidley Austin LLP Our People Todd Wagner

With this I do some Microsoft Excel manipulation which are shown in
the attachment.
They bring us to the desired result.
Zeynel.xls
Zeynel.jpg

Zeynel

unread,
Nov 20, 2009, 12:00:16 AM11/20/09
to Regex
This is great! Thank you very much. I think it will save me a lot of
time. I also discovered Scrapy http://scrapy.org/ which uses XPath
selectors to extract text from HTML. It allows regex too. I am still
trying to make the spider work. I may need to create more than one
spider for websites of different structures. Or I may test with few
pages from each law firm page and then apply methods like this and see
what I get. I am still learning so appreciate your help.
>  Zeynel.xls
> 28KViewDownload
>
>  Zeynel.jpg
> 231KViewDownload
Reply all
Reply to author
Forward
0 new messages