Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Using regular expression to extract text between HTML tags
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  6 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Zeynel  
View profile  
 More options Nov 15 2009, 8:51 pm
From: Zeynel <azeyn...@gmail.com>
Date: Sun, 15 Nov 2009 17:51:35 -0800 (PST)
Local: Sun, Nov 15 2009 8:51 pm
Subject: Using regular expression to extract text between HTML tags
Hello,

I am planning to crawl a large number of lawyer bios by using 80legs.
http://www.80legs.com/ I am trying to understand if regular
expressions can be used to extract the data inside the <title> </
title> tags as ordered in csv rows.

I know that we can extract the text inside the tags by using regex as
explained here http://www.regular-expressions.info/examples.html but
is it possible to extract the lawyer name and firm name like this

    Gregg M Galardi, Skadden Arps
    John M. Simpson, Fulbright & Jaworski
    Todd Wagner, Sidley Austin LLP

from the below data by using regular expressions:

<title>Gregg M Galardi - Skadden, Arps</title>
<title> John M. Simpson - Partner - The International Law Firm of
Fulbright & Jaworski</title>
<title>Sidley Austin LLP - Our People - Todd Wagner</title>

Thank you.


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eugeny.Sattler@gmail.com  
View profile  
 More options Nov 16 2009, 1:09 pm
From: "Eugeny.Satt...@gmail.com" <eugeny.satt...@gmail.com>
Date: Mon, 16 Nov 2009 10:09:09 -0800 (PST)
Local: Mon, Nov 16 2009 1:09 pm
Subject: Re: Using regular expression to extract text between HTML tags
That is an example of a task which is easier to solve without using
regexes.
I see the following difficult points:
1) Sometimes a company name follows <title>, sometimes lawyer name
follows <title>.
A logic of how to distinguish them is not 100% clear even to me, a
human being.
Needless to say it will be more difficult to explain to a machine. Can
you  explain that logic?

2) You want commas to be eliminated from inside company name.
example
input: Skadden, Arps
output: Skadden Arps
 Can there be multiple commas in input? Or only one?

3) You want words like "The International Law Firm of " to be
eliminated.
can you provide a List of such strings to be cut?

These are questions you need to answer yourself before applying any
method, be it regular expression or not.


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zeynel  
View profile  
 More options Nov 16 2009, 8:29 pm
From: Zeynel <azeyn...@gmail.com>
Date: Mon, 16 Nov 2009 17:29:25 -0800 (PST)
Local: Mon, Nov 16 2009 8:29 pm
Subject: Re: Using regular expression to extract text between HTML tags
Thank you for your answer. I can supply a list of law firms. There are
only 250 firms but about 100,000 lawyers. Would a list of firms help?

On Nov 16, 1:09 pm, "Eugeny.Satt...@gmail.com"


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eugeny.Sattler@gmail.com  
View profile  
 More options Nov 16 2009, 10:44 pm
From: "Eugeny.Satt...@gmail.com" <eugeny.satt...@gmail.com>
Date: Mon, 16 Nov 2009 19:44:45 -0800 (PST)
Local: Mon, Nov 16 2009 10:44 pm
Subject: Re: Using regular expression to extract text between HTML tags

On 17 ноя, 05:29, Zeynel <azeyn...@gmail.com> wrote:

> Thank you for your answer. I can supply a list of law firms. There are
> only 250 firms but about 100,000 lawyers. Would a list of firms help?

Hm... It will require a regular expression like
(Number 1 Law Firm|The Second to None Lawyers |Etc. Law|...|250tn
Lawyers)
Isn't it going to be too lengthy? It will work but, but it will be
a) hard to edit
b) hard to read

Looks more like a small database task?

How I would do it.
1) Grab content from within tags using regex
2) Replace " - " with tabs
3) select Everything and paste into Excel or any other spreadsheet
processor.
The tabs will ensure that it gets into two columns.
4) In the third column, drag down the countIF furmula checking whether
text in the neighbouring column is present in law firms list. If
formula produces "1" or more, swap contents of name&surname cell and
company name cell.


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eugeny Sattler  
View profile  
 More options Nov 18 2009, 5:30 am
From: Eugeny Sattler <eugeny.satt...@gmail.com>
Date: Wed, 18 Nov 2009 14:30:52 +0400
Local: Wed, Nov 18 2009 5:30 am
Subject: Re: Using regular expression to extract text between HTML tags

> How I would do it.
> 1) Grab content from within tags using regex
> 2) Replace " - " with tabs
> 3) select Everything and paste into Excel or any other spreadsheet
> processor.
> The tabs will ensure that it gets into two columns.
> 4) In the third column, drag down the countIF furmula checking whether
> text in the neighbouring column is present in law firms list. If
> formula produces "1" or more, swap contents of name&surname cell and
> company name cell.

Here is the regex and the excel file that illustrate my approach.

The regex (in free spacing mode and in "dot matches newlines" mode)

<title>
         ((?:(?!</title>)(?! - ).)*?)        #column 1
(\x20-\x20(?:(?!</title>)(?! - ).)*?)        #column2
(\x20-\x20(?:(?!</title>)(?! - ).)*?)?       #column3 (optional)
</title>

As the next step, I apply this S&R operation to each line that the
above regex produces.
I search for (\r\n| - |, ) and replace with a space.
Note1: I assume that " - " can not occur in a layer name or in company name
Note2. I assume that any comma is to be deleted.

So I get 3 columns (based on the example given; tabs invisible, sorry)

Gregg M Galardi  Skadden Arps  
 John M. Simpson         Partner         The International Law Firm of Fulbright & Jaworski
Sidley Austin LLP        Our People      Todd Wagner

With this I do some Microsoft Excel manipulation which are shown in
the attachment.
They bring us to the desired result.

  Zeynel.xls
28K Download

  Zeynel.jpg
231K Download

    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zeynel  
View profile  
 More options Nov 20 2009, 12:00 am
From: Zeynel <azeyn...@gmail.com>
Date: Thu, 19 Nov 2009 21:00:16 -0800 (PST)
Local: Fri, Nov 20 2009 12:00 am
Subject: Re: Using regular expression to extract text between HTML tags
This is great! Thank you very much. I think it will save me a lot of
time. I also discovered Scrapy http://scrapy.org/ which uses XPath
selectors to extract text from HTML. It allows regex too. I am still
trying to make the spider work. I may need to create more than one
spider for websites of different structures. Or I may test with few
pages from each law firm page and then apply methods like this and see
what I get. I am still learning so appreciate your help.

On Nov 18, 5:30 am, Eugeny Sattler <eugeny.satt...@gmail.com> wrote:


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2010 Google