Need help on regexp over new lines

2,783 views
Skip to first unread message

Sathish VJ

unread,
Feb 6, 2012, 7:08:21 AM2/6/12
to golan...@googlegroups.com
Hello,

I'm trying out certain things in regular expressions and attempted to do a web scraper as part of the example.  The one I tried is to pull out the filmography (year and film name) of an actor from wikipedia.  However, at the very beginning, I'm unable to scan over new lines.  Could anybody help me with the code below?

* Final goal: extract the year and film name using a single regex with appropriate submatches (not sure if it is possible though).
* But to start with, at least extract the content between the first occurrence of "Filmography.*Acting.*<table.*<caption>Film</caption>" and the corresponding occurrence of "</table>

My setup: go version weekly.2012-01-27 11507 on Ubuntu 11.10 64bit.

thanks
Sathish

//code
package main

import (
    "fmt"
    "net/http"
    "io"
    "regexp"
    )

func main() {
    //get the webpage content as a byte array
    c := new(http.Client)
    if err != nil {
        fmt.Println("Error: ", err)
        return
    }
    fmt.Println(resp.Status)
    buf := make([]byte, 1024*1024)
    defer resp.Body.Close()
    io.ReadFull(resp.Body, buf)

    //some of many regexes I tried
    //beginRegStr := "<table.*<caption>Film</caption>.*<tr>.*<th>Year</th>"
    //beginRegStr := "Filmography.*Acting.*<caption>Film</caption>"
    //beginRegStr := "Filmography"
    //beginRegStr := "Filmography[.[:space:]]*</table>"
    //beginRegStr := "Filmography[.\n]*Acting"

    //create regex string
    beginRegStr := "Filmography.*\n.*\n.*Acting"
    beginRegex, err := regexp.Compile(beginRegStr)
    if err != nil {
        fmt.Println("Error in beginning regex string: ", err)
        return
    }

    pos := beginRegex.FindAllIndex(buf, 1)
    if len(pos) == 0 {
        fmt.Println("Nothing matched!")
        return
    }
    fmt.Println(string(buf[pos[0][0]:pos[0][1]]))
    fmt.Println(pos)
}

chris dollin

unread,
Feb 6, 2012, 7:53:46 AM2/6/12
to golan...@googlegroups.com
On 6 February 2012 12:08, Sathish VJ <sath...@gmail.com> wrote:
> Hello,
>
> I'm trying out certain things in regular expressions and attempted to do a
> web scraper as part of the example.  The one I tried is to pull out the
> filmography (year and film name) of an actor from wikipedia.  However, at
> the very beginning, I'm unable to scan over new lines.  Could anybody help
> me with the code below?

(You might be better of parsing it as html and traversing the structure.
Regular expressions aren't really the tool for the job because of all the
nested structures.)

When I tried:

>     //create regex string
>     beginRegStr := "Filmography.*\n.*\n.*Acting"
>     beginRegex, err := regexp.Compile(beginRegStr)
>     if err != nil {
>         fmt.Println("Error in beginning regex string: ", err)
>         return
>     }
>
>     pos := beginRegex.FindAllIndex(buf, 1)
>     if len(pos) == 0 {
>         fmt.Println("Nothing matched!")
>         return
>     }
>     fmt.Println(string(buf[pos[0][0]:pos[0][1]]))
>     fmt.Println(pos)
> }

It worked: it did what you tols it to do, which probably
isn't what you wanted. FindAllIndex will find you those things
that match your pattern, and the result it gave me:

Filmography"><span class="tocnumber">4</span> <span
class="toctext">Filmography</span></a>
<ul>
<li class="toclevel-2 tocsection-11"><a href="#Acting"><span
class="tocnumber">4.1</span> <span class="toctext">Acting
[[13984 14198]]

is indeed a chunk from the page that matches the pattern you
gave. And there are newlines in it: Go's regexp's `.` will happily match
a newline.

Chris

--
Chris "allusive" Dollin

Sathish VJ

unread,
Feb 6, 2012, 8:15:26 AM2/6/12
to golan...@googlegroups.com
Actually Chris,

* I had to put \n to specifically match the new lines - "." is not matching the new lines.
* The regex right now only matches this specific instance including the counted number of "\n"s.  What I want is the text from this point, all the way up to the next </table> instance, which spans across many new line characters.

I take your point on the html parser and I can try that.  But still would like to find out how to make this work - parsing this html page was an easy example that I could send out over the internet, but the similar one I tried with a local file also has the same problem.  

regards
Sathish

Sathish VJ

unread,
Feb 6, 2012, 11:20:56 AM2/6/12
to golan...@googlegroups.com
Hey Chris, I guess both of us had accidentally replied over mail rather than in the group post.  But just to close off this thread, what you told me worked - "To get . to match newlines you have to set the dot-matches-newline flag `s`, which you can do with the magic incantation (?s) at the start of your regexp."  Thanks.

Though the code below isn't perfect w.r.t. the results, it got me the regexp searching understanding that I wanted to get.  In case it is useful for others, I've copied the main part below.

thanks
Sathish

func main() {
    c := new(http.Client)
    if err != nil {
        fmt.Println("Error: ", err)
        return
    }
    buf := make([]byte, 1024*1024)
    defer resp.Body.Close()
    io.ReadFull(resp.Body, buf)

    beginRegStr := `(?Us)<td>([0-9]+)</td>.*<td><i><a href="/wiki.*title.*>(.*)</a></i></td>`
    beginRegex, _ := regexp.Compile(beginRegStr)

    pos := beginRegex.FindAllSubmatch(buf, 500)
    if pos == nil {
        fmt.Println("Nothing matched!")
        return
    }

    for i := 0; i<len(pos); i++ {
        fmt.Println(string(pos[i][1]), string(pos[i][2]))
    }
}

spir

unread,
Feb 6, 2012, 12:53:49 PM2/6/12
to golan...@googlegroups.com
On 02/06/2012 01:53 PM, chris dollin wrote:
> Go's regexp's `.` will happily match
> a newline.
Good to know. Isn't there a multiline mode to enable '\n' matching by
'.' (as in many other regexp langs)?

Denis

chris dollin

unread,
Feb 6, 2012, 1:00:31 PM2/6/12
to spir, golan...@googlegroups.com
On 6 February 2012 17:53, spir <denis...@gmail.com> wrote:
> On 02/06/2012 01:53 PM, chris dollin wrote:
>>
>> Go's regexp's `.` will happily match
>> a newline.

(I was not entirely correct as it turns out ...)

> Good to know. Isn't there a multiline mode to enable '\n' matching by '.'
> (as in many other regexp langs)?

(?s) enables allow-newline-for-dot; I think it's scoped to (...) brackets
but I'd expect to park it at the front of whatever RE I was using that
didn't want to care about matching over newlines.

Reply all
Reply to author
Forward
0 new messages