Adding position to Tags

72 views
Skip to first unread message

Greg Baker

unread,
Sep 20, 2009, 1:57:18 AM9/20/09
to beautifulsoup
I have recently starting playing with BeautifulSoup. Thanks all for
the hard work that went into it.

For my project, I wanted the position of the tags in the original
source file (i.e. the line/column of the start tags). It turns out
that this was a fairly minor change to the source. What I did may not
be the best way to accomplish this, but it does the job. Here's a
diff to 3.1.0.1; do with it what you will:

diff BeautifulSoup-3.1.0.1/BeautifulSoup.py BeautifulSoup-mine/
BeautifulSoup.py
1015c1015
< self.soup.extractCharsetFromMeta(attrs)
---
> self.soup.extractCharsetFromMeta(attrs, self.getpos())
1017c1017
< self.soup.unknown_starttag(name, attrs)
---
> self.soup.unknown_starttag(name, attrs, self.getpos())
1397c1397
< def unknown_starttag(self, name, attrs, selfClosing=0):
---
> def unknown_starttag(self, name, attrs, pos, selfClosing=0):
1414a1415
> tag.position = pos
1443,1444c1444,1445
< def extractCharsetFromMeta(self, attrs):
< self.unknown_starttag('meta', attrs)
---
> def extractCharsetFromMeta(self, attrs, pos):
> self.unknown_starttag('meta', attrs, pos)
1553c1554
< def extractCharsetFromMeta(self, attrs):
---
> def extractCharsetFromMeta(self, attrs, pos):
1596c1597
< tag = self.unknown_starttag("meta", attrs)
---
> tag = self.unknown_starttag("meta", attrs, pos)

Greg Baker

unread,
Sep 23, 2009, 4:31:35 PM9/23/09
to beautifulsoup
Here is an update to that patch: it also add position to
NavigableString instances. As far as I know, I have positions
everywhere they can go now:

1015c1015
< self.soup.extractCharsetFromMeta(attrs)
---
> self.soup.extractCharsetFromMeta(attrs, position=self.getpos())
1017c1017
< self.soup.unknown_starttag(name, attrs)
---
> self.soup.unknown_starttag(name, attrs, position=self.getpos())
1020c1020
< self.soup.unknown_endtag(name)
---
> self.soup.unknown_endtag(name, position=self.getpos())
1023c1023
< self.soup.handle_data(content)
---
> self.soup.handle_data(content, position=self.getpos())
1030c1030
< self.soup.endData(subclass)
---
> self.soup.endData(subclass, position=self.getpos())
1279a1283
> self.currentDataPosition = None
1306c1310
< def endData(self, containerClass=NavigableString):
---
> def endData(self, containerClass=NavigableString, position=None):
1308a1313
> currentDataPosition = self.currentDataPosition
1316a1322
> self.currentDataPosition = None
1322a1329
> o.position = currentDataPosition
1397,1398c1404,1405
< def unknown_starttag(self, name, attrs, selfClosing=0):
< #print "Start tag %s: %s" % (name, attrs)
---
> def unknown_starttag(self, name, attrs, position, selfClosing=0):
> #print "Start tag %s: %s" % (name, attrs), position
1403c1410
< self.handle_data('<%s%s>' % (name, attrs))
---
> self.handle_data('<%s%s>' % (name, attrs), position)
1414a1422
> tag.position = position
1427c1435
< def unknown_endtag(self, name):
---
> def unknown_endtag(self, name, position):
1432c1440
< self.handle_data('</%s>' % name)
---
> self.handle_data('</%s>' % name, position)
1440c1448,1450
< def handle_data(self, data):
---
> def handle_data(self, data, position):
> if not self.currentDataPosition:
> self.currentDataPosition = position
1443,1444c1453,1454
< def extractCharsetFromMeta(self, attrs):
< self.unknown_starttag('meta', attrs)
---
> def extractCharsetFromMeta(self, attrs, position):
> self.unknown_starttag('meta', attrs, position)
1553c1563
< def extractCharsetFromMeta(self, attrs):
---
> def extractCharsetFromMeta(self, attrs, position):
1596c1606
< tag = self.unknown_starttag("meta", attrs)
---
> tag = self.unknown_starttag("meta", attrs, position)

gsmaverick

unread,
Oct 4, 2009, 6:56:26 PM10/4/09
to beautifulsoup
Thanks so much for this patch. Exactly what I'm looking for.

Aaron DeVore

unread,
Oct 4, 2009, 8:37:15 PM10/4/09
to beauti...@googlegroups.com
Greg,
How about breaking position into two attributes? Maybe something like
lineNumber and columnNumber. All you need to do is replace lines like
this:

tag.position = position

with this

tag.lineNumber, tag.columnNumber = position

-Aaron
Reply all
Reply to author
Forward
0 new messages