Adding abbr tags to a document

Mansour Moufid

unread,

Mar 28, 2024, 12:06:17 PMMar 28

to beautifulsoup

Hello,

I'm attempting to automatically add abbr tags into HTML based on a dictionary.

Here is a small demo: https://gist.github.com/mansourmoufid/711b38e65a3ddd0863e4f61827a2e699

I iterate over the elements of the page (elements() generator), then I match abbreviations with a regular expression (abbr_exp), extract the string (e.string.extract()), and finally add in a fresh abbr tag (e.append()).

The regular expression uses a 'negative lookbehind assertion' (?<!...) and a 'negative lookahead assertion' (?!...) to find abbreviations in strings.

When the element is a NavigableString, I have to wrap it in a <span> tag in order to use the append() method. I later remove the <span> tags.

I have a couple issues:

1) not all strings are iterated over; and

2) abbreviations inside <a> tags get appended after the tag rather than inside it.

Has anybody successfully done this? What approach did you use? What am I doing wrong?

I would love to hear any suggestions,

Mansour

Mansour Moufid

unread,

Mar 29, 2024, 11:50:41 AMMar 29

to beautifulsoup

On Thursday, March 28, 2024 at 12:06:17 PM UTC-4 Mansour Moufid wrote:

I have a couple issues:
1) not all strings are iterated over; and

Ok, I figured this first one out. I was iterating over the tree while modifying it. So I changed

for e in elements(...):

to

for e in list(elements(...)):

Mansour Moufid

unread,

Mar 29, 2024, 1:09:59 PMMar 29

to beautifulsoup

On Thursday, March 28, 2024 at 12:06:17 PM UTC-4 Mansour Moufid wrote:

I have a couple issues:
1) not all strings are iterated over; and
2) abbreviations inside <a> tags get appended after the tag rather than inside it.

Ok I fixed the second point too.

Now I iterate only over NavigableStrings, not all elements:

def elements(x):
while x is not None:
if isinstance(x, bs4.NavigableString):
if len(x.string.strip('\n ')) > 0:
yield x
x = x.next_element

Then in the loop, I extract the string, recording its position in the parent element:

parent = e.parent

i = parent.contents.index(e)

x = e.extract()

and replace it (the string) with a new list which includes the newly created abbr tag:

xs = list(re.split(abbr_exp, x.string))
ys = join(functools.partial(new_abbr, abbr), xs)
for j, y in enumerate(ys):
parent.insert(i + j, y)

and it works!

See the gist for the updated script. Maybe someone will find it useful.

Message has been deleted

Chris Papademetrious

unread,

Mar 29, 2024, 9:14:16 PMMar 29

to beautifulsoup

Hi Mansour,

Another way to do this is by building a regex pattern that matches any abbreviation in the list:

abbreviations_pattern = re.compile(
r"^(.*?)" # stuff before abbreviation (lazy)
+ r"(?<![\w.])" # negative lookbehind (not a letter or period)
+ "("
+ "|".join(map(re.escape, abbreviations))
+ ")"
+ r"(?!\w)" # negative lookahead (not a letter)
+ r"(.*)$" # stuff after abbreviation (greedy)
)

then looping through all NavigableString objects looking for matches:

for this_ns in list(soup.find_all(string=True)):
# check if there's a match in this string
while (this_ns is not None) and (match := re.match(abbreviations_pattern, this_ns)):
# replace the abbreviation with the <abbr> tag
before, abbr, after = match.groups()
abbr_tag = soup.new_tag("abbr", title=abbreviations[abbr])
abbr_tag.string = abbr
this_ns.replace_with(abbr_tag)

# if there was text before, add it
if before != '':
abbr_tag.insert_before(before)

# if there was text after, add it, then continue checking for matches in it
if after != '':
abbr_tag.insert_after(after)
this_ns = abbr_tag.next_sibling
else:
this_ns = None

For each match, (1) convert the abbreviation to its equivalent tag, (2) restore any text preceding the match, (3) restore any text following the match, and (4) if there was text following the match, continue looking in it for additional matches.

Note that in the upcoming 4.13 release of bs4, you'll be able to add text content directly in the new_tag() method:

abbr_tag = soup.new_tag("abbr", abbr, title=abbreviations[abbr])

- Chris

Reply all

Reply to author

Forward