Adding abbr tags to a document

30 views
Skip to first unread message

Mansour Moufid

unread,
Mar 28, 2024, 12:06:17 PMMar 28
to beautifulsoup
Hello,

I'm attempting to automatically add abbr tags into HTML based on a dictionary.


I iterate over the elements of the page (elements() generator), then I match abbreviations with a regular expression (abbr_exp), extract the string (e.string.extract()), and finally add in a fresh abbr tag (e.append()).

The regular expression uses a 'negative lookbehind assertion' (?<!...) and a 'negative lookahead assertion' (?!...) to find abbreviations in strings.

When the element is a NavigableString, I have to wrap it in a <span> tag in order to use the append() method. I later remove the <span> tags.

I have a couple issues:
1) not all strings are iterated over; and
2) abbreviations inside <a> tags get appended after the tag rather than inside it.

Has anybody successfully done this? What approach did you use? What am I doing wrong?

I would love to hear any suggestions,
Mansour

Mansour Moufid

unread,
Mar 29, 2024, 11:50:41 AMMar 29
to beautifulsoup
On Thursday, March 28, 2024 at 12:06:17 PM UTC-4 Mansour Moufid wrote:

I have a couple issues:
1) not all strings are iterated over; and

Ok, I figured this first one out. I was iterating over the tree while modifying it. So I changed 

    for e in elements(...):

 to 

    for e in list(elements(...)):

Mansour Moufid

unread,
Mar 29, 2024, 1:09:59 PMMar 29
to beautifulsoup
On Thursday, March 28, 2024 at 12:06:17 PM UTC-4 Mansour Moufid wrote:
I have a couple issues:
1) not all strings are iterated over; and
2) abbreviations inside <a> tags get appended after the tag rather than inside it.

Ok I fixed the second point too.

Now I iterate only over NavigableStrings, not all elements:

    def elements(x):
        while x is not None:
            if isinstance(x, bs4.NavigableString):
                if len(x.string.strip('\n ')) > 0:
                    yield x
            x = x.next_element

Then in the loop, I extract the string, recording its position in the parent element:

    parent = e.parent
    i = parent.contents.index(e)
    x = e.extract()

and replace it (the string) with a new list which includes the newly created abbr tag:

    xs = list(re.split(abbr_exp, x.string))
    ys = join(functools.partial(new_abbr, abbr), xs)
    for j, y in enumerate(ys):
        parent.insert(i + j, y)

and it works!

See the gist for the updated script. Maybe someone will find it useful.

Message has been deleted

Chris Papademetrious

unread,
Mar 29, 2024, 9:14:16 PMMar 29
to beautifulsoup
Hi Mansour,

Another way to do this is by building a regex pattern that matches any abbreviation in the list:

abbreviations_pattern = re.compile(
    r"^(.*?)"  # stuff before abbreviation (lazy)
    + r"(?<![\w.])"  # negative lookbehind (not a letter or period)
    + "("
    + "|".join(map(re.escape, abbreviations))
    + ")"
    + r"(?!\w)"  # negative lookahead (not a letter)
    + r"(.*)$"  # stuff after abbreviation (greedy)
)

then looping through all NavigableString objects looking for matches:

for this_ns in list(soup.find_all(string=True)):
    # check if there's a match in this string
    while (this_ns is not None) and (match := re.match(abbreviations_pattern, this_ns)):
        # replace the abbreviation with the <abbr> tag
        before, abbr, after = match.groups()
        abbr_tag = soup.new_tag("abbr", title=abbreviations[abbr])
        abbr_tag.string = abbr
        this_ns.replace_with(abbr_tag)

        # if there was text before, add it
        if before != '':
            abbr_tag.insert_before(before)

        # if there was text after, add it, then continue checking for matches in it
        if after != '':
            abbr_tag.insert_after(after)
            this_ns = abbr_tag.next_sibling
        else:
            this_ns = None


For each match, (1) convert the abbreviation to its equivalent tag, (2) restore any text preceding the match, (3) restore any text following the match, and (4) if there was text following the match, continue looking in it for additional matches.

Note that in the upcoming 4.13 release of bs4, you'll be able to add text content directly in the new_tag() method:

abbr_tag = soup.new_tag("abbr", abbr, title=abbreviations[abbr])

 - Chris
Reply all
Reply to author
Forward
0 new messages