Truncating my blog posts with Python’s HTMLParser

2021-09-23 Not a bowl of beautiful soup.

I recently converted this blog to Pelican, a Python powered static site generator. On the way I added a few customizations. One customization is a Jinja template filter to truncate a post’s HTML as a summary, using Python’s HTMLParser class. Here’s how I wrote it.

Pelican calculates a summary for each post to use on pages listing several posts. The default summary behaviour truncates the post HTML to the first N words. I enjoyed discovering its implementation is based on Django’s truncatewords_html filter.

This behaviour is reasonable for arbitrary content, but I wanted something else. I normally write my first paragraphs as an introduction to the post that follows. Therefore for my post summaries, I wanted to truncate to the whole first paragraph, not an arbitrary word limit.

To do this, I wrote my own Jinja template filter to parse the post HTML and truncate it appropriately.

Template filters

Jinja template filters are regular functions that take and return strings, so I started with:

def excerpt(content: str) -> str:
    # TODO: truncation
    return content

A straightforward beginning.

Enter `HTMLParser`

Python has several HTML parsing packages, the most popular of which is Beautiful Soup. But for constrained tasks, the standard library’s HTMLParser class can suffice, and even be more performant.

(Note: It’s not possible to use a regex to parse HTML without ceaseless screaming.)

HTMLParser is used by Pelican’s default summary behaviour, so I tried using it as well.

To use HTMLParser, we need to subclass it and implement methods as required to handle comments, tag starts, tag ends, etc. Then we can instantiate our class, call feed() to pass it some HTML, and act on its results.

For a simple example, we can implement a parser class to count paragraph tags like so:

from __future__ import annotations

from html.parser import HTMLParser


class ParagraphCounter(HTMLParser):
    def __init__(self, *, convert_charrefs: bool = True) -> None:
        super().__init__(convert_charrefs=convert_charrefs)
        self.paragraphs = 0

    def handle_starttag(
        self,
        tag: str,
        attrs: list[tuple[str, str | None]],
    ) -> None:
        if tag == "p":
            self.paragraphs += 1

And use it like so:

>>> parser = ParagraphCounter()
>>> parser.feed("<p>this</p><p>that</p>")
>>> parser.paragraphs
2

HTMLParser is a non-validating parser, which means it won’t check many things, such as whether tags close in the right order. This also implies it doesn’t cover all the quirks and edge cases defined in HTML specification. As I learnt recently in Idiosyncrasies of the HTML parser, there are a lot of special rules!

The lack of validation is okay for my use case though, as I can assume the HTML generated by Pelican is valid.

First Paragraph Truncation v1

I had this idea for performing my custom truncation with HTMLParser:

Keep count of how many tags are open.
When a tag ends and the counter hits 0, that means the first tag has closed. Track where this is, and raise an exception to stop parsing.
Truncate the string up until the discovered end point.

I found that to track the end tag’s position required overriding one private, undocumented method: parse_endtag(). This is the method that advances the position in the string after encountering the end tag.

The first version of my code looked like:

from __future__ import annotations

from html.parser import HTMLParser


class FirstTagTruncator(HTMLParser):
    class TruncationCompleted(Exception):
        pass

    def __init__(self, *, convert_charrefs: bool = True) -> None:
        super().__init__(convert_charrefs=convert_charrefs)
        self.end: int | None = None
        self.tag_counter = 0

    def feed(self, data: str) -> None:
        try:
            super().feed(data)
        except self.TruncationCompleted:
            pass

    def handle_starttag(
        self,
        tag: str,
        attrs: list[tuple[str, str | None]],
    ) -> None:
        self.tag_counter += 1

    def handle_endtag(self, tag: str) -> None:
        self.tag_counter -= 1

    def parse_endtag(self, i: int) -> int:
        """
        Override internal method to capture the position of the end tag
        """
        gtpos = super().parse_endtag(i)
        if self.tag_counter == 0:
            self.end = gtpos
            raise self.TruncationCompleted()
        return gtpos


def excerpt(content: str) -> str:
    """
    Truncate HTML to only the first top level tag (normally a <p>).
    """
    truncator = FirstTagTruncator()
    truncator.feed(content)
    if truncator.end is None:
        return content
    return content[: truncator.end]

FirstTagTruncator works as described above, and excerpt() wraps it up for use in templates.

This approach worked fine, but then I found a little extra behaviour I wanted to add...

First Paragraph Truncation v2

On some posts I add extra details in a “message” paragraph at the top, which is not really appropriate to include in the summary. For example in my A Guide to Python Lambda Functions there’s a message that it’s a cross-post. Such posts have HTML that looks like:

<p class="message">
  Some message details...
</p>

<p>
  Actual summary paragraph...
</p>

<p>
  Article body...
</p>

...

I wanted to make the parser skip over tags with class “message”, and truncate to the first paragraph afterwards. To do this I extended the behaviour:

Add two new variables: the start position, and a skip flag.
When a top-level tag starts, if it has the class message, set the skip flag to True.
When a top-level tag ends, if the skip flag is set, update the start position, rather than raising TruncationCompleted.

After adding this I ended up with:

class FirstTagTruncator(HTMLParser):
    class TruncationCompleted(Exception):
        pass

    def __init__(self, *, convert_charrefs: bool = True) -> None:
        super().__init__(convert_charrefs=convert_charrefs)
        self.start = 0
        self.end: int | None = None
        self.tag_counter = 0
        self.skip = False

    def feed(self, data: str) -> None:
        try:
            super().feed(data)
        except self.TruncationCompleted:
            pass

    def handle_starttag(
        self,
        tag: str,
        attrs: list[tuple[str, str | None]],
    ) -> None:
        self.tag_counter += 1

        if self.tag_counter == 1:
            for name, value in attrs:
                if name == "class" and value and "message" in value.split():
                    self.skip = True

    def handle_endtag(self, tag: str) -> None:
        self.tag_counter -= 1

    def parse_endtag(self, i: int) -> int:
        """
        Override internal method to capture the position of the end tag
        """
        gtpos = super().parse_endtag(i)
        if self.tag_counter == 0:
            if self.skip:
                self.start = gtpos
                self.skip = False
            else:
                self.end = gtpos
                raise self.TruncationCompleted()
        return gtpos


def excerpt(content: str) -> str:
    """
    Truncate HTML to only the first top level tag (normally a <p>),
    skipping any 'message' class items that appear before.
    """
    truncator = FirstTagTruncator()
    truncator.feed(content)
    if truncator.end is None:
        return content
    return content[truncator.start : truncator.end]

This works exactly as I wanted 😊

Fin

Parsing HTML with Python’s standard library is fun!

—Adam

Read my book Boost Your Git DX to Git better.

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: python

Truncating my blog posts with Python’s HTMLParser

Template filters

Enter HTMLParser

First Paragraph Truncation v1

First Paragraph Truncation v2

Fin

Enter `HTMLParser`