Truncating my blog posts with Python’s HTMLParser
I recently converted this blog to Pelican, a Python powered static site generator. On the way I added a few customizations. One customization is a Jinja template filter to truncate a post’s HTML as a summary, using Python’s HTMLParser
class. Here’s how I wrote it.
Pelican calculates a summary for each post to use on pages listing several posts. The default summary behaviour truncates the post HTML to the first N words. I enjoyed discovering its implementation is based on Django’s truncatewords_html
filter.
This behaviour is reasonable for arbitrary content, but I wanted something else. I normally write my first paragraphs as an introduction to the post that follows. Therefore for my post summaries, I wanted to truncate to the whole first paragraph, not an arbitrary word limit.
To do this, I wrote my own Jinja template filter to parse the post HTML and truncate it appropriately.
Template filters
Jinja template filters are regular functions that take and return strings, so I started with:
def excerpt(content: str) -> str:
# TODO: truncation
return content
A straightforward beginning.
Enter HTMLParser
Python has several HTML parsing packages, the most popular of which is Beautiful Soup. But for constrained tasks, the standard library’s HTMLParser
class can suffice, and even be more performant.
(Note: It’s not possible to use a regex to parse HTML without ceaseless screaming.)
HTMLParser
is used by Pelican’s default summary behaviour, so I tried using it as well.
To use HTMLParser
, we need to subclass it and implement methods as required to handle comments, tag starts, tag ends, etc. Then we can instantiate our class, call feed()
to pass it some HTML, and act on its results.
For a simple example, we can implement a parser class to count paragraph tags like so:
from __future__ import annotations
from html.parser import HTMLParser
class ParagraphCounter(HTMLParser):
def __init__(self, *, convert_charrefs: bool = True) -> None:
super().__init__(convert_charrefs=convert_charrefs)
self.paragraphs = 0
def handle_starttag(
self,
tag: str,
attrs: list[tuple[str, str | None]],
) -> None:
if tag == "p":
self.paragraphs += 1
And use it like so:
>>> parser = ParagraphCounter()
>>> parser.feed("<p>this</p><p>that</p>")
>>> parser.paragraphs
2
HTMLParser
is a non-validating parser, which means it won’t check many things, such as whether tags close in the right order. This also implies it doesn’t cover all the quirks and edge cases defined in HTML specification. As I learnt recently in Idiosyncrasies of the HTML parser, there are a lot of special rules!
The lack of validation is okay for my use case though, as I can assume the HTML generated by Pelican is valid.
First Paragraph Truncation v1
I had this idea for performing my custom truncation with HTMLParser
:
- Keep count of how many tags are open.
- When a tag ends and the counter hits 0, that means the first tag has closed. Track where this is, and raise an exception to stop parsing.
- Truncate the string up until the discovered end point.
I found that to track the end tag’s position required overriding one private, undocumented method: parse_endtag()
. This is the method that advances the position in the string after encountering the end tag.
The first version of my code looked like:
from __future__ import annotations
from html.parser import HTMLParser
class FirstTagTruncator(HTMLParser):
class TruncationCompleted(Exception):
pass
def __init__(self, *, convert_charrefs: bool = True) -> None:
super().__init__(convert_charrefs=convert_charrefs)
self.end: int | None = None
self.tag_counter = 0
def feed(self, data: str) -> None:
try:
super().feed(data)
except self.TruncationCompleted:
pass
def handle_starttag(
self,
tag: str,
attrs: list[tuple[str, str | None]],
) -> None:
self.tag_counter += 1
def handle_endtag(self, tag: str) -> None:
self.tag_counter -= 1
def parse_endtag(self, i: int) -> int:
"""
Override internal method to capture the position of the end tag
"""
gtpos = super().parse_endtag(i)
if self.tag_counter == 0:
self.end = gtpos
raise self.TruncationCompleted()
return gtpos
def excerpt(content: str) -> str:
"""
Truncate HTML to only the first top level tag (normally a <p>).
"""
truncator = FirstTagTruncator()
truncator.feed(content)
if truncator.end is None:
return content
return content[: truncator.end]
FirstTagTruncator
works as described above, and excerpt()
wraps it up for use in templates.
This approach worked fine, but then I found a little extra behaviour I wanted to add...
First Paragraph Truncation v2
On some posts I add extra details in a “message” paragraph at the top, which is not really appropriate to include in the summary. For example in my A Guide to Python Lambda Functions there’s a message that it’s a cross-post. Such posts have HTML that looks like:
<p class="message">
Some message details...
</p>
<p>
Actual summary paragraph...
</p>
<p>
Article body...
</p>
...
I wanted to make the parser skip over tags with class “message”, and truncate to the first paragraph afterwards. To do this I extended the behaviour:
- Add two new variables: the start position, and a skip flag.
- When a top-level tag starts, if it has the class
message
, set the skip flag toTrue
. - When a top-level tag ends, if the skip flag is set, update the start position, rather than raising
TruncationCompleted
.
After adding this I ended up with:
class FirstTagTruncator(HTMLParser):
class TruncationCompleted(Exception):
pass
def __init__(self, *, convert_charrefs: bool = True) -> None:
super().__init__(convert_charrefs=convert_charrefs)
self.start = 0
self.end: int | None = None
self.tag_counter = 0
self.skip = False
def feed(self, data: str) -> None:
try:
super().feed(data)
except self.TruncationCompleted:
pass
def handle_starttag(
self,
tag: str,
attrs: list[tuple[str, str | None]],
) -> None:
self.tag_counter += 1
if self.tag_counter == 1:
for name, value in attrs:
if name == "class" and value and "message" in value.split():
self.skip = True
def handle_endtag(self, tag: str) -> None:
self.tag_counter -= 1
def parse_endtag(self, i: int) -> int:
"""
Override internal method to capture the position of the end tag
"""
gtpos = super().parse_endtag(i)
if self.tag_counter == 0:
if self.skip:
self.start = gtpos
self.skip = False
else:
self.end = gtpos
raise self.TruncationCompleted()
return gtpos
def excerpt(content: str) -> str:
"""
Truncate HTML to only the first top level tag (normally a <p>),
skipping any 'message' class items that appear before.
"""
truncator = FirstTagTruncator()
truncator.feed(content)
if truncator.end is None:
return content
return content[truncator.start : truncator.end]
This works exactly as I wanted 😊
Read my book Boost Your Git DX to Git better.
One summary email a week, no spam, I pinky promise.
Related posts:
Tags: python