When dealing with HTML content, such as output from a Markdown processor, it is often useful to have a way to show an excerpt of the content. Much like a thumbnail for an image or a movie trailer.

Truncating HTML isn’t a very complicated problem, and I was surprised to find that it’s not part of more standard libraries. In any case, it was a fun little exercise and a nice way to learn how to do some string processing in a new language.

In a hurry? You can grab my solution over on GitHub. It’s MIT licensed, so you’re free to use it for open-source or commercial projects.

Let’s walk through creating a function called TruncateHtml in the Go Programming Language that will truncate arbitrary, valid HTML. All of this can easily be translated to another language (like Python), but I did it in Go so I will mention some useful Go language constructs and helper functions.

Before getting started, we have to answer some questions:

  • What type of encoding is the input expected to be in?
  • How is truncation length defined?
  • If valid HTML text is to be truncated, will the truncated text also be valid HTML?

Input Encoding

According to Wikipedia, UTF-8 accounts for 83.3% of all Web pages in March of 2015. Also, the Go Programming Language and has some really awesome string processing facilities which operate on UTF-8. So, for the sake of simplicity, we will require that all input to this function is UTF-8 encoded.

For iterating over characters in a string in Go, we can use the range clause in the for control structure. This range clause will automatically decode the characters (called runes in Go) of a UTF-8 encoded strings and return the associated byte offset.

for pos, runeValue := range "some fancy string" {
    fmt.Printf("At byte %d is the character %q\n", pos, runeValue)    
}

See the output of this code snippet on the Go playground.

If you’re new to Go, new to Unicode and the UTF-8 encoding, or would just like to see more on this subject, check out the Go blog article on the subject.

“Visible Characters”

It’s not enough to just specify a total number of bytes to truncate the input to. What if the limit falls within a tag? What if the limit falls somewhere in the middle of a UTF-8 encoded code point?

We will create a new unit called visible character and define the truncation limit in this unit. We define a visibile character as a printable character, with a corresponding Unicode code point, that does not belong to an HTML tag, and is not a whitespace character. This could be anything from a letter of the Latin alphabet (A-Z), a number (0-9), punctuation (!@#), or one of many other characters.

Note that this definition does not take in to account HTML formatting which may affect which characters, if any, are drawn to the screen. It is merely an approximation of what the browser will render.

To test whether a character is a “visible character” in Go, we can easily use functions from the unicode package to form the following expression:

unicode.IsPrint(runeValue) && !unicode.IsSpace(runeValue)

As mentioned in the last section, we can use the range clause with the for control structure to loop over the decoded characters of a string. So we will advance the input pointer until the start of an HTML tag (i.e. <) is discovered. While scanning, we count only visible characters using the above condition. If the limit of visible characters has been reached while scanning, we can stop processing the input and proceed to closing any opened tags (see following sections).

Matching Tags

We can use Regular Expressions to extract the useful information from a discovered tag. In Go, we can use the regexp package to do this.

var TagExpr = regexp.MustCompile("<(/?)([A-Za-z0-9]+).*?>")

This expression will match all valid HTML tags. If the input contains a valid HTML tag, the first submatch will be the entire match, the second submatch will be either “/” or “”, and the third submatch will be the tag name.

If a match is discovered, we advance the input pointer to the end of the tag.

Closing Tags

Consider the following example where some HTML is truncated to 5 visible characters:

<h1>Monty Python</h1>  =>  <h1>Monty

Clearly, this is not valid HTML because the H1 element has an opening <h1> tag but is not completed by a closing </h1> tag. So, in our function, we must now make sure to keep track of which tags are open, which tags are closed, and when we are done truncating to close any open tags.

Fortunately, valid HTML tags are always (should be) closed in the order in which they were opened. For example:

<h1><i>This is some fancy text.</i></h1>

This is perfectly valid HTML. But this:

<h1><i>This is some fancy text.</h1></i>

…is not.

Observing this property of HTML, we can easily leverage the stack data structure to keep track of which tags remain open. In Go, we can use the slice type to implement a stack.

Whenever an open tag is discovered, push the tag name onto our stack. When a closing tag is discovered, pop the stack and double-check that what was popped matches the closing tag. If the tag does not match what was popped, the input tags were not in the correct order or misnested.

To push into the stack:

tagStack = append(tagStack, tagName)

To pop from the stack:

tagStack = tagStack[0:len(tagStack)-1]

Finally, when the maximum desired input is reached, simply pop all the tags off the stack and append the associated closing tags to the output.

Void Elements

Across the different versions of HTML, so-called void elements can appear with or without a trailing forward-slash. For instance:

<img src="monty.gif"> or <img src="python.gif" />

Our parser should be robust enough to handle this variation. Fortunately, the solution to this is easy. When a tag is discovered, if the tag name is one of the void element names, simply do not keep track of it and advance the input pointer to the end of the tag.

Here are the list of the void elements:

area, base, br, col, command, embed, hr, img, input, keygen, link, meta, param, source, track, wbr

Character References (Entities)

A common occurance in HTML is the character entity reference. This is a construct that allows referencing a character set to produce a character instead of producing the character directly in the markup. For example: &copy; produces © and &#8984; produces ⌘, the looped square.

The following regular expression can be used to match character entity references:

var EntityExpr = regexp.MustCompile("&#?[A-Za-z0-9]+;")

When an entity is discovered, simply count it as one visible character and advance the input pointer to the end of the character reference.

Remarks

The TruncateHtml function is simple enough to implement and a good exercise if you’d like to learn a new language. Feel free to take a look at my Go implementation (MIT Licensed) if you’d like, or use it in your next Go project. If you do implement this function in another language, please drop a note in the comments below.