Switching my blog to a static site generator

After staying on Textpattern for more than ten years, time was right for a new blog engine. It’s not that Textpattern is bad, it’s actually pretty good and rather sturdy security-wise. But perfect is the enemy of good, and a blog where it’s only static files on the server side is perfect security – no attack surface whatsoever. No PHP and no database on the server means far fewer security updates. And I can easily see locally what any modifications to the site would look like, then push to a repository that doubles as backup – done, the changes are deployed. Finally, I got simply fed up with writing Textile when everywhere else the format of choice is Markdown.

So now this blog is generated by Hugo, it’s all static files and the server can get some rest.

As an added bonus, the concept of page bundles means that adding images or PDFs to individual posts no longer results in an unmanageable mess. Migrating content and layout from Textpattern was fairly straightforward, with the custom RSS template to allow full blog posts in the RSS feed being already the “challenging” part.

But there are two inherently dynamic parts of a blog: search and comments. Very often, statically generated blogs will use something like Google’s custom search and Disqus to implement these. I didn’t want to rely on third parties however, for privacy reasons already. In addition, with comments I’d much rather keep them in the repository along with all the other content instead of breaking the beautiful concept with a remote dynamically generated frame. So here is how I solved this.

Static search with lunr.js

Hugo websites has a few suggestions for implementing search functionality. After looking through these, I thought that lunr.js would be the simplest solution. However, the hugo-lunr package mentioned there turned out to be a waste of time. Its purpose is generating a list of all the content in the blog. Yet it tries to do that without considering site configuration, so it fails to guess page URIs correctly, exports the wrong taxonomy and adds binary files to the index. I eventually realized that it is much easier to generate the index with Hugo itself. The following layouts/index.json template does the job for me already:

{{ $scratch := newScratch -}}
{{ $scratch.Add "index" slice -}}
{{ range .Site.RegularPages -}}
  {{ $scratch.Add "index" (dict "uri" .RelPermalink
                                "title" .Title
                                "description" .Description
                                "categories" .Params.categories
                                "content" (.Plain | htmlUnescape)) -}}
{{ end -}}
{{ $scratch.Get "index" | jsonify -}}

You have to enable JSON format in the site configuration and you are done:

outputs:
  home:
    - HTML
    - JSON
    - RSS

Now this isn’t an actual search index but merely a list of all content. I considered pre-building a search index but ended up giving up this idea. A pre-built search index is larger, but that would still be acceptable thanks to compression. More importantly however, it no longer has any information about the original text. So lunr.js would give you a list of URIs as search results but nothing else. You would have neither a title nor a summary to show to the user.

End result: The search script currently used on this site will download the JSON file with all the blog contents on first invocation. It will invoke lunr.js to build a search index and execute the search then. For the search results it shows the title and summary, the latter being generated from the entire content in the same way Hugo does it. It would be nice to highlight actual keywords found but that would be far more complicated and lunr.js does nothing to help you with this task.

A concern I have about lunr.js is its awkward query language. While this allows for more flexibility in theory, in practice nobody will want to learn this only to use the search on some stupid blog. Instead, people might put search phrases in quotation marks, currently a certain way to get no search results.

Somewhat dynamic commenting functionality

The concept of page bundles also has the nice effect that you can put a number of comment files into an article’s directory and a simple change to the templates will have them displayed under the article. So you can have comments in the same repository, neatly organized by article and generated statically along with all the other comment. Nice!

Only issue: how do you get comments there? This is the part that’s no longer possible without some server-side code. Depending on how much you want this to be automated, it might not even be a lot of code. I ended up going for full automation, so right now I’ve got around 300 lines of Python code and additional 100 lines of templates.

Comments on my blog are always pre-moderated, this makes things easier. So when somebody submits a comment, it is merely validated and put into queue. No connection to GitHub at this point, that would be slow and not entirely reliable. Contacting GitHub can be done when the comment is approved, I have more patience that the the average blog visitor.

Identifying the correct blog post

Each blog post has two identifiers: its URI and its directory path in the repository. Which one should be sent with the comment form and how to validate it? This question turned out less obvious than it seemed, e.g. because I wanted to see the title of the blog post when moderating comments; yet I didn’t want to rely on the commenter to send the correct title with the form. Getting data from GitHub isn’t an option at this stage, so I thought: why not get it from the generated pages on the server?

The comment form will now send the URI of the blog post. The comment server will use the URI to locate the corresponding index.html file, so here we already have validation that the blog post actually exists. From the file it can get the title and (via data-path attribute on the comment form) the article’s path in the repository. Another beneficial side-effect: if the blog post doesn’t have a comment form (e.g. because comments are disabled), this validation step will fail.

Sanitizing content

Ideally, I would add comments to the repository exactly as entered by the user and leave conversion from Markdown up to Hugo. Unfortunately, Hugo doesn’t have a sanitizer for untrusted content, the corresponding issue report is stale. So the comment server has to do Markdown conversion and sanitization, the comments will be stored in the repository as already safe HTML code and rel="nofollow" added to all links. The good news: Python-Markdown module allows disabling some syntax handlers, which I did for headings for example – the corresponding HTML tags would have been converted to plain text by the sanitizer otherwise.

Securing moderation interface

I didn’t want to implement proper user management for the comment moderation mechanism. Instead I wanted to be given a link in the notification mail, and I would merely need to follow it to review the comment. Original thought: do some HMAC dance to sign comment data in the URL. Nope, comment data might be too large for the URL, so it needs to be stored in a temporary file for moderation. Sign comment ID instead? Wait, why bother? If the comment ID is some lengthy random string it will be impossible to guess.

And that’s what I implemented: comment data is stored in the queue under a random file name. Accessing the moderation interface is only possible if you know that file name. Bruteforcing it remotely is unrealistic, so no fancy crypto required here.

Notifications and replies

Obviously, I wouldn’t want to put people’s email addresses into a public repository. Frankly however, I don’t think that subscribing to comments is terribly useful; comment sections of blogs simply aren’t a good place to have extended conversations. So already with Textpattern a direct reply to a comment could only come from me, and that’s the only scenario where people would get notified.

I’ve made this somewhat more explicit now, with the email field hint saying that filling it out is usually unnecessary. It is stored along with the comment data when the comment is in the moderation queue, so I can provide a reply during moderation and the comment author will receive a notification. Once moderation is done the comment data is removed from the queue and the email address is gone forever. Works for me, your mileage may wary.

Adding a comment to GitHub

I’ve made some bad experiences with automating repository commits in the past, there are too many edge conditions here. So this time I decided to use GitHub API instead, which turned out fairly simple. The comment server gets an access token and can then construct a commit to the repository.

Downside: adding a comment requires five HTTP requests, party because one file needs to be modified (updating lastmod setting of the article), but mostly because of the API being very low-level. There is only a high-level “all-in-one update” call if you want to modify a single file. For a commit with multiple files you have to:

Create a new tree.
Create a commit for this tree.
Update master branch reference to point to the commit.

Altogether this means: approving a comment is expected to take a few seconds.

Comments

Timo Ollech 2020-06-23 18:28

Do you have any tool for importing comments from Wordpress into your solution? That would be supercool because I'm in the process of migrating to Hugo, and before I found your blog I planned on using Isso for comments. But that would bring a lot more server-side code than your solution. Isso has the big advantage of being able to import WP comments itself though.

Wladimir Palant

No, I used a script to convert comments from a database dump into HTML files - but I came from Textpattern.