Dumbing down HTML content for AMO

If you are publishing extensions on AMO then you might have the same problem: how do I keep content synchronous between my website and extension descriptions on AMO? It could have been simple: take the HTML code from your website, copy it into the extension description and save. Unfortunately, usually this won’t produce useful results. The biggest issue: AMO doesn’t understand HTML paragraphs and will strip them out (along with most other tags). Instead it will turn each line break in your HTML code into a hard line break.

Luckily, a fairly simple script can do the conversion and make sure your text still looks somewhat okayish. Here is what I’ve come up with for myself:

#!/usr/bin/env python
import sys
import re

data = sys.stdin.read()

# Normalize whitespace
data = re.sub(r'\s+', ' ', data)

# Insert line breaks after block tags
data = re.sub(r'<(ul|/ul|ol|/ol|blockquote|/blockquote|/li)\b[^<>]*>\s*', '<\\1>\n', data)

# Headers aren't supported, turn them into bold text
data = re.sub(r'<h(\d)\b[^<>]*>(.*?)</h\1>\s*', '<b>\\2</b>\n\n', data)

# Convert paragraphs into line breaks
data = re.sub(r'<p\b[^<>]*>\s*', '', data)
data = re.sub(r'</p>\s*', '\n\n', data)

# Convert hard line breaks into line breaks
data = re.sub(r'<br\b[^<>]*>\s*', '\n', data)

# Remove any leading or trailing whitespace
data = data.strip()

print data

This script expects the original HTML code from standard input and will print the result to standard output. The conversions performed are sufficient for my needs, your mileage may vary — e.g. because you aren’t closing paragraph tags or because relative links are used that need resolving. I’m not intending to design some universal solution, you are free to add more logic to the script as needed.

Edit: Alternatively you can use the equivalent JavaScript code:

var textareas = document.getElementsByTagName("textarea");
for (var i = 0; i < textareas.length; i++)
{
  if (window.getComputedStyle(textareas[i], "").display == "none")
    continue;

  data = textareas[i].value;

  // Normalize whitespace
  data = data.replace(/\s+/g, " ");

  // Insert line breaks after block tags
  data = data.replace(/<(ul|\/ul|ol|\/ol|blockquote|\/blockquote|\/li)\b[^<>]*>\s*/g, "<$1>\n");

  // Headers aren't supported, turn them into bold text
  data = data.replace(/<h(\d)\b[^<>]*>(.*?)<\/h\1>\s*/g, "<b>$2</b>\n\n");

  // Convert paragraphs into line breaks
  data = data.replace(/<p\b[^<>]*>\s*/g, "");
  data = data.replace(/<\/p>\s*/g, "\n\n");

  // Convert hard line breaks into line breaks
  data = data.replace(/<br\b[^<>]*>\s*/, "\n");

  // Remove any leading or trailing whitespace
  data = data.trim();

  textareas[i].value = data;
}

This one will convert the text in all visible text areas. You can either run it on AMO pages via Scratchpad or turn it into a bookmarklet.

Comments

There are currently no comments on this article.