Why "Save Page As HTML, complete" sucks

Posted

I read a forum question from an Opera user who was upset because Opera 9.10 now saves web pages “like IE and Firefox” – meaning saving them with all the included files. His problem was easily solved with a configuration change but it got me thinking. Generally this doesn’t seem to be such a bad idea, it allows you to open a saved web page and it will look exactly the same. So I tried to understand why this user was so upset and why I almost never use this feature myself. It seems there are three things.

Non-obvious result

It doesn’t just create the file the user told it to create but also a directory for the auxiliary files. It isn’t obvious to the user that this will happen and that he has to remove the directory as well when he chooses to remove the saved page. Even if he knows it, it still means some effort locating the directory which is annoying. Yes, if you happen to use Windows Explorer it will remove this directory automatically but this is a hack and a very non-obvious action again.

Solution: save everything into one file. I first thought of using the data: urls to embed all data inside the same HTML file. This would have the advantage of sticking to the HTML format, also nothing other than the web page saving code would need to be changed. However, I noticed disadvantages as well: this file wouldn’t be usable in Internet Explorer (it still doesn’t support data: URLs). Most importantly however, if the same image is used multiple times on the page it will have to be stored multiple times, no way to specify “this image uses the same URL as image XYZ” in HTML. That last one is a showstopper so that supporting Microsoft’s MTHML format is probably still the more realistic alternative even though it means much more effort.

Reproductive defects

The saved image of the web page isn’t perfect. While the HTML code is serialized from the DOM tree and is a perfect copy of what is currently displayed in the browser, JavaScript and CSS are not taken care of. So CSS might still contain relative URLs and JavaScript… well, ideally you would save JavaScript’s current state as well. Because the web page might have code like this:

<script type="text/javascript">
  document.write('<img src="image.gif">');
</script>

If you save this page and open it again, what will you see? Right, two images — one that already was on the web page when it was saved and a second that was created by this code when the saved page was opened. You can get even more images by saving again.

Solution: when saving remove all CSS and JavaScript that was in the web page originally and replace it by a correct representation of the current web page state. Meaning that you would have a big style block at the beginning of the page defining all the styles that are relevant for the document (with all the relative URLs corrected of course). And you would have a block of JavaScript that would execute on page load to restore all the JavaScript functions, variables, event handlers etc. Don’t take me too seriously on this one, I know all too well that nobody will go through the hurdles of implementing it especially since there are several major issues: restoring JavaScript properties on DOM objects that don’t have an ID, references from JavaScript to DOM objects that are not in any document or are in the document of a different window, user actions during page load when the JavaScript event handlers are not yet in place.

Saving more than necessary

There have been complaints that even though Adblock Plus blocks the ads the saved page will still have them. The problem is that web page saving doesn’t respect content policies and will download files even if they are blocked. That is especially concerning for web bugs that have been blocked because of privacy concerns. Previously I was thinking that this is the way it should be, after all “HTML, complete” mode is supposed to create a copy of the original web page. But now I am tending to filing a bug on this issue.

Solution: only download files that haven’t been blocked. The implementation here shouldn’t be difficult for a change, images and objects already implement the imageBlockingStatus property that indicates whether the image has been blocked by a content policy.

PS: If everything goes well this post should appear on Planet Mozilla. Yay, that’s exciting! :)

Categories:

Comments

  1. Dao

    That’s why we need “save as PDF”. The HTML thing can safely be removed then — it’s irritating to users and useless for Web developers.

    Reply from Wladimir Palant:

    Do you know of any concrete plans? I remember people talking that producing PDFs will be possible with Cairo but I am not sure how serious this is.

  2. Jeremy

    To address the “reproductive defects”, shouldn’t it save the page as it currently appears AFTER manipulation of the DOM by JavaScript? That would also have the benefit of not saving things that had been removed by Adblock Plus, right? Of course that solution has its own difficulties, such as how to remove the JavaScript that manipulates the DOM without removing other JavaScript. And even if you could do so you’d be removing JavaScript that is needed to get things to work correctly in IE (such as the JavaScript trick to get PNG alpha blending to work in IE).

    Reply from Wladimir Palant:

    That’s exactly how it works now, it saves the current state of the DOM. But removing some JavaScript while keeping other is impossible – JavaScript is Turing-complete, you can’t tell whether it will modify the DOM or not. So you have the choice between leaving all JavaScript (meaning that you will need to save its state) or remove all of it. The latter will break some pages, e.g. the ones that don’t display all their content at the same time but allow you to switch pages using JavaScript.

    As to Adblock Plus – it doesn’t change the DOM. So if you save the DOM it will still contain the images that have been blocked.

  3. Simon

    Unless I specifically need the saved HTML version, I don’t generally use that feature – mostly for the second reason. If I order something online and want a copy of the order acknowledgement, I print the page to file (currently Postscript, but PDF would be nice) so I’ve got an exact copy.

  4. the'

    For “Save as PDF” the back end is almost done:

    https://bugzilla.mozilla.org/show_bug.cgi?id=369930

    No progress on the front end:

    https://bugzilla.mozilla.org/show_bug.cgi?id=162659

    Reply from Wladimir Palant:

    Yes, there are certainly plans to fix printing using Cairo’s PDF capabilities. Judging by bug 162659 there are no plans to allow saving PDFs however (note that saving a page is in many respects not the same as printing it – e.g. you want to preserve backgrounds and disable printing-specific scaling/page breaks).

  5. Fred Wenzel

    Yes, it indeed did show up on planet.m.org — nice!

  6. funTomas

    Why not to use JAR archive? FF3 will supprt JAR URLs, or will it?

    Reply from Wladimir Palant:

    I thought about it. Support for JARs has been there at least since Netscape 6.0 so if they are trying to sell it to you as a new feature in FF3 – don’t believe them :). I see three problems with JARs:

    • You have compatibility issues again because no other browser will open them
    • I am not sure whether you can have relative JAR URLs (meaning showing to other files inside the same JAR without knowing the name of the JAR file)
    • The default action if you double-click a JAR is not starting the browser but running it as a Java application

    You can solve the last two issues by saving the main HTML page outside the JAR – but then the JAR file doesn’t have any advantages compared to a directory for the same files.

  7. funTomas

    Well, I’m aware of those issues. So, is there any standard for archiving web pages? If yes, it should be followed, otherwise FF should set one. When it comes to JAR and its association with java, then I’d suggest using a different extension name, e.g. war (I know it’s been used by java web apps, this times it’d stand for web archive). It’s archive file format after all. This time you get a single file of standardized archiving file format, accessible by FF (other browsers would learn to use it too).

  8. Ian

    We really, really do need MHTML support – it’s a wonderful format, that lets you save a complete web page in a single file, and a great way to keep a complete, atomic copy of a web page. Everyone who has saved MHT files in IE before is now stuck with IE if they want to be able to see them in the future – we’re letting them maintain lockin by not implementing what is essentially an open format!

    Saving as PDF is also something that’s awesome and thankfully my grumbling about it make then reopen the bug for it :)

  9. Benoit

    “Even if he knows it, it still means some effort locating the directory which is annoying.”

    Huh? The directory is stored in the same directory as the HTML file, so why would it be an effort to locate it?

    Reply from Wladimir Palant:

    Because in every file browser I know directories are displayed separately from files so that you have to scroll up and find it by the name. At least if your directory has more than five files which it usually does ;)

  10. ReTuck

    You don’t see the difference between pre and post processing: by the time Adblock Plus begins to do its job, the undesirable object is probably already in the browser’s cache. “Save As” probably just functions to pull an exact copy of what’s already in the browser’s cache out into the user’s desired directory. So you get all the web bugs, etc… and depending on the relative/absolute filter paths, the Adblock Plus filter may or may not supress correctly when opening from the local directories. In order to get pre-processing, where the bugs, etc. are not downloaded, the incoming data needs to be inspected before the browser.
    Agnitum’s free Outpost firewall does this- one configurable filter intercepts the incoming data and if incoming network traffic matches a pattern (such as a 1×1 gif), it dies. Of course this is counterproducive on webpages that use some sort of script to make certain that everything gets downloaded b/c it will try unsuccessfully forever (click, click, click…)to get its bugs/ads downloaded. Of course Outpost can identify that script and not load it either and then the webmaster makes the page unviewable until the script is loaded… so the user installs a commercial Anti-ad$ that replaces undesirable script with harmless script for a yearly$, but Anti-ad$ downloads update$, hogs system resources/CPU cycles and on and on infinitum.

    Reply from Wladimir Palant:

    Thanks for the junk advertising. You are wrong, Adblock Plus prevents things from being downloaded and put into the browser cache. And “Save Page As” does in fact download things it cannot find in the browser cache without consulting content policies – including images blocked by Adblock Plus. Which is a bug in the “Save Page As” feature and should be fixed.

  11. South Park

    Opera 9.10 now supports MHTML files, offering an alternative to Internet Exploder. I would love to see Firefox support MHTML as well.

  12. Extensions?

    Are there any extensions that let Firefox open MHTML files?

    Reply from Wladimir Palant:

    Google spits out this one: https://addons.mozilla.org/en-US/firefox/addon/212

    Not sure whether this is worth anything, especially since it is abandoned. I think that for these things to work properly they should be done in the core, not as an extension.

  13. T Blevins

    I realize this is old; however your blog is always a treasure trove, Wladimir.
    restoring JavaScript properties on DOM objects that don’t have an ID, references from JavaScript to DOM objects that are not in any document or are in the document of a different window, user actions during page load when the JavaScript event handlers are not yet in place. Wouldn’t TraceMonkey address this?

    MHTML at Wikipedia
    I think this is relevant to read and contemplate before moving forward in designing and architecture.
    Quote from article: Konqueror ..does include a feature for saving web pages as single files (“web archives”, file extension .war) that are actually gzipped tarballs..
    TGZ is a suboptimal format for rapid access, if you ask me.

    Apple went proprietary with Safari..Thanks Apple. :D

    I don’t use IE anymore — mostly because of AB+ — however the MHT concept is great. I have also used them on my phone as a reference.

    PDF would work, however last time I looked it was rather static so I think keeping HTML is better.
    MHTML should also be compressed which is better than multiple files with the current FF implementation.

    Ideally the format would be standardized by the w3c (how long would THAT take??) if RFC 2557 MHTML is insufficient.

    FYI, there is and unpacker for MHTML from the Microsoft Office Area however a quick search didn’t find the addon — which adds options in the context menu to compile/decompile saved files — to those interested.

    p.s. Wladimir you are and inspiration and a role model, good sir! Thank you for ALL of your amazing work!

    Reply from Wladimir Palant:

    Why should TraceMonkey help with saving the JavaScript state? It is only a different JavaScript engine, not a solution for everything. It doesn’t change the fact that serializing a JavaScript state is inherently difficult.

    Mozilla could easily use the jar: protocol to save an entire webpage in a single file – but will yet another proprietary format really make the world better? As you already mention, there are currently three different and incompatible solutions implemented by different browsers. So I still think that supporting MHTML would offer the most value to the user because it would make 90% of the browser market interoperable. But when I googled I found a big “oops”: http://www.patentstorm.us/patents/6886132/description.html. Guess supporting MHTML is out of question then.

    Reply from Wladimir Palant:

    In http://www.informationweek.com/news/global-cio/showArticle.jhtml?articleID=162100345 it is clearer what this patent is about – guess supporting MHTML in Firefox is still an option, if somebody actually takes the time.

Commenting has expired for this article.

← Older Newer →