Cleaning up the WYSIWYG HTML

Published on 29 Oct 2011

My blog is built with Hippo CMS and the Hippo Site Toolkit. This means that I use the Xinha (Is Not HtmlArea) WYSIWYG editor to type the content of my blogposts. The Xinha editor does its job, but has its limitations. Editors may mess up the HTML without knowing it. There’s also an option to toggle to HTML mode. Editors may again mess up the HTML but then they know it. The HTML may work in their browser, but if it’s not valid, it may render wrong in other browsers. If the site is meant for a Dutch governmental organisation, the law requires it to meet accessibility guidelines which require the HTML to be XHTML 1.0 strict compliant.

Switching to a different WYSIWYG editor is not trivial and for Hippo there is another restriction: its license must be compatible with the Apache License. Nearly all alternatives are commercial or have some form of GPL license. That’s a reason why we kept using Xinha with its plugin ecosystem in both Hippo CMS 6 and 7. To ensure its result would be valid XHTML we cleanup the HTML when an editor clicks the Save button.

The HTML cleaner checks whether the HTML meets the following requirements:

Are the HTML elements and attributes valid according to the configured XHTML 1.0 dtd?
Options are strict and transitional. This is the reason why YouTube <embed/> codes will be stripped. For historical reasons we made an exception for the (IMO evil) element <u> (underline).
Is the element or attribute allowed in the configuration?
The webmaster can disallow the usage of an <h1> in the WYSIWYG HTML because that element is reserved for the main page title or disallow the usage of deprecated elements and attributes.. Maybe the design guidelines don’t allow adding markup and therefore the class and style attributes are stripped. Especially the style attribute makes it harder to reuse the content on multiple platforms.
Is the inserted CSS class allowed for the elements p, div, pre or span?
For these four elements, the content of the class attribute is restricted to the class names that are configured as allowed.

If the HTML element or attribute does not meet the configured requirements it will be removed. The text inside an element will not be removed, but wrapped inside an element that is allowed, usually a <p>. If a <div> element does not contain an allowed class name, it will be cleaned up by either converting it into a <p> or the <div> element. Its content will not be removed if it has children like <p>, <ul> or <table>.