#3849 - HTML cleanup framework, and new admin module

Identifier #3849
Issue type Feature request or suggestion
Title HTML cleanup framework, and new admin module
Status Open
Tags

Roadmap: Over the horizon (custom)

Handling member Deleted
Addon core
Description We have a morass of HTML cleanup rules in comcode_from_html.php, which is rarely used anymore (as we don't do HTML to Comcode conversion unless enabled, and recommend against it).

We also have some HTML ugliness detections defined for the new Health Check in #3793.

We also have some cleanups in the Confluence integration addon for v11.

And finally, there is a broad set of things we might want to cleanup, but cannot do automatically - whether or not to clean stuff up requires user choice. A good example is HTML exported from Microsoft Excel, which tends to be absolutely dreadfully over-specified, yet we need the user to say what is.

So, a new unified framework of cleanups, likely hook-powered, would be good. Each hook could detect and resolve a problem(s). comcode_from_html.php would be just for Comcode conversion, and the HTML cleanup would be a separate phase controlled by our framework.

Then there'd be a new admin module that would let you select which cleanups to do, with live preview of both the HTML code and rendered HTML.

Here's a list of new cleanups I'd like...
- List all elements, attributes, styles - ability to choose which to strip (this is the excel example, where it'd be great to just check off most tags/attributes/style-rules for removal)
- HTML reformatting (we have XHTML reformatting code we can tie into)
- Trailing spaces on attributes (I believe that the Health Check does it for the end of HTML, but not attributes specifically)
- Move small image files (definable threshold) into "data:" URIs (this is very useful if you don't want lots of tiny image files littered around neededlessly)

To test all this we can throw a series of terrible HTML documents, exported from:
- Excel (different versions)
- LibreOffice Sheets
- OpenOffice Sheets
- Google Docs Sheets
- Apple Numbers
- Microsoft Word
- LibreOffice Word
- OpenOffice Word
- Google Docs Word
- Apple Pages
- Microsoft Publisher

A good way to produce terrible HTML documents is to paste web pages into the software. That way it goes through a double conversion for many rich features, and really exposes maximal mess.
Steps to reproduce

Additional information Here's some code we can partly re-use, that moves images inline...

$c = file_get_contents('x.htm');
$matches = array();
$num_matches = preg_match_all('#(images/\w+\.png)#', $c, $matches);
$remap = array();
for ($i = 0; $i < $num_matches; $i++) {
$url = $matches[1][$i];
$new = 'data:image/png;base64,' . base64_encode(file_get_contents($url));
$remap[$url] = $new;
}
$c = str_replace(array_keys($remap), array_values($remap), $c);
echo $c;
file_put_contents('x.htm', $c);
Funded? No
The system will post a comment when this issue is modified (e.g., status changes). To be notified of this, click "Enable comment notifications".

Rating

Unrated