View Issue Details

IDProjectCategoryView StatusLast Update
3849Composrcorepublic2021-05-01 02:59
ReporterChris Graham Assigned To 
PrioritynormalSeverityfeature 
Status newResolutionopen 
Summary3849: HTML cleanup framework, and new admin module
DescriptionWe have a morass of HTML cleanup rules in comcode_from_html.php, which is rarely used anymore (as we don't do HTML to Comcode conversion unless enabled, and recommend against it).

We also have some HTML ugliness detections defined for the new Health Check in 3793.

We also have some cleanups in the Confluence integration addon for v11.

And finally, there is a broad set of things we might want to cleanup, but cannot do automatically - whether or not to clean stuff up requires user choice. A good example is HTML exported from Microsoft Excel, which tends to be absolutely dreadfully over-specified, yet we need the user to say what is.

So, a new unified framework of cleanups, likely hook-powered, would be good. Each hook could detect and resolve a problem(s). comcode_from_html.php would be just for Comcode conversion, and the HTML cleanup would be a separate phase controlled by our framework.

Then there'd be a new admin module that would let you select which cleanups to do, with live preview of both the HTML code and rendered HTML.

Here's a list of new cleanups I'd like...
 - List all elements, attributes, styles - ability to choose which to strip (this is the excel example, where it'd be great to just check off most tags/attributes/style-rules for removal)
 - HTML reformatting (we have XHTML reformatting code we can tie into)
 - Trailing spaces on attributes (I believe that the Health Check does it for the end of HTML, but not attributes specifically)
 - Move small image files (definable threshold) into "data:" URIs (this is very useful if you don't want lots of tiny image files littered around neededlessly)

To test all this we can throw a series of terrible HTML documents, exported from:
 - Excel (different versions)
 - LibreOffice Sheets
 - OpenOffice Sheets
 - Google Docs Sheets
 - Apple Numbers
 - Microsoft Word
 - LibreOffice Word
 - OpenOffice Word
 - Google Docs Word
 - Apple Pages
 - Microsoft Publisher

A good way to produce terrible HTML documents is to paste web pages into the software. That way it goes through a double conversion for many rich features, and really exposes maximal mess.
Additional InformationHere's some code we can partly re-use, that moves images inline...

$c = file_get_contents('x.htm');
$matches = array();
$num_matches = preg_match_all('#(images/\w+\.png)#', $c, $matches);
$remap = array();
for ($i = 0; $i < $num_matches; $i++) {
    $url = $matches[1][$i];
    $new = 'data:image/png;base64,' . base64_encode(file_get_contents($url));
    $remap[$url] = $new;
}
$c = str_replace(array_keys($remap), array_values($remap), $c);
echo $c;
file_put_contents('x.htm', $c);
TagsRoadmap: Over the horizon
Attach Tags
Time estimation (hours)20
Sponsorship open

Sponsor

Date Added Member Amount Sponsored

Activities

Chris Graham

2021-03-24 16:37

administrator   ~7029

Here's a sample of terrible MS Word HTML for just just one list item:

<li style="text-align:justify"><span style="font-size:12pt"><span style="font-family:"Times New Roman",serif"><span style="letter-spacing:0.25pt"><span style="font-family:"Segoe UI",sans-serif">Use this link to whatever.htm</span></span></span></span> <span style="font-size:12pt"><span style="font-family:"Times New Roman",serif"><span style="letter-spacing:0.25pt"><span style="font-family:"Segoe UI",sans-serif">Lorem Ipsum</span></span></span></span>

Or possible it's a combination of MS Word HTML combined with changing styles in CKEditor.

In English this is saying:

    Make a list item...
    with justified text
    Inside, make the font size 12pt (overriding the site's default font size)
    Inside, make the font Times New Roman (overriding the site's default font)
    Inside, make the letter spacing 0.25pt (a very esoteric thing to do, override the site's default)
    Inside make the font Segoe UI (overriding the font just set)
    ... ( a bit later ) ...
    Close off all those styles
    Now, make the font size 12pt (repeating again what was just closed off)
    Now, make the font Times New Roman (")
    Now, make the letter spacing 0.25pt (")
    Now make the font Segoe UI (")

We should be able to know what the default font size and font is, and remove any top level rules for those. It's not 100% perfect as a rule as we don't know the global defaults are in play in context, but >99% of the time they will be.
We should be able to strip any rules that get overriden by lower-level rules without anything being applied to it otherwise.
It needs to be recursive.

Chris Graham

2021-04-30 02:51

administrator   ~7081

Last edited: 2021-05-01 02:59

Another common problem is empty paragraphs appearing at the end of a document, possibly before a Tempcode comment.

EDIT: Another is setting "font-size:1em". Does absolutely nothing.

Add Note

View Status
Note
Upload Files
Maximum size: 32,768 KiB

Attach files by dragging & dropping, selecting or pasting them.
You are not logged in You are not logged in. This means you will not get any e-mail notifications. And if you reply, we will not know for sure you are the original poster of the issue.

Issue History

Date Modified Username Field Change
2019-07-19 16:14 Chris Graham New Issue
2021-03-15 17:33 Chris Graham Description Updated
2021-03-24 16:37 Chris Graham Note Added: 0007029
2021-03-24 16:37 Chris Graham Tag Attached: Roadmap: v12
2021-04-30 02:51 Chris Graham Note Added: 0007081
2021-05-01 02:59 Chris Graham Note Edited: 0007081
2024-03-26 00:58 PDStig Tag Renamed Roadmap: v12 => Roadmap: Over the horizon