#3587 - Internationalised e-mail addresses and URLs

This is a spacer post for a website comment topic. The content this topic relates to: #3587 - Internationalised e-mail addresses and URLs
We allow URL monikers (optional *) and codenames to have Unicode characters (so long as reserved characters are not used). These will still be URL encoded in a nasty way in the real URLs Composr users, because URLs have to be safe in ASCII. For URLs encoded using HarmlessURLCoder (optional), we will show Unicode characters directly because we are showing them only in a text context that we control.

* The Composr webmaster controls whether monikers are made using Unicode, or transliteration.

THAT SAID. It may be the case that our URL encoded URLs to downloads overflow our available database field space. In such a case we bend the rules and allow non-ASCII URLs to be saved into our database instead. That is the best compromise in such a case and has no practical bugs relating to it.

Additionally we have the capability for transliteration. On old PHP versions on Windows we have to transliterate filenames (and hence URLs to those files) due to no PHP Unicode filesystem support.
We always transliterate directory names due to poor PHP support.

It's also worth explaining the difference between urlencode, rawurlencode, cms_urlencode, cms_rawurlrecode, and HarmlessURLCoder.

rawurlencode - PHP function for standardised URL encoding.

urlencode - PHP function for URL encoding specifically for GET parameters. It's the same as rawurlencode except spaces become "+'.

cms_urlencode - A layer around urlencode that provides Composr-specific encoding that stops Apache's mod_rewrite from corrupting certain special characters during it's "smart" processing.

cms_rawurlrecode - Shortens URLs that are too long for the database by intelligently cheating in our encoding. The URLs are not technically valid but will work.

HarmlessURLCoder - Simplifies/desimplifies URLs trading human-readablity for non-compliance. Similar to what browsers do in their address bars. It is a non-destructive operation that doesn't allow for double encoding or double decoding. Non-latin characters in URLs encodes with HarmlessURLCoder are much easier to use.

I've implemented Punycode support.

I'm leaving email along for now as e-mail validation is a mess:
http://emailregex.com/email-validation-summary/
And I'm happy to reinforce the consensus of simple addresses for now.
0 guests and 0 members have recently viewed this.