View Issue Details

IDProjectCategoryView StatusLast Update
3147Composrcorepublic2022-08-15 17:02
ReporterChris Graham Assigned ToChris Graham  
PrioritynormalSeverityfeature 
Status closedResolutionfixed 
Summary3147: Review of cloud filesystem support
DescriptionThere are a few possible approaches to automatic synching of the filesystem on the cloud:
1) Mount the entire install on shared storage
2) Implement Composr's sync_file function, automatically detecting what change was done to a file then synching it out
3) Using a different subpath for all custom folders, mounting it under a path that is a shared storage mount (i.e. at the operating system level)
4) Using a different subpath for all custom folders, mounting it under a path that is a PHP file wrapper, and setting up so URLs under there are picked up by the Apache configuration too
5) rsync
6) Moving everything into the database
7) Use of an internal CDN transfer API instead of direct filesystem writing, with URLs generated according to that API (i.e. no direct correspondence between a URL and any particular file path)

It's tricky to know what to do, but we want something very architecturally clean and maintainable, not lots of different approaches needing expert configuration. If we define some design goals we can eliminate some approaches.

a) Files should be hostable on a CDN so that they may be served geographically close to the user. This will improve page load times.
b) Our CDN may not be able to host every kind of media (e.g. Cloudinary could not host non-images).
c) We need to be able to delete files.
d) It has to be reliable.
e) It has to be scalable.
f) It has to be easy to set up.
g) It can't bloat our code-base too much.
h) It has to be hard for a newbie Composr developer to forget to implement the functionality.
i) It cannot place unreasonable limitations on hardware architecture.
j) It has to have a wide compatibility with actual services people use.
k) It has to have a wide compatibility with actual web hosting people use.

We can therefore eliminate:
1 - This violates 'e' because it is a single bottleneck, and also 'i' because servers would need to be on the same cluster with a very high-performance I/O channel
2 - This violates 'h', developer's can easily forget to call sync_file (they can't if they're running ocProducts PHP, but they're probably not); it also violates 'f'
3 - This violates 'a', 'f', 'h', 'j' and 'k'
4 - This violates 'a' and 'h
5 - This probably wouldn't work at all, as rsync would not know the difference between a delete and a new file appearing on one particular server
6 - This violates 'a', 'i' and 'k' -- putting potentially GB of data into the database is not something we can reasonably expect the majority of users to accept
7 - This works, although will be a lot of work.

I think we should remove the concept of 'sync_file'. Nobody ever used it.

Then I think we need to implement '7', combined with '4'. That is we extend our current CDN transfer system so that CDN transfer hooks can accept control of any path/file-type combinations -- with a native PHP file access API using the PHP file wrappers functionality. CDN transfer hooks would sit behind our file wrapper. URLs would be converted via a conversion functions that go each way.
TagsocProducts client-work (likely), Roadmap: v11, Type: Cloudification, Type: Cross-cutting feature , Type: External dependency, Type: Performance
Attach Tags
Time estimation (hours)64
Sponsorship open

Sponsor

Date Added Member Amount Sponsored

Relationships

related to 3856 Not AssignedGuest Addon isolation via virtual subtrees 

Activities

Chris Graham

2017-03-21 22:01

administrator   ~4887

A simple default implementation of a CDN transfer hook (with associated config options) would allow just mapping of files onto a particular directory path and base URL combination.

Chris Graham

2017-04-03 16:15

administrator   ~4947

A peripheral thing I'd like to solve with this work is case-insensitive filenames. If you develop on Mac or Windows there's a chance you'll mess up with case-mismatches but not notice until it goes to a Linux server.

The filesystem wrapper would have an option to force case-sensitivity, even if just mapping to a case-insensitive filesystem.

Chris Graham

2017-06-13 13:18

administrator   ~5139

Last edited: 2018-03-06 05:53

I gave this a lot more thought.

Notes...


function init__dyn_file_manager()
{
    define('DATA_CLASS_SYSTEM', 1);
    define('DATA_CLASS_SYSTEM_CUSTOM', 2);
    define('DATA_CLASS_USER', 4);
    define('DATA_CLASS_VOLATILE', 8);

    $GLOBALS['DYN_MANAGER'] = new DynFileManager();
}

class DynFileManager
{
    protected $hook_obs;

    function __construct()
    {
        $this->hook_obs = find_all_hook_obs('systems', 'dyn_file_manager');
    }

    function find_path($type, $data_class)
    {
        // May return a path that is a filesystem wrapper path; normal file operations can then be done
    }

    function find_file_path($type, $subpath, $data_class_filter = null)
    {
    }

    function find_url($type, $subpath, $relative = false, $data_class_filter = null)
    {
    }

    function file_path_to_url($url)
    {
    }

    function url_to_file_path($file_path)
    {
    }

    function find_unique_filename($type, $subpath)
    {
    }

    function copy_to($tmp_path, $type, $subpath)
    {
    }
}

class DatabaseFilesystemWrapper
{
}


Examples...

$GLOBALS['DYN_MANAGER']->find_file_path('uploads/banners', 'example.png');




Notes....

Instead of just "non-custom" and "custom", we now have "system", "system custom" and "user data" - and this is an override chain. Some things are probably only user data, e.g. uploads.

We move everything that changes during run-time and is shared between installs under a '_user_data' directory.
data/data_custom currently conflates too much. Have data/data_custom, resources/resources_custom, scripts/scripts_custom, logs.
uploads/website_specific will have to change, as this is not uploaded. Merge to resources_custom
Move caches and logs under a '_volatile' directory. Actually whether logs should be volatile or not should perhaps be configurable.
Document what '_volatile' and '_user_data' (i.e. _volatile is not to be synced, _user_data is). Document the whole override chain system.

Our API will allow hooks to override the functionality

Options for specifying which directories are 'system data' vs 'user data' (so you can for example decide all theme files and Comcode pages are 'system data')
Other directories are hard-coded as one or the other
Warnings if editing anything that would edit 'system data' and therefore should be done on a development level - but only if an option is enabled for these warnings
Both kinds of data would be managed via the same API
Search both locations for data, but in priority order (even for ones hard-coded - as shared installs may be using stuff as system data).
Themes and translations should definitely be system data, as otherwise it would complicate distribution of them as addons.

What about new data (e.g. a new download) that is being added at the staging stage? Programmer will have to deal with this manually.

This the time to drop non-suexec support? Put check in installer that all files are owned by web user. Quick installer will now not extract using FTP, just FS - and complain if no write access. Remove all chmodding references. Remove abstract file manager. Change written minimum requirements. Remove fix_permissions.

Auto-create missing directories.

Ability to store all user data in DB. Controlled via hidden option, function to switch between that is documented.

Merge in cdn_transfer hook functionality (broadly these will become dyn_file_manager hooks)

Another kind of hook that just listens to changes (dyn_file_manager_sync). Needs to be called by filesystem wrappers and DynFileManager functions.

Re-write the tut_optimisation tutorial. Maybe rename to tut_performance.
Document that '_user_data' can be placed under shared storage. Or you can put in DB. Or you can have an addon that puts it elsewhere (or multiple places). Or you can implement a dyn_file_manager_sync hook. Document advantages - DB may be best because it is synced across machines automatically, so minimises bottleneck.
Document to NOT try and use rsync, as there is no 'master' and thus deletes would be messed up.

Kill sync_file.

get_custom_file_base and get_custom_base_url can go, as 'user data' is now same as custom data. Each install gets it's own '_user_data' directory.

In dev-mode host the entire Composr filesystem under a filesystem wrapper and give errors if file-ops are done on things that should not be.

A peripheral thing I'd like to solve with this work is case-insensitive filenames. If you develop on Mac or Windows there's a chance you'll mess up with case-mismatches but not notice until it goes to a Linux server. The filesystem wrapper would have an option to force case-sensitivity, even if just mapping to a case-insensitive filesystem.

Implement Allow backup to use cdn_transfer mechanism https://compo.sr/tracker/view.php?id=2962 (almost done automatically)

Kill upload_syndication hooks. Over complex and not really user friendly.

Chris Graham

2018-03-06 05:19

administrator   ~5550

This is now covered on this spreadsheet: https://docs.google.com/spreadsheets/d/1_yaJeGzDIsxq33I7Wg9I-lTBDk3YS22WPBwJ971v5tI
Services referenced:
 Amazon S3
 Cloudinary
 Dropbox
 Google Drive

Chris Graham

2018-06-22 18:39

administrator   ~5744

" The filesystem wrapper would have an option to force case-sensitivity, even if just mapping to a case-insensitive filesystem. " - this is now done in debug_fs.php, although it is only implemented as a specific debug option, and not related to the rest of the functionality discussed here.

Chris Graham

2018-12-24 02:54

administrator   ~5892

Last edited: 2021-11-18 00:32

A fresh look at all this, which I think is both simpler and more powerful...

Overall strategy
================

We have many different kinds of "content across servers" scenarios we need to support well:

1) Content Delivery Networks [CDN] (locate asset files geographically close to users to minimise site download time)
2) Server farms (spread load across multiple servers)
3) Staging servers (pushing content from a staging server to a live server)
4) Multi-site (having content on a Demonstratr-style master site available to satellite sites - e.g. site-builder scenario)
5) Git (implementing content inside a Git repository then pulling it live)

And here's how we approach them...

1) Content Delivery Networks -- promote smart CDNs that will automatically pull assets from the master site on-demand, with proper cache management (avoiding the need for the server to ever proactively push anything)
2) Server farms -- support mounting network storage onto the new smart filesystem feature or implement hooks on it
3) Staging servers -- new Sync UI feature
4) Multi-site -- new smart filesystem feature
5) Git -- no special approach needed

Sync UI
=======

Have a new UI for synching between a staging site and a live site.

The details of the live site would need configuration, some kind of API-key system.

It would be laid out something like...

<table>
    <thead>
        <tr>
            <th colspan="3">Repository object [sort]</th>
            <th colspan="2">Modification date [sort]</th>
            <th colspan="2">CRC</th>
            <th rowspan="2">Sync action</th>
        </tr>
        <tr>
            <th>GUID</th>
            <th>Title</th>
            <th>Moniker</th>
            <th>Staging</th>
            <th>Live</th>
            <th>Staging</th>
            <th>Live</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>xxx</td>
            <td>xxx</td>
            <td>xxx</td>
            <td>3rd Dec 2018 2:03 pm</td>
            <td>(only on staging)</td>
            <td>xxx</td>
            <td>(only on staging)</td>
            <td>
                <select>
                    <option>Leave staging-only</option>
                    <option disabled>Leave live-only</option>
                    <option>Copy from staging to live (includes revision history)</option>
                    <option disabled>Copy from live to staging (includes revision history)</option>
                    <option disabled>Delete from live</option>
                    <option>Delete from staging</option>
                    <option disabled>Delete from both staging and live</option>
                </select>
            </td>
        </tr>
        ...
    </tbody>
</table>


    Also clear caches aggressively on live after sync <input type="checkbox" />

    Take database backup on staging <input type="checkbox" />

    Take database backup on live <input type="checkbox" />



<button>Sync</button>

For this synching system to work well, we'd ideally want to completely remove IDs from Composr and replace with GUIDs. We currently do have GUIDs as an optional feature, but they're not usually used.

Also we need to consider that a sync may result in a moniker conflict, and we have to somehow ask how to resolve that.

Smart filesystem
================

Instead of:
 - get_custom_file_base
 - sync_file
 - (much of what is considered in this issue with having a custom class to solve it)

... have all file I/O go through a custom PHP stream-wrapper

Allow configuring into _config.php how other paths mount onto the default base directory (which may themselves by PHP stream-wrappers. FUSE-mounts, or whatever).

Allow mounting multiple paths in the same position, with precedence. This allows a multi-site scenario to work well.

All I/O operations would support hooks, so you can for example write custom sync code to sync onto different servers.

Chris Graham

2020-11-07 20:18

administrator   ~6795

This issue isn't fully updated, but we have been, and are, generally moving in this direction - hopefully concluding for v12.

I hope we can do the service integration leveraging Hybridauth. This way we don't need to be fully responsible for implementing/maintaining all the integrations we might ever want to do, we pool the work with other users of Hybridauth. For 2020 Hybridauth is getting lots of new code for sharing social networks 'Atoms' back and forth, for Hybridauth providers that implement that. I've designed the API to be flexible enough to also use this same API for filesystem access.

Hybridauth work needed...

Add new providers:
 Amazon S3 (no auth)
 Cloudinary (no auth)
 Google Drive

Add Atom support to existing providers:
 Dropbox

The capabilities code in sources_custom/hybridauth_admin.php will be important, and need to be extended for these new providers. We rely on this metadata to know what providers can actually be integrated for filesystem support, and how.
I was considering trying to merge this into Hybridauth too, but I think it would bloat up that project too much.

Chris Graham

2021-11-18 00:38

administrator   ~7169

I am closing this issue.
Many superior new ideas have happened since it was created, and a lot has been completed - and really we're talking about a lot at once here.

The Hybridauth stuff was implemented last year and allows so many third party logins, pull of third party content via Atom, push of content via an Atom-like API (replacing our old syndication functionality).

Soon I will merge support for a new "cloud filesystem" within Composr, implemented on top of PHP stream wrappers. This achieves a lot of what is discussed in this issue, but in a simpler way.
A new logging API has also been completed, allowing syndication of individual log lines via the database or via syslog.

What remains will be moved to new issues.

Issue History

Date Modified Username Field Change
2017-03-21 21:58 Chris Graham New Issue
2017-03-21 21:59 Chris Graham Relationship added related to 1392
2017-03-21 21:59 Chris Graham Tag Attached: Type: Performance
2017-03-21 21:59 Chris Graham Relationship added related to 2020
2017-03-21 22:01 Chris Graham Note Added: 0004887
2017-04-03 16:15 Chris Graham Note Added: 0004947
2017-05-01 16:04 Chris Graham Tag Attached: Type: Cross-cutting feature
2017-05-01 17:08 Chris Graham Relationship added related to 2962
2017-05-11 11:49 Chris Graham Tag Attached: ocProducts client-work (likely)
2017-06-13 13:18 Chris Graham Note Added: 0005139
2017-06-13 13:41 Chris Graham Note Edited: 0005139
2018-03-06 04:45 Chris Graham Relationship added related to 2980
2018-03-06 05:19 Chris Graham Note Added: 0005550
2018-03-06 05:53 Chris Graham Note Edited: 0005139
2018-06-22 18:39 Chris Graham Note Added: 0005744
2018-12-24 02:54 Chris Graham Note Added: 0005892
2019-06-27 19:01 Chris Graham Tag Attached: Roadmap: v12
2019-06-27 19:47 Chris Graham Tag Attached: Type: External dependency
2019-07-20 02:46 Chris Graham Relationship added related to 3549
2019-07-22 19:25 Chris Graham Relationship added related to 3856
2019-12-08 03:43 Chris Graham Relationship added related to 3792
2020-01-26 22:55 Chris Graham Relationship added related to 4052
2020-11-07 20:12 Chris Graham Tag Attached: Type: Cloudification
2020-11-07 20:12 Chris Graham Relationship deleted related to 4052
2020-11-07 20:12 Chris Graham Relationship deleted related to 3792
2020-11-07 20:12 Chris Graham Relationship deleted related to 3549
2020-11-07 20:12 Chris Graham Relationship deleted related to 2962
2020-11-07 20:12 Chris Graham Relationship deleted related to 2980
2020-11-07 20:13 Chris Graham Relationship deleted related to 1392
2020-11-07 20:18 Chris Graham Note Added: 0006795
2020-11-10 02:36 Chris Graham Relationship deleted related to 2020
2021-11-18 00:32 Chris Graham Note Edited: 0005892
2021-11-18 00:38 Chris Graham Note Added: 0007169
2021-11-18 00:38 Chris Graham Assigned To => Chris Graham
2021-11-18 00:38 Chris Graham Status Not Assigned => Closed
2021-11-18 00:38 Chris Graham Resolution open => fixed
2022-08-15 17:02 Chris Graham Tag Detached: Roadmap: v12
2022-08-15 17:02 Chris Graham Tag Attached: Roadmap: v11