Composr Tutorial: Searching your website

Written by Chris Graham
As you add content to your website, it becomes increasingly important that your visitors are able to find your content when they need to. Composr includes a 'search' feature, which allows you to search your entire website for content.
Under-the-hood this is implemented using either database full-text search capabilities of the fast custom index.


Simple searching

Image

Detailed searches

Detailed searches

(Click to enlarge)

Image

The search block for use in panels (different to the search block in the header)

The search block for use in panels (different to the search block in the header)

(Click to enlarge)

By default, there is a 'search' block that sits in the header of your website. This is the easiest way to use the search. There is a text box where you can type what you're looking for. The search button will then search your entire website for matching content and results will be displayed on a results screen.

Each kind of content type displays in its own way. For example, matched news posts will look similar to how news looks in the news archive.

Detailed searches

If you would like to carry out detailed searches you need to go to the search module.

There are 3 typical ways to reach this:
  1. By conducting a basic search from the top bar and then changing the options at the bottom of the search results.
  2. From the search module (site:search page-link, About > Search on the default menus).
  3. If the search block has been put on a panel then there is a 'More' button there too.

The search module has many options:

Search For

This is the text box where you type the content that you're searching for.

Natural vs Boolean searching

By default natural searches are used. These are inexact but don't require so much precision by the user. For example a search for "Jump the biggest bar" might match even if "biggest" isn't in the result. Results are ranked based on how well they match. Word sequence does not matter.

You may also do boolean searches which are more exact. To activate boolean search, just use any of the following boolean operators in your query:
  • Put speech marks around words that you would like to occur in sequence
  • Put a '-' before a word to shun it
  • Put a '+' before a word to require it.

For MySQL databases, boolean searches will still ignore stop words, or words shorter than MySQL is configured to index.

Search only titles

With this option checked, Composr will only search titles of content.

Author/Submitter

In this field, you can type the name of a member on the website. If you do this, Composr will only look for entries that this person has submitted. It also matches against author names.

Submitted within the previous

In this field, you can set a cut-off date, to not show entries that are older.

Sort by

In this field, you can specify what order you would like your results shown in.

Search the following content types

Placing a checkmark beside each content type will cause Composr to search for entries in these content types. Clearing the checkbox will cause Composr not to search in these locations.

Advanced searches

Image

Performing an advanced search for a specific content type

Performing an advanced search for a specific content type

(Click to enlarge)

Many forms of content allow additional filters which allow you to:
  1. search underneath a chosen category
  2. perform template searches on individual fields (these apply as searches on top of the main search query)
  3. specify some extra checkbox options
An advanced search limits you to that individual content type. The advanced searching screen is reached by clicking one of the 'Advanced' links on the main search screen.

Searching from the Forum

Image

The search button on the forum

The search button on the forum

(Click to enlarge)

To initiate a search while in the forum (our own Conversr forum), you need to click the 'Search' button on the forum or use the contextual search box on the forum member bar. The contextual search will search beneath your current forum if you are on a forum-view screen, or within your current topic if you are on a topic-view screen.

User hand-holding

Search autocompletion

When you start typing out a search it can autocomplete. This is based on:
  • Common past searches
  • Matching keywords for the search type
  • Matching titles for the search type

All these cases are controlled via privileges (the "Autocomplete searches based on xxx" ones), as potentially it is a leak of private or privileged information. No permissions are checked, so if you grant the privileges then content titles and keywords from private content can potentially leak out.

Did you mean?

If you have spell checking enabled on your server (pspell or enchant PHP extension), then misspellings will result in a suggestion to run a search on an autocorrected search term. Any keyword terms on the site will be considered real words and not autocorrected.

Result counts

Unfortunately result counts have to be an approximation. To de-duplicate the result count we'd need to load in the full record sets for each query pattern that runs, which can be incredibly slow, especially if searches are broad.

Improving search results

The title fields, and meta keyword fields, get precedence when search results are determined. Tuning these manually for your can improve search results considerably. Additionally keywords are individually queried, rather than having to go through full sentence searching – so you can specify things more precisely, e.g. to include hyphens (which full-text search treats like spaces).

Ultimately, full-text search effectiveness resides in MySQL (or whatever database you use), not Composr. Here are some particulars for MySQL:
  • If you are using MySQL, considering turning the MySQL minimum word length down to 3 (the default 4).
  • You can also configure the stop word list in MySQL.
  • If there is only one entry in the table, nothing will be returned, because MySQL will only return words that match fewer than 50% of the rows in a table.

MySQL LIKE searches are much more accurate than full-text searches, but also much slower due to a lack of indexing. Composr will only do a 'LIKE' search if it thinks MySQL's full-text-based boolean search won't be able to handle the query itself (e.g. due to using short words). Programmers can alter this logic based on editing the is_under_radar function.

Many of these issues can be solved by enabling Composr's own search engine, the fast custom index – discussed later in this tutorial.

OpenSearch (advanced)

Composr can support OpenSearch, which allows your users to search your website from directly within their web browser. It also supports search suggestions, based on past searches performed.

By default OpenSearch is configured (via the HTML_HEAD.tpl template) to only be active within a zone named docs and for it to only search Comcode pages. You can, however, configure it to perform any search types you like via changing the code used in this template. You should make sure you have a 'favicon' before enabling OpenSearch, as it is important the web browser has one of these available to use.

Slow searches on large sites

If you have very large database tables due to very large amounts of content, or having large amounts of content on a multi-language site, you may experience slow-down doing some searches.

In fact, the slow-down will cause read locks which prevent writes to those tables. Composr is designed to generally function without database write access, but it's not a good situation to have.

This is a problem that MySQL has with full-text search. It is not specific to Composr in any way but is worth us documenting.

The problem happens particularly when Composr has to combine the full-text search with other search constraints.

Workaround 1: Auto-kill slow searches via MySQL setting

MySQL have introduced a query timeout setting in MySQL 5.7+. Set it in MySQL like:

Code (MySQL)

SET GLOBAL MAX_STATEMENT_TIME=10000;
 
(this is for 10 seconds, i.e. 10,000 milliseconds)

We actually automatically set this on a session level when you do a search, so there's no need to do anything if you're running MySQL 5.7+.

Workaround 2: Use InnoDB

Another workaround is to switch to InnoDB tables in MySQL. It won't stop slow queries, it'll just stop them locking the whole table and slowing other users down; your server will still suffer the load, but so long as your server is not overloaded that is likely not an issue.

Actual solution: Use the fast custom index

Read on :) .

The fast custom index

When searching large amounts of content it is important for the content to be pre-indexed for search, as searching through all the bytes is too slow. As discussed in this tutorial, Composr uses of "full-text search" capabilities present in most database software, i.e. the database software's own search engine, and this handles all the indexing for you behind-the-scenes. For the sake of simplicity in this section of the tutorial we will assume all users are using MySQL.

Composr also features its own search engine, which can run as a separate option triggered to run instead of MySQL full-text search in a number of configurable situations (by default it does not run). The search engine is implemented for forum posts (public and private), catalogue entries, and Comcode pages. For simplicity we'll just talk about public forum post search.

The problem with MySQL full-text search is 2-fold:
  1. The search index is totally separated out from other indexing. If you want to do a search, and then filter it down to say a particular forum, or a particular poster, then it has to cleave a big chunk out of the search index and then cross-reference that with other index(es). It can be very inefficient.
  2. If there are common terms on a website that are not so common/irrelevant as to be filtered out by standard, e.g. 'car' on a car website, then when someone searches for those words it means an enormous amount is going to be cleaved out of the search index.
And the worst is when these things combine. Let's say 40% of your forum posts contain the word 'car', and the user is searching for 'car maintenance' but the user is filtering to a forum with only 5% of the posts in it. Basically MySQL would cleave out 40% of its search index, calculate the ranking from all those rows and sort by that, and then cursor through almost all those rows until it gets just the top 30 that cross-reference with the forum index.

There's no real way around this with MySQL full-text search indexing.

The fast custom index takes a totally different approach that is much more efficient in cases where MySQL is inefficient.

The fast custom index supports the same boolean syntax that is supported by MySQL full-text search, as well as being able to do natural searches.

Technical explanation (advanced)

The forum posts database table gets a matching search indexing table, which indexes all the common search filters (poster ID, forum ID, etc) directly against individual keywords extracted from the posts.

So basically a row in that search indexing table might be like (keyword=car, forum_id=4, poster_id=300). The search indexing table is then also database-indexed against all the fields so that the database can very efficiently query out stuff from it.

Of course searches may have multiple keywords, so it revisits the table for each keyword, basically, and it has a ranking algorithm. There's a lot more to it than that, but I'm keeping it simple here. It basically ranks by how prevalent the most obscure word in the search query is in the forum post.

In terms of performance, the end result is the fast custom index search system is a bit slower for most "straight" searches (searches with no additional filtering), but immensely faster for searches with additional filtering.

Pros and Cons

There are a few minor downsides to the fast custom index:
  • You cannot do 'blank' searches.
  • There will be a short lag before new content is indexed.
  • Performance considerations:
    • If you want to allow multi-word quoted phrases it will use a lot more disk space because it has to separately store each combination of adjacent keywords, up to the limit you configure. That's due to how the fast custom index is designed: it is not building a data structure for the keywords in a document, it's separately indexing each keyword against all possible search filters.
    • Fuzzy searching for large databases (basically analogous to the 'natural' MySQL full-text search) is very slow. Without fuzzy searching every keyword in the search will either be ANDed, ignored (stop words like 'is'), or excluded (if preceded with '-'). i.e. it doesn't take individual words as suggestions for match ranking, they all have to be taken into account.
    • Ranking accuracy isn't going to be as good, as the fast custom index ranks based on just the most obscure keyword, not a blend of all keywords. This is necessary to avoid having to do cross-computation between each keyword, instead it can rank rows using direct indexing. Actually you can configure all-keyword ranking, but it is not recommended.

And some upsides:
  • Radically better performance for filtered queries, as discussed. While MySQL full-text performance degrades as filters are added, the fast custom index performance is radically improved with additional filtering.
  • Configurable stop word list without requring server admin access.
  • Superior stemming: MySQL will depluralise words but not much more, but Composr has a high-quality stemmer that will make words such as 'liking' and 'like' equivalent.
  • Can return results for tables with only 1 row in (unlike MySQL).
  • No minimum or maximum word sizes. So you can search for numbers, for example.
  • Great multilingual support.
  • Some database backends may not even provide full-text search of their own, so the fast custom index would fill the gap.

Because there are both Pros and Cons, you can configure when the fast custom index kicks in, and otherwise have MySQL full-text driving the majority of your search queries.

Stop words

Stop words are words that will be ignored by the search engine because they convey no meaning and just add noise to the search.
A default list is provided for English, and can be edited by copying the text/EN/too_common_words.txt file to text_custom/EN/too_common_words.txt and customising it.

Quoted phrases

As discussed, enabling multi-word quoted phrases will use a lot of disk space. Enabling it involves configuring Composr to index index ngrams of length longer than 1. You get to decide how many words to allow.

If you do enable multi-word quoted phrases then it works with stop words, which MySQL cannot do. So for example "This is a test" would work (assuming you are indexing at least 4 ngrams), even though there are stop words involved there.

Stemming doesn't operate when quoting phrases of more than one ngram. For example "This is a greatest day" would not be matched by a search query of "This is a great day".

Index generation

There's a background task within the system scheduler that populates the indexing tables.
The first time it runs it:
  1. Indexes all existing data.
  2. Builds up a database table of ngram frequency (keyword frequency, basically) across all the supported searchable data (the <table-prefix>ft_index_commonality table). This frequency data is used for ranking purposes.
Subsequently it just adds indexing for new content created/changed since the last indexing run.

If you want to reindex (perhaps you have changed some settings, or changed stop words, or want to regenerate commonality data to reflect the current status of your site), you need to run the Rebuild the fast custom index cleanup tool (Admin Zone > Tools > Cleanup tools).

Internationalisation

Content is indexed against each language, according to translations of that language. So for example if you search for 'gift' in German you'd get results for the German meaning of that character string (very different to the English meaning!).

When translating content, you need to either edit the content in the language you are translating to – of if you use the translation queue you'll need to force reindexing for those changes to show up.

Programmers can add grammar rules for different languages a lot more easily than they can to MySQL. Look at sources/lang_stemmer_EN.php and sources/lang_tokeniser_EN.php and just make equivalent files for your language.

Not all languages use words. For this reason the fast custom index doesn't actually implement words under-the-hood, it actually uses "ngrams". For most languages an ngram is the same as a word and they are separated by spaces, but it is actually up to the tokeniser what to make an ngram and how to separate them. Chinese, for example, might have each Chinese character as its own ngram (there are no spaces). For Vietnamese you might have each syllable as an ngram (they looks like word in English but alone have no real meaning). For each case you would need to configure Composr to index multiple combinations of ngrams, rather than the default combination of 1, because nobody is going to search for a collection of disconnected ngrams in these example languages like they would in English.

Special considerations

The PHP mbstring extension is significantly faster than the iconv extension. If you use iconv then you may find search indexing is very slow.


Feedback

Please rate this tutorial:

Have a suggestion? Report an issue on the tracker.