Semantic MediaWiki and related extensions
config

https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/usage.md "Usage" | https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/config.md "Settings" | https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/technical.md "Technical notes" | https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/faq.md "FAQ"

Accessing an Elasticsearch cluster from within Semantic MediaWiki requires to alter the following settings:

Connection to Elasticsearch

smwgElasticsearchEndpoints is a required setting and contains a list of available endpoints to establish a connection with an Elasticsearch cluster.

$GLOBALS['smwgElasticsearchEndpoints'] = [
    [ 'host' => '192.168.1.126', 'port' => 9200, 'scheme' => 'http' ], // extended
    'localhost:9200' // inline
];

Please consult the official documentation for details about how to use the inline or extended form.

Configuration

$smwgElasticsearchConfig is a compound setting that collects various parameters into one Semantic MediaWiki setting to shape the interaction with Elasticsearch including specific index and query details.

$GLOBALS['smwgElasticsearchConfig'] = [
    // Points to index and mapping definition files
    'index_def'       => [ ... ],
    // Defines connection details for Elasticsearch endpoints
    'connection'  => [ ... ],
    // Holds replication details
    'indexer' => [ ... ],
    // Used to modify Elasticsearch specific settings
    'settings'    => [ ... ],
    // Section to optimize the query execution
    'query'       => [ ... ]
];

Changing a setting

A detailed list of settings and their explanations is available in the DefaultSettings.php. Please make sure that after changing any setting, php rebuildElasticIndex.php --update-settings is executed.

When modifying a particular setting, use an appropriate key to change the value of a parameter otherwise it is possible that the entire configuration is replaced.

// Uses a specific key and therefore replaces only the specific parameter
$GLOBALS['smwgElasticsearchConfig']['query']['uri.field.case.insensitive'] = true;
// This !!overrides!! the entire configuration
$GLOBALS['smwgElasticsearchConfig'] = [
    'query' => [
        'uri.field.case.insensitive' => true
    ]
];

Shards and replicas

The default shards/replica configuration is set to:

If it is required to change the numbers of shards and replicas it is preferable to use the $smwgElasticsearchConfig setting for this with.

$GLOBALS['smwgElasticsearchConfig']['settings']['data'] = [
    'number_of_shards' => 3,
    'number_of_replicas' => 3
]

Elasticsearch comes with a precondition that any change to the number_of_shards requires to rebuild the entire index, so changes to that setting should be made carefully and in advance.

Read-heavy wikis might want to add (without the need the re-index the data) replica shards where Elasticsearch performance is in decline (the Elasticsearch documentation notes that replica shards should be put on an extra hardware).

Field mappings

The index_def settings points to the index definition with the data index to be assigned the smw-data-standard.json as default to define its field mappings which influences how Elasticsearch analyzes and index documents including fields that are identified to contain text and string elements. Those text fields rely on the standard analyzer and should work for most applications.

The index name will be composed of a prefix such as smw-data (or smw-lookup), the wikiID, and a version indicator (part of the rollover support) so that a single ES cluster can host different indices from different Semantic MediaWiki instances without interfering with each other.

{
    "_index": "smw-data-mw-foo-v1",
    "_type": "data",
    "_id": "1",
    "_version": 2,
    "_source": ...
}

Text, languages, and analyzers

For certain languages the icu analyzer (or any other language specific configuration) may provide better results, so one may alter the index_def index definitions hereby allowing custom settings such as deviating language analyzers to be used to increase the likelihood of better matching precision on text elements.

For a non-latin language environment the analysis-icu plugin provides better support for unicode normalization and case folding and selecting smw-data-icu.json as index_def setting may prove to create a better match accuracy during query answering especially on unstructured text elements or wide proximity searches.

smw-data-icu.json is provided as an example on how to alter those settings. It should be noted that query results on text fields may differ compared to when one would use the standard analyzer and users are expected to evaluate whether those settings are more favorable or not to the query answering.

Please note that any change to the index or its analyzer settings requires to rebuild the entire index.

Using a profile

$smwgElasticsearchProfile is provided to simplify the maintenance of configuration parameters by linking to a JSON file that hosts and hereby alters individual settings.

{
    "indexer": {
        "raw.text": true
    },
    "query": {
        "uri.field.case.insensitive": true
    }
}

The profile is loaded at the end of the configuration stack and will override any default or individual settings made to $smwgElasticsearchConfig.


About | General disclaimer | Privacy policy