Semantic MediaWiki and related extensions
|
https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/usage.md "Usage" | https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/config.md "Settings" | https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/technical.md "Technical notes" | https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/faq.md "FAQ"
Accessing an Elasticsearch cluster from within Semantic MediaWiki requires to alter the following settings:
smwgElasticsearchEndpoints
is a required setting and contains a list of available endpoints to establish a connection with an Elasticsearch cluster.
$GLOBALS['smwgElasticsearchEndpoints'] = [ [ 'host' => '192.168.1.126', 'port' => 9200, 'scheme' => 'http' ], // extended 'localhost:9200' // inline ];
Please consult the official documentation for details about how to use the inline or extended form.
$smwgElasticsearchConfig
is a compound setting that collects various parameters into one Semantic MediaWiki setting to shape the interaction with Elasticsearch including specific index and query details.
$GLOBALS['smwgElasticsearchConfig'] = [
// Points to index and mapping definition files 'index_def' => [ ... ],
// Defines connection details for Elasticsearch endpoints 'connection' => [ ... ],
// Holds replication details 'indexer' => [ ... ],
// Used to modify Elasticsearch specific settings 'settings' => [ ... ],
// Section to optimize the query execution 'query' => [ ... ] ];
A detailed list of settings and their explanations is available in the DefaultSettings.php
. Please make sure that after changing any setting, php rebuildElasticIndex.php --update-settings
is executed.
When modifying a particular setting, use an appropriate key to change the value of a parameter otherwise it is possible that the entire configuration is replaced.
// Uses a specific key and therefore replaces only the specific parameter $GLOBALS['smwgElasticsearchConfig']['query']['uri.field.case.insensitive'] = true;
// This !!overrides!! the entire configuration $GLOBALS['smwgElasticsearchConfig'] = [ 'query' => [ 'uri.field.case.insensitive' => true ] ];
The default shards/replica configuration is set to:
data
index has two primary shards and two replicaslookup
index has one primary shard and no replica (the Elasticsearch documentation notes that "... consider using an index with a single shard ... lookup terms filter will prefer to execute the get request on a local node if possible ...")If it is required to change the numbers of shards and replicas it is preferable to use the $smwgElasticsearchConfig
setting for this with.
$GLOBALS['smwgElasticsearchConfig']['settings']['data'] = [ 'number_of_shards' => 3, 'number_of_replicas' => 3 ]
Elasticsearch comes with a precondition that any change to the number_of_shards
requires to rebuild the entire index, so changes to that setting should be made carefully and in advance.
Read-heavy wikis might want to add (without the need the re-index the data) replica shards where Elasticsearch performance is in decline (the Elasticsearch documentation notes that replica shards should be put on an extra hardware).
The index_def
settings points to the index definition with the data
index to be assigned the smw-data-standard.json
as default to define its field mappings which influences how Elasticsearch analyzes and index documents including fields that are identified to contain text and string elements. Those text fields rely on the standard analyzer and should work for most applications.
The index name will be composed of a prefix such as smw-data
(or smw-lookup
), the wikiID
, and a version indicator (part of the rollover support) so that a single ES cluster can host different indices from different Semantic MediaWiki instances without interfering with each other.
{ "_index": "smw-data-mw-foo-v1", "_type": "data", "_id": "1", "_version": 2, "_source": ... }
For certain languages the icu
analyzer (or any other language specific configuration) may provide better results, so one may alter the index_def
index definitions hereby allowing custom settings such as deviating language analyzers to be used to increase the likelihood of better matching precision on text elements.
For a non-latin language environment the analysis-icu plugin provides better support for unicode normalization and case folding and selecting smw-data-icu.json
as index_def
setting may prove to create a better match accuracy during query answering especially on unstructured text elements or wide proximity searches.
smw-data-icu.json
is provided as an example on how to alter those settings. It should be noted that query results on text fields may differ compared to when one would use the standard analyzer and users are expected to evaluate whether those settings are more favorable or not to the query answering.
Please note that any change to the index or its analyzer settings requires to rebuild the entire index.
$smwgElasticsearchProfile
is provided to simplify the maintenance of configuration parameters by linking to a JSON file that hosts and hereby alters individual settings.
{ "indexer": { "raw.text": true }, "query": { "uri.field.case.insensitive": true } }
The profile is loaded at the end of the configuration stack and will override any default or individual settings made to $smwgElasticsearchConfig
.