Semantic MediaWiki and related extensions
|
https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/usage.md "Usage" | https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/config.md "Settings" | https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/technical.md "Technical notes" | https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/faq.md "FAQ"
Updates to an Elasticsearch index happens instantaneously after a new revision has been saved in MediaWiki and after the storage layer receives an event that has been emitted by the Semantic MediaWiki/MediaWiki hook to guarantee that queries can use the latest available data set as soon as possible.
The index creation documentation describes how the index process occurs in Elasticsearch. Semantic MediaWiki provides two separate indices:
data
index hosts all user-facing queryable data (structured and unstructured content)lookup
index stores term and lookup queries used for concept, property path, and inverse match computationsEach MediaWiki instance (hereby Semantic MediaWiki) with its own wikiID
(internal name of the database, wiki site identification) replicates to a separate index which is why different MediaWiki installations (assuming the wikiID
is different) can be hosted on the same Elasticsearch cluster without interfering with each other during the update or search.
{ // `smw-data` is the fixed part, `mw-foo` describes the wikiID, and // `v1` identifies the active version (required for the roll-over) "_index": "smw-data-mw-foo-v1", "_type": "data" }
In the normal operative mode, ElasticStore
uses an active replication to transfer the data to the Elasticsearch cluster which means that changes (i.e those caused by an update, delete etc.) from wikipages are actively relicated and mostly instantaneously (depends on the refresh interval) visiable.
The ElasticStore
by default is set to use a safe replication mode which entails that if during a page storage no connection could be established to an Elasticsearch cluster, a smw.elasticIndexerRecovery
job is planned for changes that weren't replicated. These jobs should be executed on a regular basis to ensure that data are kept in sync with the backend.
The job.recovery.retries
setting is set to a maximum of retry attempts in case the job itself cannot establish a connection, in which case the job is then canceled even though it could not recover.
The rebuildElasticIndex.php
script is provided as a method to transfer existing data from the SQLStore
(fetching information directly from tables without re-parsing any wikipages) to the Elasticsearch backend.
The script operates in a rollover mode lets an existing index to remain operative until the new index with a different version (v1/v2) is created and released. The current active index is kept untouched so queries can continue to operate while the reindex process is ongoing and once completed, the new index switches places with the old index and is removed from the Elasticsearch cluster.
For the script-based replication to be as fast as possible some changes to operational settings are made:
refresh_interval
is changed to -1
as recommended by the official documentation to speed up the data transfer.The refresh_interval
dictates how often Elasticsearch creates new segments and in the normal operative mode is set to 1s
as default to make updates appear near real time or instantaneously.
During the rebuild process the setting is changed to -1
as recommended by the official documentation so that changes to the cluster nodes can be replicated quicker. Now, if for some reason (e.g an aborted rebuild, raised exception etc.) the refresh_interval
remains at -1
(since the process was aborted without the possibility for Semantic MediaWiki to intervene on a OS level) and changes to an index will not be visible until the refresh_interval
has been reset, and to fix the setting it is recommended to run:
php rebuildElasticIndex.php --update-settings
php rebuildElasticIndex.php --force-refresh
Replication monitoring has been added as feature to allow users to be informed about the state of a document replication with the Elasticsearch cluster given that the ElasticStore
relies on active replication.
There are two different types of data (or content) that is replicated to an Elasticsearch cluster. The most obvious and reliable of the two are structured data retrieved from:
"Unstructured" as category classifies loose text without any metadata or specific annotations and includes:
Two experimental settings are provided to handle unstructured content (i.e. text that does not provide any explicit annotations or structured elements) by using a separate index field in Elasticsearch are defined by:
indexer.raw.text
indexer.experimental.file.ingest
It should be noted that if either of the setting is enabled, the index size will grow for the unstructured fields in size especially if users want to index large document files therefore expected index size should be estimated carefully.
The support for searching "unstructured text" (i.e. searching without a property assignment) is made possible by the wide proximity expression (~~
) or the following prefixes (in:
, phrase:
, or not:
) to indicate to a query request such as [[in:some text]]
or [[phrase:the brown fox]]
to include special "unstructured" index fields to match those requests.
Aside from searching for "unstructured text", combining structured and unstructured elements in a query such as [[Has population::1000]] [[in:some text]]
improves the quality of search matches when Semantic MediaWiki users are to balance the cost of maintaining structured content and require unstructured content to broaden the scope and search depth.
The indexer.raw.text
(default is false
) setting is provided to replicate the entire raw text of an article revision as unprocessed text to the text_raw
field.
The indexer.experimental.file.ingest
(default is false
) setting is provided to support the ingestion of file content. It requires the Elasticsearch ingest-attachment plugin.
The ingest process provides a method to retrieve content from files (using the Apache Tika component bundled with Elasticsearch) and make them available via Elasticsearch to Semantic MediaWiki without requiring the actual file content to be stored within the wiki itself.
Due to size and memory consumption requirements by Elasticsearch and Tika, file content ingestion happens exclusively in background using the smw.elasticFileIngest
job and only after the job has been executed successfully will the file content and additional annotations be accessible and available as indexed (searchable) content.
As the documentation points out, "Extracting contents from binary data is a resource intensive operation and consumes a lot of resources. It is highly recommended to run pipelines using this processor in a dedicated ingest node." (see also the ingest node documentation).
The replication monitoring will indicate on a file page whether the ingest process was completed or not by checking if the File attachment
property exists for the particular file entity.
File
pageFileIngestJob
hereby register smw.elasticFileIngest
job with the job queue, waiting on command line executionsmw.elasticFileIngest
, runs FileIndexer
which adds and runs the attachment
pipelineAttachmentAnnotator
(adding File Attachment
annotation)The rebuildElasticIndex.php
maintenance script comes with two options related to the file ingestion process:
skip-fileindex
to skip any file ingestion during the rebuild executionrun-fileindex
only run and execute file ingestions during the rebuild processOnce the ingestion and extraction of content was successful, a File attachment
annotation will appear on the specific File
entity in Semantic MediaWiki, and based on the extraction quality of Tika (see the documentation for details on what is retrievable), annotations will include:
Content type
(corresponds to content_type
),Content author
(corresponds to author
),Content length
(corresponds to content_length
),Content language
(corresponds to language
),Content title
(corresponds to title
),Content date
(corresponds to date
), andContent keyword
(corresponds to keywords
)The File attachment
property is a container object which means accessing aforementioned properties requires the use of a property chain.
`` // Find all subjects (aka. files) that were ingested and indexed with a
image/png` // content type
{{#ask: [[File attachment.Content type::image/png]] |?File attachment.Content title }}
// Find all subjects (aka. files) that were ingested and indexed with a application/pdf
// content type and where the index content contains the brown fox
text
{{#ask: [[File attachment.Content type::application/pdf]] [[in:brown fox]] |?File attachment.Content title |?File attachment.Content author }} ```
If Elasticsearch (hereby Tika) doesn't provide information to some of the properties then those will not appear as part of the File attachment
annotation.
The quality of the text indexed and the information provided to the File attachment
properties depends solely on Elasticsearch and Tika (a specific Tika version is bundled with a specific Elasticsearch release).
Any issues with the quality of indexed content or the recognition of specific information about a file (e.g. type, date, author etc.) has to be addressed in Elasticsearch and is not part or the scope of Semantic MediaWiki.