Semantic MediaWiki and related extensions
replication

https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/usage.md "Usage" | https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/config.md "Settings" | https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/technical.md "Technical notes" | https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/faq.md "FAQ"

Updates to an Elasticsearch index happens instantaneously after a new revision has been saved in MediaWiki and after the storage layer receives an event that has been emitted by the Semantic MediaWiki/MediaWiki hook to guarantee that queries can use the latest available data set as soon as possible.

The index creation documentation describes how the index process occurs in Elasticsearch. Semantic MediaWiki provides two separate indices:

Each MediaWiki instance (hereby Semantic MediaWiki) with its own wikiID (internal name of the database, wiki site identification) replicates to a separate index which is why different MediaWiki installations (assuming the wikiID is different) can be hosted on the same Elasticsearch cluster without interfering with each other during the update or search.

{
    // `smw-data` is the fixed part, `mw-foo` describes the wikiID, and
    // `v1` identifies the active version (required for the roll-over)
    "_index": "smw-data-mw-foo-v1",
    "_type": "data"
}

Active replication (update-based)

In the normal operative mode, ElasticStore uses an active replication to transfer the data to the Elasticsearch cluster which means that changes (i.e those caused by an update, delete etc.) from wikipages are actively relicated and mostly instantaneously (depends on the refresh interval) visiable.

Safe replication

The ElasticStore by default is set to use a safe replication mode which entails that if during a page storage no connection could be established to an Elasticsearch cluster, a smw.elasticIndexerRecovery job is planned for changes that weren't replicated. These jobs should be executed on a regular basis to ensure that data are kept in sync with the backend.

The job.recovery.retries setting is set to a maximum of retry attempts in case the job itself cannot establish a connection, in which case the job is then canceled even though it could not recover.

Script-based replication

The rebuildElasticIndex.php script is provided as a method to transfer existing data from the SQLStore (fetching information directly from tables without re-parsing any wikipages) to the Elasticsearch backend.

The script operates in a rollover mode lets an existing index to remain operative until the new index with a different version (v1/v2) is created and released. The current active index is kept untouched so queries can continue to operate while the reindex process is ongoing and once completed, the new index switches places with the old index and is removed from the Elasticsearch cluster.

For the script-based replication to be as fast as possible some changes to operational settings are made:

Refresh interval

The refresh_interval dictates how often Elasticsearch creates new segments and in the normal operative mode is set to 1s as default to make updates appear near real time or instantaneously.

During the rebuild process the setting is changed to -1 as recommended by the official documentation so that changes to the cluster nodes can be replicated quicker. Now, if for some reason (e.g an aborted rebuild, raised exception etc.) the refresh_interval remains at -1 (since the process was aborted without the possibility for Semantic MediaWiki to intervene on a OS level) and changes to an index will not be visible until the refresh_interval has been reset, and to fix the setting it is recommended to run:

Replication monitoring

Replication monitoring has been added as feature to allow users to be informed about the state of a document replication with the Elasticsearch cluster given that the ElasticStore relies on active replication.

Structured and unstructured data

There are two different types of data (or content) that is replicated to an Elasticsearch cluster. The most obvious and reliable of the two are structured data retrieved from:

"Unstructured" as category classifies loose text without any metadata or specific annotations and includes:

Two experimental settings are provided to handle unstructured content (i.e. text that does not provide any explicit annotations or structured elements) by using a separate index field in Elasticsearch are defined by:

It should be noted that if either of the setting is enabled, the index size will grow for the unstructured fields in size especially if users want to index large document files therefore expected index size should be estimated carefully.

The support for searching "unstructured text" (i.e. searching without a property assignment) is made possible by the wide proximity expression (~~) or the following prefixes (in:, phrase:, or not:) to indicate to a query request such as [[in:some text]] or [[phrase:the brown fox]] to include special "unstructured" index fields to match those requests.

Aside from searching for "unstructured text", combining structured and unstructured elements in a query such as [[Has population::1000]] [[in:some text]] improves the quality of search matches when Semantic MediaWiki users are to balance the cost of maintaining structured content and require unstructured content to broaden the scope and search depth.

Raw text

The indexer.raw.text (default is false) setting is provided to replicate the entire raw text of an article revision as unprocessed text to the text_raw field.

File content

The indexer.experimental.file.ingest (default is false) setting is provided to support the ingestion of file content. It requires the Elasticsearch ingest-attachment plugin.

The ingest process provides a method to retrieve content from files (using the Apache Tika component bundled with Elasticsearch) and make them available via Elasticsearch to Semantic MediaWiki without requiring the actual file content to be stored within the wiki itself.

Due to size and memory consumption requirements by Elasticsearch and Tika, file content ingestion happens exclusively in background using the smw.elasticFileIngest job and only after the job has been executed successfully will the file content and additional annotations be accessible and available as indexed (searchable) content.

As the documentation points out, "Extracting contents from binary data is a resource intensive operation and consumes a lot of resources. It is highly recommended to run pipelines using this processor in a dedicated ingest node." (see also the ingest node documentation).

The replication monitoring will indicate on a file page whether the ingest process was completed or not by checking if the File attachment property exists for the particular file entity.

Ingest and index process

  1. File upload (wiki upload) and creation of a File page
  2. Push FileIngestJob hereby register smw.elasticFileIngest job with the job queue, waiting on command line execution
  3. Execution of smw.elasticFileIngest, runs FileIndexer which adds and runs the attachment pipeline
  4. Retrieve response, run AttachmentAnnotator (adding File Attachment annotation)

The rebuildElasticIndex.php maintenance script comes with two options related to the file ingestion process:

File attachment

Once the ingestion and extraction of content was successful, a File attachment annotation will appear on the specific File entity in Semantic MediaWiki, and based on the extraction quality of Tika (see the documentation for details on what is retrievable), annotations will include:

The File attachment property is a container object which means accessing aforementioned properties requires the use of a property chain.

`` // Find all subjects (aka. files) that were ingested and indexed with aimage/png` // content type

{{#ask: [[File attachment.Content type::image/png]] |?File attachment.Content title }}

// Find all subjects (aka. files) that were ingested and indexed with a application/pdf // content type and where the index content contains the brown fox text

{{#ask: [[File attachment.Content type::application/pdf]] [[in:brown fox]] |?File attachment.Content title |?File attachment.Content author }} ```

Index quality

If Elasticsearch (hereby Tika) doesn't provide information to some of the properties then those will not appear as part of the File attachment annotation.

The quality of the text indexed and the information provided to the File attachment properties depends solely on Elasticsearch and Tika (a specific Tika version is bundled with a specific Elasticsearch release).

Any issues with the quality of indexed content or the recognition of specific information about a file (e.g. type, date, author etc.) has to be addressed in Elasticsearch and is not part or the scope of Semantic MediaWiki.


About | General disclaimer | Privacy policy