Semantic MediaWiki and related extensions
Technical notes

Classes and objects related to the Elasticsearch interface and implementation are placed under the SMW\Elastic namespace.

SMW
│   ├─ Admin         # Classes used to extend `Special:SemanticMediaWiki`
│   ├─ Exception
│   ├─ Connection    # Responsible for building a connection to ES
│   ├─ Indexer       # Contains all necessary classes for updating the ES index
│   ├─ Lookup        # Provides additional lookup services
│   └─ QueryEngine   # Hosts the query builder and `#ask` language interpreter classes
│
├─ ElasticFactory
└─ ElasticStore

Data mapping and serialization

Serialization format

{
    "_index": "smw-data-mw-30-00-elastic-v1",
    "_type": "data",
    "_id": "334032",
    "_version": 2,
    "_source": {
        "subject": {
            "title": "ABC/20180716/k10011534941000",
            "subobject": "_f21687e8bab0ebee627f71654ddd4bc4",
            "namespace": 0,
            "interwiki": "",
            "sortkey": "foo ..."
        },
        "P:100": {
            "txtField": [
                "Foo bar ..."
            ]
        },
        "P:4": {
            "wpgField": [
                "foobar"
            ],
            "wpgID": [
                334125
            ]
        }
    }
}

It should remembered that besides specific available types in ES, text fields are generally divided into analyzed and not_analyzed fields.

Field mapping

Semantic MediaWiki is mapping its internal structure using dynamic_templates to define expected data types, their attributes, and possibly add extra index fields (see multi-fields) to support certain query constructs.

The naming convention follows a very pragmatic naming scheme, P:<ID>.<type>Field with each new field (aka property) being mapped dynamically to a corresponding field type.

The SemanticData object is always serialized in its entirety to avoid for the interface to keep delta information. Furthermore, ES itself always creates a new index document for each update therefore keeping deltas wouldn't make much of difference in terms of how the data are stored and updated and allows the indexer to take advantage of the bulk API making updates faster and more resilient while avoiding document comparison during an update process.

To allow for exact matches as well as full-text searches on the same field most mapped fields will have at least two or three additional multi-field elements to store text as not_analyzed (meaning as keyword) and as sortable entity.

An ES document can contain additional fields such as:

ES DSL mapping

#ask queries are transformed to represent an equivalent expression in the ES DSL hereby allowing the ElasticStore to be used as drop-in replacement for queries expressed using #ask language constructs.

For example, the [[in:lorem ipsum]] query (or as fully qualified [[~~*lorem ipsum*]], find all entities that contains lorem ipsum on any document) on structured and unstructured fields written as ES DSL will look similar to:

"bool": {
    "must": {
        "query_string": {
            "fields": [
                "subject.title^8",
                "text_copy^5",
                "text_raw",
                "attachment.title^3",
                "attachment.content"
            ],
            "query": "*lorem ipsum*",
            "minimum_should_match": 1
        }
    }
}

The term lorem ipsum will be queried in different fields with different boost factors to highlight preferences when a term is among a title or only part of a text field.

A request for a structured term (assigned to a property e.g. [[Has text::lorem ipsum]]) will generate a different ES DSL query.

"bool": {
    "filter": {
        "term": {
            "P:100.txtField.keyword": "lorem ipsum"
        }
    }
}

While P:100.txtField contains the text component that is assigned to Has text and by default is an analyzed field, the keyword field is selected to execute the query on a not analyzed content to match the exact term. Exact term matching means that the matching process distinguishes between lorem ipsum and Lorem ipsum.

On the contrary, a proximity request (e.g. [[Has text::~lorem ipsum*]]) has different requirements including case folding, lower, and upper case matching and therefore includes the analyzed field with an ES DSL output that is comparable to:

"bool": {
    "must": {
        "query_string": {
            "fields": [
                "P:100.txtField",
                "P:100.txtField.keyword"
            ],
            "query": "lorem +ipsum*"
        }
    }
}

Monitoring

To make it easier for administrators to monitor the interface between Semantic MediaWiki and ES, several service links are provided for a convenient access to selected information.

The main access point is defined with Special:SemanticMediaWiki/elastic but only users with the smw-admin right (which is required for the Special:SemanticMediaWiki page) can access the information and only when an ES cluster is available.

Logging

The enable connector specific logging, please use the smw-elastic identifier in your LocalSettings.php.

$wgDebugLogGroups  = [
    'smw-elastic' => ".../logs/smw-elastic-{$wgDBname}.log",
];

About | General disclaimer | Privacy policy