Fulltext indexing

Search
Feedback

If the Text Search module (Fulltext Indexing and Search) is enabled, all files uploaded to Structr are indexed automatically using the fulltext indexing engine Apache Tika (https://tika.apache.org/). If Tesseract OCR (https://github.com/tesseract-ocr) is installed, Structr even indexes the textual content extracted from images.

Language identification

Structr tries to identify the actual language of the indexed document and uses a language-dependent list of stop words that are ignored when indexing so that only the important content of the document is indexed.

Example

Files can be retrieved by querying the indexedWords field like in the following example:

GET http://localhost:8082/structr/rest/files/ui?indexedWords=example?loose=1

The result will look roughly like this:

{
    "query_time": "0.001493350",
    "result_count": 1,
    "result": [
        {
            "id": "d7ac4f78e25141f199d3e39eb7ae3676",
            "type": "File",
            "name": "test.txt",
            "contentType": null,
            "size": 18,
            "url": null,
            "owner": {
                "id": "f02e59a47dc9492da3e6cb7fb6b3ac25",
                "type": "User",
                "name": "admin",
                "isUser": true
            },
            "path": "/test.txt",
            "isFile": true,
            "visibleToPublicUsers": false,
            "visibleToAuthenticatedUsers": false
        }
    ],
    "serialization_time": "0.000328395"
}

Search Context

Structr provides a method to retrieve the context of a fulltext search hit, i.e. the paragraph or text block that contains the match. The search context can be retrieved using the following REST call, assuming the d7ac4f78e25141f199d3e39eb7ae3676 is the ID of one of the files that was returned using the above search query.

POST files/d7ac4f78e25141f199d3e39eb7ae3676/getSearchContext { searchTerm: "test", contextLength: 10 }'

The result of this call will look like this:

{
    "result_count": 1,
    "result": {
        "context": [
            "Dies ist ein Test"
        ]
    },
    "serialization_time": "0.000040438"
}

Graph-Browser

Related Articles
About this article
Last change 2017-05-04
Topics FilesStructr 2.0