Fulltext indexing
If the Text Search module (Fulltext Indexing and Search) is enabled, all files uploaded to Structr are indexed automatically using the fulltext indexing engine Apache Tika (https://tika.apache.org/). If Tesseract OCR (https://github.com/tesseract-ocr) is installed, Structr even indexes the textual content extracted from images.
Language identification
Structr tries to identify the actual language of the indexed document and uses a language-dependent list of stop words that are ignored when indexing so that only the important content of the document is indexed.
Example
Files can be retrieved by querying the indexedWords
field like in the following example:
GET http://localhost:8082/structr/rest/files/ui?indexedWords=example?loose=1
The result will look roughly like this:
{
"query_time": "0.001493350",
"result_count": 1,
"result": [
{
"id": "d7ac4f78e25141f199d3e39eb7ae3676",
"type": "File",
"name": "test.txt",
"contentType": null,
"size": 18,
"url": null,
"owner": {
"id": "f02e59a47dc9492da3e6cb7fb6b3ac25",
"type": "User",
"name": "admin",
"isUser": true
},
"path": "/test.txt",
"isFile": true,
"visibleToPublicUsers": false,
"visibleToAuthenticatedUsers": false
}
],
"serialization_time": "0.000328395"
}
Search Context
Structr provides a method to retrieve the context of a fulltext search hit, i.e. the paragraph or text block that contains the match. The search context can be retrieved using the following REST call, assuming the d7ac4f78e25141f199d3e39eb7ae3676
is the ID of one of the files that was returned using the above search query.
POST files/d7ac4f78e25141f199d3e39eb7ae3676/getSearchContext { searchTerm: "test", contextLength: 10 }'
The result of this call will look like this:
{
"result_count": 1,
"result": {
"context": [
"Dies ist ein Test"
]
},
"serialization_time": "0.000040438"
}