Implement synthetic source support for annotated text field (elastic#…

…107735) This PR adds synthetic source support for annotated_text fields. Existing implementation for text is reused including test infrastructure so the majority of the change is moving and making things accessible. Contributes to elastic#106460, elastic#78744.
henningandersen · Apr 25, 2024 · e1d902d · e1d902d
1 parent 4ef8b38
commit e1d902d
Show file tree

Hide file tree

Showing 16 changed files with 824 additions and 300 deletions.
diff --git a/docs/changelog/107735.yaml b/docs/changelog/107735.yaml
@@ -0,0 +1,5 @@
+pr: 107735
+summary: Implement synthetic source support for annotated text field
+area: Mapping
+type: feature
+issues: []
diff --git a/docs/plugins/mapper-annotated-text.asciidoc b/docs/plugins/mapper-annotated-text.asciidoc
@@ -6,7 +6,7 @@ experimental[]
 The mapper-annotated-text plugin provides the ability to index text that is a
 combination of free-text and special markup that is typically used to identify
 items of interest such as people or organisations (see NER or Named Entity Recognition
-tools). 
+tools).
 
 
 The elasticsearch markup allows one or more additional tokens to be injected, unchanged, into the token
@@ -18,7 +18,7 @@ include::install_remove.asciidoc[]
 [[mapper-annotated-text-usage]]
 ==== Using the `annotated-text` field
 
-The `annotated-text` tokenizes text content as per the more common {ref}/text.html[`text`] field (see 
+The `annotated-text` tokenizes text content as per the more common {ref}/text.html[`text`] field (see
 "limitations" below) but also injects any marked-up annotation tokens directly into
 the search index:
 
@@ -49,7 +49,7 @@ in the search index:
 --------------------------
 GET my-index-000001/_analyze
 {
-  "field": "my_field", 
+  "field": "my_field",
   "text":"Investors in [Apple](Apple+Inc.) rejoiced."
 }
 --------------------------
@@ -76,7 +76,7 @@ Response:
       "position": 1
     },
     {
-      "token": "Apple Inc.", <1> 
+      "token": "Apple Inc.", <1>
       "start_offset": 13,
       "end_offset": 18,
       "type": "annotation",
@@ -106,7 +106,7 @@ the token stream and at the same position (position 2) as the text token (`apple
 
 
 We can now perform searches for annotations using regular `term` queries that don't tokenize
-the provided search values. Annotations are a more precise way of matching as can be seen 
+the provided search values. Annotations are a more precise way of matching as can be seen
 in this example where a search for `Beck` will not match `Jeff Beck` :
 
 [source,console]
@@ -133,18 +133,119 @@ GET my-index-000001/_search
 }
 --------------------------
 
-<1> As well as tokenising the plain text into single words e.g. `beck`, here we 
+<1> As well as tokenising the plain text into single words e.g. `beck`, here we
 inject the single token value `Beck` at the same position as `beck` in the token stream.
 <2> Note annotations can inject multiple tokens at the same position - here we inject both
 the very specific value `Jeff Beck` and the broader term `Guitarist`. This enables
 broader positional queries e.g. finding mentions of a `Guitarist` near to `strat`.
-<3> A benefit of searching with these carefully defined annotation tokens is that a query for 
+<3> A benefit of searching with these carefully defined annotation tokens is that a query for
 `Beck` will not match document 2 that contains the tokens `jeff`, `beck` and `Jeff Beck`
 
-WARNING: Any use of `=` signs in annotation values eg `[Prince](person=Prince)` will 
+WARNING: Any use of `=` signs in annotation values eg `[Prince](person=Prince)` will
 cause the document to be rejected with a parse failure. In future we hope to have a use for
 the equals signs so wil actively reject documents that contain this today.
 
+[[annotated-text-synthetic-source]]
+===== Synthetic `_source`
+
+IMPORTANT: Synthetic `_source` is Generally Available only for TSDB indices
+(indices that have `index.mode` set to `time_series`). For other indices
+synthetic `_source` is in technical preview. Features in technical preview may
+be changed or removed in a future release. Elastic will work to fix
+any issues, but features in technical preview are not subject to the support SLA
+of official GA features.
+
+`annotated_text` fields support {ref}/mapping-source-field.html#synthetic-source[synthetic `_source`] if they have
+a {ref}/keyword.html#keyword-synthetic-source[`keyword`] sub-field that supports synthetic
+`_source` or if the `text` field sets `store` to `true`. Either way, it may
+not have {ref}/copy-to.html[`copy_to`].
+
+If using a sub-`keyword` field then the values are sorted in the same way as
+a `keyword` field's values are sorted. By default, that means sorted with
+duplicates removed. So:
+[source,console,id=synthetic-source-text-example-default]
+----
+PUT idx
+{
+  "mappings": {
+    "_source": { "mode": "synthetic" },
+    "properties": {
+      "text": {
+        "type": "annotated_text",
+        "fields": {
+          "raw": {
+            "type": "keyword"
+          }
+        }
+      }
+    }
+  }
+}
+PUT idx/_doc/1
+{
+  "text": [
+    "the quick brown fox",
+    "the quick brown fox",
+    "jumped over the lazy dog"
+  ]
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+[source,console-result]
+----
+{
+  "text": [
+    "jumped over the lazy dog",
+    "the quick brown fox"
+  ]
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+
+NOTE: Reordering text fields can have an effect on {ref}/query-dsl-match-query-phrase.html[phrase]
+and {ref}/span-queries.html[span] queries. See the discussion about {ref}/position-increment-gap.html[`position_increment_gap`] for more detail. You
+can avoid this by making sure the `slop` parameter on the phrase queries
+is lower than the `position_increment_gap`. This is the default.
+
+If the `annotated_text` field sets `store` to true then order and duplicates
+are preserved.
+[source,console,id=synthetic-source-text-example-stored]
+----
+PUT idx
+{
+  "mappings": {
+    "_source": { "mode": "synthetic" },
+    "properties": {
+      "text": { "type": "annotated_text", "store": true }
+    }
+  }
+}
+PUT idx/_doc/1
+{
+  "text": [
+    "the quick brown fox",
+    "the quick brown fox",
+    "jumped over the lazy dog"
+  ]
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+[source,console-result]
+----
+{
+  "text": [
+    "the quick brown fox",
+    "the quick brown fox",
+    "jumped over the lazy dog"
+  ]
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+
 
 [[mapper-annotated-text-tips]]
 ==== Data modelling tips
@@ -153,13 +254,13 @@ the equals signs so wil actively reject documents that contain this today.
 Annotations are normally a way of weaving structured information into unstructured text for
 higher-precision search.
 
-`Entity resolution` is a form of document enrichment undertaken by specialist software or people 
+`Entity resolution` is a form of document enrichment undertaken by specialist software or people
 where references to entities in a document are disambiguated by attaching a canonical ID.
 The ID is used to resolve any number of aliases or distinguish between people with the
-same name. The hyperlinks connecting Wikipedia's articles are a good example of resolved 
-entity IDs woven into text. 
+same name. The hyperlinks connecting Wikipedia's articles are a good example of resolved
+entity IDs woven into text.
 
-These IDs can be embedded as annotations in an annotated_text field but it often makes 
+These IDs can be embedded as annotations in an annotated_text field but it often makes
 sense to include them in dedicated structured fields to support discovery via aggregations:
 
 [source,console]
@@ -214,40 +315,40 @@ GET my-index-000001/_search
 --------------------------
 
 <1> Note the `my_twitter_handles` contains a list of the annotation values
-also used in the unstructured text. (Note the annotated_text syntax requires escaping). 
-By repeating the annotation values in a structured field this application has ensured that 
-the tokens discovered in the structured field can be used for search and highlighting 
-in the unstructured field.  
+also used in the unstructured text. (Note the annotated_text syntax requires escaping).
+By repeating the annotation values in a structured field this application has ensured that
+the tokens discovered in the structured field can be used for search and highlighting
+in the unstructured field.
 <2> In this example we search for documents that talk about components of the elastic stack
 <3> We use the `my_twitter_handles` field here to discover people who are significantly
 associated with the elastic stack.
 
 ===== Avoiding over-matching annotations
-By design, the regular text tokens and the annotation tokens co-exist in the same indexed 
+By design, the regular text tokens and the annotation tokens co-exist in the same indexed
 field but in rare cases this can lead to some over-matching.
 
 The value of an annotation often denotes a _named entity_ (a person, place or company).
-The tokens for these named entities are inserted untokenized, and differ from typical text 
+The tokens for these named entities are inserted untokenized, and differ from typical text
 tokens because they are normally:
 
 * Mixed case e.g. `Madonna`
 * Multiple words e.g. `Jeff Beck`
 * Can have punctuation or numbers e.g. `Apple Inc.` or `@kimchy`
 
 This means, for the most part, a search for a named entity in the annotated text field will
-not have any false positives e.g. when selecting `Apple Inc.` from an aggregation result 
-you can drill down to highlight uses in the text without "over matching" on any text tokens 
+not have any false positives e.g. when selecting `Apple Inc.` from an aggregation result
+you can drill down to highlight uses in the text without "over matching" on any text tokens
 like the word `apple` in this context:
 
     the apple was very juicy
-    
-However, a problem arises if your named entity happens to be a single term and lower-case e.g. the 
+
+However, a problem arises if your named entity happens to be a single term and lower-case e.g. the
 company `elastic`. In this case, a search on the annotated text field for the token `elastic`
 may match a text document such as this:
 
     they fired an elastic band
 
-To avoid such false matches users should consider prefixing annotation values to ensure 
+To avoid such false matches users should consider prefixing annotation values to ensure
 they don't name clash with text tokens e.g.
 
     [elastic](Company_elastic) released version 7.0 of the elastic stack today
@@ -273,7 +374,7 @@ GET my-index-000001/_search
 {
   "query": {
     "query_string": {
-        "query": "cats" 
+        "query": "cats"
     }
   },
   "highlight": {
@@ -291,21 +392,21 @@ GET my-index-000001/_search
 
 The annotated highlighter is based on the `unified` highlighter and supports the same
 settings but does not use the `pre_tags` or `post_tags` parameters. Rather than using
-html-like markup such as `<em>cat</em>` the annotated highlighter uses the same 
+html-like markup such as `<em>cat</em>` the annotated highlighter uses the same
 markdown-like syntax used for annotations and injects a key=value annotation where `_hit_term`
-is the key and the matched search term is the value e.g. 
+is the key and the matched search term is the value e.g.
 
     The [cat](_hit_term=cat) sat on the [mat](sku3578)
 
-The annotated highlighter tries to be respectful of any existing markup in the original 
+The annotated highlighter tries to be respectful of any existing markup in the original
 text:
 
-* If the search term matches exactly the location of an existing annotation then the 
+* If the search term matches exactly the location of an existing annotation then the
 `_hit_term` key is merged into the url-like syntax used in the `(...)` part of the
-existing annotation. 
+existing annotation.
 * However, if the search term overlaps the span of an existing annotation it would break
 the markup formatting so the original annotation is removed in favour of a new annotation
-with just the search hit information in the results. 
+with just the search hit information in the results.
 * Any non-overlapping annotations in the original text are preserved in highlighter
 selections
 

diff --git a/docs/reference/mapping/fields/synthetic-source.asciidoc b/docs/reference/mapping/fields/synthetic-source.asciidoc
@@ -41,6 +41,7 @@ There are a couple of restrictions to be aware of:
 types:
 
 ** <<aggregate-metric-double-synthetic-source, `aggregate_metric_double`>>
+** {plugins}/mapper-annotated-text-usage.html#annotated-text-synthetic-source[`annotated-text`]
 ** <<binary-synthetic-source,`binary`>>
 ** <<boolean-synthetic-source,`boolean`>>
 ** <<numeric-synthetic-source,`byte`>>

diff --git a/plugins/mapper-annotated-text/src/main/java/module-info.java b/plugins/mapper-annotated-text/src/main/java/module-info.java
@@ -0,0 +1,19 @@
+/*
+ * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
+ * or more contributor license agreements. Licensed under the Elastic License
+ * 2.0 and the Server Side Public License, v 1; you may not use this file except
+ * in compliance with, at your election, the Elastic License 2.0 or the Server
+ * Side Public License, v 1.
+ */
+
+module org.elasticsearch.index.mapper.annotatedtext {
+    requires org.elasticsearch.base;
+    requires org.elasticsearch.server;
+    requires org.elasticsearch.xcontent;
+    requires org.apache.lucene.core;
+    requires org.apache.lucene.highlighter;
+
+    // exports nothing
+
+    provides org.elasticsearch.features.FeatureSpecification with org.elasticsearch.index.mapper.annotatedtext.Features;
+}