diff --git a/_analyzers/character-filters/html-character-filter.md b/_analyzers/character-filters/html-character-filter.md index ef55930bdf..eee548d0f7 100644 --- a/_analyzers/character-filters/html-character-filter.md +++ b/_analyzers/character-filters/html-character-filter.md @@ -11,6 +11,8 @@ The `html_strip` character filter removes HTML tags, such as `
`, `

`, and ## Example: HTML analyzer +The following request applies an `html_strip` character filter to the provided text: + ```json GET /_analyze { @@ -23,15 +25,35 @@ GET /_analyze ``` {% include copy-curl.html %} -Using the HTML analyzer, you can convert the HTML character entity references into their corresponding symbols. The processed text would read as follows: +The response contains the token in which HTML characters have been converted to their decoded values: -``` +```json +{ + "tokens": [ + { + "token": """ Commonly used calculus symbols include α, β and θ +""", + "start_offset": 0, + "end_offset": 74, + "type": "word", + "position": 0 + } + ] +} ``` +## Parameters + +The `html_strip` character filter can be configured with the following parameter. + +| Parameter | Required/Optional | Data type | Description | +|:---|:---|:---|:---| +| `escaped_tags` | Optional | Array of strings | An array of HTML element names, specified without the enclosing angle brackets (`< >`). The filter does not remove elements in this list when stripping HTML from the text. For example, setting the array to `["b", "i"]` will prevent the `` and `` elements from being stripped.| + ## Example: Custom analyzer with lowercase filter -The following example query creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter: +The following example request creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter: ```json PUT /html_strip_and_lowercase_analyzer @@ -57,9 +79,7 @@ PUT /html_strip_and_lowercase_analyzer ``` {% include copy-curl.html %} -### Testing `html_strip_and_lowercase_analyzer` - -You can run the following request to test the analyzer: +Use the following request to examine the tokens generated using the analyzer: ```json GET /html_strip_and_lowercase_analyzer/_analyze @@ -72,8 +92,32 @@ GET /html_strip_and_lowercase_analyzer/_analyze In the response, the HTML tags have been removed and the plain text has been converted to lowercase: -``` -welcome to opensearch! +```json +{ + "tokens": [ + { + "token": "welcome", + "start_offset": 4, + "end_offset": 11, + "type": "", + "position": 0 + }, + { + "token": "to", + "start_offset": 12, + "end_offset": 14, + "type": "", + "position": 1 + }, + { + "token": "opensearch", + "start_offset": 23, + "end_offset": 42, + "type": "", + "position": 2 + } + ] +} ``` ## Example: Custom analyzer that preserves HTML tags @@ -104,9 +148,7 @@ PUT /html_strip_preserve_analyzer ``` {% include copy-curl.html %} -### Testing `html_strip_preserve_analyzer` - -You can run the following request to test the analyzer: +Use the following request to examine the tokens generated using the analyzer: ```json GET /html_strip_preserve_analyzer/_analyze @@ -119,6 +161,18 @@ GET /html_strip_preserve_analyzer/_analyze In the response, the `italic` and `bold` tags have been retained, as specified in the custom analyzer request: -``` +```json +{ + "tokens": [ + { + "token": """ This is a bold and italic text. +""", + "start_offset": 0, + "end_offset": 52, + "type": "word", + "position": 0 + } + ] +} ``` diff --git a/_analyzers/character-filters/mapping-character-filter.md b/_analyzers/character-filters/mapping-character-filter.md new file mode 100644 index 0000000000..0cd882e52e --- /dev/null +++ b/_analyzers/character-filters/mapping-character-filter.md @@ -0,0 +1,124 @@ +--- +layout: default +title: Mapping +parent: Character filters +nav_order: 120 +--- + +# Mapping character filter + +The `mapping` character filter accepts a map of key-value pairs for character replacement. Whenever the filter encounters a string of characters matching a key, it replaces them with the corresponding value. Replacement values can be empty strings. + +The filter applies greedy matching, meaning that the longest matching pattern is matched. + +The `mapping` character filter helps in scenarios where specific text replacements are required before tokenization. + +## Example + +The following request configures a `mapping` character filter that converts Roman numerals (such as I, II, or III) into their corresponding Arabic numerals (1, 2, and 3): + +```json +GET /_analyze +{ + "tokenizer": "keyword", + "char_filter": [ + { + "type": "mapping", + "mappings": [ + "I => 1", + "II => 2", + "III => 3", + "IV => 4", + "V => 5" + ] + } + ], + "text": "I have III apples and IV oranges" +} +``` + +The response contains a token where Roman numerals have been replaced with Arabic numerals: + +```json +{ + "tokens": [ + { + "token": "1 have 3 apples and 4 oranges", + "start_offset": 0, + "end_offset": 32, + "type": "word", + "position": 0 + } + ] +} +``` +{% include copy-curl.html %} + +## Parameters + +You can use either of the following parameters to configure the key-value map. + +| Parameter | Required/Optional | Data type | Description | +|:---|:---|:---|:---| +| `mappings` | Optional | Array | An array of key-value pairs in the format `key => value`. Each key found in the input text will be replaced with its corresponding value. | +| `mappings_path` | Optional | String | The path to a UTF-8 encoded file containing key-value mappings. Each mapping should appear on a new line in the format `key => value`. The path can be absolute or relative to the OpenSearch configuration directory. | + +### Using a custom mapping character filter + +You can create a custom mapping character filter by defining your own set of mappings. The following request creates a custom character filter that replaces common abbreviations in a text: + +```json +PUT /test-index +{ + "settings": { + "analysis": { + "analyzer": { + "custom_abbr_analyzer": { + "tokenizer": "standard", + "char_filter": [ + "custom_abbr_filter" + ] + } + }, + "char_filter": { + "custom_abbr_filter": { + "type": "mapping", + "mappings": [ + "BTW => By the way", + "IDK => I don't know", + "FYI => For your information" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /text-index/_analyze +{ + "tokenizer": "keyword", + "char_filter": [ "custom_abbr_filter" ], + "text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday." +} +``` + +The response shows that the abbreviations were replaced: + +```json +{ + "tokens": [ + { + "token": "For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.", + "start_offset": 0, + "end_offset": 153, + "type": "word", + "position": 0 + } + ] +} +```