diff --git a/_analyzers/character-filters/html-character-filter.md b/_analyzers/character-filters/html-character-filter.md
index ef55930bdf..eee548d0f7 100644
--- a/_analyzers/character-filters/html-character-filter.md
+++ b/_analyzers/character-filters/html-character-filter.md
@@ -11,6 +11,8 @@ The `html_strip` character filter removes HTML tags, such as `
`, `
`, and
## Example: HTML analyzer
+The following request applies an `html_strip` character filter to the provided text:
+
```json
GET /_analyze
{
@@ -23,15 +25,35 @@ GET /_analyze
```
{% include copy-curl.html %}
-Using the HTML analyzer, you can convert the HTML character entity references into their corresponding symbols. The processed text would read as follows:
+The response contains the token in which HTML characters have been converted to their decoded values:
-```
+```json
+{
+ "tokens": [
+ {
+ "token": """
Commonly used calculus symbols include α, β and θ
+""",
+ "start_offset": 0,
+ "end_offset": 74,
+ "type": "word",
+ "position": 0
+ }
+ ]
+}
```
+## Parameters
+
+The `html_strip` character filter can be configured with the following parameter.
+
+| Parameter | Required/Optional | Data type | Description |
+|:---|:---|:---|:---|
+| `escaped_tags` | Optional | Array of strings | An array of HTML element names, specified without the enclosing angle brackets (`< >`). The filter does not remove elements in this list when stripping HTML from the text. For example, setting the array to `["b", "i"]` will prevent the `` and `` elements from being stripped.|
+
## Example: Custom analyzer with lowercase filter
-The following example query creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:
+The following example request creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:
```json
PUT /html_strip_and_lowercase_analyzer
@@ -57,9 +79,7 @@ PUT /html_strip_and_lowercase_analyzer
```
{% include copy-curl.html %}
-### Testing `html_strip_and_lowercase_analyzer`
-
-You can run the following request to test the analyzer:
+Use the following request to examine the tokens generated using the analyzer:
```json
GET /html_strip_and_lowercase_analyzer/_analyze
@@ -72,8 +92,32 @@ GET /html_strip_and_lowercase_analyzer/_analyze
In the response, the HTML tags have been removed and the plain text has been converted to lowercase:
-```
-welcome to opensearch!
+```json
+{
+ "tokens": [
+ {
+ "token": "welcome",
+ "start_offset": 4,
+ "end_offset": 11,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "to",
+ "start_offset": 12,
+ "end_offset": 14,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "opensearch",
+ "start_offset": 23,
+ "end_offset": 42,
+ "type": "",
+ "position": 2
+ }
+ ]
+}
```
## Example: Custom analyzer that preserves HTML tags
@@ -104,9 +148,7 @@ PUT /html_strip_preserve_analyzer
```
{% include copy-curl.html %}
-### Testing `html_strip_preserve_analyzer`
-
-You can run the following request to test the analyzer:
+Use the following request to examine the tokens generated using the analyzer:
```json
GET /html_strip_preserve_analyzer/_analyze
@@ -119,6 +161,18 @@ GET /html_strip_preserve_analyzer/_analyze
In the response, the `italic` and `bold` tags have been retained, as specified in the custom analyzer request:
-```
+```json
+{
+ "tokens": [
+ {
+ "token": """
This is a bold and italic text.
+""",
+ "start_offset": 0,
+ "end_offset": 52,
+ "type": "word",
+ "position": 0
+ }
+ ]
+}
```
diff --git a/_analyzers/character-filters/mapping-character-filter.md b/_analyzers/character-filters/mapping-character-filter.md
new file mode 100644
index 0000000000..0cd882e52e
--- /dev/null
+++ b/_analyzers/character-filters/mapping-character-filter.md
@@ -0,0 +1,124 @@
+---
+layout: default
+title: Mapping
+parent: Character filters
+nav_order: 120
+---
+
+# Mapping character filter
+
+The `mapping` character filter accepts a map of key-value pairs for character replacement. Whenever the filter encounters a string of characters matching a key, it replaces them with the corresponding value. Replacement values can be empty strings.
+
+The filter applies greedy matching, meaning that the longest matching pattern is matched.
+
+The `mapping` character filter helps in scenarios where specific text replacements are required before tokenization.
+
+## Example
+
+The following request configures a `mapping` character filter that converts Roman numerals (such as I, II, or III) into their corresponding Arabic numerals (1, 2, and 3):
+
+```json
+GET /_analyze
+{
+ "tokenizer": "keyword",
+ "char_filter": [
+ {
+ "type": "mapping",
+ "mappings": [
+ "I => 1",
+ "II => 2",
+ "III => 3",
+ "IV => 4",
+ "V => 5"
+ ]
+ }
+ ],
+ "text": "I have III apples and IV oranges"
+}
+```
+
+The response contains a token where Roman numerals have been replaced with Arabic numerals:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "1 have 3 apples and 4 oranges",
+ "start_offset": 0,
+ "end_offset": 32,
+ "type": "word",
+ "position": 0
+ }
+ ]
+}
+```
+{% include copy-curl.html %}
+
+## Parameters
+
+You can use either of the following parameters to configure the key-value map.
+
+| Parameter | Required/Optional | Data type | Description |
+|:---|:---|:---|:---|
+| `mappings` | Optional | Array | An array of key-value pairs in the format `key => value`. Each key found in the input text will be replaced with its corresponding value. |
+| `mappings_path` | Optional | String | The path to a UTF-8 encoded file containing key-value mappings. Each mapping should appear on a new line in the format `key => value`. The path can be absolute or relative to the OpenSearch configuration directory. |
+
+### Using a custom mapping character filter
+
+You can create a custom mapping character filter by defining your own set of mappings. The following request creates a custom character filter that replaces common abbreviations in a text:
+
+```json
+PUT /test-index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "custom_abbr_analyzer": {
+ "tokenizer": "standard",
+ "char_filter": [
+ "custom_abbr_filter"
+ ]
+ }
+ },
+ "char_filter": {
+ "custom_abbr_filter": {
+ "type": "mapping",
+ "mappings": [
+ "BTW => By the way",
+ "IDK => I don't know",
+ "FYI => For your information"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /text-index/_analyze
+{
+ "tokenizer": "keyword",
+ "char_filter": [ "custom_abbr_filter" ],
+ "text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday."
+}
+```
+
+The response shows that the abbreviations were replaced:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.",
+ "start_offset": 0,
+ "end_offset": 153,
+ "type": "word",
+ "position": 0
+ }
+ ]
+}
+```