Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add Support for IP Data Type #3145

Closed
currantw opened this issue Oct 31, 2024 · 10 comments · Fixed by #3175
Closed

[FEATURE] Add Support for IP Data Type #3145

currantw opened this issue Oct 31, 2024 · 10 comments · Fixed by #3175
Labels
enhancement New feature or request

Comments

@currantw
Copy link
Contributor

currantw commented Oct 31, 2024

Is your feature request related to a problem?

OpenSearch SQL plugin does not support the IP address field type.

🚫 IP address fields are converted to strings:

search=weblog | fields host

returns host with field type string.

🚫 IP address fields cannot be correctly for equality:

search=weblog | where host = "2001:0db7::ff00:42:8329" | fields host

will not return the value 2001:0db7:0000:0000:0000:ff00:0042:8329, even though both strings represent the same IP address.

What solution would you like?

Outcomes:

  • IP addresses can be retrieved using the OpenSearch SQL plugin without conversion to strings.
  • IP addresses supports equality operations (= and !=).
  • IP addresses supports comparison operations (<, <=, >, and >) if they are both IPv4 or IPv6.
  • IP addresses supports sorting (again, if they are all IPv4 or IPv6).
  • IP addresses work with IP-specific functions (currently only cidrmatch - see Add CIDR function to PPL (#3036) #3110).

Proposed Solution:

  • Add an new IP type to ExprCoreType.
  • Replace OpenSearchExprIpValue with ExprIpValue , and update implementation.
  • Update OpenSearchDataType.MappingType to map "ip" fields to ExprCoreType.IP.
  • Update OpenSearchExprValueFactory.
  • Update other code, unit tests, and integration tests as necessary.

What alternatives have you considered?

None

Do you have any additional context?

This is closely related to #3110. This issue added a new cidrmatch(ip, cidr) function that returns whether the given IP address is within the specified CIDR IP address range. As part of this work, the SQL plugin was updated to cast IP addresses to strings - previously, it would raise an exception.

@currantw currantw added enhancement New feature or request untriaged labels Oct 31, 2024
@kedbirhan
Copy link

we really need this feature, we can't even do basic IP lookups right now!

query

SELECT * 
FROM logs-vpc
WHERE dstaddr = "10.100.138.82"
LIMIT 10;

log

400 Bad Request: "{<EOL>  "error": {<EOL>    "reason": "Invalid SQL query",<EOL>    "details": "= function expected {[BYTE,BYTE],[SHORT,SHORT],[INTEGER,INTEGER],[LONG,LONG],[FLOAT,FLOAT],[DOUBLE,DOUBLE],[STRING,STRING],[BOOLEAN,BOOLEAN],[DATE,DATE],[TIME,TIME],[DATETIME,DATETIME],[TIMESTAMP,TIMESTAMP],[INTERVAL,INTERVAL],[STRUCT,STRUCT],[ARRAY,ARRAY]}, but get [IP,STRING]",<EOL>    "type": "ExpressionEvaluationException"<EOL>  },<EOL>  "status": 400<EOL>}"

@YANG-DB
Copy link
Member

YANG-DB commented Nov 15, 2024

Is your feature request related to a problem?

OpenSearch SQL plugin does not support the IP address field type.

🚫 IP address fields are converted to strings:

search=weblog | fields host

returns host with field type string.

🚫 IP address fields cannot be correctly for equality:

search=weblog | where host = "2001:0db7::ff00:42:8329" | fields host

will not return the value 2001:0db7:0000:0000:0000:ff00:0042:8329, even though both strings represent the same IP address.

What solution would you like?

Outcomes:

  • IP addresses can be retrieved using the OpenSearch SQL plugin without conversion to strings.
  • IP addresses supports equality operations (= and !=).
  • IP addresses supports comparison operations (<, <=, >, and >) if they are both IPv4 or IPv6.
  • IP addresses supports sorting (again, if they are all IPv4 or IPv6).
  • IP addresses work with IP-specific functions (currently only cidrmatch - see Add CIDR function to PPL (#3036) #3110).

Proposed Solution:

  • Add an new IP type to ExprCoreType.
  • Replace OpenSearchExprIpValue with ExprIpValue , and update implementation.
  • Update OpenSearchDataType.MappingType to map "ip" fields to ExprCoreType.IP.
  • Update OpenSearchExprValueFactory.
  • Update other code, unit tests, and integration tests as necessary.

What alternatives have you considered?

None

Do you have any additional context?

This is closely related to #3110. This issue added a new cidrmatch(ip, cidr) function that returns whether the given IP address is within the specified CIDR IP address range. As part of this work, the SQL plugin was updated to cast IP addresses to strings - previously, it would raise an exception.

Hi @currantw - this is a great idea
my only concern is how to differentiate between a keyword field and an ip field - it can infer this implicitly but its not always the case that a user knows this in advanced ...

Can we explicitly use an ip syntax to have both the query writer and the query parser engine be aware this field is expected to be an ip and not text/keyword ?

This should be applicable for any <, <=, >, and > predicate operators

search=weblog | where host = ip("2001:0db7::ff00:42:8329") | fields host

@andrross
Copy link
Member

[Catch All Triage - 1, 2, 3, 4, 5]

currantw added a commit to Bit-Quill/opensearch-project-sql that referenced this issue Nov 22, 2024
…ort for IP address type).

Signed-off-by: currantw <[email protected]>
@currantw
Copy link
Contributor Author

currantw commented Nov 22, 2024

PROPOSED SOLUTION

  • Add an new IP type to ExprCoreType with compatible type STRING.
  • Replace OpenSearchExprIpValue with ExprIpValue.
  • Store the IP address internally as an IPAddressString.
  • Implement equal to support comparison with an IP address or a string.
  • Implement compare to support comparison with an IP address or a string of the same type (i.e. IPv4 or IPv6).
  • Implement stringValue and toString to return the original string.
  • Replace OpenSearchExprIpValueTest with ExprIpValueTest and update tests accordingly.
  • Update OpenSearchDataType.MappingType to map ip fields to ExprCoreType.IP instead of ExprCoreType.UNKNOWN.
  • Update OpenSearchExprValueFactory and corresponding tests.
  • Add new method ip that takes an ExprStringValue and returns an ExprIpValue.
  • Update cidrmatch function to support both IP address and string data types (for the first argument). Also update corresponding unit and integration tests.
  • Update other code, unit tests, and integration tests as necessary.

Additional notes:

  • For convenience, I think it makes sense to allow comparisons between IP addresses and strings, without needing to use the ip function. It's not strictly necessary, but it's easy to implement and probably more user friendly?
  • Since the cidrmatch function isn't merged yet, and the ip function is also relatively separate from the rest of the work, I may implement this as 2 or 3 successive PRs.

@normanj-bitquill
Copy link
Contributor

normanj-bitquill commented Nov 22, 2024

@currantw I think that comparisons should support both IPv4 and IPv6 addresses. For example an IPv4 address could be compared to an IPv6 address.

To implement this, I think it makes sense to have all IPv6 addresses be greater than all IPv4 addresses.

Consider an index that has an IP field. The field contains both IPv4 and IPv6 addresses. We should be able to sort based on that column. Sorting needs to be able to compare values.

For IPv6, toString and stringValue do not necessarily need to return the original string. The address 12ab:0000:0000:: is the same as the address 12ab::. It should be acceptable for toString or stringValue to return a reduced but equivalent version of the original IP address. At the very least be sure not to return a longer but equivalent IP address.

@normanj-bitquill
Copy link
Contributor

Another thing to consider is IPv4 address in IPv6 representation.
https://en.wikipedia.org/wiki/Reserved_IP_addresses#IPv6

This could mean that some IPv6 addresses are equal to some IPv4 addresses, such as:

::ffff:c0a8:2:3 = 192.168.2.3

@currantw
Copy link
Contributor Author

Thanks @normanj-bitquill for the comments and suggestions. I think it makes sense to try (to the extent possible) to model the SQL plugin's IP address data type behaviour on the existing core OpenSearch behaviour.

  • It does not supports any leading zeros for IPv4 addresses (e.g. 101.0.0.02), and simply fails. It should be relatively straight-forward to support this in the SQL plugin (and convert it to the canonical representation, e.g. 101.0.0.2), and it seems fine to me to be more permissive.
  • IPv4-mapped IPv6 addresses are automatically converted to IPv4 addresses (e.g. ::FFFF:1.2.3.4 is converted to 1.2.3.4).
  • Comparisons (including equality) are based on InetAddressPoint, which converts IPv4 addresses to IPv4-mapped IPv6 addresses, and does a bit-wise comparison. For example, this means that ::FFFF:1.2.3.4,1.2.3.4, and ::FFFF:0102:0304 are all equivalent.

@currantw
Copy link
Contributor Author

currantw commented Dec 2, 2024

UPDATE

IP Data Type Behaviour

As discussed above, the IP address data type in the SQL plugin is intended to mirror the core OpenSearch behaviour:

  • All IP addresses are transformed into their "canonical" form (e.g. strip leading zeros, compress continuous zeros in IPv6). IPv6-mapped IPv4 address are depicted as IPv4 addresses (e.g. ::FFFF:0102:0304 becomes 1.2.3.4).
  • More IP address forms are supported (via IPAddressString) than in core OpenSearch: leading zeroes are supported in IPv4 addresses (e.g. 101.0.0.02), and mixed notation is supported (e.g. ::FFFF:1.2.3.4).
  • Comparisons (and thus sorting) are based on IPv6 mapping for IPv4 addresses, which allows both IPv4 and IPv6 addresses to be sorted and compared for equality.

Casting to IP Address Data Type

  • After looking at the code, I decided to take a different approach than that suggested by @YANG-DB by building on the existing casting support for other data types. A string can be cast to an IP address using cast(value as ip). I think this achieves the same aim, and is more consistent with the existing syntax.
source=weblogs | where host_ip = cast("1.2.3.4" as ip) | fields host_ip

Outdated Sorting Syntax

During work on this issue, I discovered that PPL supports syntax for sorting numerically, lexicographically, or by IP address (e.g. sort num(field_name), sort str(field_name), and sort ip(field_name), respectively) -- but this syntax has no effect on the resulting sort order. This functionality appears to have been overtaken by the addition of data types. and sorting is always done according to the data type (i.e. numerical data types are sorted numerically, strings are sorted lexicographically), rather than as specified by str, num or ip. Moreover, this syntax is not mentioned in the supporting user documentation for either OpenSearch SQL or Spark, and there is no actual implementation of the functionality in either code bases - the specified "sort type" is simply ignored.

As such, I have removed this syntax from the SQL plugin as part of this issue (to avoid clashes with the ip data type name), and propose raising an equivalent code cleanup issue to remove it from Spark, in order to maintain consistency. Please let me know if you have any thoughts or comments.

Some Examples

Important note. Some queries are executed entirely within core OpenSearch, rather than within the SQL plugin. There are indicated with an asterisk (*) below, in order to verify that the behaviour is the same (e.g. sorting order) between core OpenSearch and the SQL plugin.

Input Data:
"::1"
"0.0.0.2"
"::3"
"::FFFF:1.2.3.4"
"1.2.3.4"
"::FFFF:1234"

> source=weblogs (*)
"::1"
"0.0.0.2"
"::3"
"1.2.3.4"
"1.2.3.4"
"::ffff:1234"

> source=weblogs | sort host_ip | fields host_ip (*)
"::1"
"::3"
"::ffff:1234"
"0.0.0.2"
"1.2.3.4"
"1.2.3.4"

> source=weblogs | fields host_ip | sort host_ip
"::1"
"::3"
"::ffff:1234"
"0.0.0.2"
"1.2.3.4"
"1.2.3.4"

> source=weblogs | where host_ip = '::ffff:0102:0304' | fields host_ip (*)
"1.2.3.4"
"1.2.3.4"

> source=weblogs | fields host_ip | where host_ip = '::ffff:0102:0304'
"1.2.3.4"
"1.2.3.4"

> source=weblogs | where host_ip < '::ffff:0102:0304' | fields host_ip (*)
"::1"
"0.0.0.2"
"::3"
"::ffff:1234"

> source=weblogs | fields host_ip | where host_ip < '::ffff:0102:0304'
"::1"
0.0.0.2"
"::3"
"::ffff:1234"

> source=weblogs | where '::ffff:0102:0304' > host_ip | fields host_ip (*)
Exception: 500 - Error occurred in OpenSearch engine: all shards failed

> source=weblogs | where cast('::ffff:0102:0304' as ip) > host_ip | fields host_ip
"::1"
0.0.0.2"
"::3"
"::ffff:1234"

@YANG-DB
Copy link
Member

YANG-DB commented Dec 2, 2024

UPDATE

IP Data Type Behaviour

As discussed above, the IP address data type in the SQL plugin is intended to mirror the core OpenSearch behaviour:

  • All IP addresses are transformed into their "canonical" form (e.g. strip leading zeros, compress continuous zeros in IPv6). IPv6-mapped IPv4 address are depicted as IPv4 addresses (e.g. ::FFFF:0102:0304 becomes 1.2.3.4).
  • More IP address forms are supported (via IPAddressString) than in core OpenSearch: leading zeroes are supported in IPv4 addresses (e.g. 101.0.0.02), and mixed notation is supported (e.g. ::FFFF:1.2.3.4).
  • Comparisons (and thus sorting) are based on IPv6 mapping for IPv4 addresses, which allows both IPv4 and IPv6 addresses to be sorted and compared for equality.

Casting to IP Address Data Type

  • After looking at the code, I decided to take a different approach than that suggested by @YANG-DB by building on the existing casting support for other data types. A string can be cast to an IP address using cast(value as ip). I think this achieves the same aim, and is more consistent with the existing syntax.
source=weblogs | where host_ip = cast("1.2.3.4" as ip) | fields host_ip

Outdated Sorting Syntax

During work on this issue, I discovered that PPL supports syntax for sorting numerically, lexicographically, or by IP address (e.g. sort num(field_name), sort str(field_name), and sort ip(field_name), respectively) -- but this syntax has no effect on the resulting sort order. This functionality appears to have been overtaken by the addition of data types. and sorting is always done according to the data type (i.e. numerical data types are sorted numerically, strings are sorted lexicographically), rather than as specified by str, num or ip. Moreover, this syntax is not mentioned in the supporting user documentation for either OpenSearch SQL or Spark, and there is no actual implementation of the functionality in either code bases - the specified "sort type" is simply ignored.

As such, I have removed this syntax from the SQL plugin as part of this issue (to avoid clashes with the ip data type name), and propose raising an equivalent code cleanup issue to remove it from Spark, in order to maintain consistency. Please let me know if you have any thoughts or comments.

Hi @currantw
Thanks for the update. Could u plz open a dedicated issue for the Outdated Sorting Syntax BUG including a detailed description of the expected behaviour and the actual one ?
thanks

@currantw
Copy link
Contributor Author

currantw commented Dec 2, 2024

Hi @currantw Thanks for the update. Could u plz open a dedicated issue for the Outdated Sorting Syntax BUG including a detailed description of the expected behaviour and the actual one ? thanks

Issues raised:

As mentioned, I will resolve #3180 as part of the same PR as this issue, and address the Spark one separately.

YANG-DB pushed a commit that referenced this issue Dec 19, 2024
* Add `ExprIpValue` and `IP` data type

Signed-off-by: currantw <[email protected]>

* Add support for casting (`cast(field_name to ip)`) and remove existing unused sorting syntax.

Signed-off-by: currantw <[email protected]>

* Update comparison logic to compare in IPv6

Signed-off-by: currantw <[email protected]>

* Fix bug casting to IP

Signed-off-by: currantw <[email protected]>

* Fix failing tests

Signed-off-by: currantw <[email protected]>

* Assert that comparison only valid if same type, update tests accordingly

Signed-off-by: currantw <[email protected]>

* Add additional tests to increase code coverage

Signed-off-by: currantw <[email protected]>

* Integrate `cidrmatch` changes

Signed-off-by: currantw <[email protected]>

* Remove `OpenSearchIPType` data type

Signed-off-by: currantw <[email protected]>

* Fix more failing tests

Signed-off-by: currantw <[email protected]>

* Minor cleanup

Signed-off-by: currantw <[email protected]>

* Add new tests for IP data type to `SortCommandIT`, and update `weblogs` test data.

Signed-off-by: currantw <[email protected]>

* Fixing IT test failure.

Signed-off-by: currantw <[email protected]>

* Spotless and update test to sort in SQL

Signed-off-by: currantw <[email protected]>

* Fix broken link

Signed-off-by: currantw <[email protected]>

* Fix failing code coverage

Signed-off-by: currantw <[email protected]>

* Fix failing doctest

Signed-off-by: currantw <[email protected]>

* Fix failing `ip.rst` doctest

Signed-off-by: currantw <[email protected]>

* Fix test failure due to merge.

Signed-off-by: currantw <[email protected]>

* Fix spotless

Signed-off-by: currantw <[email protected]>

* Add missing `url` field

Signed-off-by: currantw <[email protected]>

* Address minor review comments.

Signed-off-by: currantw <[email protected]>

* Revert sort syntax changes

Signed-off-by: currantw <[email protected]>

* Minor doc update

Signed-off-by: currantw <[email protected]>

* FIx failing `ip.rst` doctest

Signed-off-by: currantw <[email protected]>

* Add `IPComparisonIT` tests for comparison operators, rename modules and weblogs test index to make plural for consistency.

Signed-off-by: currantw <[email protected]>

---------

Signed-off-by: currantw <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants