-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50714][SQL][SS] Enable schema evolution for TransformWithState when Avro encoding is used #49277
base: master
Are you sure you want to change the base?
Conversation
.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala
Outdated
Show resolved
Hide resolved
readerSchema: Schema, | ||
valueProj: UnsafeProjection): UnsafeRow = { | ||
if (valueBytes != null) { | ||
val reader = new GenericDatumReader[Any](writerSchema, readerSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add some comments here around the args
...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala
Show resolved
Hide resolved
...n/scala/org/apache/spark/sql/execution/streaming/state/StateSchemaCompatibilityChecker.scala
Outdated
Show resolved
Hide resolved
...re/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateSuite.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
Outdated
Show resolved
Hide resolved
|
||
dataType match { | ||
// Basic types | ||
case BooleanType => false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these Avro defaults too ?
|
||
// Complex types | ||
case ArrayType(elementType, _) => | ||
val defaultArray = new java.util.ArrayList[Any]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have empty collections ? i.e. empty array/map etc ?
sql/core/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala
Show resolved
Hide resolved
...re/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ListStateMetricsImpl.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
Show resolved
Hide resolved
.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala
Show resolved
Hide resolved
...re/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TimerStateImpl.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala
Outdated
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateVariableUtils.scala
Outdated
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateVariableUtils.scala
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala
Outdated
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala
Outdated
Show resolved
Hide resolved
assert(timeMode == TimeMode.EventTime.toString || timeMode == TimeMode.ProcessingTime.toString) | ||
if (timeMode == TimeMode.EventTime.toString) { | ||
val primaryIndex = if (timeMode == TimeMode.EventTime.toString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i guess this could also be split into a separate function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was trying to address this comment: #49277 (comment)
...n/scala/org/apache/spark/sql/execution/streaming/state/StateSchemaCompatibilityChecker.scala
Outdated
Show resolved
Hide resolved
...n/scala/org/apache/spark/sql/execution/streaming/state/StateSchemaCompatibilityChecker.scala
Show resolved
Hide resolved
...n/scala/org/apache/spark/sql/execution/streaming/state/StateSchemaCompatibilityChecker.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/streamingLimits.scala
Outdated
Show resolved
Hide resolved
...e/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateSuite.scala
Show resolved
Hide resolved
@@ -285,13 +285,13 @@ class TransformWithValueStateTTLSuite extends TransformWithStateTTLTest { | |||
.add("expiryTimestampMs", LongType, nullable = false) | |||
val schemaForValueRow: StructType = StructType(Array(StructField("__dummy__", NullType))) | |||
val schema0 = StateStoreColFamilySchema( | |||
TimerStateUtils.getTimerStateVarName(TimeMode.ProcessingTime().toString), | |||
schemaForKeyRow, | |||
TimerStateUtils.getTimerStateVarNames(TimeMode.ProcessingTime().toString)._1, 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets also add atleast a couple of tests for other composite types such as list, map etc ?
@ericm-db - can you also format the PR description and explain in more detail what functionality this PR adds. Thx Also - this is a user facing change right ? |
|
||
val result2 = inputData.toDS() | ||
.groupByKey(x => x) | ||
.transformWithState(new RunningCountStatefulProcessorNestedLongs(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also add a test case where the newly added column is also part of the output ? That way we can ensure that the default values are being picked up correctly ? Prob also similar for dropped columns ?
What changes were proposed in this pull request?
This PR introduces stateful schema evolution for the TransformWithState operator when Avro is used.
We modified the StateStoreColumnFamilySchema and the StateSchemaV3 file to keep track of the key schema and value schema id in order to support versioning and lookups of schemas across query restarts.
The AvroStateEncoder now takes a StateSchemaProvider, which allows it to look up all of the active schemas in the StateStore for a given column family, allowing it to pass reader and writer schemas to the AvroEncoder class.
We have also added changes so that the StateDataSource can read from these rows with schema ID
Why are the changes needed?
Schema evolution is a critical feature for stateful stream processing applications that need to handle changing data schemas over time.
Does this PR introduce any user-facing change?
Yes - this change allows stateful schema evolution that was not previously possible for the TransformWithState operator.
How was this patch tested?
Unit and Integration tests in RocksDBStateStoreSuite and TransformWithStateSuite
Was this patch authored or co-authored using generative AI tooling?