Streaming Data Source is a "continuous" stream of data and is described using the Source Contract.
Source
can generate a streaming DataFrame (aka batch) given start and end offsets in a batch.
For fault tolerance, Source
must be able to replay data given a start offset.
Source
should be able to replay an arbitrary sequence of past data in a stream using a range of offsets. Streaming sources like Apache Kafka and Amazon Kinesis (with their per-record offsets) fit into this model nicely. This is the assumption so structured streaming can achieve end-to-end exactly-once guarantees.
Format | Source |
---|---|
Any
|
|
|
|
|
|
|
|
|
package org.apache.spark.sql.execution.streaming
trait Source {
def commit(end: Offset) : Unit = {}
def getBatch(start: Option[Offset], end: Offset): DataFrame
def getOffset: Option[Offset]
def schema: StructType
def stop(): Unit
}
Method | Description | ||
---|---|---|---|
Generates a Used when |
|||
Finding the latest offset
Used exclusively when |
|||
Schema of the data from this source Used when:
|