Streaming Data Source

Streaming Data Source is a "continuous" stream of data and is described using the Source Contract.

Source can generate a streaming DataFrame (aka batch) given start and end offsets in a batch.

For fault tolerance, Source must be able to replay data given a start offset.

Source should be able to replay an arbitrary sequence of past data in a stream using a range of offsets. Streaming sources like Apache Kafka and Amazon Kinesis (with their per-record offsets) fit into this model nicely. This is the assumption so structured streaming can achieve end-to-end exactly-once guarantees.

Table 1. Sources

Format	Source
Any `FileFormat` `csv` `hive` `json` `libsvm` `orc` `parquet` `text`	FileStreamSource
`kafka`	KafkaSource
`memory`	MemoryStream
`rate`	RateStreamSource
`socket`	TextSocketSource

Source Contract

package org.apache.spark.sql.execution.streaming

trait Source {
  def commit(end: Offset) : Unit = {}
  def getBatch(start: Option[Offset], end: Offset): DataFrame
  def getOffset: Option[Offset]
  def schema: StructType
  def stop(): Unit
}

Table 2. Source Contract

Method Description

getBatch

Generates a DataFrame (with new rows) for a given batch (described using the optional start and end offsets).

Used when StreamExecution runs a batch and populateStartOffsets.

getOffset

Finding the latest offset

Note	Offset is…FIXME

Used exclusively when StreamExecution runs streaming batches (and constructing the next streaming micro-batch for every streaming data source in a streaming Dataset)

schema

Schema of the data from this source

Used when:

KafkaSource generates a DataFrame with records from Kafka for a streaming batch
FileStreamSource generates a DataFrame for a streaming batch
RateStreamSource generates a DataFrame for a streaming batch
StreamingExecutionRelation is created (for MemoryStream)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-sql-streaming-Source.adoc

spark-sql-streaming-Source.adoc

Streaming Data Source

Source Contract

Files

spark-sql-streaming-Source.adoc

Latest commit

History

spark-sql-streaming-Source.adoc

File metadata and controls

Streaming Data Source

Source Contract