diff --git a/search/search_index.json b/search/search_index.json index fff93ab792..bcc70f46f0 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Numaflow \u00b6 Welcome to Numaflow! A Kubernetes-native, serverless platform for running scalable and reliable event-driven applications. Numaflow decouples event sources and sinks from the processing logic, allowing each component to independently auto-scale based on demand. With out-of-the-box sources and sinks, and built-in observability, developers can focus on their processing logic without worrying about event consumption, writing boilerplate code, or operational complexities. Each step of the pipeline can be written in any programming language, offering unparalleled flexibility in using the best programming language for each step and ease of using the languages you are most familiar with. Use Cases \u00b6 Event driven applications: Process events as they happen, e.g., updating inventory and sending customer notifications in e-commerce. Real time analytics: Analyze data instantly, e.g., social media analytics, observability data processing. Inference on streaming data: Perform real-time predictions, e.g., anomaly detection. Workflows running in a streaming manner. Learn more in our User Guide . Key Features \u00b6 Kubernetes-native: If you know Kubernetes, you already know how to use Numaflow. Serverless: Focus on your code and let the system scale up and down based on demand. Language agnostic: Use your favorite programming language. Exactly-Once semantics: No input element is duplicated or lost even as pods are rescheduled or restarted. Auto-scaling with back-pressure: Each vertex automatically scales from zero to whatever is needed. Data Integrity Guarantees \u00b6 Minimally provide at-least-once semantics Provide exactly-once semantics for unbounded and near real-time data sources Preserving order is not required Roadmap \u00b6 Map Streaming (1.3) Demo \u00b6 Getting Started \u00b6 For set-up information and running your first Numaflow pipeline, please see our getting started guide .","title":"Home"},{"location":"#numaflow","text":"Welcome to Numaflow! A Kubernetes-native, serverless platform for running scalable and reliable event-driven applications. Numaflow decouples event sources and sinks from the processing logic, allowing each component to independently auto-scale based on demand. With out-of-the-box sources and sinks, and built-in observability, developers can focus on their processing logic without worrying about event consumption, writing boilerplate code, or operational complexities. Each step of the pipeline can be written in any programming language, offering unparalleled flexibility in using the best programming language for each step and ease of using the languages you are most familiar with.","title":"Numaflow"},{"location":"#use-cases","text":"Event driven applications: Process events as they happen, e.g., updating inventory and sending customer notifications in e-commerce. Real time analytics: Analyze data instantly, e.g., social media analytics, observability data processing. Inference on streaming data: Perform real-time predictions, e.g., anomaly detection. Workflows running in a streaming manner. Learn more in our User Guide .","title":"Use Cases"},{"location":"#key-features","text":"Kubernetes-native: If you know Kubernetes, you already know how to use Numaflow. Serverless: Focus on your code and let the system scale up and down based on demand. Language agnostic: Use your favorite programming language. Exactly-Once semantics: No input element is duplicated or lost even as pods are rescheduled or restarted. Auto-scaling with back-pressure: Each vertex automatically scales from zero to whatever is needed.","title":"Key Features"},{"location":"#data-integrity-guarantees","text":"Minimally provide at-least-once semantics Provide exactly-once semantics for unbounded and near real-time data sources Preserving order is not required","title":"Data Integrity Guarantees"},{"location":"#roadmap","text":"Map Streaming (1.3)","title":"Roadmap"},{"location":"#demo","text":"","title":"Demo"},{"location":"#getting-started","text":"For set-up information and running your first Numaflow pipeline, please see our getting started guide .","title":"Getting Started"},{"location":"APIs/","text":"Packages: numaflow.numaproj.io/v1alpha1 numaflow.numaproj.io/v1alpha1 Resource Types: AbstractPodTemplate ( Appears on: AbstractVertex , DaemonTemplate , JetStreamBufferService , JobTemplate , NativeRedis , SideInputsManagerTemplate , VertexTemplate ) AbstractPodTemplate provides a template for pod customization in vertices, daemon deployments and so on. Field Description metadata Metadata (Optional) Metadata sets the pods\u2019s metadata, i.e. annotations and labels nodeSelector map\\[string\\]string (Optional) NodeSelector is a selector which must be true for the pod to fit on a node. Selector which must match a node\u2019s labels for the pod to be scheduled on that node. More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ tolerations \\[\\]Kubernetes core/v1.Toleration (Optional) If specified, the pod\u2019s tolerations. securityContext Kubernetes core/v1.PodSecurityContext (Optional) SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field. imagePullSecrets \\[\\]Kubernetes core/v1.LocalObjectReference (Optional) ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images used by this PodSpec. If specified, these secrets will be passed to individual puller implementations for them to use. For example, in the case of docker, only DockerConfig type secrets are honored. More info: https://kubernetes.io/docs/concepts/containers/images#specifying-imagepullsecrets-on-a-pod priorityClassName string (Optional) If specified, indicates the Redis pod\u2019s priority. \u201csystem-node-critical\u201d and \u201csystem-cluster-critical\u201d are two special keywords which indicate the highest priorities with the former being the highest priority. Any other name must be defined by creating a PriorityClass object with that name. If not specified, the pod priority will be default or zero if there is no default. More info: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ priority int32 (Optional) The priority value. Various system components use this field to find the priority of the Redis pod. When Priority Admission Controller is enabled, it prevents users from setting this field. The admission controller populates this field from PriorityClassName. The higher the value, the higher the priority. More info: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ affinity Kubernetes core/v1.Affinity (Optional) The pod\u2019s scheduling constraints More info: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ serviceAccountName string (Optional) ServiceAccountName applied to the pod runtimeClassName string (Optional) RuntimeClassName refers to a RuntimeClass object in the node.k8s.io group, which should be used to run this pod. If no RuntimeClass resource matches the named class, the pod will not be run. If unset or empty, the \u201clegacy\u201d RuntimeClass will be used, which is an implicit class with an empty definition that uses the default runtime handler. More info: https://git.k8s.io/enhancements/keps/sig-node/585-runtime-class automountServiceAccountToken bool (Optional) AutomountServiceAccountToken indicates whether a service account token should be automatically mounted. dnsPolicy Kubernetes core/v1.DNSPolicy (Optional) Set DNS policy for the pod. Defaults to \u201cClusterFirst\u201d. Valid values are \u2018ClusterFirstWithHostNet\u2019, \u2018ClusterFirst\u2019, \u2018Default\u2019 or \u2018None\u2019. DNS parameters given in DNSConfig will be merged with the policy selected with DNSPolicy. To have DNS options set along with hostNetwork, you have to specify DNS policy explicitly to \u2018ClusterFirstWithHostNet\u2019. dnsConfig Kubernetes core/v1.PodDNSConfig (Optional) Specifies the DNS parameters of a pod. Parameters specified here will be merged to the generated DNS configuration based on DNSPolicy. AbstractSink ( Appears on: Sink ) Field Description log Log (Optional) Log sink is used to write the data to the log. kafka KafkaSink (Optional) Kafka sink is used to write the data to the Kafka. blackhole Blackhole (Optional) Blackhole sink is used to write the data to the blackhole sink, which is a sink that discards all the data written to it. udsink UDSink (Optional) UDSink sink is used to write the data to the user-defined sink. AbstractVertex ( Appears on: PipelineSpec , VertexSpec ) Field Description name string source Source (Optional) sink Sink (Optional) udf UDF (Optional) containerTemplate ContainerTemplate (Optional) Container template for the main numa container. initContainerTemplate ContainerTemplate (Optional) Container template for all the vertex pod init containers spawned by numaflow, excluding the ones specified by the user. AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) volumes \\[\\]Kubernetes core/v1.Volume (Optional) limits VertexLimits (Optional) Limits define the limitations such as buffer read batch size for all the vertices of a pipeline, will override pipeline level settings scale Scale (Optional) Settings for autoscaling initContainers \\[\\]Kubernetes core/v1.Container (Optional) List of customized init containers belonging to the pod. More info: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/ sidecars \\[\\]Kubernetes core/v1.Container (Optional) List of customized sidecar containers belonging to the pod. partitions int32 (Optional) Number of partitions of the vertex owned buffers. It applies to udf and sink vertices only. sideInputs \\[\\]string (Optional) Names of the side inputs used in this vertex. sideInputsContainerTemplate ContainerTemplate (Optional) Container template for the side inputs watcher container. Authorization ( Appears on: HTTPSource , ServingSource ) Field Description token Kubernetes core/v1.SecretKeySelector (Optional) A secret selector which contains bearer token To use this, the client needs to add \u201cAuthorization: Bearer \u201d in the header BasicAuth ( Appears on: NatsAuth ) BasicAuth represents the basic authentication approach which contains a user name and a password. Field Description user Kubernetes core/v1.SecretKeySelector (Optional) Secret for auth user password Kubernetes core/v1.SecretKeySelector (Optional) Secret for auth password Blackhole ( Appears on: AbstractSink ) Blackhole is a sink to emulate /dev/null BufferFullWritingStrategy ( string alias) ( Appears on: Edge ) BufferServiceConfig ( Appears on: InterStepBufferServiceStatus ) Field Description redis RedisConfig jetstream JetStreamConfig CombinedEdge ( Appears on: VertexSpec ) CombinedEdge is a combination of Edge and some other properties such as vertex type, partitions, limits. It\u2019s used to decorate the fromEdges and toEdges of the generated Vertex objects, so that in the vertex pod, it knows the properties of the connected vertices, for example, how many partitioned buffers I should write to, what is the write buffer length, etc. Field Description Edge Edge (Members of Edge are embedded into this type.) fromVertexType VertexType From vertex type. fromVertexPartitionCount int32 (Optional) The number of partitions of the from vertex, if not provided, the default value is set to \u201c1\u201d. fromVertexLimits VertexLimits (Optional) toVertexType VertexType To vertex type. toVertexPartitionCount int32 (Optional) The number of partitions of the to vertex, if not provided, the default value is set to \u201c1\u201d. toVertexLimits VertexLimits (Optional) ConditionType ( string alias) ConditionType is a valid value of Condition.Type Container ( Appears on: SideInput , UDF , UDSink , UDSource , UDTransformer ) Container is used to define the container properties for user-defined functions, sinks, etc. Field Description image string (Optional) command \\[\\]string (Optional) args \\[\\]string (Optional) env \\[\\]Kubernetes core/v1.EnvVar (Optional) envFrom \\[\\]Kubernetes core/v1.EnvFromSource (Optional) volumeMounts \\[\\]Kubernetes core/v1.VolumeMount (Optional) resources Kubernetes core/v1.ResourceRequirements (Optional) securityContext Kubernetes core/v1.SecurityContext (Optional) imagePullPolicy Kubernetes core/v1.PullPolicy (Optional) ContainerTemplate ( Appears on: AbstractVertex , DaemonTemplate , JetStreamBufferService , JobTemplate , NativeRedis , SideInputsManagerTemplate , VertexTemplate ) ContainerTemplate defines customized spec for a container Field Description resources Kubernetes core/v1.ResourceRequirements (Optional) imagePullPolicy Kubernetes core/v1.PullPolicy (Optional) securityContext Kubernetes core/v1.SecurityContext (Optional) env \\[\\]Kubernetes core/v1.EnvVar (Optional) envFrom \\[\\]Kubernetes core/v1.EnvFromSource (Optional) DaemonTemplate ( Appears on: Templates ) Field Description AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) replicas int32 (Optional) Replicas is the number of desired replicas of the Deployment. This is a pointer to distinguish between explicit zero and unspecified. Defaults to 1. More info: https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller#what-is-a-replicationcontroller containerTemplate ContainerTemplate (Optional) initContainerTemplate ContainerTemplate (Optional) Edge ( Appears on: CombinedEdge , PipelineSpec ) Field Description from string to string conditions ForwardConditions (Optional) Conditional forwarding, only allowed when \u201cFrom\u201d is a Source or UDF. onFull BufferFullWritingStrategy (Optional) OnFull specifies the behaviour for the write actions when the inter step buffer is full. There are currently two options, retryUntilSuccess and discardLatest. if not provided, the default value is set to \u201cretryUntilSuccess\u201d FixedWindow ( Appears on: Window ) FixedWindow describes a fixed window Field Description length Kubernetes meta/v1.Duration Length is the duration of the fixed window. streaming bool (Optional) Streaming should be set to true if the reduce udf is streaming. ForwardConditions ( Appears on: Edge ) Field Description tags TagConditions Tags used to specify tags for conditional forwarding Function ( Appears on: UDF ) Field Description name string args \\[\\]string (Optional) kwargs map\\[string\\]string (Optional) GSSAPI ( Appears on: SASL ) GSSAPI represents a SASL GSSAPI config Field Description serviceName string realm string usernameSecret Kubernetes core/v1.SecretKeySelector UsernameSecret refers to the secret that contains the username authType KRB5AuthType valid inputs - KRB5_USER_AUTH, KRB5_KEYTAB_AUTH passwordSecret Kubernetes core/v1.SecretKeySelector (Optional) PasswordSecret refers to the secret that contains the password keytabSecret Kubernetes core/v1.SecretKeySelector (Optional) KeytabSecret refers to the secret that contains the keytab kerberosConfigSecret Kubernetes core/v1.SecretKeySelector (Optional) KerberosConfigSecret refers to the secret that contains the kerberos config GeneratorSource ( Appears on: Source ) Field Description rpu int64 (Optional) duration Kubernetes meta/v1.Duration (Optional) msgSize int32 (Optional) Size of each generated message keyCount int32 (Optional) KeyCount is the number of unique keys in the payload value uint64 (Optional) Value is an optional uint64 value to be written in to the payload jitter Kubernetes meta/v1.Duration (Optional) Jitter is the jitter for the message generation, used to simulate out of order messages for example if the jitter is 10s, then the message\u2019s event time will be delayed by a random time between 0 and 10s which will result in the message being out of order by 0 to 10s valueBlob string (Optional) ValueBlob is an optional string which is the base64 encoding of direct payload to send. This is useful for attaching a GeneratorSource to a true pipeline to test load behavior with true messages without requiring additional work to generate messages through the external source if present, the Value and MsgSize fields will be ignored. GetDaemonDeploymentReq Field Description ISBSvcType ISBSvcType Image string PullPolicy Kubernetes core/v1.PullPolicy Env \\[\\]Kubernetes core/v1.EnvVar DefaultResources Kubernetes core/v1.ResourceRequirements GetJetStreamServiceSpecReq Field Description Labels map\\[string\\]string ClusterPort int32 ClientPort int32 MonitorPort int32 MetricsPort int32 GetJetStreamStatefulSetSpecReq Field Description ServiceName string Labels map\\[string\\]string NatsImage string MetricsExporterImage string ConfigReloaderImage string ClusterPort int32 ClientPort int32 MonitorPort int32 MetricsPort int32 ServerAuthSecretName string ServerEncryptionSecretName string ConfigMapName string PvcNameIfNeeded string StartCommand string DefaultResources Kubernetes core/v1.ResourceRequirements GetRedisServiceSpecReq Field Description Labels map\\[string\\]string RedisContainerPort int32 SentinelContainerPort int32 GetRedisStatefulSetSpecReq Field Description ServiceName string Labels map\\[string\\]string RedisImage string SentinelImage string MetricsExporterImage string InitContainerImage string RedisContainerPort int32 SentinelContainerPort int32 RedisMetricsContainerPort int32 CredentialSecretName string TLSEnabled bool PvcNameIfNeeded string ConfConfigMapName string ScriptsConfigMapName string HealthConfigMapName string DefaultResources Kubernetes core/v1.ResourceRequirements GetSideInputDeploymentReq Field Description ISBSvcType ISBSvcType Image string PullPolicy Kubernetes core/v1.PullPolicy Env \\[\\]Kubernetes core/v1.EnvVar DefaultResources Kubernetes core/v1.ResourceRequirements GetVertexPodSpecReq Field Description ISBSvcType ISBSvcType Image string PullPolicy Kubernetes core/v1.PullPolicy Env \\[\\]Kubernetes core/v1.EnvVar SideInputsStoreName string ServingSourceStreamName string PipelineSpec PipelineSpec DefaultResources Kubernetes core/v1.ResourceRequirements GroupBy ( Appears on: UDF ) GroupBy indicates it is a reducer UDF Field Description window Window Window describes the windowing strategy. keyed bool (Optional) allowedLateness Kubernetes meta/v1.Duration (Optional) AllowedLateness allows late data to be included for the Reduce operation as long as the late data is not later than (Watermark - AllowedLateness). storage PBQStorage Storage is used to define the PBQ storage for a reduce vertex. HTTPSource ( Appears on: Source ) Field Description auth Authorization (Optional) service bool (Optional) Whether to create a ClusterIP Service ISBSvcPhase ( string alias) ( Appears on: InterStepBufferServiceStatus ) ISBSvcType ( string alias) ( Appears on: GetDaemonDeploymentReq , GetSideInputDeploymentReq , GetVertexPodSpecReq , InterStepBufferServiceStatus ) IdleSource ( Appears on: Watermark ) Field Description threshold Kubernetes meta/v1.Duration Threshold is the duration after which a source is marked as Idle due to lack of data. Ex: If watermark found to be idle after the Threshold duration then the watermark is progressed by IncrementBy . stepInterval Kubernetes meta/v1.Duration (Optional) StepInterval is the duration between the subsequent increment of the watermark as long the source remains Idle. The default value is 0s which means that once we detect idle source, we will be incrementing the watermark by IncrementBy for time we detect that we source is empty (in other words, this will be a very frequent update). incrementBy Kubernetes meta/v1.Duration IncrementBy is the duration to be added to the current watermark to progress the watermark when source is idling. InterStepBufferService Field Description metadata Kubernetes meta/v1.ObjectMeta Refer to the Kubernetes API documentation for the fields of the metadata field. spec InterStepBufferServiceSpec redis RedisBufferService jetstream JetStreamBufferService status InterStepBufferServiceStatus (Optional) InterStepBufferServiceSpec ( Appears on: InterStepBufferService ) Field Description redis RedisBufferService jetstream JetStreamBufferService InterStepBufferServiceStatus ( Appears on: InterStepBufferService ) Field Description Status Status (Members of Status are embedded into this type.) phase ISBSvcPhase message string config BufferServiceConfig type ISBSvcType observedGeneration int64 ObservedGeneration stores the generation value observed by the controller. JetStreamBufferService ( Appears on: InterStepBufferServiceSpec ) Field Description version string JetStream version, such as \u201c2.7.1\u201d replicas int32 JetStream StatefulSet size containerTemplate ContainerTemplate (Optional) ContainerTemplate contains customized spec for NATS container reloaderContainerTemplate ContainerTemplate (Optional) ReloaderContainerTemplate contains customized spec for config reloader container metricsContainerTemplate ContainerTemplate (Optional) MetricsContainerTemplate contains customized spec for metrics container persistence PersistenceStrategy (Optional) AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) settings string (Optional) Nats/JetStream configuration, if not specified, global settings in numaflow-controller-config will be used. See https://docs.nats.io/running-a-nats-service/configuration#limits and https://docs.nats.io/running-a-nats-service/configuration#jetstream . For limits, only \u201cmax_payload\u201d is supported for configuration, defaults to 1048576 (1MB), not recommended to use values over 8388608 (8MB) but max_payload can be set up to 67108864 (64MB). For jetstream, only \u201cmax_memory_store\u201d and \u201cmax_file_store\u201d are supported for configuration, do not set \u201cstore_dir\u201d as it has been hardcoded. startArgs \\[\\]string (Optional) Optional arguments to start nats-server. For example, \u201c-D\u201d to enable debugging output, \u201c-DV\u201d to enable debugging and tracing. Check https://docs.nats.io/ for all the available arguments. bufferConfig string (Optional) Optional configuration for the streams, consumers and buckets to be created in this JetStream service, if specified, it will be merged with the default configuration in numaflow-controller-config. It accepts a YAML format configuration, it may include 4 sections, \u201cstream\u201d, \u201cconsumer\u201d, \u201cotBucket\u201d and \u201cprocBucket\u201d. Available fields under \u201cstream\u201d include \u201cretention\u201d (e.g. interest, limits, workerQueue), \u201cmaxMsgs\u201d, \u201cmaxAge\u201d (e.g. 72h), \u201creplicas\u201d (1, 3, 5), \u201cduplicates\u201d (e.g. 5m). Available fields under \u201cconsumer\u201d include \u201cackWait\u201d (e.g. 60s) Available fields under \u201cotBucket\u201d include \u201cmaxValueSize\u201d, \u201chistory\u201d, \u201cttl\u201d (e.g. 72h), \u201cmaxBytes\u201d, \u201creplicas\u201d (1, 3, 5). Available fields under \u201cprocBucket\u201d include \u201cmaxValueSize\u201d, \u201chistory\u201d, \u201cttl\u201d (e.g. 72h), \u201cmaxBytes\u201d, \u201creplicas\u201d (1, 3, 5). encryption bool (Optional) Whether encrypt the data at rest, defaults to false Enabling encryption might impact the performance, see https://docs.nats.io/running-a-nats-service/nats_admin/jetstream_admin/encryption_at_rest for the detail Toggling the value will impact encrypting/decrypting existing messages. tls bool (Optional) Whether enable TLS, defaults to false Enabling TLS might impact the performance JetStreamConfig ( Appears on: BufferServiceConfig ) Field Description url string JetStream (NATS) URL auth NatsAuth streamConfig string (Optional) tlsEnabled bool TLS enabled or not JetStreamSource ( Appears on: Source ) Field Description url string URL to connect to NATS cluster, multiple urls could be separated by comma. stream string Stream represents the name of the stream. tls TLS (Optional) TLS configuration for the nats client. auth NatsAuth (Optional) Auth information JobTemplate ( Appears on: Templates ) Field Description AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) containerTemplate ContainerTemplate (Optional) ttlSecondsAfterFinished int32 (Optional) ttlSecondsAfterFinished limits the lifetime of a Job that has finished execution (either Complete or Failed). If this field is set, ttlSecondsAfterFinished after the Job finishes, it is eligible to be automatically deleted. When the Job is being deleted, its lifecycle guarantees (e.g. finalizers) will be honored. If this field is unset, the Job won\u2019t be automatically deleted. If this field is set to zero, the Job becomes eligible to be deleted immediately after it finishes. Numaflow defaults to 30 backoffLimit int32 (Optional) Specifies the number of retries before marking this job failed. More info: https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy Numaflow defaults to 20 KRB5AuthType ( string alias) ( Appears on: GSSAPI ) KRB5AuthType describes the kerberos auth type KafkaSink ( Appears on: AbstractSink ) Field Description brokers \\[\\]string topic string tls TLS (Optional) TLS user to configure TLS connection for kafka broker TLS.enable=true default for TLS. config string (Optional) sasl SASL (Optional) SASL user to configure SASL connection for kafka broker SASL.enable=true default for SASL. KafkaSource ( Appears on: Source ) Field Description brokers \\[\\]string topic string consumerGroup string tls TLS (Optional) TLS user to configure TLS connection for kafka broker TLS.enable=true default for TLS. config string (Optional) sasl SASL (Optional) SASL user to configure SASL connection for kafka broker SASL.enable=true default for SASL. Lifecycle ( Appears on: PipelineSpec ) Field Description deleteGracePeriodSeconds int32 (Optional) DeleteGracePeriodSeconds used to delete pipeline gracefully desiredPhase PipelinePhase (Optional) DesiredPhase used to bring the pipeline from current phase to desired phase pauseGracePeriodSeconds int32 (Optional) PauseGracePeriodSeconds used to pause pipeline gracefully Log ( Appears on: AbstractSink ) LogicOperator ( string alias) ( Appears on: TagConditions ) Metadata ( Appears on: AbstractPodTemplate ) Field Description annotations map\\[string\\]string labels map\\[string\\]string NativeRedis ( Appears on: RedisBufferService ) Field Description version string Redis version, such as \u201c6.0.16\u201d replicas int32 Redis StatefulSet size redisContainerTemplate ContainerTemplate (Optional) RedisContainerTemplate contains customized spec for Redis container sentinelContainerTemplate ContainerTemplate (Optional) SentinelContainerTemplate contains customized spec for Redis container metricsContainerTemplate ContainerTemplate (Optional) MetricsContainerTemplate contains customized spec for metrics container initContainerTemplate ContainerTemplate (Optional) persistence PersistenceStrategy (Optional) AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) settings RedisSettings (Optional) Redis configuration, if not specified, global settings in numaflow-controller-config will be used. NatsAuth ( Appears on: JetStreamConfig , JetStreamSource , NatsSource ) NatsAuth defines how to authenticate the nats access Field Description basic BasicAuth (Optional) Basic auth which contains a username and a password token Kubernetes core/v1.SecretKeySelector (Optional) Token auth nkey Kubernetes core/v1.SecretKeySelector (Optional) NKey auth NatsSource ( Appears on: Source ) Field Description url string URL to connect to NATS cluster, multiple urls could be separated by comma. subject string Subject holds the name of the subject onto which messages are published. queue string Queue is used for queue subscription. tls TLS (Optional) TLS configuration for the nats client. auth NatsAuth (Optional) Auth information NoStore ( Appears on: PBQStorage ) NoStore means there will be no persistence storage and there will be data loss during pod restarts. Use this option only if you do not care about correctness (e.g., approx statistics pipeline like sampling rate, etc.). PBQStorage ( Appears on: GroupBy ) PBQStorage defines the persistence configuration for a vertex. Field Description persistentVolumeClaim PersistenceStrategy (Optional) emptyDir Kubernetes core/v1.EmptyDirVolumeSource (Optional) no_store NoStore (Optional) PersistenceStrategy ( Appears on: JetStreamBufferService , NativeRedis , PBQStorage ) PersistenceStrategy defines the strategy of persistence Field Description storageClassName string (Optional) Name of the StorageClass required by the claim. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#class-1 accessMode Kubernetes core/v1.PersistentVolumeAccessMode (Optional) Available access modes such as ReadWriteOnce, ReadWriteMany https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes volumeSize k8s.io/apimachinery/pkg/api/resource.Quantity Volume size, e.g. 50Gi Pipeline Field Description metadata Kubernetes meta/v1.ObjectMeta Refer to the Kubernetes API documentation for the fields of the metadata field. spec PipelineSpec interStepBufferServiceName string (Optional) vertices \\[\\]AbstractVertex edges \\[\\]Edge Edges define the relationships between vertices lifecycle Lifecycle (Optional) Lifecycle define the Lifecycle properties limits PipelineLimits (Optional) Limits define the limitations such as buffer read batch size for all the vertices of a pipeline, they could be overridden by each vertex\u2019s settings watermark Watermark (Optional) Watermark enables watermark progression across the entire pipeline. templates Templates (Optional) Templates are used to customize additional kubernetes resources required for the Pipeline sideInputs \\[\\]SideInput (Optional) SideInputs defines the Side Inputs of a pipeline. status PipelineStatus (Optional) PipelineLimits ( Appears on: PipelineSpec ) Field Description readBatchSize uint64 (Optional) Read batch size for all the vertices in the pipeline, can be overridden by the vertex\u2019s limit settings. bufferMaxLength uint64 (Optional) BufferMaxLength is used to define the max length of a buffer. Only applies to UDF and Source vertices as only they do buffer write. It can be overridden by the settings in vertex limits. bufferUsageLimit uint32 (Optional) BufferUsageLimit is used to define the percentage of the buffer usage limit, a valid value should be less than 100, for example, 85. Only applies to UDF and Source vertices as only they do buffer write. It will be overridden by the settings in vertex limits. readTimeout Kubernetes meta/v1.Duration (Optional) Read timeout for all the vertices in the pipeline, can be overridden by the vertex\u2019s limit settings PipelinePhase ( string alias) ( Appears on: Lifecycle , PipelineStatus ) PipelineSpec ( Appears on: GetVertexPodSpecReq , Pipeline ) Field Description interStepBufferServiceName string (Optional) vertices \\[\\]AbstractVertex edges \\[\\]Edge Edges define the relationships between vertices lifecycle Lifecycle (Optional) Lifecycle define the Lifecycle properties limits PipelineLimits (Optional) Limits define the limitations such as buffer read batch size for all the vertices of a pipeline, they could be overridden by each vertex\u2019s settings watermark Watermark (Optional) Watermark enables watermark progression across the entire pipeline. templates Templates (Optional) Templates are used to customize additional kubernetes resources required for the Pipeline sideInputs \\[\\]SideInput (Optional) SideInputs defines the Side Inputs of a pipeline. PipelineStatus ( Appears on: Pipeline ) Field Description Status Status (Members of Status are embedded into this type.) phase PipelinePhase message string lastUpdated Kubernetes meta/v1.Time vertexCount uint32 sourceCount uint32 sinkCount uint32 udfCount uint32 observedGeneration int64 ObservedGeneration stores the generation value observed by the controller. RedisBufferService ( Appears on: InterStepBufferServiceSpec ) Field Description native NativeRedis Native brings up a native Redis service external RedisConfig External holds an External Redis config RedisConfig ( Appears on: BufferServiceConfig , RedisBufferService ) Field Description url string (Optional) Redis URL sentinelUrl string (Optional) Sentinel URL, will be ignored if Redis URL is provided masterName string (Optional) Only required when Sentinel is used user string (Optional) Redis user password Kubernetes core/v1.SecretKeySelector (Optional) Redis password secret selector sentinelPassword Kubernetes core/v1.SecretKeySelector (Optional) Sentinel password secret selector RedisSettings ( Appears on: NativeRedis ) Field Description redis string (Optional) Redis settings shared by both master and slaves, will override the global settings from controller config master string (Optional) Special settings for Redis master node, will override the global settings from controller config replica string (Optional) Special settings for Redis replica nodes, will override the global settings from controller config sentinel string (Optional) Sentinel settings, will override the global settings from controller config SASL ( Appears on: KafkaSink , KafkaSource ) Field Description mechanism SASLType SASL mechanism to use gssapi GSSAPI (Optional) GSSAPI contains the kerberos config plain SASLPlain (Optional) SASLPlain contains the sasl plain config scramsha256 SASLPlain (Optional) SASLSCRAMSHA256 contains the sasl plain config scramsha512 SASLPlain (Optional) SASLSCRAMSHA512 contains the sasl plain config SASLPlain ( Appears on: SASL ) Field Description userSecret Kubernetes core/v1.SecretKeySelector UserSecret refers to the secret that contains the user passwordSecret Kubernetes core/v1.SecretKeySelector (Optional) PasswordSecret refers to the secret that contains the password handshake bool SASLType ( string alias) ( Appears on: SASL ) SASLType describes the SASL type Scale ( Appears on: AbstractVertex ) Scale defines the parameters for autoscaling. Field Description disabled bool (Optional) Whether to disable autoscaling. Set to \u201ctrue\u201d when using Kubernetes HPA or any other 3rd party autoscaling strategies. min int32 (Optional) Minimum replicas. max int32 (Optional) Maximum replicas. lookbackSeconds uint32 (Optional) Lookback seconds to calculate the average pending messages and processing rate. cooldownSeconds uint32 (Optional) Deprecated: Use scaleUpCooldownSeconds and scaleDownCooldownSeconds instead. Cooldown seconds after a scaling operation before another one. zeroReplicaSleepSeconds uint32 (Optional) After scaling down the source vertex to 0, sleep how many seconds before scaling the source vertex back up to peek. targetProcessingSeconds uint32 (Optional) TargetProcessingSeconds is used to tune the aggressiveness of autoscaling for source vertices, it measures how fast you want the vertex to process all the pending messages. Typically increasing the value, which leads to lower processing rate, thus less replicas. It\u2019s only effective for source vertices. targetBufferAvailability uint32 (Optional) TargetBufferAvailability is used to define the target percentage of the buffer availability. A valid and meaningful value should be less than the BufferUsageLimit defined in the Edge spec (or Pipeline spec), for example, 50. It only applies to UDF and Sink vertices because only they have buffers to read. replicasPerScale uint32 (Optional) ReplicasPerScale defines maximum replicas can be scaled up or down at once. The is use to prevent too aggressive scaling operations scaleUpCooldownSeconds uint32 (Optional) ScaleUpCooldownSeconds defines the cooldown seconds after a scaling operation, before a follow-up scaling up. It defaults to the CooldownSeconds if not set. scaleDownCooldownSeconds uint32 (Optional) ScaleDownCooldownSeconds defines the cooldown seconds after a scaling operation, before a follow-up scaling down. It defaults to the CooldownSeconds if not set. ServingSource ( Appears on: Source ) ServingSource is the HTTP endpoint for Numaflow. Field Description auth Authorization (Optional) service bool (Optional) Whether to create a ClusterIP Service msgIDHeaderKey string The header key from which the message id will be extracted store ServingStore Persistent store for the callbacks for serving and tracking ServingStore ( Appears on: ServingSource ) ServingStore to track and store data and metadata for tracking and serving. Field Description url string URL of the persistent store to write the callbacks ttl Kubernetes meta/v1.Duration (Optional) TTL for the data in the store and tracker SessionWindow ( Appears on: Window ) SessionWindow describes a session window Field Description timeout Kubernetes meta/v1.Duration Timeout is the duration of inactivity after which a session window closes. SideInput ( Appears on: PipelineSpec ) SideInput defines information of a Side Input Field Description name string container Container volumes \\[\\]Kubernetes core/v1.Volume (Optional) trigger SideInputTrigger SideInputTrigger ( Appears on: SideInput ) Field Description schedule string The schedule to trigger the retrievement of the side input data. It supports cron format, for example, \u201c0 30 \\* \\* \\* \\*\u201d. Or interval based format, such as \u201c@hourly\u201d, \u201c@every 1h30m\u201d, etc. timezone string (Optional) SideInputsManagerTemplate ( Appears on: Templates ) Field Description AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) containerTemplate ContainerTemplate (Optional) Template for the side inputs manager numa container initContainerTemplate ContainerTemplate (Optional) Template for the side inputs manager init container Sink ( Appears on: AbstractVertex ) Field Description AbstractSink AbstractSink (Members of AbstractSink are embedded into this type.) fallback AbstractSink (Optional) Fallback sink can be imagined as DLQ for primary Sink. The writes to Fallback sink will only be initiated if the ud-sink response field sets it. SlidingWindow ( Appears on: Window ) SlidingWindow describes a sliding window Field Description length Kubernetes meta/v1.Duration Length is the duration of the sliding window. slide Kubernetes meta/v1.Duration Slide is the slide parameter that controls the frequency at which the sliding window is created. streaming bool (Optional) Streaming should be set to true if the reduce udf is streaming. Source ( Appears on: AbstractVertex ) Field Description generator GeneratorSource (Optional) kafka KafkaSource (Optional) http HTTPSource (Optional) nats NatsSource (Optional) transformer UDTransformer (Optional) udsource UDSource (Optional) jetstream JetStreamSource (Optional) serving ServingSource (Optional) Status ( Appears on: InterStepBufferServiceStatus , PipelineStatus ) Status is a common structure which can be used for Status field. Field Description conditions \\[\\]Kubernetes meta/v1.Condition (Optional) Conditions are the latest available observations of a resource\u2019s current state. TLS ( Appears on: JetStreamSource , KafkaSink , KafkaSource , NatsSource ) Field Description insecureSkipVerify bool (Optional) caCertSecret Kubernetes core/v1.SecretKeySelector (Optional) CACertSecret refers to the secret that contains the CA cert certSecret Kubernetes core/v1.SecretKeySelector (Optional) CertSecret refers to the secret that contains the cert keySecret Kubernetes core/v1.SecretKeySelector (Optional) KeySecret refers to the secret that contains the key TagConditions ( Appears on: ForwardConditions ) Field Description operator LogicOperator (Optional) Operator specifies the type of operation that should be used for conditional forwarding value could be \u201cand\u201d, \u201cor\u201d, \u201cnot\u201d values \\[\\]string Values tag values for conditional forwarding Templates ( Appears on: PipelineSpec ) Field Description daemon DaemonTemplate (Optional) DaemonTemplate is used to customize the Daemon Deployment. job JobTemplate (Optional) JobTemplate is used to customize Jobs. sideInputsManager SideInputsManagerTemplate (Optional) SideInputsManagerTemplate is used to customize the Side Inputs Manager. vertex VertexTemplate (Optional) VertexTemplate is used to customize the vertices of the pipeline. Transformer ( Appears on: UDTransformer ) Field Description name string args \\[\\]string (Optional) kwargs map\\[string\\]string (Optional) UDF ( Appears on: AbstractVertex ) Field Description container Container (Optional) builtin Function (Optional) groupBy GroupBy (Optional) UDSink ( Appears on: AbstractSink ) Field Description container Container UDSource ( Appears on: Source ) Field Description container Container UDTransformer ( Appears on: Source ) Field Description container Container (Optional) builtin Transformer (Optional) Vertex ( Appears on: VertexInstance ) Field Description metadata Kubernetes meta/v1.ObjectMeta Refer to the Kubernetes API documentation for the fields of the metadata field. spec VertexSpec AbstractVertex AbstractVertex (Members of AbstractVertex are embedded into this type.) pipelineName string interStepBufferServiceName string (Optional) replicas int32 (Optional) fromEdges \\[\\]CombinedEdge (Optional) toEdges \\[\\]CombinedEdge (Optional) watermark Watermark (Optional) Watermark indicates watermark progression in the vertex, it\u2019s populated from the pipeline watermark settings. status VertexStatus (Optional) VertexInstance VertexInstance is a wrapper of a vertex instance, which contains the vertex spec and the instance information such as hostname and replica index. Field Description vertex Vertex hostname string replica int32 VertexLimits ( Appears on: AbstractVertex , CombinedEdge ) Field Description readBatchSize uint64 (Optional) Read batch size from the source or buffer. It overrides the settings from pipeline limits. readTimeout Kubernetes meta/v1.Duration (Optional) Read timeout duration from the source or buffer It overrides the settings from pipeline limits. bufferMaxLength uint64 (Optional) BufferMaxLength is used to define the max length of a buffer. It overrides the settings from pipeline limits. bufferUsageLimit uint32 (Optional) BufferUsageLimit is used to define the percentage of the buffer usage limit, a valid value should be less than 100, for example, 85. It overrides the settings from pipeline limits. VertexPhase ( string alias) ( Appears on: VertexStatus ) VertexSpec ( Appears on: Vertex ) Field Description AbstractVertex AbstractVertex (Members of AbstractVertex are embedded into this type.) pipelineName string interStepBufferServiceName string (Optional) replicas int32 (Optional) fromEdges \\[\\]CombinedEdge (Optional) toEdges \\[\\]CombinedEdge (Optional) watermark Watermark (Optional) Watermark indicates watermark progression in the vertex, it\u2019s populated from the pipeline watermark settings. VertexStatus ( Appears on: Vertex ) Field Description phase VertexPhase reason string message string replicas uint32 selector string lastScaledAt Kubernetes meta/v1.Time VertexTemplate ( Appears on: Templates ) Field Description AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) containerTemplate ContainerTemplate (Optional) Template for the vertex numa container initContainerTemplate ContainerTemplate (Optional) Template for the vertex init container VertexType ( string alias) ( Appears on: CombinedEdge ) Watermark ( Appears on: PipelineSpec , VertexSpec ) Field Description disabled bool (Optional) Disabled toggles the watermark propagation, defaults to false. maxDelay Kubernetes meta/v1.Duration (Optional) Maximum delay allowed for watermark calculation, defaults to \u201c0s\u201d, which means no delay. idleSource IdleSource (Optional) IdleSource defines the idle watermark properties, it could be configured in case source is idling. Window ( Appears on: GroupBy ) Window describes windowing strategy Field Description fixed FixedWindow (Optional) sliding SlidingWindow (Optional) session SessionWindow (Optional) Generated with gen-crd-api-reference-docs .","title":"APIs"},{"location":"quick-start/","text":"Quick Start \u00b6 In this page, we will guide you through the steps to: Install Numaflow. Create and run a simple pipeline. Create and run an advanced pipeline. Before you begin: prerequisites \u00b6 To try Numaflow, you will first need to setup using one of the following options to run container images: Docker Desktop podman Then use one of the following options to create a local Kubernete Cluster: Docker Desktop Kubernetes k3d kind minikube You will also need kubectl to manage the cluster. Follow these steps to install kubectl . In case you need a refresher, all the kubectl commands used in this quick start guide can be found in the kubectl Cheat Sheet . Installing Numaflow \u00b6 Once you have completed all the prerequisites, run the following command lines to install Numaflow and start the Inter-Step Buffer Service that handles communication between vertices. kubectl create ns numaflow-system kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/install.yaml kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/0-isbsvc-jetstream.yaml Creating a simple pipeline \u00b6 As an example, we will create a simple pipeline that contains a source vertex to generate messages, a processing vertex that echos the messages, and a sink vertex that logs the messages. Run the command below to create a simple pipeline. kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/1-simple-pipeline.yaml To view a list of pipelines you've created, run: kubectl get pipeline # or \"pl\" as a short name This should create a response like the following, with AGE indicating the time elapsed since the creation of your simple pipeline. NAME PHASE MESSAGE VERTICES AGE simple-pipeline Running 3 9s To inspect the status of the pipeline, use kubectl get pods . Note that the pod names will be different from the sample response: # Wait for pods to be ready kubectl get pods NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3 /3 Running 0 19s isbsvc-default-js-1 3 /3 Running 0 19s isbsvc-default-js-2 3 /3 Running 0 19s simple-pipeline-daemon-78b798fb98-qf4t4 1 /1 Running 0 10s simple-pipeline-out-0-xc0pf 1 /1 Running 0 10s simple-pipeline-cat-0-kqrhy 2 /2 Running 0 10s simple-pipeline-in-0-rhpjm 1 /1 Running 0 11s Now you can watch the log for the output vertex. Run the command below and remember to replace xxxxx with the appropriate pod name above. kubectl logs -f simple-pipeline-out-0-xxxxx This should generate an output like the sample below: 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"VT+G+/W7Dhc=\" , \"Createdts\" :1661471977707552597 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"0TaH+/W7Dhc=\" , \"Createdts\" :1661471977707615953 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"EEGH+/W7Dhc=\" , \"Createdts\" :1661471977707618576 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"WESH+/W7Dhc=\" , \"Createdts\" :1661471977707619416 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"YEaH+/W7Dhc=\" , \"Createdts\" :1661471977707619936 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"qfomN/a7Dhc=\" , \"Createdts\" :1661471978707942057 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"aUcnN/a7Dhc=\" , \"Createdts\" :1661471978707961705 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"iUonN/a7Dhc=\" , \"Createdts\" :1661471978707962505 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"mkwnN/a7Dhc=\" , \"Createdts\" :1661471978707963034 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"jk4nN/a7Dhc=\" , \"Createdts\" :1661471978707963534 } Numaflow also comes with a built-in user interface. NOTE : Please install the metrics server if your local Kubernetes cluster does not bring it by default (e.g., Kind). You can install it by running the below command. kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml kubectl patch -n kube-system deployment metrics-server --type = json -p '[{\"op\":\"add\",\"path\":\"/spec/template/spec/containers/0/args/-\",\"value\":\"--kubelet-insecure-tls\"}]' To port forward the UI, run the following command. # Port forward the UI to https://localhost:8443/ kubectl -n numaflow-system port-forward deployment/numaflow-server 8443 :8443 This renders the following UI on https://localhost:8443/. The pipeline can be deleted by issuing the following command: kubectl delete -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/1-simple-pipeline.yaml Creating an advanced pipeline \u00b6 Now we will walk you through creating an advanced pipeline. In our example, this is called the even-odd pipeline, illustrated by the following diagram: There are five vertices in this example of an advanced pipeline. An HTTP source vertex which serves an HTTP endpoint to receive numbers as source data, a UDF vertex to tag the ingested numbers with the key even or odd , three Log sinks, one to print the even numbers, one to print the odd numbers, and the other one to print both the even and odd numbers. Run the following command to create the even-odd pipeline. kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml You may opt to view the list of pipelines you've created so far by running kubectl get pipeline . Otherwise, proceed to inspect the status of the pipeline, using kubectl get pods . # Wait for pods to be ready kubectl get pods NAME READY STATUS RESTARTS AGE even-odd-daemon-64d65c945d-vjs9f 1 /1 Running 0 5m3s even-odd-even-or-odd-0-pr4ze 2 /2 Running 0 30s even-odd-even-sink-0-unffo 1 /1 Running 0 22s even-odd-in-0-a7iyd 1 /1 Running 0 5m3s even-odd-number-sink-0-zmg2p 1 /1 Running 0 7s even-odd-odd-sink-0-2736r 1 /1 Running 0 15s isbsvc-default-js-0 3 /3 Running 0 10m isbsvc-default-js-1 3 /3 Running 0 10m isbsvc-default-js-2 3 /3 Running 0 10m Next, port-forward the HTTP endpoint, and make a POST request using curl . Remember to replace xxxxx with the appropriate pod names both here and in the next step. kubectl port-forward even-odd-in-0-xxxx 8444 :8443 # Post data to the HTTP endpoint curl -kq -X POST -d \"101\" https://localhost:8444/vertices/in curl -kq -X POST -d \"102\" https://localhost:8444/vertices/in curl -kq -X POST -d \"103\" https://localhost:8444/vertices/in curl -kq -X POST -d \"104\" https://localhost:8444/vertices/in Now you can watch the log for the even and odd vertices by running the commands below. # Watch the log for the even vertex kubectl logs -f even-odd-even-sink-0-xxxxx 2022 /09/07 22 :29:40 ( even-sink ) 102 2022 /09/07 22 :29:40 ( even-sink ) 104 # Watch the log for the odd vertex kubectl logs -f even-odd-odd-sink-0-xxxxx 2022 /09/07 22 :30:19 ( odd-sink ) 101 2022 /09/07 22 :30:19 ( odd-sink ) 103 View the UI for the advanced pipeline at https://localhost:8443/. The source code of the even-odd user-defined function can be found here . You also can replace the Log Sink with some other sinks like Kafka to forward the data to Kafka topics. The pipeline can be deleted by kubectl delete -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml A pipeline with reduce (aggregation) \u00b6 To set up an example pipeline with the Reduce UDF , see Reduce Examples . What's Next \u00b6 Try more examples in the examples directory. After exploring how Numaflow pipelines run, you can check what data Sources and Sinks Numaflow supports out of the box, or learn how to write User-defined Functions . Numaflow can also be paired with Numalogic, a collection of ML models and algorithms for real-time data analytics and AIOps including anomaly detection. Visit the Numalogic homepage for more information.","title":"Quick Start"},{"location":"quick-start/#quick-start","text":"In this page, we will guide you through the steps to: Install Numaflow. Create and run a simple pipeline. Create and run an advanced pipeline.","title":"Quick Start"},{"location":"quick-start/#before-you-begin-prerequisites","text":"To try Numaflow, you will first need to setup using one of the following options to run container images: Docker Desktop podman Then use one of the following options to create a local Kubernete Cluster: Docker Desktop Kubernetes k3d kind minikube You will also need kubectl to manage the cluster. Follow these steps to install kubectl . In case you need a refresher, all the kubectl commands used in this quick start guide can be found in the kubectl Cheat Sheet .","title":"Before you begin: prerequisites"},{"location":"quick-start/#installing-numaflow","text":"Once you have completed all the prerequisites, run the following command lines to install Numaflow and start the Inter-Step Buffer Service that handles communication between vertices. kubectl create ns numaflow-system kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/install.yaml kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/0-isbsvc-jetstream.yaml","title":"Installing Numaflow"},{"location":"quick-start/#creating-a-simple-pipeline","text":"As an example, we will create a simple pipeline that contains a source vertex to generate messages, a processing vertex that echos the messages, and a sink vertex that logs the messages. Run the command below to create a simple pipeline. kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/1-simple-pipeline.yaml To view a list of pipelines you've created, run: kubectl get pipeline # or \"pl\" as a short name This should create a response like the following, with AGE indicating the time elapsed since the creation of your simple pipeline. NAME PHASE MESSAGE VERTICES AGE simple-pipeline Running 3 9s To inspect the status of the pipeline, use kubectl get pods . Note that the pod names will be different from the sample response: # Wait for pods to be ready kubectl get pods NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3 /3 Running 0 19s isbsvc-default-js-1 3 /3 Running 0 19s isbsvc-default-js-2 3 /3 Running 0 19s simple-pipeline-daemon-78b798fb98-qf4t4 1 /1 Running 0 10s simple-pipeline-out-0-xc0pf 1 /1 Running 0 10s simple-pipeline-cat-0-kqrhy 2 /2 Running 0 10s simple-pipeline-in-0-rhpjm 1 /1 Running 0 11s Now you can watch the log for the output vertex. Run the command below and remember to replace xxxxx with the appropriate pod name above. kubectl logs -f simple-pipeline-out-0-xxxxx This should generate an output like the sample below: 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"VT+G+/W7Dhc=\" , \"Createdts\" :1661471977707552597 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"0TaH+/W7Dhc=\" , \"Createdts\" :1661471977707615953 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"EEGH+/W7Dhc=\" , \"Createdts\" :1661471977707618576 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"WESH+/W7Dhc=\" , \"Createdts\" :1661471977707619416 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"YEaH+/W7Dhc=\" , \"Createdts\" :1661471977707619936 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"qfomN/a7Dhc=\" , \"Createdts\" :1661471978707942057 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"aUcnN/a7Dhc=\" , \"Createdts\" :1661471978707961705 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"iUonN/a7Dhc=\" , \"Createdts\" :1661471978707962505 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"mkwnN/a7Dhc=\" , \"Createdts\" :1661471978707963034 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"jk4nN/a7Dhc=\" , \"Createdts\" :1661471978707963534 } Numaflow also comes with a built-in user interface. NOTE : Please install the metrics server if your local Kubernetes cluster does not bring it by default (e.g., Kind). You can install it by running the below command. kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml kubectl patch -n kube-system deployment metrics-server --type = json -p '[{\"op\":\"add\",\"path\":\"/spec/template/spec/containers/0/args/-\",\"value\":\"--kubelet-insecure-tls\"}]' To port forward the UI, run the following command. # Port forward the UI to https://localhost:8443/ kubectl -n numaflow-system port-forward deployment/numaflow-server 8443 :8443 This renders the following UI on https://localhost:8443/. The pipeline can be deleted by issuing the following command: kubectl delete -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/1-simple-pipeline.yaml","title":"Creating a simple pipeline"},{"location":"quick-start/#creating-an-advanced-pipeline","text":"Now we will walk you through creating an advanced pipeline. In our example, this is called the even-odd pipeline, illustrated by the following diagram: There are five vertices in this example of an advanced pipeline. An HTTP source vertex which serves an HTTP endpoint to receive numbers as source data, a UDF vertex to tag the ingested numbers with the key even or odd , three Log sinks, one to print the even numbers, one to print the odd numbers, and the other one to print both the even and odd numbers. Run the following command to create the even-odd pipeline. kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml You may opt to view the list of pipelines you've created so far by running kubectl get pipeline . Otherwise, proceed to inspect the status of the pipeline, using kubectl get pods . # Wait for pods to be ready kubectl get pods NAME READY STATUS RESTARTS AGE even-odd-daemon-64d65c945d-vjs9f 1 /1 Running 0 5m3s even-odd-even-or-odd-0-pr4ze 2 /2 Running 0 30s even-odd-even-sink-0-unffo 1 /1 Running 0 22s even-odd-in-0-a7iyd 1 /1 Running 0 5m3s even-odd-number-sink-0-zmg2p 1 /1 Running 0 7s even-odd-odd-sink-0-2736r 1 /1 Running 0 15s isbsvc-default-js-0 3 /3 Running 0 10m isbsvc-default-js-1 3 /3 Running 0 10m isbsvc-default-js-2 3 /3 Running 0 10m Next, port-forward the HTTP endpoint, and make a POST request using curl . Remember to replace xxxxx with the appropriate pod names both here and in the next step. kubectl port-forward even-odd-in-0-xxxx 8444 :8443 # Post data to the HTTP endpoint curl -kq -X POST -d \"101\" https://localhost:8444/vertices/in curl -kq -X POST -d \"102\" https://localhost:8444/vertices/in curl -kq -X POST -d \"103\" https://localhost:8444/vertices/in curl -kq -X POST -d \"104\" https://localhost:8444/vertices/in Now you can watch the log for the even and odd vertices by running the commands below. # Watch the log for the even vertex kubectl logs -f even-odd-even-sink-0-xxxxx 2022 /09/07 22 :29:40 ( even-sink ) 102 2022 /09/07 22 :29:40 ( even-sink ) 104 # Watch the log for the odd vertex kubectl logs -f even-odd-odd-sink-0-xxxxx 2022 /09/07 22 :30:19 ( odd-sink ) 101 2022 /09/07 22 :30:19 ( odd-sink ) 103 View the UI for the advanced pipeline at https://localhost:8443/. The source code of the even-odd user-defined function can be found here . You also can replace the Log Sink with some other sinks like Kafka to forward the data to Kafka topics. The pipeline can be deleted by kubectl delete -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml","title":"Creating an advanced pipeline"},{"location":"quick-start/#a-pipeline-with-reduce-aggregation","text":"To set up an example pipeline with the Reduce UDF , see Reduce Examples .","title":"A pipeline with reduce (aggregation)"},{"location":"quick-start/#whats-next","text":"Try more examples in the examples directory. After exploring how Numaflow pipelines run, you can check what data Sources and Sinks Numaflow supports out of the box, or learn how to write User-defined Functions . Numaflow can also be paired with Numalogic, a collection of ML models and algorithms for real-time data analytics and AIOps including anomaly detection. Visit the Numalogic homepage for more information.","title":"What's Next"},{"location":"core-concepts/inter-step-buffer-service/","text":"Inter-Step Buffer Service \u00b6 Inter-Step Buffer Service is the service to provide Inter-Step Buffers . An Inter-Step Buffer Service is described by a Custom Resource . It is required to be existing in a namespace before Pipeline objects are created. A sample InterStepBufferService with JetStream implementation looks like below. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : version : latest # Do NOT use \"latest\" but a specific version in your real deployment InterStepBufferService is a namespaced object. It can be used by all the Pipelines in the same namespace. By default, Pipeline objects look for an InterStepBufferService named default , so a common practice is to create an InterStepBufferService with the name default . If you give the InterStepBufferService a name other than default , then you need to give the same name in the Pipeline spec. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : # Optional, if not specified, defaults to \"default\" interStepBufferServiceName : different-name To query Inter-Step Buffer Service objects with kubectl : kubectl get isbsvc JetStream \u00b6 JetStream is one of the supported Inter-Step Buffer Service implementations. A keyword jetstream under spec means a JetStream cluster will be created in the namespace. Version \u00b6 Property spec.jetstream.version is required for a JetStream InterStepBufferService . Supported versions can be found from the ConfigMap numaflow-controller-config in the control plane namespace. Note The version latest in the ConfigMap should only be used for testing purpose. It's recommended that you always use a fixed version in your real workload. Replicas \u00b6 An optional property spec.jetstream.replicas (defaults to 3) can be specified, which gives the total number of nodes. Persistence \u00b6 Following example shows a JetStream InterStepBufferService with persistence. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : version : latest # Do NOT use \"latest\" but a specific version in your real deployment persistence : storageClassName : standard # Optional, will use K8s cluster default storage class if not specified accessMode : ReadWriteOnce # Optional, defaults to ReadWriteOnce volumeSize : 10Gi # Optional, defaults to 20Gi JetStream Settings \u00b6 There are 2 places to configure JetStream settings: ConfigMap numaflow-controller-config in the control plane namespace. This is the default configuration for all the JetStream InterStepBufferService created in the Kubernetes cluster. Property spec.jetstream.settings in an InterStepBufferService object. This optional property can be used to override the default configuration defined in the ConfigMap numaflow-controller-config . A sample JetStream configuration: # https://docs.nats.io/running-a-nats-service/configuration#limits # Only \"max_payload\" is supported for configuration in this section. # Max payload size in bytes, defaults to 1 MB. It is not recommended to use values over 8MB but max_payload can be set up to 64MB. max_payload: 1048576 # # https://docs.nats.io/running-a-nats-service/configuration#jetstream # Only configure \"max_memory_store\" or \"max_file_store\" in this section, do not set \"store_dir\" as it has been hardcoded. # # e.g. 1G. -1 means no limit, up to 75% of available memory. This only take effect for streams created using memory storage. max_memory_store: -1 # e.g. 20G. -1 means no limit, Up to 1TB if available max_file_store: 1TB Buffer Configuration \u00b6 For the Inter-Step Buffers created in JetStream ISB Service, there are 2 places to configure the default properties. ConfigMap numaflow-controller-config in the control plane namespace. This is the place to configure the default properties for the streams and consumers created in all the Jet Stream ISB - Services in the Kubernetes cluster. Field spec.jetstream.bufferConfig in an InterStepBufferService object. This optional field can be used to customize the stream and consumer properties of that particular InterStepBufferService , - and the configuration will be merged into the default one from the ConfigMap numaflow-controller-config . For example, - if you only want to change maxMsgs for created streams, then you only need to give stream.maxMsgs in the field, all - the rest config will still go with the default values in the control plane ConfigMap. Both these 2 places expect a YAML format configuration like below: bufferConfig : | # The properties of the buffers (streams) to be created in this JetStream service stream: # 0: Limits, 1: Interest, 2: WorkQueue retention: 1 maxMsgs: 30000 maxAge: 168h maxBytes: -1 # 0: File, 1: Memory storage: 0 replicas: 3 duplicates: 60s # The consumer properties for the created streams consumer: ackWait: 60s maxAckPending: 20000 Note Changing the buffer configuration either in the control plane ConfigMap or in the InterStepBufferService object does NOT make any change to the buffers (streams) already existing. TLS \u00b6 TLS is optional to configure through spec.jetstream.tls: true . Enabling TLS will use a self signed CERT to encrypt the connection from Vertex Pods to JetStream service. By default TLS is not enabled. Encryption At Rest \u00b6 Encryption at rest can be enabled by setting spec.jetstream.encryption: true . Be aware this will impact the performance a bit, see the detail at official doc . Once a JetStream ISB Service is created, toggling the encryption field will cause problem for the exiting messages, so if you want to change the value, please delete and recreate the ISB Service, and you also need to restart all the Vertex Pods to pick up the new credentials. Other Configuration \u00b6 Check here for the full spec of spec.jetstream . Redis \u00b6 NOTE Today when using Redis, the pipeline will stall if Redis has any data loss, especially during failovers. Redis is supported as an Inter-Step Buffer Service implementation. A keyword native under spec.redis means several Redis nodes with a Master-Replicas topology will be created in the namespace. We also support external redis. External Redis \u00b6 If you have a managed Redis, say in AWS, etc., we can make that Redis your ISB. All you need to do is provide the external Redis endpoint name. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : redis : external : url : \"\" user : \"default\" Cluster Mode \u00b6 We support cluster mode , only if the Redis is an external managed Redis. You will have to enter the url twice to indicate that the mode is cluster. This is because we use Universal Client which requires more than one address to indicate the Redis is in cluster mode. url : \"numaflow-redis-cluster-0.numaflow-redis-cluster-headless:6379,numaflow-redis-cluster-1.numaflow-redis-cluster-headless:6379\" Version \u00b6 Property spec.redis.native.version is required for a native Redis InterStepBufferService . Supported versions can be found from the ConfigMap numaflow-controller-config in the control plane namespace. Replicas \u00b6 An optional property spec.redis.native.replicas (defaults to 3) can be specified, which gives the total number of nodes (including master and replicas). An odd number >= 3 is suggested. If the given number < 3, 3 will be used. Persistence \u00b6 The following example shows an native Redis InterStepBufferService with persistence. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : redis : native : version : 6.2.6 persistence : storageClassName : standard # Optional, will use K8s cluster default storage class if not specified accessMode : ReadWriteOnce # Optional, defaults to ReadWriteOnce volumeSize : 10Gi # Optional, defaults to 20Gi Redis Configuration \u00b6 Redis configuration includes: spec.redis.native.settings.redis - Redis configuration shared by both master and replicas spec.redis.native.settings.master - Redis configuration only for master spec.redis.native.settings.replica - Redis configuration only for replicas spec.redis.native.settings.sentinel - Sentinel configuration A sample Redis configuration: # Enable AOF https://redis.io/topics/persistence#append-only-file appendonly yes auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb # Disable RDB persistence, AOF persistence already enabled. save \"\" maxmemory 512mb maxmemory-policy allkeys-lru A sample Sentinel configuration: sentinel down-after-milliseconds mymaster 10000 sentinel failover-timeout mymaster 2000 sentinel parallel-syncs mymaster 1 There are 2 places to configure these settings: ConfigMap numaflow-controller-config in the control plane namespace. This is the default configuration for all the native Redis InterStepBufferService created in the Kubernetes cluster. Property spec.redis.native.settings in an InterStepBufferService object. This optional property can be used to override the default configuration defined in the ConfigMap numaflow-controller-config . Here is the reference to the full Redis configuration. Other Configuration \u00b6 Check here for the full spec of spec.redis.native .","title":"Inter-Step Buffer Service"},{"location":"core-concepts/inter-step-buffer-service/#inter-step-buffer-service","text":"Inter-Step Buffer Service is the service to provide Inter-Step Buffers . An Inter-Step Buffer Service is described by a Custom Resource . It is required to be existing in a namespace before Pipeline objects are created. A sample InterStepBufferService with JetStream implementation looks like below. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : version : latest # Do NOT use \"latest\" but a specific version in your real deployment InterStepBufferService is a namespaced object. It can be used by all the Pipelines in the same namespace. By default, Pipeline objects look for an InterStepBufferService named default , so a common practice is to create an InterStepBufferService with the name default . If you give the InterStepBufferService a name other than default , then you need to give the same name in the Pipeline spec. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : # Optional, if not specified, defaults to \"default\" interStepBufferServiceName : different-name To query Inter-Step Buffer Service objects with kubectl : kubectl get isbsvc","title":"Inter-Step Buffer Service"},{"location":"core-concepts/inter-step-buffer-service/#jetstream","text":"JetStream is one of the supported Inter-Step Buffer Service implementations. A keyword jetstream under spec means a JetStream cluster will be created in the namespace.","title":"JetStream"},{"location":"core-concepts/inter-step-buffer-service/#version","text":"Property spec.jetstream.version is required for a JetStream InterStepBufferService . Supported versions can be found from the ConfigMap numaflow-controller-config in the control plane namespace. Note The version latest in the ConfigMap should only be used for testing purpose. It's recommended that you always use a fixed version in your real workload.","title":"Version"},{"location":"core-concepts/inter-step-buffer-service/#replicas","text":"An optional property spec.jetstream.replicas (defaults to 3) can be specified, which gives the total number of nodes.","title":"Replicas"},{"location":"core-concepts/inter-step-buffer-service/#persistence","text":"Following example shows a JetStream InterStepBufferService with persistence. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : version : latest # Do NOT use \"latest\" but a specific version in your real deployment persistence : storageClassName : standard # Optional, will use K8s cluster default storage class if not specified accessMode : ReadWriteOnce # Optional, defaults to ReadWriteOnce volumeSize : 10Gi # Optional, defaults to 20Gi","title":"Persistence"},{"location":"core-concepts/inter-step-buffer-service/#jetstream-settings","text":"There are 2 places to configure JetStream settings: ConfigMap numaflow-controller-config in the control plane namespace. This is the default configuration for all the JetStream InterStepBufferService created in the Kubernetes cluster. Property spec.jetstream.settings in an InterStepBufferService object. This optional property can be used to override the default configuration defined in the ConfigMap numaflow-controller-config . A sample JetStream configuration: # https://docs.nats.io/running-a-nats-service/configuration#limits # Only \"max_payload\" is supported for configuration in this section. # Max payload size in bytes, defaults to 1 MB. It is not recommended to use values over 8MB but max_payload can be set up to 64MB. max_payload: 1048576 # # https://docs.nats.io/running-a-nats-service/configuration#jetstream # Only configure \"max_memory_store\" or \"max_file_store\" in this section, do not set \"store_dir\" as it has been hardcoded. # # e.g. 1G. -1 means no limit, up to 75% of available memory. This only take effect for streams created using memory storage. max_memory_store: -1 # e.g. 20G. -1 means no limit, Up to 1TB if available max_file_store: 1TB","title":"JetStream Settings"},{"location":"core-concepts/inter-step-buffer-service/#buffer-configuration","text":"For the Inter-Step Buffers created in JetStream ISB Service, there are 2 places to configure the default properties. ConfigMap numaflow-controller-config in the control plane namespace. This is the place to configure the default properties for the streams and consumers created in all the Jet Stream ISB - Services in the Kubernetes cluster. Field spec.jetstream.bufferConfig in an InterStepBufferService object. This optional field can be used to customize the stream and consumer properties of that particular InterStepBufferService , - and the configuration will be merged into the default one from the ConfigMap numaflow-controller-config . For example, - if you only want to change maxMsgs for created streams, then you only need to give stream.maxMsgs in the field, all - the rest config will still go with the default values in the control plane ConfigMap. Both these 2 places expect a YAML format configuration like below: bufferConfig : | # The properties of the buffers (streams) to be created in this JetStream service stream: # 0: Limits, 1: Interest, 2: WorkQueue retention: 1 maxMsgs: 30000 maxAge: 168h maxBytes: -1 # 0: File, 1: Memory storage: 0 replicas: 3 duplicates: 60s # The consumer properties for the created streams consumer: ackWait: 60s maxAckPending: 20000 Note Changing the buffer configuration either in the control plane ConfigMap or in the InterStepBufferService object does NOT make any change to the buffers (streams) already existing.","title":"Buffer Configuration"},{"location":"core-concepts/inter-step-buffer-service/#tls","text":"TLS is optional to configure through spec.jetstream.tls: true . Enabling TLS will use a self signed CERT to encrypt the connection from Vertex Pods to JetStream service. By default TLS is not enabled.","title":"TLS"},{"location":"core-concepts/inter-step-buffer-service/#encryption-at-rest","text":"Encryption at rest can be enabled by setting spec.jetstream.encryption: true . Be aware this will impact the performance a bit, see the detail at official doc . Once a JetStream ISB Service is created, toggling the encryption field will cause problem for the exiting messages, so if you want to change the value, please delete and recreate the ISB Service, and you also need to restart all the Vertex Pods to pick up the new credentials.","title":"Encryption At Rest"},{"location":"core-concepts/inter-step-buffer-service/#other-configuration","text":"Check here for the full spec of spec.jetstream .","title":"Other Configuration"},{"location":"core-concepts/inter-step-buffer-service/#redis","text":"NOTE Today when using Redis, the pipeline will stall if Redis has any data loss, especially during failovers. Redis is supported as an Inter-Step Buffer Service implementation. A keyword native under spec.redis means several Redis nodes with a Master-Replicas topology will be created in the namespace. We also support external redis.","title":"Redis"},{"location":"core-concepts/inter-step-buffer-service/#external-redis","text":"If you have a managed Redis, say in AWS, etc., we can make that Redis your ISB. All you need to do is provide the external Redis endpoint name. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : redis : external : url : \"\" user : \"default\"","title":"External Redis"},{"location":"core-concepts/inter-step-buffer-service/#cluster-mode","text":"We support cluster mode , only if the Redis is an external managed Redis. You will have to enter the url twice to indicate that the mode is cluster. This is because we use Universal Client which requires more than one address to indicate the Redis is in cluster mode. url : \"numaflow-redis-cluster-0.numaflow-redis-cluster-headless:6379,numaflow-redis-cluster-1.numaflow-redis-cluster-headless:6379\"","title":"Cluster Mode"},{"location":"core-concepts/inter-step-buffer-service/#version_1","text":"Property spec.redis.native.version is required for a native Redis InterStepBufferService . Supported versions can be found from the ConfigMap numaflow-controller-config in the control plane namespace.","title":"Version"},{"location":"core-concepts/inter-step-buffer-service/#replicas_1","text":"An optional property spec.redis.native.replicas (defaults to 3) can be specified, which gives the total number of nodes (including master and replicas). An odd number >= 3 is suggested. If the given number < 3, 3 will be used.","title":"Replicas"},{"location":"core-concepts/inter-step-buffer-service/#persistence_1","text":"The following example shows an native Redis InterStepBufferService with persistence. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : redis : native : version : 6.2.6 persistence : storageClassName : standard # Optional, will use K8s cluster default storage class if not specified accessMode : ReadWriteOnce # Optional, defaults to ReadWriteOnce volumeSize : 10Gi # Optional, defaults to 20Gi","title":"Persistence"},{"location":"core-concepts/inter-step-buffer-service/#redis-configuration","text":"Redis configuration includes: spec.redis.native.settings.redis - Redis configuration shared by both master and replicas spec.redis.native.settings.master - Redis configuration only for master spec.redis.native.settings.replica - Redis configuration only for replicas spec.redis.native.settings.sentinel - Sentinel configuration A sample Redis configuration: # Enable AOF https://redis.io/topics/persistence#append-only-file appendonly yes auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb # Disable RDB persistence, AOF persistence already enabled. save \"\" maxmemory 512mb maxmemory-policy allkeys-lru A sample Sentinel configuration: sentinel down-after-milliseconds mymaster 10000 sentinel failover-timeout mymaster 2000 sentinel parallel-syncs mymaster 1 There are 2 places to configure these settings: ConfigMap numaflow-controller-config in the control plane namespace. This is the default configuration for all the native Redis InterStepBufferService created in the Kubernetes cluster. Property spec.redis.native.settings in an InterStepBufferService object. This optional property can be used to override the default configuration defined in the ConfigMap numaflow-controller-config . Here is the reference to the full Redis configuration.","title":"Redis Configuration"},{"location":"core-concepts/inter-step-buffer-service/#other-configuration_1","text":"Check here for the full spec of spec.redis.native .","title":"Other Configuration"},{"location":"core-concepts/inter-step-buffer/","text":"Inter-Step Buffer \u00b6 A Pipeline contains multiple vertices that ingest data from sources, process data, and forward processed data to sinks. Vertices are not connected directly, but through Inter-Step Buffers. Inter-Step Buffer can be implemented by a variety of data buffering technologies. Those technologies should support: Durability Offsets Transactions for Exactly-Once forwarding Concurrent reading Ability to explicitly acknowledge each data or offset Claim pending messages (read but not acknowledge) Ability to trim data (buffer size control) Fast (high throughput low latency) Ability to query buffer information Currently, there are 2 Inter-Step Buffer implementations: Nats JetStream Redis Stream","title":"Inter-Step Buffer"},{"location":"core-concepts/inter-step-buffer/#inter-step-buffer","text":"A Pipeline contains multiple vertices that ingest data from sources, process data, and forward processed data to sinks. Vertices are not connected directly, but through Inter-Step Buffers. Inter-Step Buffer can be implemented by a variety of data buffering technologies. Those technologies should support: Durability Offsets Transactions for Exactly-Once forwarding Concurrent reading Ability to explicitly acknowledge each data or offset Claim pending messages (read but not acknowledge) Ability to trim data (buffer size control) Fast (high throughput low latency) Ability to query buffer information Currently, there are 2 Inter-Step Buffer implementations: Nats JetStream Redis Stream","title":"Inter-Step Buffer"},{"location":"core-concepts/pipeline/","text":"Pipeline \u00b6 The Pipeline represents a data processing job. The most important concept in Numaflow, it defines: A list of vertices , which define the data processing tasks; A list of edges , which are used to describe the relationship between the vertices. Note an edge may go from a vertex to multiple vertices, and as of v0.10, an edge may also go from multiple vertices to a vertex. This many-to-one relationship is possible via Join and Cycles The Pipeline is abstracted as a Kubernetes Custom Resource . A Pipeline spec looks like below. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : vertices : - name : in source : generator : rpu : 5 duration : 1s - name : cat udf : builtin : name : cat - name : out sink : log : {} edges : - from : in to : cat - from : cat to : out To query Pipeline objects with kubectl : kubectl get pipeline # or \"pl\" as a short name","title":"Pipeline"},{"location":"core-concepts/pipeline/#pipeline","text":"The Pipeline represents a data processing job. The most important concept in Numaflow, it defines: A list of vertices , which define the data processing tasks; A list of edges , which are used to describe the relationship between the vertices. Note an edge may go from a vertex to multiple vertices, and as of v0.10, an edge may also go from multiple vertices to a vertex. This many-to-one relationship is possible via Join and Cycles The Pipeline is abstracted as a Kubernetes Custom Resource . A Pipeline spec looks like below. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : vertices : - name : in source : generator : rpu : 5 duration : 1s - name : cat udf : builtin : name : cat - name : out sink : log : {} edges : - from : in to : cat - from : cat to : out To query Pipeline objects with kubectl : kubectl get pipeline # or \"pl\" as a short name","title":"Pipeline"},{"location":"core-concepts/vertex/","text":"Vertex \u00b6 The Vertex is a key component of Numaflow Pipeline where the data processing happens. Vertex is defined as a list in the pipeline spec, each representing a data processing task. There are 3 types of Vertex in Numaflow today: Source - To ingest data from sources. Sink - To forward processed data to sinks. UDF - User-defined Function, which is used to define data processing logic. We have defined a Kubernetes Custom Resource for Vertex . A Pipeline containing multiple vertices will automatically generate multiple Vertex objects by the controller. As a user, you should NOT create a Vertex object directly. In a Pipeline , the vertices are not connected directly, but through Inter-Step Buffers . To query Vertex objects with kubectl : kubectl get vertex # or \"vtx\" as a short name","title":"Vertex"},{"location":"core-concepts/vertex/#vertex","text":"The Vertex is a key component of Numaflow Pipeline where the data processing happens. Vertex is defined as a list in the pipeline spec, each representing a data processing task. There are 3 types of Vertex in Numaflow today: Source - To ingest data from sources. Sink - To forward processed data to sinks. UDF - User-defined Function, which is used to define data processing logic. We have defined a Kubernetes Custom Resource for Vertex . A Pipeline containing multiple vertices will automatically generate multiple Vertex objects by the controller. As a user, you should NOT create a Vertex object directly. In a Pipeline , the vertices are not connected directly, but through Inter-Step Buffers . To query Vertex objects with kubectl : kubectl get vertex # or \"vtx\" as a short name","title":"Vertex"},{"location":"core-concepts/watermarks/","text":"Watermarks \u00b6 When processing an unbounded data stream, Numaflow has to materialize the results of the processing done on the data. The materialization of the output depends on the notion of time, e.g., the total number of logins served per minute. Without the idea of time inbuilt into the platform, we will not be able to determine the passage of time, which is necessary for grouping elements together to materialize the result. Watermarks is that notion of time that will help us group unbounded data into discrete chunks. Numaflow supports watermarks out-of-the-box. Source vertices generate watermarks based on the event time, and propagate to downstream vertices. Watermark is defined as \u201ca monotonically increasing timestamp of the oldest work/event not yet completed\u201d . In other words, if the watermark has advanced past some timestamp T, we are guaranteed by its monotonic property that no more processing will occur for on-time events at or before T. Configuration \u00b6 Disable Watermark \u00b6 Watermarks can be disabled with by setting disabled: true . Idle Detection \u00b6 Watermark is assigned at the source; this means that the watermark will only progress if there is data coming into the source. There are many cases where the source might not be getting data, causing the source to idle (e.g., data is periodic, say once an hour). When the source is idling the reduce vertices won't emit results because the watermark is not moving. To detect source idling and propagate watermark, we can use the idle detection feature. The idle source watermark progressor will make sure that the watermark cannot progress beyond time.now() - maxDelay ( maxDelay is defined below). To enable this, we provide the following setting: Threshold \u00b6 Threshold is the duration after which a source is marked as Idle due to a lack of data flowing into the source. StepInterval \u00b6 StepInterval is the duration between the subsequent increment of the watermark as long the source remains Idle. The default value is 0s, which means that once we detect an idle source, we will increment the watermark by IncrementBy for the time we detect that our source is empty (in other words, this will be a very frequent update). Default Value: 0s IncrementBy \u00b6 IncrementBy is the duration to be added to the current watermark to progress the watermark when the source is idling. Example \u00b6 The below example will consider the source as idle after there is no data at the source for 5s. After 5s, every other 2s an idle watermark will be emitted which increments the watermark by 3s. watermark : idleSource : threshold : 5s # The pipeline will be considered idle if the source has not emitted any data for given threshold value. incrementBy : 3s # If source is found to be idle then increment the watermark by given incrementBy value. stepInterval : 2s # If source is idling then publish the watermark only when step interval has passed. maxDelay \u00b6 Watermark assignments happen at the source. Sources could be out of order, so sometimes we want to extend the window (default is 0s ) to wait before we start marking data as late-data. You can give more time for the system to wait for late data with maxDelay so that the late data within the specified time duration will be considered as data on-time. This means the watermark propagation will be delayed by maxDelay . Example \u00b6 apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline spec : watermark : disabled : false # Optional, defaults to false. maxDelay : 60s # Optional, defaults to \"0s\". Watermark API \u00b6 When processing data in user-defined functions , you can get the current watermark through an API. Watermark API is supported in all our client SDKs. Example Golang \u00b6 // Go func mapFn ( context context . Context , keys [] string , d mapper . Datum ) mapper . Messages { _ = d . EventTime () // Event time _ = d . Watermark () // Watermark ... ... }","title":"Watermarks"},{"location":"core-concepts/watermarks/#watermarks","text":"When processing an unbounded data stream, Numaflow has to materialize the results of the processing done on the data. The materialization of the output depends on the notion of time, e.g., the total number of logins served per minute. Without the idea of time inbuilt into the platform, we will not be able to determine the passage of time, which is necessary for grouping elements together to materialize the result. Watermarks is that notion of time that will help us group unbounded data into discrete chunks. Numaflow supports watermarks out-of-the-box. Source vertices generate watermarks based on the event time, and propagate to downstream vertices. Watermark is defined as \u201ca monotonically increasing timestamp of the oldest work/event not yet completed\u201d . In other words, if the watermark has advanced past some timestamp T, we are guaranteed by its monotonic property that no more processing will occur for on-time events at or before T.","title":"Watermarks"},{"location":"core-concepts/watermarks/#configuration","text":"","title":"Configuration"},{"location":"core-concepts/watermarks/#disable-watermark","text":"Watermarks can be disabled with by setting disabled: true .","title":"Disable Watermark"},{"location":"core-concepts/watermarks/#idle-detection","text":"Watermark is assigned at the source; this means that the watermark will only progress if there is data coming into the source. There are many cases where the source might not be getting data, causing the source to idle (e.g., data is periodic, say once an hour). When the source is idling the reduce vertices won't emit results because the watermark is not moving. To detect source idling and propagate watermark, we can use the idle detection feature. The idle source watermark progressor will make sure that the watermark cannot progress beyond time.now() - maxDelay ( maxDelay is defined below). To enable this, we provide the following setting:","title":"Idle Detection"},{"location":"core-concepts/watermarks/#threshold","text":"Threshold is the duration after which a source is marked as Idle due to a lack of data flowing into the source.","title":"Threshold"},{"location":"core-concepts/watermarks/#stepinterval","text":"StepInterval is the duration between the subsequent increment of the watermark as long the source remains Idle. The default value is 0s, which means that once we detect an idle source, we will increment the watermark by IncrementBy for the time we detect that our source is empty (in other words, this will be a very frequent update). Default Value: 0s","title":"StepInterval"},{"location":"core-concepts/watermarks/#incrementby","text":"IncrementBy is the duration to be added to the current watermark to progress the watermark when the source is idling.","title":"IncrementBy"},{"location":"core-concepts/watermarks/#example","text":"The below example will consider the source as idle after there is no data at the source for 5s. After 5s, every other 2s an idle watermark will be emitted which increments the watermark by 3s. watermark : idleSource : threshold : 5s # The pipeline will be considered idle if the source has not emitted any data for given threshold value. incrementBy : 3s # If source is found to be idle then increment the watermark by given incrementBy value. stepInterval : 2s # If source is idling then publish the watermark only when step interval has passed.","title":"Example"},{"location":"core-concepts/watermarks/#maxdelay","text":"Watermark assignments happen at the source. Sources could be out of order, so sometimes we want to extend the window (default is 0s ) to wait before we start marking data as late-data. You can give more time for the system to wait for late data with maxDelay so that the late data within the specified time duration will be considered as data on-time. This means the watermark propagation will be delayed by maxDelay .","title":"maxDelay"},{"location":"core-concepts/watermarks/#example_1","text":"apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline spec : watermark : disabled : false # Optional, defaults to false. maxDelay : 60s # Optional, defaults to \"0s\".","title":"Example"},{"location":"core-concepts/watermarks/#watermark-api","text":"When processing data in user-defined functions , you can get the current watermark through an API. Watermark API is supported in all our client SDKs.","title":"Watermark API"},{"location":"core-concepts/watermarks/#example-golang","text":"// Go func mapFn ( context context . Context , keys [] string , d mapper . Datum ) mapper . Messages { _ = d . EventTime () // Event time _ = d . Watermark () // Watermark ... ... }","title":"Example Golang"},{"location":"development/debugging/","text":"How To Debug \u00b6 To enable debug logs in a Vertex Pod, set environment variable NUMAFLOW_DEBUG to true for the Vertex. For example: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : vertices : - name : in source : generator : rpu : 100 duration : 1s - name : p1 udf : builtin : name : cat containerTemplate : env : - name : NUMAFLOW_DEBUG value : !!str \"true\" - name : out sink : log : {} edges : - from : in to : p1 - from : p1 to : out To enable debug logs in the daemon pod, set environment variable NUMAFLOW_DEBUG to true for the daemon pod. For example: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : templates : daemon : containerTemplate : env : - name : NUMAFLOW_DEBUG value : !!str \"true\" Profiling \u00b6 If your pipeline is running with NUMAFLOW_DEBUG then pprof is enabled in the Vertex Pod. You can also enable just pprof by setting NUMAFLOW_PPROF to true . For example, run the commands like below to profile memory usage for a Vertex Pod, a web page displaying the memory information will be automatically opened. # Port-forward kubectl port-forward simple-pipeline-p1-0-7jzbn 2469 go tool pprof -http localhost:8081 https+insecure://localhost:2469/debug/pprof/heap Tracing is also available with commands below. # Add optional \"&seconds=n\" to specify the duration. curl -skq https://localhost:2469/debug/pprof/trace?debug = 1 -o trace.out go tool trace -http localhost:8082 trace.out Debug Inside the Container \u00b6 When doing local development using command lines such as make start , or make image , the built numaflow docker image is based on alpine , which allows you to execute into the container for debugging with kubectl exec -it {pod-name} -c {container-name} -- sh . This is not allowed when running pipelines with official released images, as they are based on scratch .","title":"How To Debug"},{"location":"development/debugging/#how-to-debug","text":"To enable debug logs in a Vertex Pod, set environment variable NUMAFLOW_DEBUG to true for the Vertex. For example: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : vertices : - name : in source : generator : rpu : 100 duration : 1s - name : p1 udf : builtin : name : cat containerTemplate : env : - name : NUMAFLOW_DEBUG value : !!str \"true\" - name : out sink : log : {} edges : - from : in to : p1 - from : p1 to : out To enable debug logs in the daemon pod, set environment variable NUMAFLOW_DEBUG to true for the daemon pod. For example: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : templates : daemon : containerTemplate : env : - name : NUMAFLOW_DEBUG value : !!str \"true\"","title":"How To Debug"},{"location":"development/debugging/#profiling","text":"If your pipeline is running with NUMAFLOW_DEBUG then pprof is enabled in the Vertex Pod. You can also enable just pprof by setting NUMAFLOW_PPROF to true . For example, run the commands like below to profile memory usage for a Vertex Pod, a web page displaying the memory information will be automatically opened. # Port-forward kubectl port-forward simple-pipeline-p1-0-7jzbn 2469 go tool pprof -http localhost:8081 https+insecure://localhost:2469/debug/pprof/heap Tracing is also available with commands below. # Add optional \"&seconds=n\" to specify the duration. curl -skq https://localhost:2469/debug/pprof/trace?debug = 1 -o trace.out go tool trace -http localhost:8082 trace.out","title":"Profiling"},{"location":"development/debugging/#debug-inside-the-container","text":"When doing local development using command lines such as make start , or make image , the built numaflow docker image is based on alpine , which allows you to execute into the container for debugging with kubectl exec -it {pod-name} -c {container-name} -- sh . This is not allowed when running pipelines with official released images, as they are based on scratch .","title":"Debug Inside the Container"},{"location":"development/development/","text":"Development \u00b6 This doc explains how to set up a development environment for Numaflow. Install required tools \u00b6 go 1.20+. git . kubectl . protoc 3.19 for compiling protocol buffers. pandoc 2.17 for generating API markdown. Node.js\u00ae for running the UI. yarn . A local Kubernetes cluster for development usage, pick either one of k3d , kind , or minikube . Example: Create a local Kubernetes cluster with kind \u00b6 # Install kind on macOS brew install kind # Create a cluster with default name kind kind create cluster # Get kubeconfig for the cluster kind export kubeconfig Metrics Server \u00b6 Please install the metrics server if your local Kubernetes cluster does not bring it by default (e.g., Kind). Without the metrics-server , we will not be able to see the pods in the UI. kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml kubectl patch -n kube-system deployment metrics-server --type = json -p '[{\"op\":\"add\",\"path\":\"/spec/template/spec/containers/0/args/-\",\"value\":\"--kubelet-insecure-tls\"}]' Useful Commands \u00b6 make start Build the source code, image, and install the Numaflow controller in the numaflow-system namespace. make build Binaries are placed in ./dist . make manifests Regenerate all the manifests after making any base manifest changes. This is also covered by make codegen . make codegen Run after making changes to ./pkg/api/ . make test Run unit tests. make test-* Run one e2e test suite. e.g. make test-kafka-e2e to run the kafka e2e suite. make Test* Run one e2e test case. e.g. make TestKafkaSourceSink to run the TestKafkaSourceSink case in the kafka e2e suite. make image Build container image, and import it to k3d , kind , or minikube cluster if corresponding KUBECONFIG is sourced. make docs Convert the docs to GitHub pages, check if there's any error. make docs-serve Start an HTTP server on your local to host the docs generated Github pages.","title":"Development"},{"location":"development/development/#development","text":"This doc explains how to set up a development environment for Numaflow.","title":"Development"},{"location":"development/development/#install-required-tools","text":"go 1.20+. git . kubectl . protoc 3.19 for compiling protocol buffers. pandoc 2.17 for generating API markdown. Node.js\u00ae for running the UI. yarn . A local Kubernetes cluster for development usage, pick either one of k3d , kind , or minikube .","title":"Install required tools"},{"location":"development/development/#example-create-a-local-kubernetes-cluster-with-kind","text":"# Install kind on macOS brew install kind # Create a cluster with default name kind kind create cluster # Get kubeconfig for the cluster kind export kubeconfig","title":"Example: Create a local Kubernetes cluster with kind"},{"location":"development/development/#metrics-server","text":"Please install the metrics server if your local Kubernetes cluster does not bring it by default (e.g., Kind). Without the metrics-server , we will not be able to see the pods in the UI. kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml kubectl patch -n kube-system deployment metrics-server --type = json -p '[{\"op\":\"add\",\"path\":\"/spec/template/spec/containers/0/args/-\",\"value\":\"--kubelet-insecure-tls\"}]'","title":"Metrics Server"},{"location":"development/development/#useful-commands","text":"make start Build the source code, image, and install the Numaflow controller in the numaflow-system namespace. make build Binaries are placed in ./dist . make manifests Regenerate all the manifests after making any base manifest changes. This is also covered by make codegen . make codegen Run after making changes to ./pkg/api/ . make test Run unit tests. make test-* Run one e2e test suite. e.g. make test-kafka-e2e to run the kafka e2e suite. make Test* Run one e2e test case. e.g. make TestKafkaSourceSink to run the TestKafkaSourceSink case in the kafka e2e suite. make image Build container image, and import it to k3d , kind , or minikube cluster if corresponding KUBECONFIG is sourced. make docs Convert the docs to GitHub pages, check if there's any error. make docs-serve Start an HTTP server on your local to host the docs generated Github pages.","title":"Useful Commands"},{"location":"development/releasing/","text":"How To Release \u00b6 Release Branch \u00b6 Always create a release branch for the releases, for example branch release-0.5 is for all the v0.5.x versions release. If it's a new release branch, simply create a branch from main . Release Steps \u00b6 Cherry-pick fixes to the release branch, skip this step if it's the first release in the branch. Run make test to make sure all test cases pass locally. Push to remote branch, and make sure all the CI jobs pass. Run make prepare-release VERSION=v{x.y.z} to update version in manifests, where x.y.x is the expected new version. Follow the output of last step, to confirm if all the changes are expected, and then run make release VERSION=v{x.y.z} . Follow the output, push a new tag to the release branch, GitHub actions will automatically build and publish the new release, this will take around 10 minutes. Test the new release, make sure everything is running as expected, and then recreate a stable tag against the latest release. git tag -d stable git tag -a stable -m stable git push -d { your-remote } stable git push { your-remote } stable Find the new release tag, and edit the release notes.","title":"How To Release"},{"location":"development/releasing/#how-to-release","text":"","title":"How To Release"},{"location":"development/releasing/#release-branch","text":"Always create a release branch for the releases, for example branch release-0.5 is for all the v0.5.x versions release. If it's a new release branch, simply create a branch from main .","title":"Release Branch"},{"location":"development/releasing/#release-steps","text":"Cherry-pick fixes to the release branch, skip this step if it's the first release in the branch. Run make test to make sure all test cases pass locally. Push to remote branch, and make sure all the CI jobs pass. Run make prepare-release VERSION=v{x.y.z} to update version in manifests, where x.y.x is the expected new version. Follow the output of last step, to confirm if all the changes are expected, and then run make release VERSION=v{x.y.z} . Follow the output, push a new tag to the release branch, GitHub actions will automatically build and publish the new release, this will take around 10 minutes. Test the new release, make sure everything is running as expected, and then recreate a stable tag against the latest release. git tag -d stable git tag -a stable -m stable git push -d { your-remote } stable git push { your-remote } stable Find the new release tag, and edit the release notes.","title":"Release Steps"},{"location":"development/static-code-analysis/","text":"Static Code Analysis \u00b6 We use the following static code analysis tools: golangci-lint for compile time linting. Snyk for image scanning. These are at least run daily or on each pull request.","title":"Static Code Analysis"},{"location":"development/static-code-analysis/#static-code-analysis","text":"We use the following static code analysis tools: golangci-lint for compile time linting. Snyk for image scanning. These are at least run daily or on each pull request.","title":"Static Code Analysis"},{"location":"operations/controller-configmap/","text":"Controller ConfigMap \u00b6 The controller ConfigMap is used for controller-wide settings. For a detailed example, please see numaflow-controller-config.yaml . Configuration Structure \u00b6 The configuration should be under controller-config.yaml key in the ConfigMap, as a string in yaml format: apiVersion : v1 kind : ConfigMap metadata : name : numaflow-controller-config data : controller-config.yaml : | defaults: containerResources: | ... isbsvc: jetstream: ... Default Controller Configuration \u00b6 Currently, we support configuring the init and main container resources for steps across all the pipelines. The configuration is under defaults key in the ConfigMap. For example, to set the default container resources for steps across all the pipelines: apiVersion : v1 kind : ConfigMap metadata : name : numaflow-controller-config data : controller-config.yaml : | defaults: containerResources: | limits: memory: \"256Mi\" cpu: \"200m\" requests: memory: \"128Mi\" cpu: \"100m\" ISB Service Configuration \u00b6 One of the important configuration items in the ConfigMap is about ISB Service . We currently use 3rd party technologies such as JetStream to implement ISB Services, if those applications have new releases, to make them available in Numaflow, the new versions need to be added in the ConfigMap. For example, there's a new Nats JetStream version x.y.x available, a new version configuration like below needs to be added before it can be referenced in the InterStepBufferService spec. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-controller-config data : controller-config.yaml : | isbsvc: jetstream: versions: - version: x.y.x # Name it whatever you want, it will be referenced in the InterStepBufferService spec. natsImage: nats:x.y.x metricsExporterImage: natsio/prometheus-nats-exporter:0.9.1 configReloaderImage: natsio/nats-server-config-reloader:0.7.0 startCommand: /nats-server","title":"Controller Configuration"},{"location":"operations/controller-configmap/#controller-configmap","text":"The controller ConfigMap is used for controller-wide settings. For a detailed example, please see numaflow-controller-config.yaml .","title":"Controller ConfigMap"},{"location":"operations/controller-configmap/#configuration-structure","text":"The configuration should be under controller-config.yaml key in the ConfigMap, as a string in yaml format: apiVersion : v1 kind : ConfigMap metadata : name : numaflow-controller-config data : controller-config.yaml : | defaults: containerResources: | ... isbsvc: jetstream: ...","title":"Configuration Structure"},{"location":"operations/controller-configmap/#default-controller-configuration","text":"Currently, we support configuring the init and main container resources for steps across all the pipelines. The configuration is under defaults key in the ConfigMap. For example, to set the default container resources for steps across all the pipelines: apiVersion : v1 kind : ConfigMap metadata : name : numaflow-controller-config data : controller-config.yaml : | defaults: containerResources: | limits: memory: \"256Mi\" cpu: \"200m\" requests: memory: \"128Mi\" cpu: \"100m\"","title":"Default Controller Configuration"},{"location":"operations/controller-configmap/#isb-service-configuration","text":"One of the important configuration items in the ConfigMap is about ISB Service . We currently use 3rd party technologies such as JetStream to implement ISB Services, if those applications have new releases, to make them available in Numaflow, the new versions need to be added in the ConfigMap. For example, there's a new Nats JetStream version x.y.x available, a new version configuration like below needs to be added before it can be referenced in the InterStepBufferService spec. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-controller-config data : controller-config.yaml : | isbsvc: jetstream: versions: - version: x.y.x # Name it whatever you want, it will be referenced in the InterStepBufferService spec. natsImage: nats:x.y.x metricsExporterImage: natsio/prometheus-nats-exporter:0.9.1 configReloaderImage: natsio/nats-server-config-reloader:0.7.0 startCommand: /nats-server","title":"ISB Service Configuration"},{"location":"operations/grafana/","text":"Grafana \u00b6 Numaflow provides prometheus metrics on top of which you can build Grafana dashboard to monitor your pipeline. Setup Grafana \u00b6 (Pre-requisite) Follow Metrics to set up prometheus operator. Follow Prometheus Tutorial to install Grafana and visualize metrics. Sample Dashboard \u00b6 You can customize your own dashboard by selecting metrics that best describe the health of your pipeline. Below is a sample dashboard which includes some basic metrics. To use the sample dashboard, download the corresponding sample dashboard template , import(before importing change the uid of the datasource in json, issue link ) it to Grafana and use the dropdown menu at top-left of the dashboard to choose which pipeline/vertex/buffer metrics to display.","title":"Grafana"},{"location":"operations/grafana/#grafana","text":"Numaflow provides prometheus metrics on top of which you can build Grafana dashboard to monitor your pipeline.","title":"Grafana"},{"location":"operations/grafana/#setup-grafana","text":"(Pre-requisite) Follow Metrics to set up prometheus operator. Follow Prometheus Tutorial to install Grafana and visualize metrics.","title":"Setup Grafana"},{"location":"operations/grafana/#sample-dashboard","text":"You can customize your own dashboard by selecting metrics that best describe the health of your pipeline. Below is a sample dashboard which includes some basic metrics. To use the sample dashboard, download the corresponding sample dashboard template , import(before importing change the uid of the datasource in json, issue link ) it to Grafana and use the dropdown menu at top-left of the dashboard to choose which pipeline/vertex/buffer metrics to display.","title":"Sample Dashboard"},{"location":"operations/installation/","text":"Installation \u00b6 Numaflow can be installed in different scopes with different approaches. Cluster Scope \u00b6 A cluster scope installation watches and executes pipelines in all the namespaces in the cluster. Run following command line to install latest stable Numaflow in cluster scope. kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/install.yaml If you use kustomize , use kustomization.yaml below. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization resources : - https://github.com/numaproj/numaflow/config/cluster-install?ref=stable # Or specify a version namespace : numaflow-system Namespace Scope \u00b6 A namespace scoped installation only watches and executes pipelines in the namespace it is installed (typically numaflow-system ). Configure the ConfigMap numaflow-cmd-params-config to achieve namespace scoped installation. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : # Whether to run in namespaced scope, defaults to false. namespaced : \"true\" Another approach to do namespace scoped installation is to add an argument --namespaced to the numaflow-controller and numaflow-server deployments. This approach takes precedence over the ConfigMap approach. - args: - --namespaced If there are multiple namespace scoped installations in one cluster, potentially there will be backward compatibility issue when any of the installation gets upgraded to a new version that has new CRD definition. To avoid this issue, we suggest to use minimal CRD definition for namespaced installation, which does not have detailed property definitions, thus no CRD changes between different versions. # Minimal CRD kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/config/advanced-install/minimal-crds.yaml # Controller in namespaced scope kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/advanced-install/namespaced-controller-wo-crds.yaml If you use kustomize , kustomization.yaml looks like below. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization resources : - https://github.com/numaproj/numaflow/config/advanced-install/minimal-crds?ref=stable # Or specify a version - https://github.com/numaproj/numaflow/config/advanced-install/namespaced-controller?ref=stable # Or specify a version namespace : numaflow-system Managed Namespace Scope \u00b6 A managed namespace installation watches and executes pipelines in a specific namespace. To do managed namespace installation, configure the ConfigMap numaflow-cmd-params-config as following. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : # Whether to run the controller and the UX server in namespaced scope, defaults to false. namespaced : \"true\" # The namespace that the controller and UX server watch when \"namespaced\" is true, defaults to the installation namespace. managed.namespace : numaflow-system Similarly, another approach is to add --managed-namespace and the specific namespace to the numaflow-controller and numaflow-server deployment arguments. This approach takes precedence over the ConfigMap approach. - args: - --namespaced - --managed-namespace - my-namespace High Availability \u00b6 By default, the Numaflow controller is installed with Active-Passive HA strategy enabled, which means you can run the controller with multiple replicas (defaults to 1 in the manifests). There are some parameters can be tuned for the leader election mechanism of HA. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : ### The duration that non-leader candidates will wait to force acquire leadership. # This is measured against time of last observed ack. Default is 15 seconds. # The configuration has to be: lease.duration > lease.renew.deadline > lease.renew.period controller.leader.election.lease.duration : 15s # ### The duration that the acting controlplane will retry refreshing leadership before giving up. # Default is 10 seconds. # The configuration has to be: lease.duration > lease.renew.deadline > lease.renew.period controller.leader.election.lease.renew.deadline : 10s ### The duration the LeaderElector clients should wait between tries of actions, which means every # this period of time, it tries to renew the lease. Default is 2 seconds. # The configuration has to be: lease.duration > lease.renew.deadline > lease.renew.period controller.leader.election.lease.renew.period : 2s These parameters are useful when you want to tune the frequency of leader election renewal calls to K8s API server, which are usually configured at a high priority level of API Priority and Fairness . To turn off HA, configure the ConfigMap numaflow-cmd-params-config as following. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : # Whether to disable leader election for the controller, defaults to false controller.leader.election.disabled : \"true\" If HA is turned off, the controller deployment should not run with multiple replicas.","title":"Installation"},{"location":"operations/installation/#installation","text":"Numaflow can be installed in different scopes with different approaches.","title":"Installation"},{"location":"operations/installation/#cluster-scope","text":"A cluster scope installation watches and executes pipelines in all the namespaces in the cluster. Run following command line to install latest stable Numaflow in cluster scope. kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/install.yaml If you use kustomize , use kustomization.yaml below. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization resources : - https://github.com/numaproj/numaflow/config/cluster-install?ref=stable # Or specify a version namespace : numaflow-system","title":"Cluster Scope"},{"location":"operations/installation/#namespace-scope","text":"A namespace scoped installation only watches and executes pipelines in the namespace it is installed (typically numaflow-system ). Configure the ConfigMap numaflow-cmd-params-config to achieve namespace scoped installation. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : # Whether to run in namespaced scope, defaults to false. namespaced : \"true\" Another approach to do namespace scoped installation is to add an argument --namespaced to the numaflow-controller and numaflow-server deployments. This approach takes precedence over the ConfigMap approach. - args: - --namespaced If there are multiple namespace scoped installations in one cluster, potentially there will be backward compatibility issue when any of the installation gets upgraded to a new version that has new CRD definition. To avoid this issue, we suggest to use minimal CRD definition for namespaced installation, which does not have detailed property definitions, thus no CRD changes between different versions. # Minimal CRD kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/config/advanced-install/minimal-crds.yaml # Controller in namespaced scope kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/advanced-install/namespaced-controller-wo-crds.yaml If you use kustomize , kustomization.yaml looks like below. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization resources : - https://github.com/numaproj/numaflow/config/advanced-install/minimal-crds?ref=stable # Or specify a version - https://github.com/numaproj/numaflow/config/advanced-install/namespaced-controller?ref=stable # Or specify a version namespace : numaflow-system","title":"Namespace Scope"},{"location":"operations/installation/#managed-namespace-scope","text":"A managed namespace installation watches and executes pipelines in a specific namespace. To do managed namespace installation, configure the ConfigMap numaflow-cmd-params-config as following. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : # Whether to run the controller and the UX server in namespaced scope, defaults to false. namespaced : \"true\" # The namespace that the controller and UX server watch when \"namespaced\" is true, defaults to the installation namespace. managed.namespace : numaflow-system Similarly, another approach is to add --managed-namespace and the specific namespace to the numaflow-controller and numaflow-server deployment arguments. This approach takes precedence over the ConfigMap approach. - args: - --namespaced - --managed-namespace - my-namespace","title":"Managed Namespace Scope"},{"location":"operations/installation/#high-availability","text":"By default, the Numaflow controller is installed with Active-Passive HA strategy enabled, which means you can run the controller with multiple replicas (defaults to 1 in the manifests). There are some parameters can be tuned for the leader election mechanism of HA. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : ### The duration that non-leader candidates will wait to force acquire leadership. # This is measured against time of last observed ack. Default is 15 seconds. # The configuration has to be: lease.duration > lease.renew.deadline > lease.renew.period controller.leader.election.lease.duration : 15s # ### The duration that the acting controlplane will retry refreshing leadership before giving up. # Default is 10 seconds. # The configuration has to be: lease.duration > lease.renew.deadline > lease.renew.period controller.leader.election.lease.renew.deadline : 10s ### The duration the LeaderElector clients should wait between tries of actions, which means every # this period of time, it tries to renew the lease. Default is 2 seconds. # The configuration has to be: lease.duration > lease.renew.deadline > lease.renew.period controller.leader.election.lease.renew.period : 2s These parameters are useful when you want to tune the frequency of leader election renewal calls to K8s API server, which are usually configured at a high priority level of API Priority and Fairness . To turn off HA, configure the ConfigMap numaflow-cmd-params-config as following. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : # Whether to disable leader election for the controller, defaults to false controller.leader.election.disabled : \"true\" If HA is turned off, the controller deployment should not run with multiple replicas.","title":"High Availability"},{"location":"operations/releases/","text":"Releases \u00b6 You can find the most recent version under Github Releases . Versioning \u00b6 Versions are expressed as vx.y.z (for example, v0.5.3 ), where x is the major version, y is the minor version, and z is the patch version, following Semantic Versioning terminology. Numaflow does not use Semantic Versioning. Minor versions may contain breaking changes. Patch versions only contain bug fixes and minor features. There's a stable tag, pointing to a latest stable release, usually it is the latest patch version. Release Cycle \u00b6 TBD as Numaflow is under active development. Nightly Build \u00b6 If you want to try out the new features on main branch, Numaflow provides nightly build images from main , the images are available in the format of quay.io/numaproj/numaflow:nightly-yyyyMMdd . Nightly build images expire in 30 days.","title":"Releases \u29c9"},{"location":"operations/releases/#releases","text":"You can find the most recent version under Github Releases .","title":"Releases"},{"location":"operations/releases/#versioning","text":"Versions are expressed as vx.y.z (for example, v0.5.3 ), where x is the major version, y is the minor version, and z is the patch version, following Semantic Versioning terminology. Numaflow does not use Semantic Versioning. Minor versions may contain breaking changes. Patch versions only contain bug fixes and minor features. There's a stable tag, pointing to a latest stable release, usually it is the latest patch version.","title":"Versioning"},{"location":"operations/releases/#release-cycle","text":"TBD as Numaflow is under active development.","title":"Release Cycle"},{"location":"operations/releases/#nightly-build","text":"If you want to try out the new features on main branch, Numaflow provides nightly build images from main , the images are available in the format of quay.io/numaproj/numaflow:nightly-yyyyMMdd . Nightly build images expire in 30 days.","title":"Nightly Build"},{"location":"operations/security/","text":"Security \u00b6 Controller \u00b6 Numaflow controller can be deployed in two scopes. It can be either at the Cluster level or at the Namespace level. When the Numaflow controller is deployed at the Namespace level, it will only have access to the Namespace resources. Pipeline \u00b6 Data Movement \u00b6 Data movement happens only within the namespace (no cross-namespaces). Numaflow provides the ability to encrypt data at rest and also in transit. Controller and Data Plane \u00b6 All communications between the controller and Numaflow pipeline components are encrypted. These are uni-directional read-only communications.","title":"Security"},{"location":"operations/security/#security","text":"","title":"Security"},{"location":"operations/security/#controller","text":"Numaflow controller can be deployed in two scopes. It can be either at the Cluster level or at the Namespace level. When the Numaflow controller is deployed at the Namespace level, it will only have access to the Namespace resources.","title":"Controller"},{"location":"operations/security/#pipeline","text":"","title":"Pipeline"},{"location":"operations/security/#data-movement","text":"Data movement happens only within the namespace (no cross-namespaces). Numaflow provides the ability to encrypt data at rest and also in transit.","title":"Data Movement"},{"location":"operations/security/#controller-and-data-plane","text":"All communications between the controller and Numaflow pipeline components are encrypted. These are uni-directional read-only communications.","title":"Controller and Data Plane"},{"location":"operations/validating-webhook/","text":"Validating Admission Webhook \u00b6 This validating webhook will prevent disallowed spec changes to immutable fields of Numaflow CRDs including Pipelines and InterStepBufferServices. It also prevents creating a CRD with a faulty spec. The user sees an error immediately returned by the server explaining why the request was denied. Installation \u00b6 To install the validating webhook, run the following command line: kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/validating-webhook-install.yaml Examples \u00b6 Currently, the validating webhook prevents updating the type of an InterStepBufferService from JetStream to Redis for example. Example spec: apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : // change to redis and reapply will cause below error version : latest Error from server ( BadRequest ) : error when applying patch: { \"metadata\" : { \"annotations\" : { \"kubectl.kubernetes.io/last-applied-configuration\" : \"{\\\"apiVersion\\\":\\\"numaflow.numaproj.io/v1alpha1\\\",\\\"kind\\\":\\\"InterStepBufferService\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"name\\\":\\\"default\\\",\\\"namespace\\\":\\\"numaflow-system\\\"},\\\"spec\\\":{\\\"redis\\\":{\\\"native\\\":{\\\"version\\\":\\\"7.0.11\\\"}}}}\\n\" }} , \"spec\" : { \"jetstream\" :null, \"redis\" : { \"native\" : { \"version\" : \"7.0.11\" }}}} to: Resource: \"numaflow.numaproj.io/v1alpha1, Resource=interstepbufferservices\" , GroupVersionKind: \"numaflow.numaproj.io/v1alpha1, Kind=InterStepBufferService\" Name: \"default\" , Namespace: \"numaflow-system\" for : \"redis.yaml\" : error when patching \"redis.yaml\" : admission webhook \"webhook.numaflow.numaproj.io\" denied the request: Can not change ISB Service type from Jetstream to Redis There is also validation that prevents the interStepBufferServiceName of a Pipeline from being updated. Error from server ( BadRequest ) : error when applying patch: { \"metadata\" : { \"annotations\" : { \"kubectl.kubernetes.io/last-applied-configuration\" : \"{\\\"apiVersion\\\":\\\"numaflow.numaproj.io/v1alpha1\\\",\\\"kind\\\":\\\"Pipeline\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"name\\\":\\\"simple-pipeline\\\",\\\"namespace\\\":\\\"numaflow-system\\\"},\\\"spec\\\":{\\\"edges\\\":[{\\\"from\\\":\\\"in\\\",\\\"to\\\":\\\"cat\\\"},{\\\"from\\\":\\\"cat\\\",\\\"to\\\":\\\"out\\\"}],\\\"interStepBufferServiceName\\\":\\\"change\\\",\\\"vertices\\\":[{\\\"name\\\":\\\"in\\\",\\\"source\\\":{\\\"generator\\\":{\\\"duration\\\":\\\"1s\\\",\\\"rpu\\\":5}}},{\\\"name\\\":\\\"cat\\\",\\\"udf\\\":{\\\"builtin\\\":{\\\"name\\\":\\\"cat\\\"}}},{\\\"name\\\":\\\"out\\\",\\\"sink\\\":{\\\"log\\\":{}}}]}}\\n\" }} , \"spec\" : { \"interStepBufferServiceName\" : \"change\" , \"vertices\" : [{ \"name\" : \"in\" , \"source\" : { \"generator\" : { \"duration\" : \"1s\" , \"rpu\" :5 }}} , { \"name\" : \"cat\" , \"udf\" : { \"builtin\" : { \"name\" : \"cat\" }}} , { \"name\" : \"out\" , \"sink\" : { \"log\" : {}}}]}} to: Resource: \"numaflow.numaproj.io/v1alpha1, Resource=pipelines\" , GroupVersionKind: \"numaflow.numaproj.io/v1alpha1, Kind=Pipeline\" Name: \"simple-pipeline\" , Namespace: \"numaflow-system\" for : \"examples/1-simple-pipeline.yaml\" : error when patching \"examples/1-simple-pipeline.yaml\" : admission webhook \"webhook.numaflow.numaproj.io\" denied the request: Cannot update pipeline with different interStepBufferServiceName Other validations include: Pipeline: cannot change the type of an existing vertex cannot change the partition count of a reduce vertex cannot change the storage class of a reduce vertex etc. InterStepBufferService: cannot change the persistence configuration of an ISB Service etc.","title":"Validating Webhook"},{"location":"operations/validating-webhook/#validating-admission-webhook","text":"This validating webhook will prevent disallowed spec changes to immutable fields of Numaflow CRDs including Pipelines and InterStepBufferServices. It also prevents creating a CRD with a faulty spec. The user sees an error immediately returned by the server explaining why the request was denied.","title":"Validating Admission Webhook"},{"location":"operations/validating-webhook/#installation","text":"To install the validating webhook, run the following command line: kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/validating-webhook-install.yaml","title":"Installation"},{"location":"operations/validating-webhook/#examples","text":"Currently, the validating webhook prevents updating the type of an InterStepBufferService from JetStream to Redis for example. Example spec: apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : // change to redis and reapply will cause below error version : latest Error from server ( BadRequest ) : error when applying patch: { \"metadata\" : { \"annotations\" : { \"kubectl.kubernetes.io/last-applied-configuration\" : \"{\\\"apiVersion\\\":\\\"numaflow.numaproj.io/v1alpha1\\\",\\\"kind\\\":\\\"InterStepBufferService\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"name\\\":\\\"default\\\",\\\"namespace\\\":\\\"numaflow-system\\\"},\\\"spec\\\":{\\\"redis\\\":{\\\"native\\\":{\\\"version\\\":\\\"7.0.11\\\"}}}}\\n\" }} , \"spec\" : { \"jetstream\" :null, \"redis\" : { \"native\" : { \"version\" : \"7.0.11\" }}}} to: Resource: \"numaflow.numaproj.io/v1alpha1, Resource=interstepbufferservices\" , GroupVersionKind: \"numaflow.numaproj.io/v1alpha1, Kind=InterStepBufferService\" Name: \"default\" , Namespace: \"numaflow-system\" for : \"redis.yaml\" : error when patching \"redis.yaml\" : admission webhook \"webhook.numaflow.numaproj.io\" denied the request: Can not change ISB Service type from Jetstream to Redis There is also validation that prevents the interStepBufferServiceName of a Pipeline from being updated. Error from server ( BadRequest ) : error when applying patch: { \"metadata\" : { \"annotations\" : { \"kubectl.kubernetes.io/last-applied-configuration\" : \"{\\\"apiVersion\\\":\\\"numaflow.numaproj.io/v1alpha1\\\",\\\"kind\\\":\\\"Pipeline\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"name\\\":\\\"simple-pipeline\\\",\\\"namespace\\\":\\\"numaflow-system\\\"},\\\"spec\\\":{\\\"edges\\\":[{\\\"from\\\":\\\"in\\\",\\\"to\\\":\\\"cat\\\"},{\\\"from\\\":\\\"cat\\\",\\\"to\\\":\\\"out\\\"}],\\\"interStepBufferServiceName\\\":\\\"change\\\",\\\"vertices\\\":[{\\\"name\\\":\\\"in\\\",\\\"source\\\":{\\\"generator\\\":{\\\"duration\\\":\\\"1s\\\",\\\"rpu\\\":5}}},{\\\"name\\\":\\\"cat\\\",\\\"udf\\\":{\\\"builtin\\\":{\\\"name\\\":\\\"cat\\\"}}},{\\\"name\\\":\\\"out\\\",\\\"sink\\\":{\\\"log\\\":{}}}]}}\\n\" }} , \"spec\" : { \"interStepBufferServiceName\" : \"change\" , \"vertices\" : [{ \"name\" : \"in\" , \"source\" : { \"generator\" : { \"duration\" : \"1s\" , \"rpu\" :5 }}} , { \"name\" : \"cat\" , \"udf\" : { \"builtin\" : { \"name\" : \"cat\" }}} , { \"name\" : \"out\" , \"sink\" : { \"log\" : {}}}]}} to: Resource: \"numaflow.numaproj.io/v1alpha1, Resource=pipelines\" , GroupVersionKind: \"numaflow.numaproj.io/v1alpha1, Kind=Pipeline\" Name: \"simple-pipeline\" , Namespace: \"numaflow-system\" for : \"examples/1-simple-pipeline.yaml\" : error when patching \"examples/1-simple-pipeline.yaml\" : admission webhook \"webhook.numaflow.numaproj.io\" denied the request: Cannot update pipeline with different interStepBufferServiceName Other validations include: Pipeline: cannot change the type of an existing vertex cannot change the partition count of a reduce vertex cannot change the storage class of a reduce vertex etc. InterStepBufferService: cannot change the persistence configuration of an ISB Service etc.","title":"Examples"},{"location":"operations/metrics/metrics/","text":"Metrics \u00b6 Numaflow provides the following prometheus metrics which we can use to monitor our pipeline and setup any alerts if needed. Golden Signals \u00b6 These metrics in combination can be used to determine the overall health of your pipeline Traffic \u00b6 These metrics can be used to determine throughput of your pipeline. Data-forward \u00b6 Metric name Metric type Labels Description forwarder_data_read_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages read by a given Vertex from an Inter-Step Buffer Partition forwarder_read_bytes_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of bytes read by a given Vertex from an Inter-Step Buffer Partition forwarder_write_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages written to Inter-Step Buffer by a given Vertex forwarder_write_bytes_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of bytes written to Inter-Step Buffer by a given Vertex forwarder_ack_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages acknowledged by a given Vertex from an Inter-Step Buffer Partition forwarder_drop_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages dropped by a given Vertex due to a full Inter-Step Buffer Partition forwarder_drop_bytes_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of bytes dropped by a given Vertex due to a full Inter-Step Buffer Partition Kafka Source \u00b6 Metric name Metric type Labels Description kafka_source_read_total Counter pipeline= vertex= Provides the number of messages read by the Kafka Source Vertex/Processor. kafka_source_ack_total Counter pipeline= vertex= Provides the number of messages acknowledged by the Kafka Source Vertex/Processor Generator Source \u00b6 Metric name Metric type Labels Description tickgen_source_read_total Counter pipeline= vertex= Provides the number of messages read by the Generator Source Vertex/Processor. Http Source \u00b6 Metric name Metric type Labels Description http_source_read_total Counter pipeline= vertex= Provides the number of messages read by the HTTP Source Vertex/Processor. Kafka Sink \u00b6 Metric name Metric type Labels Description kafka_sink_write_total Counter pipeline= vertex= Provides the number of messages written by the Kafka Sink Vertex/Processor Log Sink \u00b6 Metric name Metric type Labels Description log_sink_write_total Counter pipeline= vertex= Provides the number of messages written by the Log Sink Vertex/Processor Latency \u00b6 These metrics can be used to determine the latency of your pipeline. Metric name Metric type Labels Description pipeline_lag_milliseconds Gauge pipeline= Provides the pipeline processing lag in milliseconds watermark_cmp_now_milliseconds Gauge pipeline= Provides the Watermark compared with current time in milliseconds source_forwarder_transformer_processing_time Histogram pipeline= vertex= vertex_type= replica= partition_name= Provides a histogram distribution of the processing times of User-defined Source Transformer forwarder_udf_processing_time Histogram pipeline= vertex= vertex_type= replica= Provides a histogram distribution of the processing times of User-defined Functions. (UDF's) forwarder_forward_chunk_processing_time Histogram pipeline= vertex= vertex_type= replica= Provides a histogram distribution of the processing times of the forwarder function as a whole reduce_pnf_process_time Histogram pipeline= vertex= replica= Provides a histogram distribution of the processing times of the reducer reduce_pnf_forward_time Histogram pipeline= vertex= replica= Provides a histogram distribution of the forwarding times of the reducer Errors \u00b6 These metrics can be used to determine if there are any errors in the pipeline Metric name Metric type Labels Description forwarder_platform_error_total Counter pipeline= vertex= vertex_type= replica= Indicates any internal errors which could stop pipeline processing forwarder_read_error_total Counter pipeline= vertex= vertex_type= replica= partition_name= Indicates any errors while reading messages by the forwarder forwarder_write_error_total Counter pipeline= vertex= vertex_type= replica= partition_name= Indicates any errors while writing messages by the forwarder forwarder_ack_error_total Counter pipeline= vertex= vertex_type= replica= partition_name= Indicates any errors while acknowledging messages by the forwarder kafka_source_offset_ack_errors Counter pipeline= vertex= Indicates any kafka acknowledgement errors kafka_sink_write_error_total Counter pipeline= vertex= Provides the number of errors while writing to the Kafka sink kafka_sink_write_timeout_total Counter pipeline= vertex= Provides the write timeouts while writing to the Kafka sink isb_jetstream_read_error_total Counter partition_name= Indicates any read errors with NATS Jetstream ISB isb_jetstream_write_error_total Counter partition_name= Indicates any write errors with NATS Jetstream ISB isb_redis_read_error_total Counter partition_name= Indicates any read errors with Redis ISB isb_redis_write_error_total Counter partition_name= Indicates any write errors with Redis ISB Saturation \u00b6 NATS JetStream ISB \u00b6 Metric name Metric type Labels Description isb_jetstream_isFull_total Counter buffer= Indicates if the ISB is full. Continual increase of this counter metric indicates a potential backpressure that can be built on the pipeline isb_jetstream_buffer_soft_usage Gauge buffer= Indicates the usage/utilization of a NATS Jetstream ISB isb_jetstream_buffer_solid_usage Gauge buffer= Indicates the solid usage of a NATS Jetstream ISB isb_jetstream_buffer_pending Gauge buffer= Indicate the number of pending messages at a given point in time. isb_jetstream_buffer_ack_pending Gauge buffer= Indicates the number of messages pending acknowledge at a given point in time Redis ISB \u00b6 Metric name Metric type Labels Description isb_redis_isFull_total Counter buffer= Indicates if the ISB is full. Continual increase of this counter metric indicates a potential backpressure that can be built on the pipeline isb_redis_buffer_usage Gauge buffer= Indicates the usage/utilization of a Redis ISB isb_redis_consumer_lag Gauge buffer= Indicates the the consumer lag of a Redis ISB Prometheus Operator for Scraping Metrics: \u00b6 You can follow the prometheus operator setup guide if you would like to use prometheus operator configured in your cluster. You can also set up prometheus operator via helm . Configure the below Service Monitors for scraping your pipeline metrics: \u00b6 apiVersion : monitoring.coreos.com/v1 kind : ServiceMonitor metadata : labels : app.kubernetes.io/part-of : numaflow name : numaflow-pipeline-metrics spec : endpoints : - scheme : https port : metrics targetPort : 2469 tlsConfig : insecureSkipVerify : true selector : matchLabels : app.kubernetes.io/component : vertex app.kubernetes.io/managed-by : vertex-controller app.kubernetes.io/part-of : numaflow matchExpressions : - key : numaflow.numaproj.io/pipeline-name operator : Exists - key : numaflow.numaproj.io/vertex-name operator : Exists Configure the below Service Monitor if you use the NATS Jetstream ISB for your NATS Jetstream metrics: \u00b6 apiVersion : monitoring.coreos.com/v1 kind : ServiceMonitor metadata : labels : app.kubernetes.io/part-of : numaflow name : numaflow-isbsvc-jetstream-metrics spec : endpoints : - scheme : http port : metrics targetPort : 7777 selector : matchLabels : app.kubernetes.io/component : isbsvc app.kubernetes.io/managed-by : isbsvc-controller app.kubernetes.io/part-of : numaflow numaflow.numaproj.io/isbsvc-type : jetstream matchExpressions : - key : numaflow.numaproj.io/isbsvc-name operator : Exists","title":"Metrics"},{"location":"operations/metrics/metrics/#metrics","text":"Numaflow provides the following prometheus metrics which we can use to monitor our pipeline and setup any alerts if needed.","title":"Metrics"},{"location":"operations/metrics/metrics/#golden-signals","text":"These metrics in combination can be used to determine the overall health of your pipeline","title":"Golden Signals"},{"location":"operations/metrics/metrics/#traffic","text":"These metrics can be used to determine throughput of your pipeline.","title":"Traffic"},{"location":"operations/metrics/metrics/#data-forward","text":"Metric name Metric type Labels Description forwarder_data_read_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages read by a given Vertex from an Inter-Step Buffer Partition forwarder_read_bytes_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of bytes read by a given Vertex from an Inter-Step Buffer Partition forwarder_write_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages written to Inter-Step Buffer by a given Vertex forwarder_write_bytes_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of bytes written to Inter-Step Buffer by a given Vertex forwarder_ack_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages acknowledged by a given Vertex from an Inter-Step Buffer Partition forwarder_drop_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages dropped by a given Vertex due to a full Inter-Step Buffer Partition forwarder_drop_bytes_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of bytes dropped by a given Vertex due to a full Inter-Step Buffer Partition","title":"Data-forward"},{"location":"operations/metrics/metrics/#kafka-source","text":"Metric name Metric type Labels Description kafka_source_read_total Counter pipeline= vertex= Provides the number of messages read by the Kafka Source Vertex/Processor. kafka_source_ack_total Counter pipeline= vertex= Provides the number of messages acknowledged by the Kafka Source Vertex/Processor","title":"Kafka Source"},{"location":"operations/metrics/metrics/#generator-source","text":"Metric name Metric type Labels Description tickgen_source_read_total Counter pipeline= vertex= Provides the number of messages read by the Generator Source Vertex/Processor.","title":"Generator Source"},{"location":"operations/metrics/metrics/#http-source","text":"Metric name Metric type Labels Description http_source_read_total Counter pipeline= vertex= Provides the number of messages read by the HTTP Source Vertex/Processor.","title":"Http Source"},{"location":"operations/metrics/metrics/#kafka-sink","text":"Metric name Metric type Labels Description kafka_sink_write_total Counter pipeline= vertex= Provides the number of messages written by the Kafka Sink Vertex/Processor","title":"Kafka Sink"},{"location":"operations/metrics/metrics/#log-sink","text":"Metric name Metric type Labels Description log_sink_write_total Counter pipeline= vertex= Provides the number of messages written by the Log Sink Vertex/Processor","title":"Log Sink"},{"location":"operations/metrics/metrics/#latency","text":"These metrics can be used to determine the latency of your pipeline. Metric name Metric type Labels Description pipeline_lag_milliseconds Gauge pipeline= Provides the pipeline processing lag in milliseconds watermark_cmp_now_milliseconds Gauge pipeline= Provides the Watermark compared with current time in milliseconds source_forwarder_transformer_processing_time Histogram pipeline= vertex= vertex_type= replica= partition_name= Provides a histogram distribution of the processing times of User-defined Source Transformer forwarder_udf_processing_time Histogram pipeline= vertex= vertex_type= replica= Provides a histogram distribution of the processing times of User-defined Functions. (UDF's) forwarder_forward_chunk_processing_time Histogram pipeline= vertex= vertex_type= replica= Provides a histogram distribution of the processing times of the forwarder function as a whole reduce_pnf_process_time Histogram pipeline= vertex= replica= Provides a histogram distribution of the processing times of the reducer reduce_pnf_forward_time Histogram pipeline= vertex= replica= Provides a histogram distribution of the forwarding times of the reducer","title":"Latency"},{"location":"operations/metrics/metrics/#errors","text":"These metrics can be used to determine if there are any errors in the pipeline Metric name Metric type Labels Description forwarder_platform_error_total Counter pipeline= vertex= vertex_type= replica= Indicates any internal errors which could stop pipeline processing forwarder_read_error_total Counter pipeline= vertex= vertex_type= replica= partition_name= Indicates any errors while reading messages by the forwarder forwarder_write_error_total Counter pipeline= vertex= vertex_type= replica= partition_name= Indicates any errors while writing messages by the forwarder forwarder_ack_error_total Counter pipeline= vertex= vertex_type= replica= partition_name= Indicates any errors while acknowledging messages by the forwarder kafka_source_offset_ack_errors Counter pipeline= vertex= Indicates any kafka acknowledgement errors kafka_sink_write_error_total Counter pipeline= vertex= Provides the number of errors while writing to the Kafka sink kafka_sink_write_timeout_total Counter pipeline= vertex= Provides the write timeouts while writing to the Kafka sink isb_jetstream_read_error_total Counter partition_name= Indicates any read errors with NATS Jetstream ISB isb_jetstream_write_error_total Counter partition_name= Indicates any write errors with NATS Jetstream ISB isb_redis_read_error_total Counter partition_name= Indicates any read errors with Redis ISB isb_redis_write_error_total Counter partition_name= Indicates any write errors with Redis ISB","title":"Errors"},{"location":"operations/metrics/metrics/#saturation","text":"","title":"Saturation"},{"location":"operations/metrics/metrics/#nats-jetstream-isb","text":"Metric name Metric type Labels Description isb_jetstream_isFull_total Counter buffer= Indicates if the ISB is full. Continual increase of this counter metric indicates a potential backpressure that can be built on the pipeline isb_jetstream_buffer_soft_usage Gauge buffer= Indicates the usage/utilization of a NATS Jetstream ISB isb_jetstream_buffer_solid_usage Gauge buffer= Indicates the solid usage of a NATS Jetstream ISB isb_jetstream_buffer_pending Gauge buffer= Indicate the number of pending messages at a given point in time. isb_jetstream_buffer_ack_pending Gauge buffer= Indicates the number of messages pending acknowledge at a given point in time","title":"NATS JetStream ISB"},{"location":"operations/metrics/metrics/#redis-isb","text":"Metric name Metric type Labels Description isb_redis_isFull_total Counter buffer= Indicates if the ISB is full. Continual increase of this counter metric indicates a potential backpressure that can be built on the pipeline isb_redis_buffer_usage Gauge buffer= Indicates the usage/utilization of a Redis ISB isb_redis_consumer_lag Gauge buffer= Indicates the the consumer lag of a Redis ISB","title":"Redis ISB"},{"location":"operations/metrics/metrics/#prometheus-operator-for-scraping-metrics","text":"You can follow the prometheus operator setup guide if you would like to use prometheus operator configured in your cluster. You can also set up prometheus operator via helm .","title":"Prometheus Operator for Scraping Metrics:"},{"location":"operations/metrics/metrics/#configure-the-below-service-monitors-for-scraping-your-pipeline-metrics","text":"apiVersion : monitoring.coreos.com/v1 kind : ServiceMonitor metadata : labels : app.kubernetes.io/part-of : numaflow name : numaflow-pipeline-metrics spec : endpoints : - scheme : https port : metrics targetPort : 2469 tlsConfig : insecureSkipVerify : true selector : matchLabels : app.kubernetes.io/component : vertex app.kubernetes.io/managed-by : vertex-controller app.kubernetes.io/part-of : numaflow matchExpressions : - key : numaflow.numaproj.io/pipeline-name operator : Exists - key : numaflow.numaproj.io/vertex-name operator : Exists","title":"Configure the below Service Monitors for scraping your pipeline metrics:"},{"location":"operations/metrics/metrics/#configure-the-below-service-monitor-if-you-use-the-nats-jetstream-isb-for-your-nats-jetstream-metrics","text":"apiVersion : monitoring.coreos.com/v1 kind : ServiceMonitor metadata : labels : app.kubernetes.io/part-of : numaflow name : numaflow-isbsvc-jetstream-metrics spec : endpoints : - scheme : http port : metrics targetPort : 7777 selector : matchLabels : app.kubernetes.io/component : isbsvc app.kubernetes.io/managed-by : isbsvc-controller app.kubernetes.io/part-of : numaflow numaflow.numaproj.io/isbsvc-type : jetstream matchExpressions : - key : numaflow.numaproj.io/isbsvc-name operator : Exists","title":"Configure the below Service Monitor if you use the NATS Jetstream ISB for your NATS Jetstream metrics:"},{"location":"operations/ui/ui-access-path/","text":"UI Access Path \u00b6 By default, Numaflow UI server will host the service at the root / ie. localhost:8443 . If a user needs to access the UI server under a different path, this can be achieved with following configuration. This is useful when the UI is hosted behind a reverse proxy or ingress controller that requires a specific path. Configure server.base.href in the ConfigMap numaflow-cmd-params-config . apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : ### Base href for Numaflow UI server, defaults to '/'. server.base.href : \"/app\" The configuration above will host the service at localhost:8443/app . Note that this new access path will work with or without a trailing slash.","title":"Access Path"},{"location":"operations/ui/ui-access-path/#ui-access-path","text":"By default, Numaflow UI server will host the service at the root / ie. localhost:8443 . If a user needs to access the UI server under a different path, this can be achieved with following configuration. This is useful when the UI is hosted behind a reverse proxy or ingress controller that requires a specific path. Configure server.base.href in the ConfigMap numaflow-cmd-params-config . apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : ### Base href for Numaflow UI server, defaults to '/'. server.base.href : \"/app\" The configuration above will host the service at localhost:8443/app . Note that this new access path will work with or without a trailing slash.","title":"UI Access Path"},{"location":"operations/ui/authn/authentication/","text":"Authentication \u00b6 Numaflow UI server provides 2 approaches for authentication. SSO with Dex Local users There's also an option to disable authentication/authorization by setting server.disable.auth: \"true\" in the ConfigMap 1numaflow-cmd-params-config`, in this case, everybody has full access and privileges to any features of the UI (not recommended).","title":"Overview"},{"location":"operations/ui/authn/authentication/#authentication","text":"Numaflow UI server provides 2 approaches for authentication. SSO with Dex Local users There's also an option to disable authentication/authorization by setting server.disable.auth: \"true\" in the ConfigMap 1numaflow-cmd-params-config`, in this case, everybody has full access and privileges to any features of the UI (not recommended).","title":"Authentication"},{"location":"operations/ui/authn/dex/","text":"Dex Server \u00b6 Numaflow comes with a Dex Server for authentication integration. Currently, the supported identity provider is Github. SSO configuration of Numaflow UI will require editing some configuration detailed below. 1. Register application for Github \u00b6 In Github, register a new OAuth application. The callback address should be the homepage of your Numaflow UI + /dex/callback . After registering this application, you will be given a client ID. You will need this value and also generate a new client secret. 2. Configuring Numaflow \u00b6 First we need to configure server.disable.auth to false in the ConfigMap numaflow-cmd-params-config . This will enable authentication and authorization for the UX server. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : ### Whether to disable authentication and authorization for the UX server, defaults to false. server.disable.auth : \"false\" # Next we need to configure the numaflow-dex-server-config ConfigMap. Change to your organization you created the application under and include the correct teams. This file will be read by the init container of the Dex server and generate the config it will server. kind : ConfigMap apiVersion : v1 metadata : name : numaflow-dex-server-config data : config.yaml : | connectors: - type: github # https://dexidp.io/docs/connectors/github/ id: github name: GitHub config: clientID: $GITHUB_CLIENT_ID clientSecret: $GITHUB_CLIENT_SECRET orgs: - name: teams: - admin - readonly Finally we will need to create/update the numaflow-dex-secrets Secret. You will need to add the client ID and secret you created earlier for the application here. apiVersion : v1 kind : Secret metadata : name : numaflow-dex-secrets stringData : # https://dexidp.io/docs/connectors/github/ dex-github-client-id : dex-github-client-secret : 3. Restarting Pods \u00b6 If you are enabling/disabling authorization and authentication for the Numaflow server, it will need to be restarted. Any changes or additions to the connectors in the numaflow-dex-server-config ConfigMap will need to be read and generated again requiring a restart as well.","title":"SSO with Dex"},{"location":"operations/ui/authn/dex/#dex-server","text":"Numaflow comes with a Dex Server for authentication integration. Currently, the supported identity provider is Github. SSO configuration of Numaflow UI will require editing some configuration detailed below.","title":"Dex Server"},{"location":"operations/ui/authn/dex/#1-register-application-for-github","text":"In Github, register a new OAuth application. The callback address should be the homepage of your Numaflow UI + /dex/callback . After registering this application, you will be given a client ID. You will need this value and also generate a new client secret.","title":"1. Register application for Github"},{"location":"operations/ui/authn/dex/#2-configuring-numaflow","text":"First we need to configure server.disable.auth to false in the ConfigMap numaflow-cmd-params-config . This will enable authentication and authorization for the UX server. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : ### Whether to disable authentication and authorization for the UX server, defaults to false. server.disable.auth : \"false\" # Next we need to configure the numaflow-dex-server-config ConfigMap. Change to your organization you created the application under and include the correct teams. This file will be read by the init container of the Dex server and generate the config it will server. kind : ConfigMap apiVersion : v1 metadata : name : numaflow-dex-server-config data : config.yaml : | connectors: - type: github # https://dexidp.io/docs/connectors/github/ id: github name: GitHub config: clientID: $GITHUB_CLIENT_ID clientSecret: $GITHUB_CLIENT_SECRET orgs: - name: teams: - admin - readonly Finally we will need to create/update the numaflow-dex-secrets Secret. You will need to add the client ID and secret you created earlier for the application here. apiVersion : v1 kind : Secret metadata : name : numaflow-dex-secrets stringData : # https://dexidp.io/docs/connectors/github/ dex-github-client-id : dex-github-client-secret : ","title":"2. Configuring Numaflow"},{"location":"operations/ui/authn/dex/#3-restarting-pods","text":"If you are enabling/disabling authorization and authentication for the Numaflow server, it will need to be restarted. Any changes or additions to the connectors in the numaflow-dex-server-config ConfigMap will need to be read and generated again requiring a restart as well.","title":"3. Restarting Pods"},{"location":"operations/ui/authn/local-users/","text":"Local Users \u00b6 In addition to the authentication using Dex, we also provide an authentication mechanism for local user based on JSON Web Token (JWT). NOTE \u00b6 When you create local users, each of those users will need additional RBAC rules set up, otherwise they will fall back to the default policy specified by policy.default field of the numaflow-server-rbac-config ConfigMap. Numaflow comes with a built-in admin user that has full access to the system. It is recommended to use admin user for initial configuration then switch to local users or configure SSO integration. Accessing with admin user \u00b6 A built-in admin user comes with a randomly generated password that is stored in numaflow-server-secrets Secret: Example \u00b6 kubectl get secret numaflow-server-secrets -n -o jsonpath = '{.data.admin\\.initial-password}' | base64 --decode Use the admin username and password obtained above to log in to the UI. Creating Users \u00b6 1. Adding the username \u00b6 Users can be created by updating the numaflow-server-local-user-config ConfigMap: Example \u00b6 apiVersion: v1 kind: ConfigMap metadata: name: numaflow-server-local-user-config data: # Format: {username}.enabled: \"true\" bob.enabled: \"true\" 2. Generating the password \u00b6 When adding new users, it is necessary to generate a bcrypt hash of their password: Example \u00b6 # Format: htpasswd -bnBC 10 \"\" | tr -d ':\\n' htpasswd -bnBC 10 \"\" password | tr -d ':\\n' 3. Adding the password for the username \u00b6 To add the password generated above for the respective user, you can update the numaflow-server-secrets Secret: Example \u00b6 apiVersion: v1 kind: Secret metadata: name: numaflow-server-secrets type: Opaque stringData: # Format: {username}.password: bob.password: $2 y $10$0 TCvrnLHQsQtEJVdXNNL6eeXaxHmGnQO.R8zhh0Mwr2RM7s42knTK You can also update the password for admin user similarly, it will be considered over the initial password NOTE \u00b6 For the example above, the username is bob and the password is password . Disabling Users \u00b6 Users can be disabled by updating the numaflow-server-local-user-config ConfigMap, including the system generated admin user: Example \u00b6 apiVersion: v1 kind: ConfigMap metadata: name: numaflow-server-local-user-config data: # Set the value to \"false\" to disable the user. bob.enabled: \"false\" Deleting Users \u00b6 Users can be deleted by removing the corresponding entries: 1. numaflow-server-local-user-config ConfigMap \u00b6 # Format: {username}.enabled: null kubectl patch configmap -n -p '{\"data\": {\"bob.enabled\": null}}' --type merge 2. numaflow-server-secrets Secret \u00b6 # Format: {username}.password: null kubectl patch secret -n -p '{\"data\": {\"bob.password\": null}}' --type merge","title":"Local Users"},{"location":"operations/ui/authn/local-users/#local-users","text":"In addition to the authentication using Dex, we also provide an authentication mechanism for local user based on JSON Web Token (JWT).","title":"Local Users"},{"location":"operations/ui/authn/local-users/#note","text":"When you create local users, each of those users will need additional RBAC rules set up, otherwise they will fall back to the default policy specified by policy.default field of the numaflow-server-rbac-config ConfigMap. Numaflow comes with a built-in admin user that has full access to the system. It is recommended to use admin user for initial configuration then switch to local users or configure SSO integration.","title":"NOTE"},{"location":"operations/ui/authn/local-users/#accessing-with-admin-user","text":"A built-in admin user comes with a randomly generated password that is stored in numaflow-server-secrets Secret:","title":"Accessing with admin user"},{"location":"operations/ui/authn/local-users/#example","text":"kubectl get secret numaflow-server-secrets -n -o jsonpath = '{.data.admin\\.initial-password}' | base64 --decode Use the admin username and password obtained above to log in to the UI.","title":"Example"},{"location":"operations/ui/authn/local-users/#creating-users","text":"","title":"Creating Users"},{"location":"operations/ui/authn/local-users/#1-adding-the-username","text":"Users can be created by updating the numaflow-server-local-user-config ConfigMap:","title":"1. Adding the username"},{"location":"operations/ui/authn/local-users/#example_1","text":"apiVersion: v1 kind: ConfigMap metadata: name: numaflow-server-local-user-config data: # Format: {username}.enabled: \"true\" bob.enabled: \"true\"","title":"Example"},{"location":"operations/ui/authn/local-users/#2-generating-the-password","text":"When adding new users, it is necessary to generate a bcrypt hash of their password:","title":"2. Generating the password"},{"location":"operations/ui/authn/local-users/#example_2","text":"# Format: htpasswd -bnBC 10 \"\" | tr -d ':\\n' htpasswd -bnBC 10 \"\" password | tr -d ':\\n'","title":"Example"},{"location":"operations/ui/authn/local-users/#3-adding-the-password-for-the-username","text":"To add the password generated above for the respective user, you can update the numaflow-server-secrets Secret:","title":"3. Adding the password for the username"},{"location":"operations/ui/authn/local-users/#example_3","text":"apiVersion: v1 kind: Secret metadata: name: numaflow-server-secrets type: Opaque stringData: # Format: {username}.password: bob.password: $2 y $10$0 TCvrnLHQsQtEJVdXNNL6eeXaxHmGnQO.R8zhh0Mwr2RM7s42knTK You can also update the password for admin user similarly, it will be considered over the initial password","title":"Example"},{"location":"operations/ui/authn/local-users/#note_1","text":"For the example above, the username is bob and the password is password .","title":"NOTE"},{"location":"operations/ui/authn/local-users/#disabling-users","text":"Users can be disabled by updating the numaflow-server-local-user-config ConfigMap, including the system generated admin user:","title":"Disabling Users"},{"location":"operations/ui/authn/local-users/#example_4","text":"apiVersion: v1 kind: ConfigMap metadata: name: numaflow-server-local-user-config data: # Set the value to \"false\" to disable the user. bob.enabled: \"false\"","title":"Example"},{"location":"operations/ui/authn/local-users/#deleting-users","text":"Users can be deleted by removing the corresponding entries:","title":"Deleting Users"},{"location":"operations/ui/authn/local-users/#1-numaflow-server-local-user-config-configmap","text":"# Format: {username}.enabled: null kubectl patch configmap -n -p '{\"data\": {\"bob.enabled\": null}}' --type merge","title":"1. numaflow-server-local-user-config ConfigMap"},{"location":"operations/ui/authn/local-users/#2-numaflow-server-secrets-secret","text":"# Format: {username}.password: null kubectl patch secret -n -p '{\"data\": {\"bob.password\": null}}' --type merge","title":"2. numaflow-server-secrets Secret"},{"location":"operations/ui/authz/rbac/","text":"Authorization \u00b6 Numaflow UI utilizes a role-based access control (RBAC) model to manage authorization, the RBAC policy and permissions are defined in the ConfigMap numaflow-server-rbac-config . There are two main sections in the ConfigMap. Rules \u00b6 Policies and groups are the two main entities defined in rules section, both of them work in conjunction with each other. The groups are used to define a set of users with the same permissions and the policies are used to define the specific permissions for these users or groups. # Policies go here p, role:admin, *, *, * p, role:readonly, *, *, GET # Groups go here g, admin, role:admin g, my-github-org:my-github-team, role:readonly Here we have defined two policies for the custom groups role:admin and role:readonly . The first policy allows the group role:admin to access all resources in all namespaces with all actions. The second policy allows the group role:readonly to access all resources in all namespaces with the GET action. To add a new policy , add a new line in the format: p, , , , User/Group : The user/group requesting access to a resource. This is the identifier extracted from the authentication token, such as a username, email address, or ID. Or could be a group defined in the groups section. Resource : The namespace in the cluster which is being accessed by the user. This can allow for selective access to namespaces. Object : This could be a specific resource in the namespace, such as a pipeline, isbsvc or any event based resource. Action : The action being performed on the resource using the API. These follow the standard HTTP verbs, such as GET, POST, PUT, DELETE, etc. The namespace, resource and action supports a wildcard * as an allow all function. Few examples: a policy line p, test@test.com, *, *, POST would allow the user with the given email address to access all resources in all namespaces with the POST action. a policy line p, test_user, *, *, * would allow the user with the given username to access all resources in all namespaces with all actions. a policy line p, role:admin_ns, test_ns, *, * would allow the group role:admin_ns to access all resources in the namespace test_ns with all actions. a policy line p, test_user, test_ns, *, GET would allow the user with the given username to access all resources in the namespace test_ns with the GET action. Groups can be defined by adding a new line in the format: g, , Here user is the identifier extracted from the authentication token, such as a username, email address, or ID. And group is the name of the group to which the user is being added. These are useful for defining a set of users with the same permissions. The group can be used in the policy definition in place of the user. And thus any user added to the group will have the same permissions as the group. Few examples: a group line g, test@test.com, role:readonly would add the user with the given email address to the group role:readonly. a group line g, test_user, role:admin would add the user with the given username to the group role:admin. Configuration \u00b6 This defines certain properties for the Casbin enforcer. The properties are defined in the following format: rbac-conf.yaml: | policy.default: role:readonly policy.scopes: groups,email,username We see two properties defined here: policy.default : This defines the default role for a user. If a user does not have any roles defined, then this role will be used for the user. This is useful for defining a default role for all users. policy.scopes : The scopes field controls which authentication scopes to examine during rbac enforcement. We can have multiple scopes, and the first scope that matches with the policy will be used. \"groups\", which means that the groups field of the user's token will be examined, This is default value and is used if no scopes are defined. \"email\", which means that the email field of the user's token will be examined \"username\", which means that the username field of the user's token will be examined Multiple scopes can be provided as a comma-separated, e.g \"groups,email,username\" This scope information is used to extract the user information from the token and then used to enforce the policies. Thus is it important to have the rules defined in the above section to map with the scopes expected in the configuration. Note : The rbac-conf.yaml file can be updated during runtime and the changes will be reflected immediately. This is useful for changing the default role for all users or adding a new scope to be used for rbac enforcement.","title":"Authorization"},{"location":"operations/ui/authz/rbac/#authorization","text":"Numaflow UI utilizes a role-based access control (RBAC) model to manage authorization, the RBAC policy and permissions are defined in the ConfigMap numaflow-server-rbac-config . There are two main sections in the ConfigMap.","title":"Authorization"},{"location":"operations/ui/authz/rbac/#rules","text":"Policies and groups are the two main entities defined in rules section, both of them work in conjunction with each other. The groups are used to define a set of users with the same permissions and the policies are used to define the specific permissions for these users or groups. # Policies go here p, role:admin, *, *, * p, role:readonly, *, *, GET # Groups go here g, admin, role:admin g, my-github-org:my-github-team, role:readonly Here we have defined two policies for the custom groups role:admin and role:readonly . The first policy allows the group role:admin to access all resources in all namespaces with all actions. The second policy allows the group role:readonly to access all resources in all namespaces with the GET action. To add a new policy , add a new line in the format: p, , , , User/Group : The user/group requesting access to a resource. This is the identifier extracted from the authentication token, such as a username, email address, or ID. Or could be a group defined in the groups section. Resource : The namespace in the cluster which is being accessed by the user. This can allow for selective access to namespaces. Object : This could be a specific resource in the namespace, such as a pipeline, isbsvc or any event based resource. Action : The action being performed on the resource using the API. These follow the standard HTTP verbs, such as GET, POST, PUT, DELETE, etc. The namespace, resource and action supports a wildcard * as an allow all function. Few examples: a policy line p, test@test.com, *, *, POST would allow the user with the given email address to access all resources in all namespaces with the POST action. a policy line p, test_user, *, *, * would allow the user with the given username to access all resources in all namespaces with all actions. a policy line p, role:admin_ns, test_ns, *, * would allow the group role:admin_ns to access all resources in the namespace test_ns with all actions. a policy line p, test_user, test_ns, *, GET would allow the user with the given username to access all resources in the namespace test_ns with the GET action. Groups can be defined by adding a new line in the format: g, , Here user is the identifier extracted from the authentication token, such as a username, email address, or ID. And group is the name of the group to which the user is being added. These are useful for defining a set of users with the same permissions. The group can be used in the policy definition in place of the user. And thus any user added to the group will have the same permissions as the group. Few examples: a group line g, test@test.com, role:readonly would add the user with the given email address to the group role:readonly. a group line g, test_user, role:admin would add the user with the given username to the group role:admin.","title":"Rules"},{"location":"operations/ui/authz/rbac/#configuration","text":"This defines certain properties for the Casbin enforcer. The properties are defined in the following format: rbac-conf.yaml: | policy.default: role:readonly policy.scopes: groups,email,username We see two properties defined here: policy.default : This defines the default role for a user. If a user does not have any roles defined, then this role will be used for the user. This is useful for defining a default role for all users. policy.scopes : The scopes field controls which authentication scopes to examine during rbac enforcement. We can have multiple scopes, and the first scope that matches with the policy will be used. \"groups\", which means that the groups field of the user's token will be examined, This is default value and is used if no scopes are defined. \"email\", which means that the email field of the user's token will be examined \"username\", which means that the username field of the user's token will be examined Multiple scopes can be provided as a comma-separated, e.g \"groups,email,username\" This scope information is used to extract the user information from the token and then used to enforce the policies. Thus is it important to have the rules defined in the above section to map with the scopes expected in the configuration. Note : The rbac-conf.yaml file can be updated during runtime and the changes will be reflected immediately. This is useful for changing the default role for all users or adding a new scope to be used for rbac enforcement.","title":"Configuration"},{"location":"specifications/authorization/","text":"UI Authorization \u00b6 We utilize a role-based access control (RBAC) model to manage authorization in Numaflow. Along with this we utilize Casbin as a library for the implementation of these policies. Permissions and Policies \u00b6 The following model configuration is given to define the policies. The policy model is defined in the Casbin policy language. [request_definition] r = sub, res, obj, act [policy_definition] p = sub, res, obj, act [role_definition] g = _, _ [policy_effect] e = some(where (p.eft == allow)) [matchers] m = g(r.sub, p.sub) && patternMatch(r.res, p.res) && stringMatch(r.obj, p.obj) && stringMatch(r.act, p.act) The policy model consists of the following sections: request_definition: The request definition section defines the request attributes. In our case, the request attributes are the user, resource, action, and object. policy_definition: The policy definition section defines the policy attributes. In our case, the policy attributes are the user, resource, action, and object. role_definition: The role definition section defines the role attributes. In our case, the role attributes are the user and role. policy_effect: The policy effect defines what action is to be taken on auth, In our case, the policy effect is allow. matchers: The matcher section defines the matching logic which decides whether is a given request matches any policy or not. These matches are done in order of the definition above and shortcircuit at the first failure. There are custom functions like patternMatch and stringMatch. patternMatch: This function is used to match the resource with the policy resource using os path pattern matching along with adding support for wildcards for allowAll. stringMatch: This function is used to match the object and action and uses a simple exact string match. This also supports wildcards for allowAll The policy model for us follows the following structure for all policies defined and any requests made to th UI server: User: The user requesting access to a resource. This could be any identifier, such as a username, email address, or ID. Resource: The namespace in the cluster which is being accessed by the user. This can allow for selective access to namespaces. We have wildcard \"*\" to allow access to all namespaces. Object : This could be a specific resource in the namespace, such as a pipeline, isbsvc or any event based resource. We have wildcard \"*\" to allow access to all resources. Action: The action being performed on the resource using the API. These follow the standard HTTP verbs, such as GET, POST, PUT, DELETE, etc. We have wildcard \"*\" to allow access to all actions. Refer to the RBAC to learn more about how to configure authorization policies for Numaflow UI.","title":"UI Authorization"},{"location":"specifications/authorization/#ui-authorization","text":"We utilize a role-based access control (RBAC) model to manage authorization in Numaflow. Along with this we utilize Casbin as a library for the implementation of these policies.","title":"UI Authorization"},{"location":"specifications/authorization/#permissions-and-policies","text":"The following model configuration is given to define the policies. The policy model is defined in the Casbin policy language. [request_definition] r = sub, res, obj, act [policy_definition] p = sub, res, obj, act [role_definition] g = _, _ [policy_effect] e = some(where (p.eft == allow)) [matchers] m = g(r.sub, p.sub) && patternMatch(r.res, p.res) && stringMatch(r.obj, p.obj) && stringMatch(r.act, p.act) The policy model consists of the following sections: request_definition: The request definition section defines the request attributes. In our case, the request attributes are the user, resource, action, and object. policy_definition: The policy definition section defines the policy attributes. In our case, the policy attributes are the user, resource, action, and object. role_definition: The role definition section defines the role attributes. In our case, the role attributes are the user and role. policy_effect: The policy effect defines what action is to be taken on auth, In our case, the policy effect is allow. matchers: The matcher section defines the matching logic which decides whether is a given request matches any policy or not. These matches are done in order of the definition above and shortcircuit at the first failure. There are custom functions like patternMatch and stringMatch. patternMatch: This function is used to match the resource with the policy resource using os path pattern matching along with adding support for wildcards for allowAll. stringMatch: This function is used to match the object and action and uses a simple exact string match. This also supports wildcards for allowAll The policy model for us follows the following structure for all policies defined and any requests made to th UI server: User: The user requesting access to a resource. This could be any identifier, such as a username, email address, or ID. Resource: The namespace in the cluster which is being accessed by the user. This can allow for selective access to namespaces. We have wildcard \"*\" to allow access to all namespaces. Object : This could be a specific resource in the namespace, such as a pipeline, isbsvc or any event based resource. We have wildcard \"*\" to allow access to all resources. Action: The action being performed on the resource using the API. These follow the standard HTTP verbs, such as GET, POST, PUT, DELETE, etc. We have wildcard \"*\" to allow access to all actions. Refer to the RBAC to learn more about how to configure authorization policies for Numaflow UI.","title":"Permissions and Policies"},{"location":"specifications/autoscaling/","text":"Autoscaling \u00b6 Scale Subresource is enabled in Vertex Custom Resource , which makes it possible to scale vertex pods. To be specifically, it is enabled by adding following comments to Vertex struct model, and then corresponding CRD definition is automatically generated. // +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.replicas,selectorpath=.status.selector Pods management is done by vertex controller. With scale subresource implemented, vertex object can be scaled by either horizontal or vertical pod autoscaling. Numaflow Autoscaling \u00b6 The out of box Numaflow autoscaling is done by a scaling component running in the controller manager, you can find the source code here . The autoscaling strategy is implemented according to different type of vertices. Source Vertices \u00b6 For source vertices, we define a target time (in seconds) to finish processing the pending messages based on the processing rate (tps) of the vertex. pendingMessages / processingRate = targetSeconds For example, if targetSeconds is 3, current replica number is 2 , current tps is 10000/second, and the pending messages is 60000, so we calculate the desired replica number as following: desiredReplicas = 60000 / (3 * (10000 / 2)) = 4 Numaflow autoscaling does not work for those source vertices that can not calculate pending messages. UDF and Sink Vertices \u00b6 Pending messages of a UDF or Sink vertex does not represent the real number because of the restrained writing caused by back pressure, so we use a different model to achieve autoscaling for them. For each of the vertices, we calculate the available buffer length, and consider it is contributed by all the replicas, so that we can get each replica's contribution. availableBufferLength = totalBufferLength * bufferLimit(%) - pendingMessages singleReplicaContribution = availableBufferLength / currentReplicas We define a target available buffer length, and then calculate how many replicas are needed to achieve the target. desiredReplicas = targetAvailableBufferLength / singleReplicaContribution Back Pressure Impact \u00b6 Back pressure is considered during autoscaling (which is only available for Source and UDF vertices). We measure the back pressure by defining a threshold of the buffer usage. For example, the total buffer length is 50000, buffer limit is 80%, and the back pressure threshold is 90%, if in the past period of time, the average pending messages is more than 36000 (50000 * 80% * 90%) , we consider there's back pressure. When the calculated desired replicas is greater than current replicas: For vertices which have back pressure from the directly connected vertices, instead of increasing the replica number, we decrease it by 1; For vertices which have back pressure in any of its downstream vertices, the replica number remains unchanged. Autoscaling Tuning \u00b6 Numaflow autoscaling can be tuned by updating some parameters, find the details at the doc .","title":"Autoscaling"},{"location":"specifications/autoscaling/#autoscaling","text":"Scale Subresource is enabled in Vertex Custom Resource , which makes it possible to scale vertex pods. To be specifically, it is enabled by adding following comments to Vertex struct model, and then corresponding CRD definition is automatically generated. // +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.replicas,selectorpath=.status.selector Pods management is done by vertex controller. With scale subresource implemented, vertex object can be scaled by either horizontal or vertical pod autoscaling.","title":"Autoscaling"},{"location":"specifications/autoscaling/#numaflow-autoscaling","text":"The out of box Numaflow autoscaling is done by a scaling component running in the controller manager, you can find the source code here . The autoscaling strategy is implemented according to different type of vertices.","title":"Numaflow Autoscaling"},{"location":"specifications/autoscaling/#source-vertices","text":"For source vertices, we define a target time (in seconds) to finish processing the pending messages based on the processing rate (tps) of the vertex. pendingMessages / processingRate = targetSeconds For example, if targetSeconds is 3, current replica number is 2 , current tps is 10000/second, and the pending messages is 60000, so we calculate the desired replica number as following: desiredReplicas = 60000 / (3 * (10000 / 2)) = 4 Numaflow autoscaling does not work for those source vertices that can not calculate pending messages.","title":"Source Vertices"},{"location":"specifications/autoscaling/#udf-and-sink-vertices","text":"Pending messages of a UDF or Sink vertex does not represent the real number because of the restrained writing caused by back pressure, so we use a different model to achieve autoscaling for them. For each of the vertices, we calculate the available buffer length, and consider it is contributed by all the replicas, so that we can get each replica's contribution. availableBufferLength = totalBufferLength * bufferLimit(%) - pendingMessages singleReplicaContribution = availableBufferLength / currentReplicas We define a target available buffer length, and then calculate how many replicas are needed to achieve the target. desiredReplicas = targetAvailableBufferLength / singleReplicaContribution","title":"UDF and Sink Vertices"},{"location":"specifications/autoscaling/#back-pressure-impact","text":"Back pressure is considered during autoscaling (which is only available for Source and UDF vertices). We measure the back pressure by defining a threshold of the buffer usage. For example, the total buffer length is 50000, buffer limit is 80%, and the back pressure threshold is 90%, if in the past period of time, the average pending messages is more than 36000 (50000 * 80% * 90%) , we consider there's back pressure. When the calculated desired replicas is greater than current replicas: For vertices which have back pressure from the directly connected vertices, instead of increasing the replica number, we decrease it by 1; For vertices which have back pressure in any of its downstream vertices, the replica number remains unchanged.","title":"Back Pressure Impact"},{"location":"specifications/autoscaling/#autoscaling-tuning","text":"Numaflow autoscaling can be tuned by updating some parameters, find the details at the doc .","title":"Autoscaling Tuning"},{"location":"specifications/controllers/","text":"Controllers \u00b6 Currently in Numaflow , there are 3 CRDs introduced, each one has a corresponding controller. interstepbufferservices.numaflow.numaproj.io pipelines.numaflow.numaproj.io vertices.numaflow.numaproj.io The source code of the controllers is located at ./pkg/reconciler/ . Inter-Step Buffer Service Controller \u00b6 Inter-Step Buffer Service Controller is used to watch InterStepBufferService object, depending on the spec of the object, it might install services (such as JetStream, or Redis) in the namespace, or simply provide the configuration of the InterStepBufferService (for example, when an external redis ISB Service is given). Pipeline Controller \u00b6 Pipeline Controller is used to watch Pipeline objects, it does following major things when there's a pipeline object created. Spawn a Kubernetes Job to create buffers and buckets in the Inter-Step Buffer Services . Create Vertex objects according to .spec.vertices defined in Pipeline object. Create some other Kubernetes objects used for the Pipeline, such as a Deployment and a Service for daemon service application. Vertex Controller \u00b6 Vertex controller watches the Vertex objects, based on the replica defined in the spec, creates a number of pods to run the workloads.","title":"Controllers"},{"location":"specifications/controllers/#controllers","text":"Currently in Numaflow , there are 3 CRDs introduced, each one has a corresponding controller. interstepbufferservices.numaflow.numaproj.io pipelines.numaflow.numaproj.io vertices.numaflow.numaproj.io The source code of the controllers is located at ./pkg/reconciler/ .","title":"Controllers"},{"location":"specifications/controllers/#inter-step-buffer-service-controller","text":"Inter-Step Buffer Service Controller is used to watch InterStepBufferService object, depending on the spec of the object, it might install services (such as JetStream, or Redis) in the namespace, or simply provide the configuration of the InterStepBufferService (for example, when an external redis ISB Service is given).","title":"Inter-Step Buffer Service Controller"},{"location":"specifications/controllers/#pipeline-controller","text":"Pipeline Controller is used to watch Pipeline objects, it does following major things when there's a pipeline object created. Spawn a Kubernetes Job to create buffers and buckets in the Inter-Step Buffer Services . Create Vertex objects according to .spec.vertices defined in Pipeline object. Create some other Kubernetes objects used for the Pipeline, such as a Deployment and a Service for daemon service application.","title":"Pipeline Controller"},{"location":"specifications/controllers/#vertex-controller","text":"Vertex controller watches the Vertex objects, based on the replica defined in the spec, creates a number of pods to run the workloads.","title":"Vertex Controller"},{"location":"specifications/edges-buffers-buckets/","text":"Edges, Buffers and Buckets \u00b6 This document describes the concepts of Edge , Buffer and Bucket in a pipeline. Edges \u00b6 Edge is the connection between the vertices, specifically, edge is defined in the pipeline spec under .spec.edges . No matter if the to vertex is a Map, or a Reduce with multiple partitions, it is considered as one edge. In the following pipeline, there are 3 edges defined ( in - aoti , aoti - compute-sum , compute-sum - out ). apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : even-odd-sum spec : vertices : - name : in source : http : {} - name : atoi scale : min : 1 udf : container : image : quay.io/numaio/numaflow-go/map-even-odd:v0.5.0 - name : compute-sum partitions : 2 udf : container : image : quay.io/numaio/numaflow-go/reduce-sum:v0.5.0 groupBy : window : fixed : length : 60s keyed : true - name : out scale : min : 1 sink : log : {} edges : - from : in to : atoi - from : atoi to : compute-sum - from : compute-sum to : out Each edge could have a name for internal usage, the naming convention is {pipeline-name}-{from-vertex-name}-{to-vertex-name} . Buffers \u00b6 Buffer is InterStepBuffer . Each buffer has an owner, which is the vertex who reads from it. Each udf and sink vertex in a pipeline owns a group of partitioned buffers. Each buffer has a name with the naming convention {pipeline-name}-{vertex-name}-{index} , where the index is the partition index, starting from 0. This naming convention applies to the buffers of both map and reduce udf vertices. When multiple vertices connecting to the same vertex, if the to vertex is a Map, the data from all the from vertices will be forwarded to the group of partitoned buffers round-robinly. If the to vertex is a Reduce, the data from all the from vertices will be forwarded to the group of partitioned buffers based on the partitioning key. A Source vertex does not have any owned buffers. But a pipeline may have multiple Source vertices, followed by one vertex. Same as above, if the following vertex is a map, the data from all the Source vertices will be forwarded to the group of partitioned buffers round-robinly. If it is a reduce, the data from all the Source vertices will be forwarded to the group of partitioned buffers based on the partitioning key. Buckets \u00b6 Bucket is a K/V store (or a pair of stores) used for watermark propagation. There are 3 types of buckets in a pipeline: Edge Bucket : Each edge has a bucket, used for edge watermark propagation, no matter if the vertex that the edge leads to is a Map or a Reduce. The naming convention of an edge bucket is {pipeline-name}-{from-vertex-name}-{to-vertex-name} . Source Bucket : Each Source vertex has a source bucket, used for source watermark propagation. The naming convention of a source bucket is {pipeline-name}-{vertex-name}-SOURCE . Sink Bucket : Sitting on the right side of a Sink vertex, used for sink watermark. The naming convention of a sink bucket is {pipeline-name}-{vertex-name}-SINK . Diagrams \u00b6 Map Reduce","title":"Edges, Buffers and Buckets"},{"location":"specifications/edges-buffers-buckets/#edges-buffers-and-buckets","text":"This document describes the concepts of Edge , Buffer and Bucket in a pipeline.","title":"Edges, Buffers and Buckets"},{"location":"specifications/edges-buffers-buckets/#edges","text":"Edge is the connection between the vertices, specifically, edge is defined in the pipeline spec under .spec.edges . No matter if the to vertex is a Map, or a Reduce with multiple partitions, it is considered as one edge. In the following pipeline, there are 3 edges defined ( in - aoti , aoti - compute-sum , compute-sum - out ). apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : even-odd-sum spec : vertices : - name : in source : http : {} - name : atoi scale : min : 1 udf : container : image : quay.io/numaio/numaflow-go/map-even-odd:v0.5.0 - name : compute-sum partitions : 2 udf : container : image : quay.io/numaio/numaflow-go/reduce-sum:v0.5.0 groupBy : window : fixed : length : 60s keyed : true - name : out scale : min : 1 sink : log : {} edges : - from : in to : atoi - from : atoi to : compute-sum - from : compute-sum to : out Each edge could have a name for internal usage, the naming convention is {pipeline-name}-{from-vertex-name}-{to-vertex-name} .","title":"Edges"},{"location":"specifications/edges-buffers-buckets/#buffers","text":"Buffer is InterStepBuffer . Each buffer has an owner, which is the vertex who reads from it. Each udf and sink vertex in a pipeline owns a group of partitioned buffers. Each buffer has a name with the naming convention {pipeline-name}-{vertex-name}-{index} , where the index is the partition index, starting from 0. This naming convention applies to the buffers of both map and reduce udf vertices. When multiple vertices connecting to the same vertex, if the to vertex is a Map, the data from all the from vertices will be forwarded to the group of partitoned buffers round-robinly. If the to vertex is a Reduce, the data from all the from vertices will be forwarded to the group of partitioned buffers based on the partitioning key. A Source vertex does not have any owned buffers. But a pipeline may have multiple Source vertices, followed by one vertex. Same as above, if the following vertex is a map, the data from all the Source vertices will be forwarded to the group of partitioned buffers round-robinly. If it is a reduce, the data from all the Source vertices will be forwarded to the group of partitioned buffers based on the partitioning key.","title":"Buffers"},{"location":"specifications/edges-buffers-buckets/#buckets","text":"Bucket is a K/V store (or a pair of stores) used for watermark propagation. There are 3 types of buckets in a pipeline: Edge Bucket : Each edge has a bucket, used for edge watermark propagation, no matter if the vertex that the edge leads to is a Map or a Reduce. The naming convention of an edge bucket is {pipeline-name}-{from-vertex-name}-{to-vertex-name} . Source Bucket : Each Source vertex has a source bucket, used for source watermark propagation. The naming convention of a source bucket is {pipeline-name}-{vertex-name}-SOURCE . Sink Bucket : Sitting on the right side of a Sink vertex, used for sink watermark. The naming convention of a sink bucket is {pipeline-name}-{vertex-name}-SINK .","title":"Buckets"},{"location":"specifications/edges-buffers-buckets/#diagrams","text":"Map Reduce","title":"Diagrams"},{"location":"specifications/overview/","text":"Numaflow Dataplane High-Level Architecture \u00b6 Synopsis \u00b6 Numaflow allows developers without any special knowledge of data/stream processing to easily create massively parallel data/stream processing jobs using a programming language of their choice, with just basic knowledge of Kubernetes. Reliable data processing is highly desirable and exactly-once semantics is often required by many data processing applications. This document describes the use cases, requirements, and design for providing exactly-once semantics with Numaflow. Use Cases Continuous stream processing for unbounded streams. Efficient batch processing for bounded streams and data sets. Definitions \u00b6 Pipeline A pipeline contains multiple processors, which include source processors, data processors, and sink processors. These processors are not connected directly, but through inter-step buffers . Source The actual source for the data (not a step in the Numaflow). Sink The actual sink for the data (not a step in the Numaflow). Inter-Step Buffers Inter-step buffers are used to connect processors and they should support the following Durability Support offsets Support transactions for Exactly-Once forwarding Concurrent operations (reader group) Ability to explicitly ack each data/offset Claim pending messages (read but never acknowledged) Ability to trim (buffer size controls) Fast (high throughput and low latency) Ability to query buffer information (observability) Source Processors Source processors are the initial processors that ingest data into the Numaflow. They sit in front of the first data processor, ingest the data from the data source, and forward to inter-step buffers. Logic: Read data from the data source; Write to the inter-step buffer; Ack the data in the data source. Data Processors The data processors execute idempotent user-defined functions and will be sandwiched between source and sink processors. There could be one or more data processors. A data processor only reads from one upstream buffer, but it might write to multiple downstream buffers. Logic: Read data from the upstream inter-step buffer; Process data; Write to downstream inter-step buffers; Ack the data in the upstream buffer. Sink Processors Sink processors are the final processors used to write processed data to sinks. A sink processor only reads from one upstream buffer and writes to a single sink. Logic: Read data from the upstream inter-step buffer; Write to the sink; Ack the data in the upstream buffer. UDF (User-defined Function) Use-defined Functions run in data processors. UDFs implements a unified interface to process data. UDFs are typically implemented by end-users, but there will be some built-in functions that can be used without writing any code. UDFs can be implemented in different languages, a pseudo-interface might look like the below, where the function signatures include step context and input payload and returns a result. The Result contains the processed data as well as optional labels that will be exposed to the DSL to do complex conditional forwarding. Process(key, message, context) (result, err) UDFs should only focus on user logic, buffer message reading and writing should not be handled by this function. UDFs should be idempotent. Matrix of Operations Source Processor Sink ReadFromBuffer Read From Source Generic Generic CallUDF Void User Defined Void Forward Generic Generic Write To Sink Ack Ack Source Generic Generic Requirements \u00b6 Exactly once semantics from the source processor to the sink processor. Be able to support a variety of data buffering technologies. Numaflow is restartable if aborted or steps fail while preserving exactly-once semantics. Do not generate more output than can be used by the next stage in a reasonable amount of time, i.e., the size of buffers between steps should be limited, (aka backpressure). User code should be isolated from offset management, restart, exactly once, backpressure, etc. Streaming process systems inherently require a concept of time, this time will be either derived from the Source (LOG_APPEND_TIME in Kafka, etc.) or will be inserted at ingestion time if the source doesn't provide it. Every processor is connected by an inter-step buffer. Source processors add a \"header\" to each \"item\" received from the source in order to: Uniquely identify the item for implementing exactly-once Uniquely identify the source of the message. Sink processors should avoid writing output for the same input when possible. Numaflow should support the following types of flows: Line Tree Diamond (In Future) Multiple Sources with the same schema (In Future) Non-Requirements \u00b6 Support for non-idempotent data processors (UDFs?) Distributed transactions/checkpoints are not needed Open Issues \u00b6 None Closed Issues \u00b6 In order to be able to support various buffering technologies, we will persist and manage stream \"offsets\" rather than relying on the buffering technology (e.g., Kafka) Each processor may persist state associated with their processing no distributed transactions are needed for checkpointing If we have a tree DAG, how will we manage acknowledgments? We will use back-pressure and exactly-once schematics on the buffer to solve it. How/where will offsets be persisted? Buffer will have a \"lookup - insert - update\" as a txn What will be used to implement the inter-step buffers between processors? The interface is abstracted out, but internally we will use Redis Streams (supports streams, hash, txn) Design Details \u00b6 Duplicates \u00b6 Numaflow (like any other stream processing engine) at its core has Read -> Process -> Forward -> Acknowledge loop for every message it has to process. Given that the user-defined process is idempotent, there are two failure mode scenarios where there could be duplicates. The message has been forwarded but the information failed to reach back (we do not know whether we really have successfully forwarded the message). A retry on forwarding again could lead to duplication. Acknowledgment has been sent back to the source buffer, but we do not know whether we have really acknowledged the successful processing of the message. A retry on reading could end up in duplications (both in processing and forwarding, but we need to worry only about forwarding because processing is idempotent). To detect duplicates, make sure the delivery is Exactly-Once: A unique and immutable identifier for the message from the upstream buffer will be used as the key of the data in the downstream buffer Best effort of the transactional commit. Data processors make transactional commits for data forwarding to the next buffer, and upstream buffer acknowledgment. Source processors have no way to do similar transactional operations for data source message acknowledgment and message forwarding, but #1 will make sure there's no duplicate after retrying in case of failure. Sink processors can not do transactional operations unless there's a contract between Numaflow and the sink, which is out of the scope of this doc. We will rely on the sink to implement this (eg, \"enable.idempotent\" in Kafka producer). Unique Identifier for Message \u00b6 To detect duplicates, we first need to uniquely identify each message. We will be relying on the \"identifier\" available (e.g., \"offset\" in Kafka) in the buffer to uniquely identify each message. If such an identifier is not available, we will be creating a unique identifier (sequence numbers are tough because there are multiple readers). We can use this unique identifier to ensure that we forward only if the message has not been forwarded yet. We will only look back for a fixed window of time since this is a stream processing application on an unbounded stream of data, and we do not have infinite resources. The same offset will not be used across all the steps in Numaflow, but we will be using the current offset only while forwarding to the next step. Step N will use step N-1th offset to deduplicate. This requires each step to generate an unique ID. The reason we are not sticking to the original offset is because there will be operations in future which will require, say aggregations, where multiple messages will be grouped together and we will not be able to choose an offset from the original messages because the single output is based on multiple messages. Restarting After a Failure \u00b6 Numaflow needs to be able to recover from the failure of any step (pods) or even the complete failure of the Numaflow while preserving exactly-once semantics. When a message is successfully processed by a processor, it should have been written to the downstream buffer, and its status in the upstream buffer becomes \"Acknowledged\". So when a processor restarts, it checks if any message assigned to it in the upstream buffer is in the \"In-Flight\" state, if yes, it will read and process those messages before picking up other messages. Processing those messages follows the flowchart above, which makes sure they will only be processed once. Back Pressure \u00b6 The durable buffers allocated to the processors are not infinite but have a bounded buffer. Backpressure handling in Numaflow utilizes the buffer. At any time t, the durable buffer should contain messages in the following states: Acked messages - processed messages to be deleted Inflight messages - messages being handled by downstream processor Pending messages - messages to be read by the downstream processor The buffer acts like a sliding window, new messages will always be written to the right, and there's some automation to clean up the acknowledged messages on the left. If the processor is too slow, the pending messages will buffer up, and the space available for writing will become limited. Every time (or periodically for better throughput) before the upstream processor writes a message to the buffer, it checks if there's any available space, or else it stops writing (or slows down the processing while approaching the buffer limit). This buffer pressure will then pass back to the beginning of the pipeline, which is the buffer used by the source processor so that the entire flow will stop (or slow down).","title":"Overview"},{"location":"specifications/overview/#numaflow-dataplane-high-level-architecture","text":"","title":"Numaflow Dataplane High-Level Architecture"},{"location":"specifications/overview/#synopsis","text":"Numaflow allows developers without any special knowledge of data/stream processing to easily create massively parallel data/stream processing jobs using a programming language of their choice, with just basic knowledge of Kubernetes. Reliable data processing is highly desirable and exactly-once semantics is often required by many data processing applications. This document describes the use cases, requirements, and design for providing exactly-once semantics with Numaflow. Use Cases Continuous stream processing for unbounded streams. Efficient batch processing for bounded streams and data sets.","title":"Synopsis"},{"location":"specifications/overview/#definitions","text":"Pipeline A pipeline contains multiple processors, which include source processors, data processors, and sink processors. These processors are not connected directly, but through inter-step buffers . Source The actual source for the data (not a step in the Numaflow). Sink The actual sink for the data (not a step in the Numaflow). Inter-Step Buffers Inter-step buffers are used to connect processors and they should support the following Durability Support offsets Support transactions for Exactly-Once forwarding Concurrent operations (reader group) Ability to explicitly ack each data/offset Claim pending messages (read but never acknowledged) Ability to trim (buffer size controls) Fast (high throughput and low latency) Ability to query buffer information (observability) Source Processors Source processors are the initial processors that ingest data into the Numaflow. They sit in front of the first data processor, ingest the data from the data source, and forward to inter-step buffers. Logic: Read data from the data source; Write to the inter-step buffer; Ack the data in the data source. Data Processors The data processors execute idempotent user-defined functions and will be sandwiched between source and sink processors. There could be one or more data processors. A data processor only reads from one upstream buffer, but it might write to multiple downstream buffers. Logic: Read data from the upstream inter-step buffer; Process data; Write to downstream inter-step buffers; Ack the data in the upstream buffer. Sink Processors Sink processors are the final processors used to write processed data to sinks. A sink processor only reads from one upstream buffer and writes to a single sink. Logic: Read data from the upstream inter-step buffer; Write to the sink; Ack the data in the upstream buffer. UDF (User-defined Function) Use-defined Functions run in data processors. UDFs implements a unified interface to process data. UDFs are typically implemented by end-users, but there will be some built-in functions that can be used without writing any code. UDFs can be implemented in different languages, a pseudo-interface might look like the below, where the function signatures include step context and input payload and returns a result. The Result contains the processed data as well as optional labels that will be exposed to the DSL to do complex conditional forwarding. Process(key, message, context) (result, err) UDFs should only focus on user logic, buffer message reading and writing should not be handled by this function. UDFs should be idempotent. Matrix of Operations Source Processor Sink ReadFromBuffer Read From Source Generic Generic CallUDF Void User Defined Void Forward Generic Generic Write To Sink Ack Ack Source Generic Generic","title":"Definitions"},{"location":"specifications/overview/#requirements","text":"Exactly once semantics from the source processor to the sink processor. Be able to support a variety of data buffering technologies. Numaflow is restartable if aborted or steps fail while preserving exactly-once semantics. Do not generate more output than can be used by the next stage in a reasonable amount of time, i.e., the size of buffers between steps should be limited, (aka backpressure). User code should be isolated from offset management, restart, exactly once, backpressure, etc. Streaming process systems inherently require a concept of time, this time will be either derived from the Source (LOG_APPEND_TIME in Kafka, etc.) or will be inserted at ingestion time if the source doesn't provide it. Every processor is connected by an inter-step buffer. Source processors add a \"header\" to each \"item\" received from the source in order to: Uniquely identify the item for implementing exactly-once Uniquely identify the source of the message. Sink processors should avoid writing output for the same input when possible. Numaflow should support the following types of flows: Line Tree Diamond (In Future) Multiple Sources with the same schema (In Future)","title":"Requirements"},{"location":"specifications/overview/#non-requirements","text":"Support for non-idempotent data processors (UDFs?) Distributed transactions/checkpoints are not needed","title":"Non-Requirements"},{"location":"specifications/overview/#open-issues","text":"None","title":"Open Issues"},{"location":"specifications/overview/#closed-issues","text":"In order to be able to support various buffering technologies, we will persist and manage stream \"offsets\" rather than relying on the buffering technology (e.g., Kafka) Each processor may persist state associated with their processing no distributed transactions are needed for checkpointing If we have a tree DAG, how will we manage acknowledgments? We will use back-pressure and exactly-once schematics on the buffer to solve it. How/where will offsets be persisted? Buffer will have a \"lookup - insert - update\" as a txn What will be used to implement the inter-step buffers between processors? The interface is abstracted out, but internally we will use Redis Streams (supports streams, hash, txn)","title":"Closed Issues"},{"location":"specifications/overview/#design-details","text":"","title":"Design Details"},{"location":"specifications/overview/#duplicates","text":"Numaflow (like any other stream processing engine) at its core has Read -> Process -> Forward -> Acknowledge loop for every message it has to process. Given that the user-defined process is idempotent, there are two failure mode scenarios where there could be duplicates. The message has been forwarded but the information failed to reach back (we do not know whether we really have successfully forwarded the message). A retry on forwarding again could lead to duplication. Acknowledgment has been sent back to the source buffer, but we do not know whether we have really acknowledged the successful processing of the message. A retry on reading could end up in duplications (both in processing and forwarding, but we need to worry only about forwarding because processing is idempotent). To detect duplicates, make sure the delivery is Exactly-Once: A unique and immutable identifier for the message from the upstream buffer will be used as the key of the data in the downstream buffer Best effort of the transactional commit. Data processors make transactional commits for data forwarding to the next buffer, and upstream buffer acknowledgment. Source processors have no way to do similar transactional operations for data source message acknowledgment and message forwarding, but #1 will make sure there's no duplicate after retrying in case of failure. Sink processors can not do transactional operations unless there's a contract between Numaflow and the sink, which is out of the scope of this doc. We will rely on the sink to implement this (eg, \"enable.idempotent\" in Kafka producer).","title":"Duplicates"},{"location":"specifications/overview/#unique-identifier-for-message","text":"To detect duplicates, we first need to uniquely identify each message. We will be relying on the \"identifier\" available (e.g., \"offset\" in Kafka) in the buffer to uniquely identify each message. If such an identifier is not available, we will be creating a unique identifier (sequence numbers are tough because there are multiple readers). We can use this unique identifier to ensure that we forward only if the message has not been forwarded yet. We will only look back for a fixed window of time since this is a stream processing application on an unbounded stream of data, and we do not have infinite resources. The same offset will not be used across all the steps in Numaflow, but we will be using the current offset only while forwarding to the next step. Step N will use step N-1th offset to deduplicate. This requires each step to generate an unique ID. The reason we are not sticking to the original offset is because there will be operations in future which will require, say aggregations, where multiple messages will be grouped together and we will not be able to choose an offset from the original messages because the single output is based on multiple messages.","title":"Unique Identifier for Message"},{"location":"specifications/overview/#restarting-after-a-failure","text":"Numaflow needs to be able to recover from the failure of any step (pods) or even the complete failure of the Numaflow while preserving exactly-once semantics. When a message is successfully processed by a processor, it should have been written to the downstream buffer, and its status in the upstream buffer becomes \"Acknowledged\". So when a processor restarts, it checks if any message assigned to it in the upstream buffer is in the \"In-Flight\" state, if yes, it will read and process those messages before picking up other messages. Processing those messages follows the flowchart above, which makes sure they will only be processed once.","title":"Restarting After a Failure"},{"location":"specifications/overview/#back-pressure","text":"The durable buffers allocated to the processors are not infinite but have a bounded buffer. Backpressure handling in Numaflow utilizes the buffer. At any time t, the durable buffer should contain messages in the following states: Acked messages - processed messages to be deleted Inflight messages - messages being handled by downstream processor Pending messages - messages to be read by the downstream processor The buffer acts like a sliding window, new messages will always be written to the right, and there's some automation to clean up the acknowledged messages on the left. If the processor is too slow, the pending messages will buffer up, and the space available for writing will become limited. Every time (or periodically for better throughput) before the upstream processor writes a message to the buffer, it checks if there's any available space, or else it stops writing (or slows down the processing while approaching the buffer limit). This buffer pressure will then pass back to the beginning of the pipeline, which is the buffer used by the source processor so that the entire flow will stop (or slow down).","title":"Back Pressure"},{"location":"specifications/side-inputs/","text":"Side Inputs \u00b6 Side Inputs allow the user-defined functions (including UDF, UDSink, Transformer, etc.) to access slow updated data or configuration (such as database, file system, etc.) without needing to load it during each message processing. Side Inputs are read-only and can be used in both batch and streaming jobs. Requirements \u00b6 The Side Inputs should be programmable with any language. The Side Inputs should be updated centralized (for a pipeline), and be able to broadcast to each of the vertex pods in an efficient manner. The Side Inputs update could be based on a configurable interval. Assumptions \u00b6 Size of a Side Input data could be up to 1MB. The Side Inputs data is updated at a low frequency (minutes level). As a platform, Numaflow has no idea about the data format of the Side Inputs, instead, the pipeline owner (programmer) is responsible for parsing the data. Design Proposal \u00b6 Data Format \u00b6 Numaflow processes the Side Inputs data as bytes array, thus there\u2019s no data format requirement for it, the pipeline developers are supposed to parse the Side Inputs data from bytes array to any format they expect. Architecture \u00b6 There will be the following components introduced when a pipeline has Side Inputs enabled. A Side Inputs Manager - a service for Side Inputs data updating. A Side Inputs watcher sidecar - a container enabled for each of the vertex pods to receive updated Side Inputs. Side Inputs data store - a data store to store the latest Side Inputs data. Data Store \u00b6 Data store is the place where the latest retrieved Side Inputs data stays. The data is published by the Side Inputs Manager after retrieving from the Side Inputs data source, and consumed by each of the vertex Pods. The data store implementation could be a Key/Value store in JetStream, which by default supports maximum 1MB - 64MB size data. Extended implementation could be Key/Value store + object store, which makes it possible to store large sizes of data. Data Store management is supposed to be done by the controller, through the same Kubernetes Job to create/delete Inter-Step Buffers and Buckets. Side Inputs Manager \u00b6 A Side Inputs Manager is a pod (or a group of pods with active-passive HA), created by the Numaflow controller, used to run cron like jobs to retrieve the Side Inputs data and save to a data store. Each Side Inputs Manager is only responsible for corresponding pipeline, and is only created when Side Inputs is enabled for the pipeline. A pipeline may have multiple Side Inputs sources, each of them will have a Side Inputs Manger. Each of the Side Inputs Manager pods contains: An init container, which checks if the data store is ready. A user-defined container, which runs a predefined Numaflow SDK to start a service, calling a user implemented function to get Side Input data. A numa container, which runs a cron like job to call the service in the user-defined container, and store the returned data in the data store. The communication protocol between the 2 containers could be based on UDS or FIFO (TBD). High Availability \u00b6 Side Inputs Manager needs to run with Active-Passive HA, which requires a leader election mechanism support. Kubernetes has a native leader election API backed by etcd, but it requires extra RBAC privileges to use it. Considering a similar leader election mechanism is needed in some other scenarios such as Active-Passive User-defined Source, a proposal is to implement our own leader election mechanism by leveraging ISB Service. Why NOT CronJob? \u00b6 Using Kubernetes CronJob could also achieve the cron like job orchestration, but there are few downsides. A K8s Job has to be used together with the CronJob to solve the immediate starting problem - A CronJob can not trigger a job immediately after it\u2019s created, it has to wait until the first trigger condition meets. Using K8s CronJob/Job will be a challenge when ServiceMesh (Istio) is enabled. Vertex Pod Sidecar \u00b6 When Side Inputs is enabled for a pipeline, each of its vertex pods will have a second init container added, the init container will have a shared volume (emptyDir) mounted, and the same volume will be mounted to the User-defined Function/Sink/Transformer container. The init container reads from the data store, and saves to the shared volume. A sidecar container will also be injected by the controller, and it mounts the same volume as above. The sidecar runs a service provided by numaflow, watching the Side Inputs data from the data store, if there\u2019s any update, reads the data and updates the shared volume. In the User-defined Function/Sink/Sink container, a helper function will be provided by Numaflow SDK, to return the Side Input data. The helper function caches the Side Inputs data in the memory, but performs thread safe updates if it watches the changes in the shared volume. Numaflow SDK \u00b6 Some new features will be added to the Numaflow SDK. Interface for the users to implement the Side Inputs retrievement. A pseudo interface might look like below. RetrieveSideInput () ([] bytes , error ) A main function to start the service in the Side Inputs Manager user container. A helper function to be used in the udf/udsink/transformer containers to get the Side Inputs, which reads, watches and caches the data from the shared volume. SideInput [ T any ]( name string , parseFunc func ([] byte ) ( T , error )) ( T , error ) User Spec \u00b6 Side Inputs support is exposed through sideInputs in the pipeline spec, it\u2019s updated based on cron like schedule, specified in the pipeline spec with a trigger field. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : sideInputs : - name : devPortal container : image : my-sideinputs-image:v1 trigger : schedule : \"*/15 * * * *\" # interval: 180s # timezone: America/Los_Angeles vertices : - name : my-vertex sideInputs : - devPortal Open Issues \u00b6 To support multiple ways to trigger Side Inputs updating other than cron only? Event based side inputs where the changes are coming via a stream?","title":"Side Inputs"},{"location":"specifications/side-inputs/#side-inputs","text":"Side Inputs allow the user-defined functions (including UDF, UDSink, Transformer, etc.) to access slow updated data or configuration (such as database, file system, etc.) without needing to load it during each message processing. Side Inputs are read-only and can be used in both batch and streaming jobs.","title":"Side Inputs"},{"location":"specifications/side-inputs/#requirements","text":"The Side Inputs should be programmable with any language. The Side Inputs should be updated centralized (for a pipeline), and be able to broadcast to each of the vertex pods in an efficient manner. The Side Inputs update could be based on a configurable interval.","title":"Requirements"},{"location":"specifications/side-inputs/#assumptions","text":"Size of a Side Input data could be up to 1MB. The Side Inputs data is updated at a low frequency (minutes level). As a platform, Numaflow has no idea about the data format of the Side Inputs, instead, the pipeline owner (programmer) is responsible for parsing the data.","title":"Assumptions"},{"location":"specifications/side-inputs/#design-proposal","text":"","title":"Design Proposal"},{"location":"specifications/side-inputs/#data-format","text":"Numaflow processes the Side Inputs data as bytes array, thus there\u2019s no data format requirement for it, the pipeline developers are supposed to parse the Side Inputs data from bytes array to any format they expect.","title":"Data Format"},{"location":"specifications/side-inputs/#architecture","text":"There will be the following components introduced when a pipeline has Side Inputs enabled. A Side Inputs Manager - a service for Side Inputs data updating. A Side Inputs watcher sidecar - a container enabled for each of the vertex pods to receive updated Side Inputs. Side Inputs data store - a data store to store the latest Side Inputs data.","title":"Architecture"},{"location":"specifications/side-inputs/#data-store","text":"Data store is the place where the latest retrieved Side Inputs data stays. The data is published by the Side Inputs Manager after retrieving from the Side Inputs data source, and consumed by each of the vertex Pods. The data store implementation could be a Key/Value store in JetStream, which by default supports maximum 1MB - 64MB size data. Extended implementation could be Key/Value store + object store, which makes it possible to store large sizes of data. Data Store management is supposed to be done by the controller, through the same Kubernetes Job to create/delete Inter-Step Buffers and Buckets.","title":"Data Store"},{"location":"specifications/side-inputs/#side-inputs-manager","text":"A Side Inputs Manager is a pod (or a group of pods with active-passive HA), created by the Numaflow controller, used to run cron like jobs to retrieve the Side Inputs data and save to a data store. Each Side Inputs Manager is only responsible for corresponding pipeline, and is only created when Side Inputs is enabled for the pipeline. A pipeline may have multiple Side Inputs sources, each of them will have a Side Inputs Manger. Each of the Side Inputs Manager pods contains: An init container, which checks if the data store is ready. A user-defined container, which runs a predefined Numaflow SDK to start a service, calling a user implemented function to get Side Input data. A numa container, which runs a cron like job to call the service in the user-defined container, and store the returned data in the data store. The communication protocol between the 2 containers could be based on UDS or FIFO (TBD).","title":"Side Inputs Manager"},{"location":"specifications/side-inputs/#high-availability","text":"Side Inputs Manager needs to run with Active-Passive HA, which requires a leader election mechanism support. Kubernetes has a native leader election API backed by etcd, but it requires extra RBAC privileges to use it. Considering a similar leader election mechanism is needed in some other scenarios such as Active-Passive User-defined Source, a proposal is to implement our own leader election mechanism by leveraging ISB Service.","title":"High Availability"},{"location":"specifications/side-inputs/#why-not-cronjob","text":"Using Kubernetes CronJob could also achieve the cron like job orchestration, but there are few downsides. A K8s Job has to be used together with the CronJob to solve the immediate starting problem - A CronJob can not trigger a job immediately after it\u2019s created, it has to wait until the first trigger condition meets. Using K8s CronJob/Job will be a challenge when ServiceMesh (Istio) is enabled.","title":"Why NOT CronJob?"},{"location":"specifications/side-inputs/#vertex-pod-sidecar","text":"When Side Inputs is enabled for a pipeline, each of its vertex pods will have a second init container added, the init container will have a shared volume (emptyDir) mounted, and the same volume will be mounted to the User-defined Function/Sink/Transformer container. The init container reads from the data store, and saves to the shared volume. A sidecar container will also be injected by the controller, and it mounts the same volume as above. The sidecar runs a service provided by numaflow, watching the Side Inputs data from the data store, if there\u2019s any update, reads the data and updates the shared volume. In the User-defined Function/Sink/Sink container, a helper function will be provided by Numaflow SDK, to return the Side Input data. The helper function caches the Side Inputs data in the memory, but performs thread safe updates if it watches the changes in the shared volume.","title":"Vertex Pod Sidecar"},{"location":"specifications/side-inputs/#numaflow-sdk","text":"Some new features will be added to the Numaflow SDK. Interface for the users to implement the Side Inputs retrievement. A pseudo interface might look like below. RetrieveSideInput () ([] bytes , error ) A main function to start the service in the Side Inputs Manager user container. A helper function to be used in the udf/udsink/transformer containers to get the Side Inputs, which reads, watches and caches the data from the shared volume. SideInput [ T any ]( name string , parseFunc func ([] byte ) ( T , error )) ( T , error )","title":"Numaflow SDK"},{"location":"specifications/side-inputs/#user-spec","text":"Side Inputs support is exposed through sideInputs in the pipeline spec, it\u2019s updated based on cron like schedule, specified in the pipeline spec with a trigger field. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : sideInputs : - name : devPortal container : image : my-sideinputs-image:v1 trigger : schedule : \"*/15 * * * *\" # interval: 180s # timezone: America/Los_Angeles vertices : - name : my-vertex sideInputs : - devPortal","title":"User Spec"},{"location":"specifications/side-inputs/#open-issues","text":"To support multiple ways to trigger Side Inputs updating other than cron only? Event based side inputs where the changes are coming via a stream?","title":"Open Issues"},{"location":"user-guide/reference/autoscaling/","text":"Autoscaling \u00b6 Numaflow is able to run with both Horizontal Pod Autoscaling and Vertical Pod Autoscaling . Horizontal Pod Autoscaling \u00b6 Horizontal Pod Autoscaling approaches supported in Numaflow include: Numaflow Autoscaling Kubernetes HPA Third Party Autoscaling (such as KEDA ) Numaflow Autoscaling \u00b6 Numaflow provides 0 - N autoscaling capability out of the box, it's available for all the UDF , Sink and most of the Source vertices (please check each source for more details). Numaflow autoscaling is enabled by default, there are some parameters can be tuned to achieve better results. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex scale : disabled : false # Optional, defaults to false. min : 0 # Optional, minimum replicas, defaults to 0. max : 20 # Optional, maximum replicas, defaults to 50. lookbackSeconds : 120 # Optional, defaults to 120. scaleUpCooldownSeconds : 90 # Optional, defaults to 90. scaleDownCooldownSeconds : 90 # Optional, defaults to 90. zeroReplicaSleepSeconds : 120 # Optional, defaults to 120. targetProcessingSeconds : 20 # Optional, defaults to 20. targetBufferAvailability : 50 # Optional, defaults to 50. replicasPerScale : 2 # Optional, defaults to 2. disabled - Whether to disable Numaflow autoscaling, defaults to false . min - Minimum replicas, valid value could be an integer >= 0. Defaults to 0 , which means it could be scaled down to 0. max - Maximum replicas, positive integer which should not be less than min , defaults to 50 . if max and min are the same, that will be the fixed replica number. lookbackSeconds - How many seconds to lookback for vertex average processing rate (tps) and pending messages calculation, defaults to 120 . Rate and pending messages metrics are critical for autoscaling, you might need to tune this parameter a bit to see better results. For example, your data source only have 1 minute data input in every 5 minutes, and you don't want the vertices to be scaled down to 0 . In this case, you need to increase lookbackSeconds to overlap 5 minutes, so that the calculated average rate and pending messages won't be 0 during the silent period, in order to prevent from scaling down to 0. scaleUpCooldownSeconds - After a scaling operation, how many seconds to wait for the same vertex, if the follow-up operation is a scaling up, defaults to 90 . Please make sure that the time is greater than the pod to be Running and start processing, because the autoscaling algorithm will divide the TPS by the number of pods even if the pod is not Running . scaleDownCooldownSeconds - After a scaling operation, how many seconds to wait for the same vertex, if the follow-up operation is a scaling down, defaults to 90 . zeroReplicaSleepSeconds - After scaling a source vertex replicas down to 0 , how many seconds to wait before scaling up to 1 replica to peek, defaults to 120 . Numaflow autoscaler periodically scales up a source vertex pod to \"peek\" the incoming data, this is the period of time to wait before peeking. targetProcessingSeconds - It is used to tune the aggressiveness of autoscaling for source vertices, it measures how fast you want the vertex to process all the pending messages, defaults to 20 . It is only effective for the Source vertices that support autoscaling, typically increasing the value leads to lower processing rate, thus less replicas. targetBufferAvailability - Targeted buffer availability in percentage, defaults to 50 . It is only effective for UDF and Sink vertices, it determines how aggressive you want to do for autoscaling, increasing the value will bring more replicas. replicasPerScale - Maximum number of replicas change happens in one scale up or down operation, defaults to 2 . For example, if current replica number is 3, the calculated desired replica number is 8; instead of scaling up the vertex to 8, it only does 5. To disable Numaflow autoscaling, set disabled: true as following. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex scale : disabled : true Notes Numaflow autoscaling does not apply to reduce vertices, and the source vertices which do not have a way to calculate their pending messages. Generator HTTP Nats For User-defined Sources, if the function Pending() returns a negative value, autoscaling will not be applied. Kubernetes HPA \u00b6 Kubernetes HPA is supported in Numaflow for any type of Vertex. To use HPA, remember to point the scaleTargetRef to the vertex as below, and disable Numaflow autoscaling in your Pipeline spec. apiVersion : autoscaling/v2 kind : HorizontalPodAutoscaler metadata : name : my-vertex-hpa spec : minReplicas : 1 maxReplicas : 3 metrics : - resource : name : cpu targetAverageUtilization : 50 type : Resource scaleTargetRef : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Vertex name : my-vertex With the configuration above, Kubernetes HPA controller will keep the target utilization of the pods of the Vertex at 50%. Kubernetes HPA autoscaling is useful for those Source vertices not able to count pending messages, such as HTTP . Third Party Autoscaling \u00b6 Third party autoscaling tools like KEDA are also supported in Numaflow, which can be used to autoscale any type of vertex with the scalers it supports. To use KEDA for vertex autoscaling, same as Kubernetes HPA, point the scaleTargetRef to your vertex, and disable Numaflow autoscaling in your Pipeline spec. apiVersion : keda.sh/v1alpha1 kind : ScaledObject metadata : name : my-keda-scaler spec : scaleTargetRef : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Vertex name : my-vertex ... ... Vertical Pod Autoscaling \u00b6 Vertical Pod Autoscaling can be achieved by setting the targetRef to Vertex objects as following. spec : targetRef : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Vertex name : my-vertex","title":"Autoscaling"},{"location":"user-guide/reference/autoscaling/#autoscaling","text":"Numaflow is able to run with both Horizontal Pod Autoscaling and Vertical Pod Autoscaling .","title":"Autoscaling"},{"location":"user-guide/reference/autoscaling/#horizontal-pod-autoscaling","text":"Horizontal Pod Autoscaling approaches supported in Numaflow include: Numaflow Autoscaling Kubernetes HPA Third Party Autoscaling (such as KEDA )","title":"Horizontal Pod Autoscaling"},{"location":"user-guide/reference/autoscaling/#numaflow-autoscaling","text":"Numaflow provides 0 - N autoscaling capability out of the box, it's available for all the UDF , Sink and most of the Source vertices (please check each source for more details). Numaflow autoscaling is enabled by default, there are some parameters can be tuned to achieve better results. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex scale : disabled : false # Optional, defaults to false. min : 0 # Optional, minimum replicas, defaults to 0. max : 20 # Optional, maximum replicas, defaults to 50. lookbackSeconds : 120 # Optional, defaults to 120. scaleUpCooldownSeconds : 90 # Optional, defaults to 90. scaleDownCooldownSeconds : 90 # Optional, defaults to 90. zeroReplicaSleepSeconds : 120 # Optional, defaults to 120. targetProcessingSeconds : 20 # Optional, defaults to 20. targetBufferAvailability : 50 # Optional, defaults to 50. replicasPerScale : 2 # Optional, defaults to 2. disabled - Whether to disable Numaflow autoscaling, defaults to false . min - Minimum replicas, valid value could be an integer >= 0. Defaults to 0 , which means it could be scaled down to 0. max - Maximum replicas, positive integer which should not be less than min , defaults to 50 . if max and min are the same, that will be the fixed replica number. lookbackSeconds - How many seconds to lookback for vertex average processing rate (tps) and pending messages calculation, defaults to 120 . Rate and pending messages metrics are critical for autoscaling, you might need to tune this parameter a bit to see better results. For example, your data source only have 1 minute data input in every 5 minutes, and you don't want the vertices to be scaled down to 0 . In this case, you need to increase lookbackSeconds to overlap 5 minutes, so that the calculated average rate and pending messages won't be 0 during the silent period, in order to prevent from scaling down to 0. scaleUpCooldownSeconds - After a scaling operation, how many seconds to wait for the same vertex, if the follow-up operation is a scaling up, defaults to 90 . Please make sure that the time is greater than the pod to be Running and start processing, because the autoscaling algorithm will divide the TPS by the number of pods even if the pod is not Running . scaleDownCooldownSeconds - After a scaling operation, how many seconds to wait for the same vertex, if the follow-up operation is a scaling down, defaults to 90 . zeroReplicaSleepSeconds - After scaling a source vertex replicas down to 0 , how many seconds to wait before scaling up to 1 replica to peek, defaults to 120 . Numaflow autoscaler periodically scales up a source vertex pod to \"peek\" the incoming data, this is the period of time to wait before peeking. targetProcessingSeconds - It is used to tune the aggressiveness of autoscaling for source vertices, it measures how fast you want the vertex to process all the pending messages, defaults to 20 . It is only effective for the Source vertices that support autoscaling, typically increasing the value leads to lower processing rate, thus less replicas. targetBufferAvailability - Targeted buffer availability in percentage, defaults to 50 . It is only effective for UDF and Sink vertices, it determines how aggressive you want to do for autoscaling, increasing the value will bring more replicas. replicasPerScale - Maximum number of replicas change happens in one scale up or down operation, defaults to 2 . For example, if current replica number is 3, the calculated desired replica number is 8; instead of scaling up the vertex to 8, it only does 5. To disable Numaflow autoscaling, set disabled: true as following. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex scale : disabled : true Notes Numaflow autoscaling does not apply to reduce vertices, and the source vertices which do not have a way to calculate their pending messages. Generator HTTP Nats For User-defined Sources, if the function Pending() returns a negative value, autoscaling will not be applied.","title":"Numaflow Autoscaling"},{"location":"user-guide/reference/autoscaling/#kubernetes-hpa","text":"Kubernetes HPA is supported in Numaflow for any type of Vertex. To use HPA, remember to point the scaleTargetRef to the vertex as below, and disable Numaflow autoscaling in your Pipeline spec. apiVersion : autoscaling/v2 kind : HorizontalPodAutoscaler metadata : name : my-vertex-hpa spec : minReplicas : 1 maxReplicas : 3 metrics : - resource : name : cpu targetAverageUtilization : 50 type : Resource scaleTargetRef : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Vertex name : my-vertex With the configuration above, Kubernetes HPA controller will keep the target utilization of the pods of the Vertex at 50%. Kubernetes HPA autoscaling is useful for those Source vertices not able to count pending messages, such as HTTP .","title":"Kubernetes HPA"},{"location":"user-guide/reference/autoscaling/#third-party-autoscaling","text":"Third party autoscaling tools like KEDA are also supported in Numaflow, which can be used to autoscale any type of vertex with the scalers it supports. To use KEDA for vertex autoscaling, same as Kubernetes HPA, point the scaleTargetRef to your vertex, and disable Numaflow autoscaling in your Pipeline spec. apiVersion : keda.sh/v1alpha1 kind : ScaledObject metadata : name : my-keda-scaler spec : scaleTargetRef : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Vertex name : my-vertex ... ...","title":"Third Party Autoscaling"},{"location":"user-guide/reference/autoscaling/#vertical-pod-autoscaling","text":"Vertical Pod Autoscaling can be achieved by setting the targetRef to Vertex objects as following. spec : targetRef : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Vertex name : my-vertex","title":"Vertical Pod Autoscaling"},{"location":"user-guide/reference/conditional-forwarding/","text":"Conditional Forwarding \u00b6 After processing the data, conditional forwarding is doable based on the Tags returned in the result. Below is a list of different logic operations that can be done on tags. - and - forwards the message if all the tags specified are present in Message's tags. - or - forwards the message if one of the tags specified is present in Message's tags. - not - forwards the message if all the tags specified are not present in Message's tags. For example, there's a UDF used to process numbers, and forward the result to different vertices based on the number is even or odd. In this case, you can set the tag to even-tag or odd-tag in each of the returned messages, and define the edges as below: edges : - from : p1 to : even-vertex conditions : tags : operator : or # Optional, defaults to \"or\". values : - even-tag - from : p1 to : odd-vertex conditions : tags : operator : not values : - odd-tag - from : p1 to : all conditions : tags : operator : and values : - odd-tag - even-tag","title":"Conditional Forwarding"},{"location":"user-guide/reference/conditional-forwarding/#conditional-forwarding","text":"After processing the data, conditional forwarding is doable based on the Tags returned in the result. Below is a list of different logic operations that can be done on tags. - and - forwards the message if all the tags specified are present in Message's tags. - or - forwards the message if one of the tags specified is present in Message's tags. - not - forwards the message if all the tags specified are not present in Message's tags. For example, there's a UDF used to process numbers, and forward the result to different vertices based on the number is even or odd. In this case, you can set the tag to even-tag or odd-tag in each of the returned messages, and define the edges as below: edges : - from : p1 to : even-vertex conditions : tags : operator : or # Optional, defaults to \"or\". values : - even-tag - from : p1 to : odd-vertex conditions : tags : operator : not values : - odd-tag - from : p1 to : all conditions : tags : operator : and values : - odd-tag - even-tag","title":"Conditional Forwarding"},{"location":"user-guide/reference/edge-tuning/","text":"Edge Tuning \u00b6 Drop message onFull \u00b6 We need to have an edge level setting to drop the messages if the buffer.isFull == true . Even if the UDF or UDSink drops a message due to some internal error in the user-defined code, the processing latency will spike up causing a natural back pressure. A kill switch to drop messages can help alleviate/avoid any repercussions on the rest of the DAG. This setting is an edge-level setting and can be enabled by onFull and the default is retryUntilSuccess (other option is discardLatest ). This is a data loss scenario but can be useful in cases where we are doing user-introduced experimentations, like A/B testing, on the pipeline. It is totally okay for the experimentation side of the DAG to have data loss while the production is unaffected. discardLatest \u00b6 Setting onFull to discardLatest will drop the message on the floor if the edge is full. edges : - from : a to : b onFull : discardLatest retryUntilSuccess \u00b6 The default setting for onFull in retryUntilSuccess which will make sure the message is retried until successful. edges : - from : a to : b onFull : retryUntilSuccess","title":"Edge Tuning"},{"location":"user-guide/reference/edge-tuning/#edge-tuning","text":"","title":"Edge Tuning"},{"location":"user-guide/reference/edge-tuning/#drop-message-onfull","text":"We need to have an edge level setting to drop the messages if the buffer.isFull == true . Even if the UDF or UDSink drops a message due to some internal error in the user-defined code, the processing latency will spike up causing a natural back pressure. A kill switch to drop messages can help alleviate/avoid any repercussions on the rest of the DAG. This setting is an edge-level setting and can be enabled by onFull and the default is retryUntilSuccess (other option is discardLatest ). This is a data loss scenario but can be useful in cases where we are doing user-introduced experimentations, like A/B testing, on the pipeline. It is totally okay for the experimentation side of the DAG to have data loss while the production is unaffected.","title":"Drop message onFull"},{"location":"user-guide/reference/edge-tuning/#discardlatest","text":"Setting onFull to discardLatest will drop the message on the floor if the edge is full. edges : - from : a to : b onFull : discardLatest","title":"discardLatest"},{"location":"user-guide/reference/edge-tuning/#retryuntilsuccess","text":"The default setting for onFull in retryUntilSuccess which will make sure the message is retried until successful. edges : - from : a to : b onFull : retryUntilSuccess","title":"retryUntilSuccess"},{"location":"user-guide/reference/join-vertex/","text":"Joins and Cycles \u00b6 Numaflow Pipeline Edges can be defined such that multiple Vertices can forward messages to a single vertex. Quick Start \u00b6 Please see the following examples: Join on Map Vertex Join on Reduce Vertex Join on Sink Vertex Cycle to Self Cycle to Previous Why do we need JOIN \u00b6 Without JOIN \u00b6 Without JOIN, Numaflow could only allow users to build pipelines where vertices could only read from previous one vertex. This meant that Numaflow could only support simple pipelines or tree-like pipelines. Supporting pipelines where you had to read from multiple sources or UDFs were cumbersome and required creating redundant vertices. With JOIN \u00b6 Join vertices allow users the flexibility to read from multiple sources, process data from multiple UDFs, and even write to a single sink. The Pipeline Spec doesn't change at all with JOIN, now you can create multiple Edges that have the same \u201cTo\u201d Vertex, which would have otherwise been prohibited. There is no limitation on which vertices can be joined. For instance, one can join Map or Reduce vertices as shown below: Benefits \u00b6 The introduction of Join Vertex allows users to eliminate redundancy in their pipelines. It supports many-to-one data flow without needing multiple vertices performing the same job. Examples \u00b6 Join on Sink Vertex \u00b6 By joining the sink vertices, we now only need a single vertex responsible for sending to the data sink. Example \u00b6 Join on Sink Vertex Join on Map Vertex \u00b6 Two different Sources containing similar data that can be processed the same way can now point to a single vertex. Example \u00b6 Join on Map Vertex Join on Reduce Vertex \u00b6 This feature allows for efficient aggregation of data from multiple sources. Example \u00b6 Join on Reduce Vertex Cycles \u00b6 A special case of a \"Join\" is a Cycle (a Vertex which can send either to itself or to a previous Vertex.) An example use of this is a Map UDF which does some sort of reprocessing of data under certain conditions such as a transient error. Cycles are permitted, except in the case that there's a Reduce Vertex at or downstream of the cycle. (This is because a cycle inevitably produces late data, which would get dropped by the Reduce Vertex. For this reason, cycles should be used sparingly.) The following examples are of Cycles: Cycle to Self Cycle to Previous","title":"Joins and Cycles"},{"location":"user-guide/reference/join-vertex/#joins-and-cycles","text":"Numaflow Pipeline Edges can be defined such that multiple Vertices can forward messages to a single vertex.","title":"Joins and Cycles"},{"location":"user-guide/reference/join-vertex/#quick-start","text":"Please see the following examples: Join on Map Vertex Join on Reduce Vertex Join on Sink Vertex Cycle to Self Cycle to Previous","title":"Quick Start"},{"location":"user-guide/reference/join-vertex/#why-do-we-need-join","text":"","title":"Why do we need JOIN"},{"location":"user-guide/reference/join-vertex/#without-join","text":"Without JOIN, Numaflow could only allow users to build pipelines where vertices could only read from previous one vertex. This meant that Numaflow could only support simple pipelines or tree-like pipelines. Supporting pipelines where you had to read from multiple sources or UDFs were cumbersome and required creating redundant vertices.","title":"Without JOIN"},{"location":"user-guide/reference/join-vertex/#with-join","text":"Join vertices allow users the flexibility to read from multiple sources, process data from multiple UDFs, and even write to a single sink. The Pipeline Spec doesn't change at all with JOIN, now you can create multiple Edges that have the same \u201cTo\u201d Vertex, which would have otherwise been prohibited. There is no limitation on which vertices can be joined. For instance, one can join Map or Reduce vertices as shown below:","title":"With JOIN"},{"location":"user-guide/reference/join-vertex/#benefits","text":"The introduction of Join Vertex allows users to eliminate redundancy in their pipelines. It supports many-to-one data flow without needing multiple vertices performing the same job.","title":"Benefits"},{"location":"user-guide/reference/join-vertex/#examples","text":"","title":"Examples"},{"location":"user-guide/reference/join-vertex/#join-on-sink-vertex","text":"By joining the sink vertices, we now only need a single vertex responsible for sending to the data sink.","title":"Join on Sink Vertex"},{"location":"user-guide/reference/join-vertex/#example","text":"Join on Sink Vertex","title":"Example"},{"location":"user-guide/reference/join-vertex/#join-on-map-vertex","text":"Two different Sources containing similar data that can be processed the same way can now point to a single vertex.","title":"Join on Map Vertex"},{"location":"user-guide/reference/join-vertex/#example_1","text":"Join on Map Vertex","title":"Example"},{"location":"user-guide/reference/join-vertex/#join-on-reduce-vertex","text":"This feature allows for efficient aggregation of data from multiple sources.","title":"Join on Reduce Vertex"},{"location":"user-guide/reference/join-vertex/#example_2","text":"Join on Reduce Vertex","title":"Example"},{"location":"user-guide/reference/join-vertex/#cycles","text":"A special case of a \"Join\" is a Cycle (a Vertex which can send either to itself or to a previous Vertex.) An example use of this is a Map UDF which does some sort of reprocessing of data under certain conditions such as a transient error. Cycles are permitted, except in the case that there's a Reduce Vertex at or downstream of the cycle. (This is because a cycle inevitably produces late data, which would get dropped by the Reduce Vertex. For this reason, cycles should be used sparingly.) The following examples are of Cycles: Cycle to Self Cycle to Previous","title":"Cycles"},{"location":"user-guide/reference/multi-partition/","text":"Multi-partitioned Edges \u00b6 To achieve higher throughput(> 10K but < 30K tps), users can create multi-partitioned edges. Multi-partitioned edges are only supported for pipelines with JetStream as ISB. Please ensure that the JetStream is provisioned with more nodes to support higher throughput. Since partitions are owned by the vertex reading the data, to create a multi-partitioned edge we need to configure the vertex reading the data (to-vertex) to have multiple partitions. The following code snippet provides an example of how to configure a vertex (in this case, the cat vertex) to have multiple partitions, which enables it ( cat vertex) to read at a higher throughput. - name : cat partitions : 3 udf : builtin : name : cat # A built-in UDF which simply cats the message","title":"Multi-partitioned Edges"},{"location":"user-guide/reference/multi-partition/#multi-partitioned-edges","text":"To achieve higher throughput(> 10K but < 30K tps), users can create multi-partitioned edges. Multi-partitioned edges are only supported for pipelines with JetStream as ISB. Please ensure that the JetStream is provisioned with more nodes to support higher throughput. Since partitions are owned by the vertex reading the data, to create a multi-partitioned edge we need to configure the vertex reading the data (to-vertex) to have multiple partitions. The following code snippet provides an example of how to configure a vertex (in this case, the cat vertex) to have multiple partitions, which enables it ( cat vertex) to read at a higher throughput. - name : cat partitions : 3 udf : builtin : name : cat # A built-in UDF which simply cats the message","title":"Multi-partitioned Edges"},{"location":"user-guide/reference/pipeline-operations/","text":"Pipeline Operations \u00b6 Update a Pipeline \u00b6 You might want to make some changes to an existing pipeline, for example, updating request CPU, or changing the minimal replicas for a vertex. Updating a pipeline is as simple as applying the new pipeline spec to the existing one. But there are some scenarios that you'd better not update the pipeline, instead, you should delete and recreate it. The scenarios include but are not limited to: Topology changes such as adding or removing vertices, or updating the edges between vertices. Updating the partitions for a keyed reduce vertex. Updating the user-defined container image for a vertex, while the new image can not properly handle the unprocessed data in its backlog. To summarize, if there are unprocessed messages in the pipeline, and the new pipeline spec will change the way how the messages are processed, then you should delete and recreate the pipeline. Pause a Pipeline \u00b6 To pause a pipeline, use the command below, it will bring the pipeline to Paused status, and terminate all the running vertex pods. kubectl patch pl my-pipeline --type = merge --patch '{\"spec\": {\"lifecycle\": {\"desiredPhase\": \"Paused\"}}}' Pausing a pipeline will not cause data loss. It does not clean up the unprocessed data in the pipeline, but just terminates the running pods. When the pipeline is resumed, the pods will be restarted and continue processing the unprocessed data. When pausing a pipeline, it will shutdown the source vertex pods first, and then wait for the other vertices to finish the backlog before terminating them. However, it will not wait forever and will terminate the pods after pauseGracePeriodSeconds . This is default set to 30 and can be customized by setting spec.lifecycle.pauseGracePeriodSeconds . If there's a reduce vertex in the pipeline, please make sure it uses Persistent Volume Claim for storage, otherwise the data will be lost. Resume a Pipeline \u00b6 The command below will bring the pipeline back to Running status. kubectl patch pl my-pipeline --type = merge --patch '{\"spec\": {\"lifecycle\": {\"desiredPhase\": \"Running\"}}}' Delete a Pipeline \u00b6 When deleting a pipeline, before terminating all the pods, it will try to wait for all the backlog messages that have already been ingested into the pipeline to be processed. However, it will not wait forever, if the backlog is too large, it will terminate the pods after terminationGracePeriodSeconds , which defaults to 30, and can be customized by setting spec.lifecycle.terminationGracePeriodSeconds .","title":"Pipeline Operations"},{"location":"user-guide/reference/pipeline-operations/#pipeline-operations","text":"","title":"Pipeline Operations"},{"location":"user-guide/reference/pipeline-operations/#update-a-pipeline","text":"You might want to make some changes to an existing pipeline, for example, updating request CPU, or changing the minimal replicas for a vertex. Updating a pipeline is as simple as applying the new pipeline spec to the existing one. But there are some scenarios that you'd better not update the pipeline, instead, you should delete and recreate it. The scenarios include but are not limited to: Topology changes such as adding or removing vertices, or updating the edges between vertices. Updating the partitions for a keyed reduce vertex. Updating the user-defined container image for a vertex, while the new image can not properly handle the unprocessed data in its backlog. To summarize, if there are unprocessed messages in the pipeline, and the new pipeline spec will change the way how the messages are processed, then you should delete and recreate the pipeline.","title":"Update a Pipeline"},{"location":"user-guide/reference/pipeline-operations/#pause-a-pipeline","text":"To pause a pipeline, use the command below, it will bring the pipeline to Paused status, and terminate all the running vertex pods. kubectl patch pl my-pipeline --type = merge --patch '{\"spec\": {\"lifecycle\": {\"desiredPhase\": \"Paused\"}}}' Pausing a pipeline will not cause data loss. It does not clean up the unprocessed data in the pipeline, but just terminates the running pods. When the pipeline is resumed, the pods will be restarted and continue processing the unprocessed data. When pausing a pipeline, it will shutdown the source vertex pods first, and then wait for the other vertices to finish the backlog before terminating them. However, it will not wait forever and will terminate the pods after pauseGracePeriodSeconds . This is default set to 30 and can be customized by setting spec.lifecycle.pauseGracePeriodSeconds . If there's a reduce vertex in the pipeline, please make sure it uses Persistent Volume Claim for storage, otherwise the data will be lost.","title":"Pause a Pipeline"},{"location":"user-guide/reference/pipeline-operations/#resume-a-pipeline","text":"The command below will bring the pipeline back to Running status. kubectl patch pl my-pipeline --type = merge --patch '{\"spec\": {\"lifecycle\": {\"desiredPhase\": \"Running\"}}}'","title":"Resume a Pipeline"},{"location":"user-guide/reference/pipeline-operations/#delete-a-pipeline","text":"When deleting a pipeline, before terminating all the pods, it will try to wait for all the backlog messages that have already been ingested into the pipeline to be processed. However, it will not wait forever, if the backlog is too large, it will terminate the pods after terminationGracePeriodSeconds , which defaults to 30, and can be customized by setting spec.lifecycle.terminationGracePeriodSeconds .","title":"Delete a Pipeline"},{"location":"user-guide/reference/pipeline-tuning/","text":"Pipeline Tuning \u00b6 For a data processing pipeline, each vertex keeps running the cycle of reading data from an Inter-Step Buffer (or data source), processing the data, and writing to next Inter-Step Buffers (or sinks). It is possible to make some tuning for this data processing cycle. readBatchSize - How many messages to read for each cycle, defaults to 500 . bufferMaxLength - How many unprocessed messages can be existing in the Inter-Step Buffer, defaults to 30000 . bufferUsageLimit - The percentage of the buffer usage limit, a valid number should be less than 100. Default value is 80 , which means 80% . These parameters can be customized under spec.limits as below, once defined, they apply to all the vertices and Inter-Step Buffers of the pipeline. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : limits : readBatchSize : 100 bufferMaxLength : 30000 bufferUsageLimit : 85 They also can be defined in a vertex level, which will override the pipeline level settings. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : limits : # Default limits for all the vertices and edges (buffers) of this pipeline readBatchSize : 100 bufferMaxLength : 30000 bufferUsageLimit : 85 vertices : - name : in source : generator : rpu : 5 duration : 1s - name : cat udf : builtin : name : cat limits : readBatchSize : 200 # It overrides the default limit \"100\" bufferMaxLength : 20000 # It overrides the default limit \"30000\" for the buffers owned by this vertex bufferUsageLimit : 70 # It overrides the default limit \"85\" for the buffers owned by this vertex - name : out sink : log : {} edges : - from : in to : cat - from : cat to : out","title":"Pipeline Tuning"},{"location":"user-guide/reference/pipeline-tuning/#pipeline-tuning","text":"For a data processing pipeline, each vertex keeps running the cycle of reading data from an Inter-Step Buffer (or data source), processing the data, and writing to next Inter-Step Buffers (or sinks). It is possible to make some tuning for this data processing cycle. readBatchSize - How many messages to read for each cycle, defaults to 500 . bufferMaxLength - How many unprocessed messages can be existing in the Inter-Step Buffer, defaults to 30000 . bufferUsageLimit - The percentage of the buffer usage limit, a valid number should be less than 100. Default value is 80 , which means 80% . These parameters can be customized under spec.limits as below, once defined, they apply to all the vertices and Inter-Step Buffers of the pipeline. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : limits : readBatchSize : 100 bufferMaxLength : 30000 bufferUsageLimit : 85 They also can be defined in a vertex level, which will override the pipeline level settings. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : limits : # Default limits for all the vertices and edges (buffers) of this pipeline readBatchSize : 100 bufferMaxLength : 30000 bufferUsageLimit : 85 vertices : - name : in source : generator : rpu : 5 duration : 1s - name : cat udf : builtin : name : cat limits : readBatchSize : 200 # It overrides the default limit \"100\" bufferMaxLength : 20000 # It overrides the default limit \"30000\" for the buffers owned by this vertex bufferUsageLimit : 70 # It overrides the default limit \"85\" for the buffers owned by this vertex - name : out sink : log : {} edges : - from : in to : cat - from : cat to : out","title":"Pipeline Tuning"},{"location":"user-guide/reference/side-inputs/","text":"Side Inputs \u00b6 For an unbounded pipeline in Numaflow that never terminates, there are many cases where users want to update a configuration of the UDF without restarting the pipeline. Numaflow enables it by the Side Inputs feature where we can broadcast changes to vertices automatically. The Side Inputs feature achieves this by allowing users to write custom UDFs to broadcast changes to the vertices that are listening in for updates. Using Side Inputs in Numaflow \u00b6 The Side Inputs are updated based on a cron-like schedule, specified in the pipeline spec with a trigger field. Multiple side inputs are supported as well. Below is an example of pipeline spec with side inputs, which runs the custom UDFs every 15 mins and broadcasts the changes if there is any change to be broadcasted. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : sideInputs : - name : s3 container : image : my-sideinputs-s3-image:v1 trigger : schedule : \"*/15 * * * *\" # timezone: America/Los_Angeles - name : redis container : image : my-sideinputs-redis-image:v1 trigger : schedule : \"*/15 * * * *\" # timezone: America/Los_Angeles vertices : - name : my-vertex sideInputs : - s3 - name : my-vertex-multiple-side-inputs sideInputs : - s3 - redis Implementing User-defined Side Inputs \u00b6 To use the Side Inputs feature, a User-defined function implementing an interface defined in the Numaflow SDK ( Go , Python , Java ) is needed to retrieve the data. You can choose the SDK of your choice to create a User-defined Side Input image which implements the Side Inputs Update. Example in Golang \u00b6 Here is an example of how to write a User-defined Side Input in Golang, // handle is the side input handler function. func handle ( _ context . Context ) sideinputsdk . Message { t := time . Now () // val is the side input message value. This would be the value that the side input vertex receives. val := \"an example: \" + string ( t . String ()) // randomly drop side input message. Note that the side input message is not retried. // NoBroadcastMessage() is used to drop the message and not to // broadcast it to other side input vertices. counter = ( counter + 1 ) % 10 if counter % 2 == 0 { return sideinputsdk . NoBroadcastMessage () } // BroadcastMessage() is used to broadcast the message with the given value to other side input vertices. // val must be converted to []byte. return sideinputsdk . BroadcastMessage ([] byte ( val )) } Similarly, this can be written in Python and Java as well. After performing the retrieval/update, the side input value is then broadcasted to all vertices that use the side input. // BroadcastMessage() is used to broadcast the message with the given value. sideinputsdk . BroadcastMessage ([] byte ( val )) In some cased the user may want to drop the message and not to broadcast the side input value further. // NoBroadcastMessage() is used to drop the message and not to broadcast it further sideinputsdk . NoBroadcastMessage () UDF \u00b6 Users need to add a watcher on the filesystem to fetch the updated side inputs in their User-defined Source/Function/Sink in order to apply the new changes into the data process. For each side input there will be a file with the given path and after any update to the side input value the file will be updated. The directory is fixed and can be accessed through a sideinput constant and the file name is the name of the side input. sideinput . DirPath - > \"/var/numaflow/side-inputs\" sideInputFileName - > \"/var/numaflow/side-inputs/sideInputName\" Here are some examples of watching the side input filesystem for changes in Golang , Python and Java .","title":"Side Inputs"},{"location":"user-guide/reference/side-inputs/#side-inputs","text":"For an unbounded pipeline in Numaflow that never terminates, there are many cases where users want to update a configuration of the UDF without restarting the pipeline. Numaflow enables it by the Side Inputs feature where we can broadcast changes to vertices automatically. The Side Inputs feature achieves this by allowing users to write custom UDFs to broadcast changes to the vertices that are listening in for updates.","title":"Side Inputs"},{"location":"user-guide/reference/side-inputs/#using-side-inputs-in-numaflow","text":"The Side Inputs are updated based on a cron-like schedule, specified in the pipeline spec with a trigger field. Multiple side inputs are supported as well. Below is an example of pipeline spec with side inputs, which runs the custom UDFs every 15 mins and broadcasts the changes if there is any change to be broadcasted. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : sideInputs : - name : s3 container : image : my-sideinputs-s3-image:v1 trigger : schedule : \"*/15 * * * *\" # timezone: America/Los_Angeles - name : redis container : image : my-sideinputs-redis-image:v1 trigger : schedule : \"*/15 * * * *\" # timezone: America/Los_Angeles vertices : - name : my-vertex sideInputs : - s3 - name : my-vertex-multiple-side-inputs sideInputs : - s3 - redis","title":"Using Side Inputs in Numaflow"},{"location":"user-guide/reference/side-inputs/#implementing-user-defined-side-inputs","text":"To use the Side Inputs feature, a User-defined function implementing an interface defined in the Numaflow SDK ( Go , Python , Java ) is needed to retrieve the data. You can choose the SDK of your choice to create a User-defined Side Input image which implements the Side Inputs Update.","title":"Implementing User-defined Side Inputs"},{"location":"user-guide/reference/side-inputs/#example-in-golang","text":"Here is an example of how to write a User-defined Side Input in Golang, // handle is the side input handler function. func handle ( _ context . Context ) sideinputsdk . Message { t := time . Now () // val is the side input message value. This would be the value that the side input vertex receives. val := \"an example: \" + string ( t . String ()) // randomly drop side input message. Note that the side input message is not retried. // NoBroadcastMessage() is used to drop the message and not to // broadcast it to other side input vertices. counter = ( counter + 1 ) % 10 if counter % 2 == 0 { return sideinputsdk . NoBroadcastMessage () } // BroadcastMessage() is used to broadcast the message with the given value to other side input vertices. // val must be converted to []byte. return sideinputsdk . BroadcastMessage ([] byte ( val )) } Similarly, this can be written in Python and Java as well. After performing the retrieval/update, the side input value is then broadcasted to all vertices that use the side input. // BroadcastMessage() is used to broadcast the message with the given value. sideinputsdk . BroadcastMessage ([] byte ( val )) In some cased the user may want to drop the message and not to broadcast the side input value further. // NoBroadcastMessage() is used to drop the message and not to broadcast it further sideinputsdk . NoBroadcastMessage ()","title":"Example in Golang"},{"location":"user-guide/reference/side-inputs/#udf","text":"Users need to add a watcher on the filesystem to fetch the updated side inputs in their User-defined Source/Function/Sink in order to apply the new changes into the data process. For each side input there will be a file with the given path and after any update to the side input value the file will be updated. The directory is fixed and can be accessed through a sideinput constant and the file name is the name of the side input. sideinput . DirPath - > \"/var/numaflow/side-inputs\" sideInputFileName - > \"/var/numaflow/side-inputs/sideInputName\" Here are some examples of watching the side input filesystem for changes in Golang , Python and Java .","title":"UDF"},{"location":"user-guide/reference/configuration/container-resources/","text":"Container Resources \u00b6 Container Resources can be customized for all the types of vertices. For configuring container resources on pods not owned by a vertex, see Pipeline Customization . Numa Container \u00b6 To specify resources for the numa container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex containerTemplate : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi UDF Container \u00b6 To specify resources for udf container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex udf : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi UDSource Container \u00b6 To specify resources for udsource container of a source vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex source : udsource : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi Source Transformer Container \u00b6 To specify resources for transformer container of a source vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex source : transformer : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi UDSink Container \u00b6 To specify resources for udsink container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex sink : udsink : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi Init Container \u00b6 To specify resources for the init init-container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex initContainerTemplate : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi Container resources for user init-containers are instead specified at .spec.vertices[*].initContainers[*].resources .","title":"Container Resources"},{"location":"user-guide/reference/configuration/container-resources/#container-resources","text":"Container Resources can be customized for all the types of vertices. For configuring container resources on pods not owned by a vertex, see Pipeline Customization .","title":"Container Resources"},{"location":"user-guide/reference/configuration/container-resources/#numa-container","text":"To specify resources for the numa container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex containerTemplate : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi","title":"Numa Container"},{"location":"user-guide/reference/configuration/container-resources/#udf-container","text":"To specify resources for udf container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex udf : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi","title":"UDF Container"},{"location":"user-guide/reference/configuration/container-resources/#udsource-container","text":"To specify resources for udsource container of a source vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex source : udsource : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi","title":"UDSource Container"},{"location":"user-guide/reference/configuration/container-resources/#source-transformer-container","text":"To specify resources for transformer container of a source vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex source : transformer : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi","title":"Source Transformer Container"},{"location":"user-guide/reference/configuration/container-resources/#udsink-container","text":"To specify resources for udsink container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex sink : udsink : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi","title":"UDSink Container"},{"location":"user-guide/reference/configuration/container-resources/#init-container","text":"To specify resources for the init init-container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex initContainerTemplate : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi Container resources for user init-containers are instead specified at .spec.vertices[*].initContainers[*].resources .","title":"Init Container"},{"location":"user-guide/reference/configuration/environment-variables/","text":"Environment Variables \u00b6 For the numa container of vertex pods, environment variable NUMAFLOW_DEBUG can be set to true for debugging . In udf , udsink and transformer containers, there are some preset environment variables that can be used directly. NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex. NUMAFLOW_CPU_REQUEST - resources.requests.cpu , roundup to N cores, 0 if missing. NUMAFLOW_CPU_LIMIT - resources.limits.cpu , roundup to N cores, use host cpu cores if missing. NUMAFLOW_MEMORY_REQUEST - resources.requests.memory in bytes, 0 if missing. NUMAFLOW_MEMORY_LIMIT - resources.limits.memory in bytes, use host memory if missing. For setting environment variables on pods not owned by a vertex, see Pipeline Customization . Your Own Environment Variables \u00b6 To add your own environment variables to udf or udsink containers, check the example below. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf udf : container : image : my-function:latest env : - name : env01 value : value01 - name : env02 valueFrom : secretKeyRef : name : my-secret key : my-key - name : my-sink sink : udsink : container : image : my-sink:latest env : - name : env03 value : value03 Similarly, envFrom also can be specified in udf or udsink containers. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf udf : container : image : my-function:latest envFrom : - configMapRef : name : my-config - name : my-sink sink : udsink : container : image : my-sink:latest envFrom : - secretRef : name : my-secret","title":"Environment Variables"},{"location":"user-guide/reference/configuration/environment-variables/#environment-variables","text":"For the numa container of vertex pods, environment variable NUMAFLOW_DEBUG can be set to true for debugging . In udf , udsink and transformer containers, there are some preset environment variables that can be used directly. NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex. NUMAFLOW_CPU_REQUEST - resources.requests.cpu , roundup to N cores, 0 if missing. NUMAFLOW_CPU_LIMIT - resources.limits.cpu , roundup to N cores, use host cpu cores if missing. NUMAFLOW_MEMORY_REQUEST - resources.requests.memory in bytes, 0 if missing. NUMAFLOW_MEMORY_LIMIT - resources.limits.memory in bytes, use host memory if missing. For setting environment variables on pods not owned by a vertex, see Pipeline Customization .","title":"Environment Variables"},{"location":"user-guide/reference/configuration/environment-variables/#your-own-environment-variables","text":"To add your own environment variables to udf or udsink containers, check the example below. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf udf : container : image : my-function:latest env : - name : env01 value : value01 - name : env02 valueFrom : secretKeyRef : name : my-secret key : my-key - name : my-sink sink : udsink : container : image : my-sink:latest env : - name : env03 value : value03 Similarly, envFrom also can be specified in udf or udsink containers. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf udf : container : image : my-function:latest envFrom : - configMapRef : name : my-config - name : my-sink sink : udsink : container : image : my-sink:latest envFrom : - secretRef : name : my-secret","title":"Your Own Environment Variables"},{"location":"user-guide/reference/configuration/init-containers/","text":"Init Containers \u00b6 Init Containers can be provided for all the types of vertices. The following example shows how to add an init-container to a udf vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf initContainers : - name : my-init image : busybox:latest command : [ \"/bin/sh\" , \"-c\" , \"echo \\\"my-init is running!\\\" && sleep 60\" ] udf : container : image : my-function:latest The following example shows how to use init-containers and volumes together to provide a udf container files on startup. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf volumes : - name : my-udf-data emptyDir : {} initContainers : - name : my-init image : amazon/aws-cli:latest command : [ \"/bin/sh\" , \"-c\" , \"aws s3 sync s3://path/to/my-s3-data /path/to/my-init-data\" ] volumeMounts : - mountPath : /path/to/my-init-data name : my-udf-data udf : container : image : my-function:latest volumeMounts : - mountPath : /path/to/my-data name : my-udf-data","title":"Init Containers"},{"location":"user-guide/reference/configuration/init-containers/#init-containers","text":"Init Containers can be provided for all the types of vertices. The following example shows how to add an init-container to a udf vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf initContainers : - name : my-init image : busybox:latest command : [ \"/bin/sh\" , \"-c\" , \"echo \\\"my-init is running!\\\" && sleep 60\" ] udf : container : image : my-function:latest The following example shows how to use init-containers and volumes together to provide a udf container files on startup. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf volumes : - name : my-udf-data emptyDir : {} initContainers : - name : my-init image : amazon/aws-cli:latest command : [ \"/bin/sh\" , \"-c\" , \"aws s3 sync s3://path/to/my-s3-data /path/to/my-init-data\" ] volumeMounts : - mountPath : /path/to/my-init-data name : my-udf-data udf : container : image : my-function:latest volumeMounts : - mountPath : /path/to/my-data name : my-udf-data","title":"Init Containers"},{"location":"user-guide/reference/configuration/istio/","text":"Running on Istio \u00b6 If you want to get pipeline vertices running on Istio, so that they are able to talk to other services with Istio enabled, one approach is to whitelist the ports that Numaflow uses. To whitelist the ports, add traffic.sidecar.istio.io/excludeInboundPorts and traffic.sidecar.istio.io/excludeOutboundPorts annotations to your vertex configuration: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex metadata : annotations : sidecar.istio.io/inject : \"true\" traffic.sidecar.istio.io/excludeOutboundPorts : \"4222\" # Nats JetStream port traffic.sidecar.istio.io/excludeInboundPorts : \"2469\" # Numaflow vertex metrics port udf : container : image : my-udf-image:latest ... If you want to apply same configuration to all the vertices, use Vertex Template : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : vertex : metadata : annotations : sidecar.istio.io/inject : \"true\" traffic.sidecar.istio.io/excludeOutboundPorts : \"4222\" # Nats JetStream port traffic.sidecar.istio.io/excludeInboundPorts : \"2469\" # Numaflow vertex metrics port vertices : - name : my-vertex-1 udf : container : image : my-udf-1-image:latest - name : my-vertex-2 udf : container : image : my-udf-2-image:latest ...","title":"Running on Istio"},{"location":"user-guide/reference/configuration/istio/#running-on-istio","text":"If you want to get pipeline vertices running on Istio, so that they are able to talk to other services with Istio enabled, one approach is to whitelist the ports that Numaflow uses. To whitelist the ports, add traffic.sidecar.istio.io/excludeInboundPorts and traffic.sidecar.istio.io/excludeOutboundPorts annotations to your vertex configuration: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex metadata : annotations : sidecar.istio.io/inject : \"true\" traffic.sidecar.istio.io/excludeOutboundPorts : \"4222\" # Nats JetStream port traffic.sidecar.istio.io/excludeInboundPorts : \"2469\" # Numaflow vertex metrics port udf : container : image : my-udf-image:latest ... If you want to apply same configuration to all the vertices, use Vertex Template : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : vertex : metadata : annotations : sidecar.istio.io/inject : \"true\" traffic.sidecar.istio.io/excludeOutboundPorts : \"4222\" # Nats JetStream port traffic.sidecar.istio.io/excludeInboundPorts : \"2469\" # Numaflow vertex metrics port vertices : - name : my-vertex-1 udf : container : image : my-udf-1-image:latest - name : my-vertex-2 udf : container : image : my-udf-2-image:latest ...","title":"Running on Istio"},{"location":"user-guide/reference/configuration/labels-and-annotations/","text":"Labels And Annotations \u00b6 Sometimes customized Labels or Annotations are needed for the vertices, for example, adding an annotation to enable or disable Istio sidecar injection. To do that, a metadata with labels or annotations can be added to the vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex metadata : labels : key1 : val1 key2 : val2 annotations : key3 : val3 key4 : val4","title":"Labels And Annotations"},{"location":"user-guide/reference/configuration/labels-and-annotations/#labels-and-annotations","text":"Sometimes customized Labels or Annotations are needed for the vertices, for example, adding an annotation to enable or disable Istio sidecar injection. To do that, a metadata with labels or annotations can be added to the vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex metadata : labels : key1 : val1 key2 : val2 annotations : key3 : val3 key4 : val4","title":"Labels And Annotations"},{"location":"user-guide/reference/configuration/max-message-size/","text":"Maximum Message Size \u00b6 The default maximum message size is 1MB . There's a way to increase this limit in case you want to, but please think it through before doing so. The max message size is determined by: Max messages size supported by gRPC (default value is 64MB in Numaflow). Max messages size supported by the Inter-Step Buffer implementation. If JetStream is used as the Inter-Step Buffer implementation, the default max message size for it is configured as 1MB . You can change it by setting the spec.jetstream.settings in the InterStepBufferService specification. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : settings : | max_payload: 8388608 # 8MB It's not recommended to use values over 8388608 (8MB) but max_payload can be set up to 67108864 (64MB). Please be aware that if you increase the max message size of the InterStepBufferService , you probably will also need to change some other limits. For example, if the size of each messages is as large as 8MB, then 100 messages flowing in the pipeline will make each of the Inter-Step Buffer need at least 800MB of disk space to store the messages, and the memory consumption will also be high, that will probably cause the Inter-Step Buffer Service to crash. In that case, you might need to update the retention policy in the Inter-Step Buffer Service to make sure the messages are not stored for too long. Check out the Inter-Step Buffer Service for more details.","title":"Maximum Message Size"},{"location":"user-guide/reference/configuration/max-message-size/#maximum-message-size","text":"The default maximum message size is 1MB . There's a way to increase this limit in case you want to, but please think it through before doing so. The max message size is determined by: Max messages size supported by gRPC (default value is 64MB in Numaflow). Max messages size supported by the Inter-Step Buffer implementation. If JetStream is used as the Inter-Step Buffer implementation, the default max message size for it is configured as 1MB . You can change it by setting the spec.jetstream.settings in the InterStepBufferService specification. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : settings : | max_payload: 8388608 # 8MB It's not recommended to use values over 8388608 (8MB) but max_payload can be set up to 67108864 (64MB). Please be aware that if you increase the max message size of the InterStepBufferService , you probably will also need to change some other limits. For example, if the size of each messages is as large as 8MB, then 100 messages flowing in the pipeline will make each of the Inter-Step Buffer need at least 800MB of disk space to store the messages, and the memory consumption will also be high, that will probably cause the Inter-Step Buffer Service to crash. In that case, you might need to update the retention policy in the Inter-Step Buffer Service to make sure the messages are not stored for too long. Check out the Inter-Step Buffer Service for more details.","title":"Maximum Message Size"},{"location":"user-guide/reference/configuration/pipeline-customization/","text":"Pipeline Customization \u00b6 There is an optional .spec.templates field in the Pipeline resource which may be used to customize kubernetes resources owned by the Pipeline. Vertex customization is described separately in more detail (i.e. Environment Variables , Container Resources , etc.). Daemon Deployment \u00b6 The following example shows how to configure a Daemon Deployment with all currently supported fields. The .spec.templates.daemon field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : daemon : # Deployment spec replicas : 3 # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : podAntiAffinity : requiredDuringSchedulingIgnoredDuringExecution : - labelSelector : matchExpressions : - key : app.kubernetes.io/component operator : In values : - daemon - key : numaflow.numaproj.io/pipeline-name operator : In values : - my-pipeline topologyKey : kubernetes.io/hostname # Containers containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi initContainerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi Jobs \u00b6 The following example shows how to configure kubernetes Jobs owned by a Pipeline with all currently supported fields. The .spec.templates.job field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : job : # Job spec ttlSecondsAfterFinished : 600 # numaflow defaults to 30 backoffLimit : 5 # numaflow defaults to 20 # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : {} # Container containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi Vertices \u00b6 The following example shows how to configure the all the vertex pods owned by a pipeline with all currently supported fields. Be aware these configurations applied to all vertex pods can be overridden by the vertex configuration. The .spec.templates.vertex field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : vertex : # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : podAntiAffinity : requiredDuringSchedulingIgnoredDuringExecution : - labelSelector : matchExpressions : - key : numaflow.numaproj.io/pipeline-name operator : In values : - my-pipeline topologyKey : kubernetes.io/hostname # Containers containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi initContainerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi Side Inputs \u00b6 The following example shows how to configure the all the Side Inputs Manager pods owned by a pipeline with all currently supported fields. The .spec.templates.sideInputsManager field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : sideInputsManager : # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : podAntiAffinity : requiredDuringSchedulingIgnoredDuringExecution : - labelSelector : matchExpressions : - key : numaflow.numaproj.io/pipeline-name operator : In values : - my-pipeline topologyKey : kubernetes.io/hostname # Containers containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi initContainerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi","title":"Pipeline Customization"},{"location":"user-guide/reference/configuration/pipeline-customization/#pipeline-customization","text":"There is an optional .spec.templates field in the Pipeline resource which may be used to customize kubernetes resources owned by the Pipeline. Vertex customization is described separately in more detail (i.e. Environment Variables , Container Resources , etc.).","title":"Pipeline Customization"},{"location":"user-guide/reference/configuration/pipeline-customization/#daemon-deployment","text":"The following example shows how to configure a Daemon Deployment with all currently supported fields. The .spec.templates.daemon field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : daemon : # Deployment spec replicas : 3 # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : podAntiAffinity : requiredDuringSchedulingIgnoredDuringExecution : - labelSelector : matchExpressions : - key : app.kubernetes.io/component operator : In values : - daemon - key : numaflow.numaproj.io/pipeline-name operator : In values : - my-pipeline topologyKey : kubernetes.io/hostname # Containers containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi initContainerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi","title":"Daemon Deployment"},{"location":"user-guide/reference/configuration/pipeline-customization/#jobs","text":"The following example shows how to configure kubernetes Jobs owned by a Pipeline with all currently supported fields. The .spec.templates.job field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : job : # Job spec ttlSecondsAfterFinished : 600 # numaflow defaults to 30 backoffLimit : 5 # numaflow defaults to 20 # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : {} # Container containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi","title":"Jobs"},{"location":"user-guide/reference/configuration/pipeline-customization/#vertices","text":"The following example shows how to configure the all the vertex pods owned by a pipeline with all currently supported fields. Be aware these configurations applied to all vertex pods can be overridden by the vertex configuration. The .spec.templates.vertex field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : vertex : # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : podAntiAffinity : requiredDuringSchedulingIgnoredDuringExecution : - labelSelector : matchExpressions : - key : numaflow.numaproj.io/pipeline-name operator : In values : - my-pipeline topologyKey : kubernetes.io/hostname # Containers containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi initContainerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi","title":"Vertices"},{"location":"user-guide/reference/configuration/pipeline-customization/#side-inputs","text":"The following example shows how to configure the all the Side Inputs Manager pods owned by a pipeline with all currently supported fields. The .spec.templates.sideInputsManager field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : sideInputsManager : # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : podAntiAffinity : requiredDuringSchedulingIgnoredDuringExecution : - labelSelector : matchExpressions : - key : numaflow.numaproj.io/pipeline-name operator : In values : - my-pipeline topologyKey : kubernetes.io/hostname # Containers containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi initContainerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi","title":"Side Inputs"},{"location":"user-guide/reference/configuration/sidecar-containers/","text":"Sidecar Containers \u00b6 Additional \" sidecar \" containers can be provided for udf and sink vertices. source vertices do not currently support sidecars. The following example shows how to add a sidecar container to a udf vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf sidecars : - name : my-sidecar image : busybox:latest command : [ \"/bin/sh\" , \"-c\" , \"echo \\\"my-sidecar is running!\\\" && tail -f /dev/null\" ] udf : container : image : my-function:latest There are various use-cases for sidecars. One possible use-case is a udf container that needs functionality from a library written in a different language. The library's functionality could be made available through gRPC over Unix Domain Socket. The following example shows how that could be accomplished using a shared volume . It is the sidecar owner's responsibility to come up with a protocol that can be used with the UDF. It could be volume, gRPC, TCP, HTTP 1.x, etc., apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf-vertex volumes : - name : my-udf-volume emptyDir : medium : Memory sidecars : - name : my-sidecar image : alpine:latest command : [ \"/bin/sh\" , \"-c\" , \"apk add socat && socat UNIX-LISTEN:/path/to/my-sidecar-mount-path/my.sock - && tail -f /dev/null\" ] volumeMounts : - mountPath : /path/to/my-sidecar-mount-path name : my-udf-volume udf : container : image : alpine:latest command : [ \"/bin/sh\" , \"-c\" , \"apk add socat && echo \\\"hello\\\" | socat UNIX-CONNECT:/path/to/my-udf-mount-path/my.sock,forever - && tail -f /dev/null\" ] volumeMounts : - mountPath : /path/to/my-udf-mount-path name : my-udf-volume","title":"Sidecar Containers"},{"location":"user-guide/reference/configuration/sidecar-containers/#sidecar-containers","text":"Additional \" sidecar \" containers can be provided for udf and sink vertices. source vertices do not currently support sidecars. The following example shows how to add a sidecar container to a udf vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf sidecars : - name : my-sidecar image : busybox:latest command : [ \"/bin/sh\" , \"-c\" , \"echo \\\"my-sidecar is running!\\\" && tail -f /dev/null\" ] udf : container : image : my-function:latest There are various use-cases for sidecars. One possible use-case is a udf container that needs functionality from a library written in a different language. The library's functionality could be made available through gRPC over Unix Domain Socket. The following example shows how that could be accomplished using a shared volume . It is the sidecar owner's responsibility to come up with a protocol that can be used with the UDF. It could be volume, gRPC, TCP, HTTP 1.x, etc., apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf-vertex volumes : - name : my-udf-volume emptyDir : medium : Memory sidecars : - name : my-sidecar image : alpine:latest command : [ \"/bin/sh\" , \"-c\" , \"apk add socat && socat UNIX-LISTEN:/path/to/my-sidecar-mount-path/my.sock - && tail -f /dev/null\" ] volumeMounts : - mountPath : /path/to/my-sidecar-mount-path name : my-udf-volume udf : container : image : alpine:latest command : [ \"/bin/sh\" , \"-c\" , \"apk add socat && echo \\\"hello\\\" | socat UNIX-CONNECT:/path/to/my-udf-mount-path/my.sock,forever - && tail -f /dev/null\" ] volumeMounts : - mountPath : /path/to/my-udf-mount-path name : my-udf-volume","title":"Sidecar Containers"},{"location":"user-guide/reference/configuration/volumes/","text":"Volumes \u00b6 Volumes can be mounted to udsource , udf or udsink containers. Following example shows how to mount a ConfigMap to an udsource vertex, an udf vertex and an udsink vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-source volumes : - name : my-udsource-config configMap : name : udsource-config source : udsource : container : image : my-source:latest volumeMounts : - mountPath : /path/to/my-source-config name : my-udsource-config - name : my-udf volumes : - name : my-udf-config configMap : name : udf-config udf : container : image : my-function:latest volumeMounts : - mountPath : /path/to/my-function-config name : my-udf-config - name : my-sink volumes : - name : my-udsink-config configMap : name : udsink-config sink : udsink : container : image : my-sink:latest volumeMounts : - mountPath : /path/to/my-sink-config name : my-udsink-config","title":"Volumes"},{"location":"user-guide/reference/configuration/volumes/#volumes","text":"Volumes can be mounted to udsource , udf or udsink containers. Following example shows how to mount a ConfigMap to an udsource vertex, an udf vertex and an udsink vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-source volumes : - name : my-udsource-config configMap : name : udsource-config source : udsource : container : image : my-source:latest volumeMounts : - mountPath : /path/to/my-source-config name : my-udsource-config - name : my-udf volumes : - name : my-udf-config configMap : name : udf-config udf : container : image : my-function:latest volumeMounts : - mountPath : /path/to/my-function-config name : my-udf-config - name : my-sink volumes : - name : my-udsink-config configMap : name : udsink-config sink : udsink : container : image : my-sink:latest volumeMounts : - mountPath : /path/to/my-sink-config name : my-udsink-config","title":"Volumes"},{"location":"user-guide/reference/kustomize/kustomize/","text":"Kustomize Integration \u00b6 Transformers \u00b6 Kustomize Transformer Configurations can be used to do lots of powerful operations such as ConfigMap and Secret generations, applying common labels and annotations, updating image names and tags. To use these features with Numaflow CRD objects, download numaflow-transformer-config.yaml into your kustomize directory, and add it to configurations section. kind : Kustomization apiVersion : kustomize.config.k8s.io/v1beta1 configurations : - numaflow-transformer-config.yaml # Or reference the remote configuration directly. # - https://raw.githubusercontent.com/numaproj/numaflow/main/docs/user-guide/reference/kustomize/numaflow-transformer-config.yaml Here is an example to use transformers with a Pipeline. Patch \u00b6 Starting from version 4.5.5, kustomize can use Kubernetes OpenAPI schema to provide merge key and patch strategy information. To use that with Numaflow CRD objects, download schema.json into your kustomize directory, and add it to openapi section. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization openapi : path : schema.json # Or reference the remote configuration directly. # path: https://raw.githubusercontent.com/numaproj/numaflow/main/api/json-schema/schema.json For example, given the following Pipeline spec: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : in source : generator : rpu : 5 duration : 1s - name : my-udf udf : container : image : my-pipeline/my-udf:v0.1 - name : out sink : log : {} edges : - from : in to : my-udf - from : my-udf to : out You can update the source spec via a patch in a kustomize file. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization resources : - my-pipeline.yaml openapi : path : https://raw.githubusercontent.com/numaproj/numaflow/main/api/json-schema/schema.json patches : - patch : |- apiVersion: numaflow.numaproj.io/v1alpha1 kind: Pipeline metadata: name: my-pipeline spec: vertices: - name: in source: generator: rpu: 500 See the full example here .","title":"Kustomize Integration"},{"location":"user-guide/reference/kustomize/kustomize/#kustomize-integration","text":"","title":"Kustomize Integration"},{"location":"user-guide/reference/kustomize/kustomize/#transformers","text":"Kustomize Transformer Configurations can be used to do lots of powerful operations such as ConfigMap and Secret generations, applying common labels and annotations, updating image names and tags. To use these features with Numaflow CRD objects, download numaflow-transformer-config.yaml into your kustomize directory, and add it to configurations section. kind : Kustomization apiVersion : kustomize.config.k8s.io/v1beta1 configurations : - numaflow-transformer-config.yaml # Or reference the remote configuration directly. # - https://raw.githubusercontent.com/numaproj/numaflow/main/docs/user-guide/reference/kustomize/numaflow-transformer-config.yaml Here is an example to use transformers with a Pipeline.","title":"Transformers"},{"location":"user-guide/reference/kustomize/kustomize/#patch","text":"Starting from version 4.5.5, kustomize can use Kubernetes OpenAPI schema to provide merge key and patch strategy information. To use that with Numaflow CRD objects, download schema.json into your kustomize directory, and add it to openapi section. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization openapi : path : schema.json # Or reference the remote configuration directly. # path: https://raw.githubusercontent.com/numaproj/numaflow/main/api/json-schema/schema.json For example, given the following Pipeline spec: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : in source : generator : rpu : 5 duration : 1s - name : my-udf udf : container : image : my-pipeline/my-udf:v0.1 - name : out sink : log : {} edges : - from : in to : my-udf - from : my-udf to : out You can update the source spec via a patch in a kustomize file. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization resources : - my-pipeline.yaml openapi : path : https://raw.githubusercontent.com/numaproj/numaflow/main/api/json-schema/schema.json patches : - patch : |- apiVersion: numaflow.numaproj.io/v1alpha1 kind: Pipeline metadata: name: my-pipeline spec: vertices: - name: in source: generator: rpu: 500 See the full example here .","title":"Patch"},{"location":"user-guide/sinks/blackhole/","text":"Blackhole Sink \u00b6 A Blackhole sink is where the output is drained without writing to any sink, it is to emulate /dev/null . spec : vertices : - name : output sink : blackhole : {} NOTE: The previous vertex should ideally be not forwarding the message to make it more efficient to avoid network latency.","title":"Blackhole Sink"},{"location":"user-guide/sinks/blackhole/#blackhole-sink","text":"A Blackhole sink is where the output is drained without writing to any sink, it is to emulate /dev/null . spec : vertices : - name : output sink : blackhole : {} NOTE: The previous vertex should ideally be not forwarding the message to make it more efficient to avoid network latency.","title":"Blackhole Sink"},{"location":"user-guide/sinks/fallback/","text":"Fallback Sink \u00b6 A Fallback Sink functions as a Dead Letter Queue (DLQ) Sink and can be configured to serve as a backup when the primary sink is down, unavailable, or under maintenance. This is particularly useful when multiple sinks are in a pipeline; if a sink fails, the resulting back-pressure will back-propagate and stop the source vertex from reading more data. A Fallback Sink can beset up to prevent this from happening. This backup sink stores data while the primary sink is offline. The stored data can be replayed once the primary sink is back online. Note: The fallback field is optional. Users are required to return a fallback response from the user-defined sink when the primary sink fails; only then the messages will be directed to the fallback sink. Example of a fallback response in a user-defined sink: here CAVEATs \u00b6 The fallback field can only be utilized when the primary sink is a User Defined Sink. Example \u00b6 Builtin Kafka \u00b6 An example using builtin kafka as fallback sink: - name : out sink : udsink : container : image : my-sink:latest fallback : kafka : brokers : - my-broker1:19700 - my-broker2:19700 topic : my-topic UD Sink \u00b6 An example using custom user-defined sink as fallback sink. User Defined Sink as a fallback sink: - name : out sink : udsink : container : image : my-sink:latest fallback : udsink : container : image : my-sink:latest","title":"Fallback Sink"},{"location":"user-guide/sinks/fallback/#fallback-sink","text":"A Fallback Sink functions as a Dead Letter Queue (DLQ) Sink and can be configured to serve as a backup when the primary sink is down, unavailable, or under maintenance. This is particularly useful when multiple sinks are in a pipeline; if a sink fails, the resulting back-pressure will back-propagate and stop the source vertex from reading more data. A Fallback Sink can beset up to prevent this from happening. This backup sink stores data while the primary sink is offline. The stored data can be replayed once the primary sink is back online. Note: The fallback field is optional. Users are required to return a fallback response from the user-defined sink when the primary sink fails; only then the messages will be directed to the fallback sink. Example of a fallback response in a user-defined sink: here","title":"Fallback Sink"},{"location":"user-guide/sinks/fallback/#caveats","text":"The fallback field can only be utilized when the primary sink is a User Defined Sink.","title":"CAVEATs"},{"location":"user-guide/sinks/fallback/#example","text":"","title":"Example"},{"location":"user-guide/sinks/fallback/#builtin-kafka","text":"An example using builtin kafka as fallback sink: - name : out sink : udsink : container : image : my-sink:latest fallback : kafka : brokers : - my-broker1:19700 - my-broker2:19700 topic : my-topic","title":"Builtin Kafka"},{"location":"user-guide/sinks/fallback/#ud-sink","text":"An example using custom user-defined sink as fallback sink. User Defined Sink as a fallback sink: - name : out sink : udsink : container : image : my-sink:latest fallback : udsink : container : image : my-sink:latest","title":"UD Sink"},{"location":"user-guide/sinks/kafka/","text":"Kafka Sink \u00b6 A Kafka sink is used to forward the messages to a Kafka topic. Kafka sink supports configuration overrides. spec : vertices : - name : kafka-output sink : kafka : brokers : - my-broker1:19700 - my-broker2:19700 topic : my-topic tls : # Optional. insecureSkipVerify : # Optional, where to skip TLS verification. Default to false. caCertSecret : # Optional, a secret reference, which contains the CA Cert. name : my-ca-cert key : my-ca-cert-key certSecret : # Optional, pointing to a secret reference which contains the Cert. name : my-cert key : my-cert-key keySecret : # Optional, pointing to a secret reference which contains the Private Key. name : my-pk key : my-pk-key sasl : # Optional mechanism : GSSAPI # PLAIN, GSSAPI, SCRAM-SHA-256 or SCRAM-SHA-512, other mechanisms not supported gssapi : # Optional, for GSSAPI mechanism serviceName : my-service realm : my-realm # KRB5_USER_AUTH for auth using password # KRB5_KEYTAB_AUTH for auth using keytab authType : KRB5_KEYTAB_AUTH usernameSecret : # Pointing to a secret reference which contains the username name : gssapi-username key : gssapi-username-key # Pointing to a secret reference which contains the keytab (authType: KRB5_KEYTAB_AUTH) keytabSecret : name : gssapi-keytab key : gssapi-keytab-key # Pointing to a secret reference which contains the keytab (authType: KRB5_USER_AUTH) passwordSecret : name : gssapi-password key : gssapi-password-key kerberosConfigSecret : # Pointing to a secret reference which contains the kerberos config name : my-kerberos-config key : my-kerberos-config-key plain : # Optional, for PLAIN mechanism userSecret : # Pointing to a secret reference which contains the user name : plain-user key : plain-user-key passwordSecret : # Pointing to a secret reference which contains the password name : plain-password key : plain-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha256 : # Optional, for SCRAM-SHA-256 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-256-user key : scram-sha-256-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-256-password key : scram-sha-256-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha512 : # Optional, for SCRAM-SHA-512 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-512-user key : scram-sha-512-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-512-password key : scram-sha-512-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true # Optional, a yaml format string which could apply more configuration for the sink. # The configuration hierarchy follows the Struct of sarama.Config at https://github.com/IBM/sarama/blob/main/config.go. config : | producer: compression: 2","title":"Kafka Sink"},{"location":"user-guide/sinks/kafka/#kafka-sink","text":"A Kafka sink is used to forward the messages to a Kafka topic. Kafka sink supports configuration overrides. spec : vertices : - name : kafka-output sink : kafka : brokers : - my-broker1:19700 - my-broker2:19700 topic : my-topic tls : # Optional. insecureSkipVerify : # Optional, where to skip TLS verification. Default to false. caCertSecret : # Optional, a secret reference, which contains the CA Cert. name : my-ca-cert key : my-ca-cert-key certSecret : # Optional, pointing to a secret reference which contains the Cert. name : my-cert key : my-cert-key keySecret : # Optional, pointing to a secret reference which contains the Private Key. name : my-pk key : my-pk-key sasl : # Optional mechanism : GSSAPI # PLAIN, GSSAPI, SCRAM-SHA-256 or SCRAM-SHA-512, other mechanisms not supported gssapi : # Optional, for GSSAPI mechanism serviceName : my-service realm : my-realm # KRB5_USER_AUTH for auth using password # KRB5_KEYTAB_AUTH for auth using keytab authType : KRB5_KEYTAB_AUTH usernameSecret : # Pointing to a secret reference which contains the username name : gssapi-username key : gssapi-username-key # Pointing to a secret reference which contains the keytab (authType: KRB5_KEYTAB_AUTH) keytabSecret : name : gssapi-keytab key : gssapi-keytab-key # Pointing to a secret reference which contains the keytab (authType: KRB5_USER_AUTH) passwordSecret : name : gssapi-password key : gssapi-password-key kerberosConfigSecret : # Pointing to a secret reference which contains the kerberos config name : my-kerberos-config key : my-kerberos-config-key plain : # Optional, for PLAIN mechanism userSecret : # Pointing to a secret reference which contains the user name : plain-user key : plain-user-key passwordSecret : # Pointing to a secret reference which contains the password name : plain-password key : plain-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha256 : # Optional, for SCRAM-SHA-256 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-256-user key : scram-sha-256-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-256-password key : scram-sha-256-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha512 : # Optional, for SCRAM-SHA-512 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-512-user key : scram-sha-512-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-512-password key : scram-sha-512-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true # Optional, a yaml format string which could apply more configuration for the sink. # The configuration hierarchy follows the Struct of sarama.Config at https://github.com/IBM/sarama/blob/main/config.go. config : | producer: compression: 2","title":"Kafka Sink"},{"location":"user-guide/sinks/log/","text":"Log Sink \u00b6 A Log sink is very useful for debugging, it prints all the received messages to stdout . spec : vertices : - name : output sink : log : {}","title":"Log Sink"},{"location":"user-guide/sinks/log/#log-sink","text":"A Log sink is very useful for debugging, it prints all the received messages to stdout . spec : vertices : - name : output sink : log : {}","title":"Log Sink"},{"location":"user-guide/sinks/overview/","text":"Sinks \u00b6 The Sink serves as the endpoint for processed data that has been outputted from the platform, which is then sent to an external system or application. The purpose of the Sink is to deliver the processed data to its ultimate destination, such as a database, data warehouse, visualization tool, or alerting system. It's the opposite of the Source vertex, which receives input data into the platform. Sink vertex may require transformation or formatting of data prior to sending it to the target system. Depending on the target system's needs, this transformation can be simple or complex. A pipeline can have many Sink vertices, unlike the Source vertex. Numaflow currently supports the following Sinks Kafka Log Black Hole User-defined Sink A user-defined sink is a custom Sink that a user can write using Numaflow SDK when the user needs to output the processed data to a system or using a certain transformation that is not supported by the platform's built-in sinks. As an example, once we have processed the input messages, we can use Elasticsearch as a user-defined sink to store the processed data and enable search and analysis on the data. Fallback Sink (DLQ) \u00b6 There is an explicit DLQ support for sinks using a concept called fallback sink . For the rest of vertices, if you need DLQ, please use conditional-forwarding . Sink cannot not do conditional-forwarding since it is a terminal state and hence we have explicit fallback option.","title":"Overview"},{"location":"user-guide/sinks/overview/#sinks","text":"The Sink serves as the endpoint for processed data that has been outputted from the platform, which is then sent to an external system or application. The purpose of the Sink is to deliver the processed data to its ultimate destination, such as a database, data warehouse, visualization tool, or alerting system. It's the opposite of the Source vertex, which receives input data into the platform. Sink vertex may require transformation or formatting of data prior to sending it to the target system. Depending on the target system's needs, this transformation can be simple or complex. A pipeline can have many Sink vertices, unlike the Source vertex. Numaflow currently supports the following Sinks Kafka Log Black Hole User-defined Sink A user-defined sink is a custom Sink that a user can write using Numaflow SDK when the user needs to output the processed data to a system or using a certain transformation that is not supported by the platform's built-in sinks. As an example, once we have processed the input messages, we can use Elasticsearch as a user-defined sink to store the processed data and enable search and analysis on the data.","title":"Sinks"},{"location":"user-guide/sinks/overview/#fallback-sink-dlq","text":"There is an explicit DLQ support for sinks using a concept called fallback sink . For the rest of vertices, if you need DLQ, please use conditional-forwarding . Sink cannot not do conditional-forwarding since it is a terminal state and hence we have explicit fallback option.","title":"Fallback Sink (DLQ)"},{"location":"user-guide/sinks/user-defined-sinks/","text":"User-defined Sinks \u00b6 A Pipeline may have multiple Sinks, those sinks could either be a pre-defined sink such as kafka , log , etc., or a user-defined sink . A pre-defined sink vertex runs single-container pods, a user-defined sink runs two-container pods. Build Your Own User-defined Sinks \u00b6 You can build your own user-defined sinks in multiple languages. Check the links below to see the examples for different languages. Golang Java Python A user-defined sink vertex looks like below. spec : vertices : - name : output sink : udsink : container : image : my-sink:latest Available Environment Variables \u00b6 Some environment variables are available in the user-defined sink container: NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex. User-defined Sinks contributed from the open source community \u00b6 If you're looking for examples and usages contributed by the open source community, head over to the numaproj-contrib repositories . These user-defined sinks like AWS SQS, GCP Pub/Sub, provide valuable insights and guidance on how to use and write a user-defined sink.","title":"User-defined Sinks"},{"location":"user-guide/sinks/user-defined-sinks/#user-defined-sinks","text":"A Pipeline may have multiple Sinks, those sinks could either be a pre-defined sink such as kafka , log , etc., or a user-defined sink . A pre-defined sink vertex runs single-container pods, a user-defined sink runs two-container pods.","title":"User-defined Sinks"},{"location":"user-guide/sinks/user-defined-sinks/#build-your-own-user-defined-sinks","text":"You can build your own user-defined sinks in multiple languages. Check the links below to see the examples for different languages. Golang Java Python A user-defined sink vertex looks like below. spec : vertices : - name : output sink : udsink : container : image : my-sink:latest","title":"Build Your Own User-defined Sinks"},{"location":"user-guide/sinks/user-defined-sinks/#available-environment-variables","text":"Some environment variables are available in the user-defined sink container: NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex.","title":"Available Environment Variables"},{"location":"user-guide/sinks/user-defined-sinks/#user-defined-sinks-contributed-from-the-open-source-community","text":"If you're looking for examples and usages contributed by the open source community, head over to the numaproj-contrib repositories . These user-defined sinks like AWS SQS, GCP Pub/Sub, provide valuable insights and guidance on how to use and write a user-defined sink.","title":"User-defined Sinks contributed from the open source community"},{"location":"user-guide/sources/generator/","text":"Generator Source \u00b6 Generator Source is mainly used for development purpose, where you want to have self-contained source to generate some messages. We mainly use generator for load testing and integration testing of Numaflow. The load generated is per replica. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : vertices : - name : in source : generator : # How many messages to generate in the duration. rpu : 100 duration : 1s # Optional, size of each generated message, defaults to 10. msgSize : 1024 - name : p1 udf : builtin : name : cat - name : out sink : log : {} edges : - from : in to : p1 - from : p1 to : out User Defined Data \u00b6 The default data created by the generator is not likely to be useful in testing user pipelines with specific business logic. To allow this to help with user testing, a user defined value can be provided which will be emitted for each of the generator. - name: in source: generator: # How many messages to generate in the duration. rpu: 100 duration: 1s # Base64 encoding of data to send. Can be example serialized packet to # run through user pipeline to exercise particular capability or path through pipeline valueBlob: \"InlvdXIgc3BlY2lmaWMgZGF0YSI=\" # Note: msgSize and value will be ignored if valueBlob is set","title":"Generator Source"},{"location":"user-guide/sources/generator/#generator-source","text":"Generator Source is mainly used for development purpose, where you want to have self-contained source to generate some messages. We mainly use generator for load testing and integration testing of Numaflow. The load generated is per replica. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : vertices : - name : in source : generator : # How many messages to generate in the duration. rpu : 100 duration : 1s # Optional, size of each generated message, defaults to 10. msgSize : 1024 - name : p1 udf : builtin : name : cat - name : out sink : log : {} edges : - from : in to : p1 - from : p1 to : out","title":"Generator Source"},{"location":"user-guide/sources/generator/#user-defined-data","text":"The default data created by the generator is not likely to be useful in testing user pipelines with specific business logic. To allow this to help with user testing, a user defined value can be provided which will be emitted for each of the generator. - name: in source: generator: # How many messages to generate in the duration. rpu: 100 duration: 1s # Base64 encoding of data to send. Can be example serialized packet to # run through user pipeline to exercise particular capability or path through pipeline valueBlob: \"InlvdXIgc3BlY2lmaWMgZGF0YSI=\" # Note: msgSize and value will be ignored if valueBlob is set","title":"User Defined Data"},{"location":"user-guide/sources/http/","text":"HTTP Source \u00b6 HTTP Source starts an HTTP service with TLS enabled to accept POST request in the Vertex Pod. It listens to port 8443, with request URI /vertices/{vertexName} . A Pipeline with HTTP Source: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : http-pipeline spec : vertices : - name : in source : http : {} - name : p1 udf : builtin : name : cat - name : out sink : log : {} edges : - from : in to : p1 - from : p1 to : out Sending Data \u00b6 Data could be sent to an HTTP source through: ClusterIP Service (within the cluster) Ingress or LoadBalancer Service (outside of the cluster) Port-forward (for testing) ClusterIP Service \u00b6 An HTTP Source Vertex can generate a ClusterIP Service if service: true is specified, the service name is in the format of {pipelineName}-{vertexName} , so the HTTP Source can be accessed through https://{pipelineName}-{vertexName}.{namespace}.svc:8443/vertices/{vertexName} within the cluster. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : http-pipeline spec : vertices : - name : in source : http : service : true LoadBalancer Service or Ingress \u00b6 To create a LoadBalander type Service, or a NodePort one for Ingress, you need to do it by you own. Just remember to use selector like following in the Service: numaflow.numaproj.io/pipeline-name : http-pipeline # pipeline name numaflow.numaproj.io/vertex-name : in # vertex name Port-forwarding \u00b6 To test an HTTP source, you can do it from your local through port-forwarding. kubectl port-forward pod ${ pod -name } 8443 curl -kq -X POST -d \"hello world\" https://localhost:8443/vertices/in x-numaflow-id \u00b6 When posting data to the HTTP Source, an optional HTTP header x-numaflow-id can be specified, which will be used to dedup. If it's not provided, the HTTP Source will generate a random UUID to do it. curl -kq -X POST -H \"x-numaflow-id: ${ id } \" -d \"hello world\" ${ http -source-url } x-numaflow-event-time \u00b6 By default, the time of the date coming to the HTTP source is used as the event time, it could be set by putting an HTTP header x-numaflow-event-time with value of the number of milliseconds elapsed since January 1, 1970 UTC. curl -kq -X POST -H \"x-numaflow-event-time: 1663006726000\" -d \"hello world\" ${ http -source-url } Auth \u00b6 A Bearer token can be configured to prevent the HTTP Source from being accessed by unexpected clients. To do so, a Kubernetes Secret needs to be created to store the token, and the valid clients also need to include the token in its HTTP request header. Firstly, create a k8s secret containing your token. echo -n 'tr3qhs321fjglwf1e2e67dfda4tr' > ./token.txt kubectl create secret generic http-source-token --from-file = my-token = ./token.txt Then add auth to the Source Vertex: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : http-pipeline spec : vertices : - name : in source : http : auth : token : name : http-source-token key : my-token When the clients post data to the Source Vertex, add Authorization: Bearer tr3qhs321fjglwf1e2e67dfda4tr to the header, for example: TOKEN = \"Bearer tr3qhs321fjglwf1e2e67dfda4tr\" # Post data from a Pod in the same namespace of the cluster curl -kq -X POST -H \"Authorization: $TOKEN \" -d \"hello world\" https://http-pipeline-in:8443/vertices/in Health Check \u00b6 The HTTP Source also has an endpoint /health created automatically, which is useful for LoadBalancer or Ingress configuration, where a health check endpoint is often required by the cloud provider.","title":"HTTP Source"},{"location":"user-guide/sources/http/#http-source","text":"HTTP Source starts an HTTP service with TLS enabled to accept POST request in the Vertex Pod. It listens to port 8443, with request URI /vertices/{vertexName} . A Pipeline with HTTP Source: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : http-pipeline spec : vertices : - name : in source : http : {} - name : p1 udf : builtin : name : cat - name : out sink : log : {} edges : - from : in to : p1 - from : p1 to : out","title":"HTTP Source"},{"location":"user-guide/sources/http/#sending-data","text":"Data could be sent to an HTTP source through: ClusterIP Service (within the cluster) Ingress or LoadBalancer Service (outside of the cluster) Port-forward (for testing)","title":"Sending Data"},{"location":"user-guide/sources/http/#clusterip-service","text":"An HTTP Source Vertex can generate a ClusterIP Service if service: true is specified, the service name is in the format of {pipelineName}-{vertexName} , so the HTTP Source can be accessed through https://{pipelineName}-{vertexName}.{namespace}.svc:8443/vertices/{vertexName} within the cluster. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : http-pipeline spec : vertices : - name : in source : http : service : true","title":"ClusterIP Service"},{"location":"user-guide/sources/http/#loadbalancer-service-or-ingress","text":"To create a LoadBalander type Service, or a NodePort one for Ingress, you need to do it by you own. Just remember to use selector like following in the Service: numaflow.numaproj.io/pipeline-name : http-pipeline # pipeline name numaflow.numaproj.io/vertex-name : in # vertex name","title":"LoadBalancer Service or Ingress"},{"location":"user-guide/sources/http/#port-forwarding","text":"To test an HTTP source, you can do it from your local through port-forwarding. kubectl port-forward pod ${ pod -name } 8443 curl -kq -X POST -d \"hello world\" https://localhost:8443/vertices/in","title":"Port-forwarding"},{"location":"user-guide/sources/http/#x-numaflow-id","text":"When posting data to the HTTP Source, an optional HTTP header x-numaflow-id can be specified, which will be used to dedup. If it's not provided, the HTTP Source will generate a random UUID to do it. curl -kq -X POST -H \"x-numaflow-id: ${ id } \" -d \"hello world\" ${ http -source-url }","title":"x-numaflow-id"},{"location":"user-guide/sources/http/#x-numaflow-event-time","text":"By default, the time of the date coming to the HTTP source is used as the event time, it could be set by putting an HTTP header x-numaflow-event-time with value of the number of milliseconds elapsed since January 1, 1970 UTC. curl -kq -X POST -H \"x-numaflow-event-time: 1663006726000\" -d \"hello world\" ${ http -source-url }","title":"x-numaflow-event-time"},{"location":"user-guide/sources/http/#auth","text":"A Bearer token can be configured to prevent the HTTP Source from being accessed by unexpected clients. To do so, a Kubernetes Secret needs to be created to store the token, and the valid clients also need to include the token in its HTTP request header. Firstly, create a k8s secret containing your token. echo -n 'tr3qhs321fjglwf1e2e67dfda4tr' > ./token.txt kubectl create secret generic http-source-token --from-file = my-token = ./token.txt Then add auth to the Source Vertex: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : http-pipeline spec : vertices : - name : in source : http : auth : token : name : http-source-token key : my-token When the clients post data to the Source Vertex, add Authorization: Bearer tr3qhs321fjglwf1e2e67dfda4tr to the header, for example: TOKEN = \"Bearer tr3qhs321fjglwf1e2e67dfda4tr\" # Post data from a Pod in the same namespace of the cluster curl -kq -X POST -H \"Authorization: $TOKEN \" -d \"hello world\" https://http-pipeline-in:8443/vertices/in","title":"Auth"},{"location":"user-guide/sources/http/#health-check","text":"The HTTP Source also has an endpoint /health created automatically, which is useful for LoadBalancer or Ingress configuration, where a health check endpoint is often required by the cloud provider.","title":"Health Check"},{"location":"user-guide/sources/kafka/","text":"Kafka Source \u00b6 A Kafka source is used to ingest the messages from a Kafka topic. Numaflow uses consumer-groups to manage offsets. spec : vertices : - name : input source : kafka : brokers : - my-broker1:19700 - my-broker2:19700 topic : my-topic consumerGroup : my-consumer-group config : | # Optional. consumer: offsets: initial: -2 # -2 for sarama.OffsetOldest, -1 for sarama.OffsetNewest. Default to sarama.OffsetNewest. tls : # Optional. insecureSkipVerify : # Optional, where to skip TLS verification. Default to false. caCertSecret : # Optional, a secret reference, which contains the CA Cert. name : my-ca-cert key : my-ca-cert-key certSecret : # Optional, pointing to a secret reference which contains the Cert. name : my-cert key : my-cert-key keySecret : # Optional, pointing to a secret reference which contains the Private Key. name : my-pk key : my-pk-key sasl : # Optional mechanism : GSSAPI # PLAIN, GSSAPI, SCRAM-SHA-256 or SCRAM-SHA-512, other mechanisms not supported gssapi : # Optional, for GSSAPI mechanism serviceName : my-service realm : my-realm # KRB5_USER_AUTH for auth using password # KRB5_KEYTAB_AUTH for auth using keytab authType : KRB5_KEYTAB_AUTH usernameSecret : # Pointing to a secret reference which contains the username name : gssapi-username key : gssapi-username-key # Pointing to a secret reference which contains the keytab (authType: KRB5_KEYTAB_AUTH) keytabSecret : name : gssapi-keytab key : gssapi-keytab-key # Pointing to a secret reference which contains the keytab (authType: KRB5_USER_AUTH) passwordSecret : name : gssapi-password key : gssapi-password-key kerberosConfigSecret : # Pointing to a secret reference which contains the kerberos config name : my-kerberos-config key : my-kerberos-config-key plain : # Optional, for PLAIN mechanism userSecret : # Pointing to a secret reference which contains the user name : plain-user key : plain-user-key passwordSecret : # Pointing to a secret reference which contains the password name : plain-password key : plain-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha256 : # Optional, for SCRAM-SHA-256 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-256-user key : scram-sha-256-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-256-password key : scram-sha-256-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha512 : # Optional, for SCRAM-SHA-512 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-512-user key : scram-sha-512-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-512-password key : scram-sha-512-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true FAQ \u00b6 How to start the Kafka Source from a specific offset based on datetime? \u00b6 In order to start the Kafka Source from a specific offset based on datetime, we need to reset the offset before we start the pipeline. For example, we have a topic quickstart-events with 3 partitions and a consumer group console-consumer-94457 . This example uses Kafka 3.6.1 and localhost. \u279c kafka_2.13-3.6.1 bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic quickstart-events Topic: quickstart-events TopicId: WqIN6j7hTQqGZUQWdF7AdA PartitionCount: 3 ReplicationFactor: 1 Configs: Topic: quickstart-events Partition: 0 Leader: 0 Replicas: 0 Isr: 0 Topic: quickstart-events Partition: 1 Leader: 0 Replicas: 0 Isr: 0 Topic: quickstart-events Partition: 2 Leader: 0 Replicas: 0 Isr: 0 \u279c kafka_2.13-3.6.1 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list --all-groups console-consumer-94457 We have already consumed all the available messages in the topic quickstart-events , but we want to go back to some datetime and re-consume the data from that datetime. \u279c kafka_2.13-3.6.1 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group console-consumer-94457 GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID console-consumer-94457 quickstart-events 0 56 56 0 - - - console-consumer-94457 quickstart-events 1 38 38 0 - - - console-consumer-94457 quickstart-events 2 4 4 0 - - - To achieve that, before the pipeline start, we need to first stop the consumers in the consumer group console-consumer-94457 because offsets can only be reset if the group console-consumer-94457 is inactive. Then, reset the offsets using the desired date and time. The example command below uses UTC time. \u279c kafka_2.13-3.6.1 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --execute --reset-offsets --group console-consumer-94457 --topic quickstart-events --to-datetime 2024 -01-19T19:26:00.000 GROUP TOPIC PARTITION NEW-OFFSET console-consumer-94457 quickstart-events 0 54 console-consumer-94457 quickstart-events 1 26 console-consumer-94457 quickstart-events 2 0 Now, we can start the pipeline, and the Kafka source will start consuming the topic quickstart-events with consumer group console-consumer-94457 from the NEW-OFFSET . You may need to create a property file which contains the connectivity details and use it to connect to the clusters. Below are two example config.properties files: SASL/PLAIN and TSL . ssl.endpoint.identification.algorithm=https sasl.mechanism=PLAIN request.timeout.ms=20000 bootstrap.servers= retry.backoff.ms=500 sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \\ username=\"\" \\ password=\"\"; security.protocol=SASL_SSL request.timeout.ms=20000 bootstrap.servers= security.protocol=SSL ssl.enabled.protocols=TLSv1.2 ssl.truststore.location= ssl.truststore.password= Run the command with the --command-config option. bin/kafka-consumer-groups.sh --bootstrap-server --command-config config.properties --execute --reset-offsets --group --topic --to-datetime Reference: - How to Use Kafka Tools With Confluent Cloud - Apache Kafka Security","title":"Kafka Source"},{"location":"user-guide/sources/kafka/#kafka-source","text":"A Kafka source is used to ingest the messages from a Kafka topic. Numaflow uses consumer-groups to manage offsets. spec : vertices : - name : input source : kafka : brokers : - my-broker1:19700 - my-broker2:19700 topic : my-topic consumerGroup : my-consumer-group config : | # Optional. consumer: offsets: initial: -2 # -2 for sarama.OffsetOldest, -1 for sarama.OffsetNewest. Default to sarama.OffsetNewest. tls : # Optional. insecureSkipVerify : # Optional, where to skip TLS verification. Default to false. caCertSecret : # Optional, a secret reference, which contains the CA Cert. name : my-ca-cert key : my-ca-cert-key certSecret : # Optional, pointing to a secret reference which contains the Cert. name : my-cert key : my-cert-key keySecret : # Optional, pointing to a secret reference which contains the Private Key. name : my-pk key : my-pk-key sasl : # Optional mechanism : GSSAPI # PLAIN, GSSAPI, SCRAM-SHA-256 or SCRAM-SHA-512, other mechanisms not supported gssapi : # Optional, for GSSAPI mechanism serviceName : my-service realm : my-realm # KRB5_USER_AUTH for auth using password # KRB5_KEYTAB_AUTH for auth using keytab authType : KRB5_KEYTAB_AUTH usernameSecret : # Pointing to a secret reference which contains the username name : gssapi-username key : gssapi-username-key # Pointing to a secret reference which contains the keytab (authType: KRB5_KEYTAB_AUTH) keytabSecret : name : gssapi-keytab key : gssapi-keytab-key # Pointing to a secret reference which contains the keytab (authType: KRB5_USER_AUTH) passwordSecret : name : gssapi-password key : gssapi-password-key kerberosConfigSecret : # Pointing to a secret reference which contains the kerberos config name : my-kerberos-config key : my-kerberos-config-key plain : # Optional, for PLAIN mechanism userSecret : # Pointing to a secret reference which contains the user name : plain-user key : plain-user-key passwordSecret : # Pointing to a secret reference which contains the password name : plain-password key : plain-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha256 : # Optional, for SCRAM-SHA-256 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-256-user key : scram-sha-256-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-256-password key : scram-sha-256-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha512 : # Optional, for SCRAM-SHA-512 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-512-user key : scram-sha-512-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-512-password key : scram-sha-512-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true","title":"Kafka Source"},{"location":"user-guide/sources/kafka/#faq","text":"","title":"FAQ"},{"location":"user-guide/sources/kafka/#how-to-start-the-kafka-source-from-a-specific-offset-based-on-datetime","text":"In order to start the Kafka Source from a specific offset based on datetime, we need to reset the offset before we start the pipeline. For example, we have a topic quickstart-events with 3 partitions and a consumer group console-consumer-94457 . This example uses Kafka 3.6.1 and localhost. \u279c kafka_2.13-3.6.1 bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic quickstart-events Topic: quickstart-events TopicId: WqIN6j7hTQqGZUQWdF7AdA PartitionCount: 3 ReplicationFactor: 1 Configs: Topic: quickstart-events Partition: 0 Leader: 0 Replicas: 0 Isr: 0 Topic: quickstart-events Partition: 1 Leader: 0 Replicas: 0 Isr: 0 Topic: quickstart-events Partition: 2 Leader: 0 Replicas: 0 Isr: 0 \u279c kafka_2.13-3.6.1 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list --all-groups console-consumer-94457 We have already consumed all the available messages in the topic quickstart-events , but we want to go back to some datetime and re-consume the data from that datetime. \u279c kafka_2.13-3.6.1 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group console-consumer-94457 GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID console-consumer-94457 quickstart-events 0 56 56 0 - - - console-consumer-94457 quickstart-events 1 38 38 0 - - - console-consumer-94457 quickstart-events 2 4 4 0 - - - To achieve that, before the pipeline start, we need to first stop the consumers in the consumer group console-consumer-94457 because offsets can only be reset if the group console-consumer-94457 is inactive. Then, reset the offsets using the desired date and time. The example command below uses UTC time. \u279c kafka_2.13-3.6.1 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --execute --reset-offsets --group console-consumer-94457 --topic quickstart-events --to-datetime 2024 -01-19T19:26:00.000 GROUP TOPIC PARTITION NEW-OFFSET console-consumer-94457 quickstart-events 0 54 console-consumer-94457 quickstart-events 1 26 console-consumer-94457 quickstart-events 2 0 Now, we can start the pipeline, and the Kafka source will start consuming the topic quickstart-events with consumer group console-consumer-94457 from the NEW-OFFSET . You may need to create a property file which contains the connectivity details and use it to connect to the clusters. Below are two example config.properties files: SASL/PLAIN and TSL . ssl.endpoint.identification.algorithm=https sasl.mechanism=PLAIN request.timeout.ms=20000 bootstrap.servers= retry.backoff.ms=500 sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \\ username=\"\" \\ password=\"\"; security.protocol=SASL_SSL request.timeout.ms=20000 bootstrap.servers= security.protocol=SSL ssl.enabled.protocols=TLSv1.2 ssl.truststore.location= ssl.truststore.password= Run the command with the --command-config option. bin/kafka-consumer-groups.sh --bootstrap-server --command-config config.properties --execute --reset-offsets --group --topic --to-datetime Reference: - How to Use Kafka Tools With Confluent Cloud - Apache Kafka Security","title":"How to start the Kafka Source from a specific offset based on datetime?"},{"location":"user-guide/sources/nats/","text":"Nats Source \u00b6 A Nats source is used to ingest the messages from a nats subject. spec : vertices : - name : input source : nats : url : nats://demo.nats.io # Multiple urls separated by comma. subject : my-subject queue : my-queue # Queue subscription, see https://docs.nats.io/using-nats/developer/receiving/queues tls : # Optional. insecureSkipVerify : # Optional, where to skip TLS verification. Default to false. caCertSecret : # Optional, a secret reference, which contains the CA Cert. name : my-ca-cert key : my-ca-cert-key certSecret : # Optional, pointing to a secret reference which contains the Cert. name : my-cert key : my-cert-key keySecret : # Optional, pointing to a secret reference which contains the Private Key. name : my-pk key : my-pk-key auth : # Optional. basic : # Optional, pointing to the secret references which contain user name and password. user : name : my-secret key : my-user password : name : my-secret key : my-password Auth \u00b6 The auth strategies supported in nats source include basic (user and password), token and nkey , check the API for the details.","title":"Nats Source"},{"location":"user-guide/sources/nats/#nats-source","text":"A Nats source is used to ingest the messages from a nats subject. spec : vertices : - name : input source : nats : url : nats://demo.nats.io # Multiple urls separated by comma. subject : my-subject queue : my-queue # Queue subscription, see https://docs.nats.io/using-nats/developer/receiving/queues tls : # Optional. insecureSkipVerify : # Optional, where to skip TLS verification. Default to false. caCertSecret : # Optional, a secret reference, which contains the CA Cert. name : my-ca-cert key : my-ca-cert-key certSecret : # Optional, pointing to a secret reference which contains the Cert. name : my-cert key : my-cert-key keySecret : # Optional, pointing to a secret reference which contains the Private Key. name : my-pk key : my-pk-key auth : # Optional. basic : # Optional, pointing to the secret references which contain user name and password. user : name : my-secret key : my-user password : name : my-secret key : my-password","title":"Nats Source"},{"location":"user-guide/sources/nats/#auth","text":"The auth strategies supported in nats source include basic (user and password), token and nkey , check the API for the details.","title":"Auth"},{"location":"user-guide/sources/overview/","text":"Sources \u00b6 Source vertex is responsible for reliable reading data from an unbounded source into Numaflow. Source vertex may require transformation or formatting of data prior to sending it to the output buffers. Source Vertex also does Watermark tracking and late data detection. In Numaflow, we currently support the following sources Kafka HTTP Ticker Nats User-defined Source A user-defined source is a custom source that a user can write using Numaflow SDK when the user needs to read data from a system that is not supported by the platform's built-in sources. User-defined source also supports custom acknowledge management for exactly-once reading.","title":"Overview"},{"location":"user-guide/sources/overview/#sources","text":"Source vertex is responsible for reliable reading data from an unbounded source into Numaflow. Source vertex may require transformation or formatting of data prior to sending it to the output buffers. Source Vertex also does Watermark tracking and late data detection. In Numaflow, we currently support the following sources Kafka HTTP Ticker Nats User-defined Source A user-defined source is a custom source that a user can write using Numaflow SDK when the user needs to read data from a system that is not supported by the platform's built-in sources. User-defined source also supports custom acknowledge management for exactly-once reading.","title":"Sources"},{"location":"user-guide/sources/user-defined-sources/","text":"User-defined Sources \u00b6 A Pipeline may have multiple Sources, those sources could either be a pre-defined source such as kafka , http , etc., or a user-defined source . With no source data transformer, A pre-defined source vertex runs single-container pods; a user-defined source runs two-container pods. Build Your Own User-defined Sources \u00b6 You can build your own user-defined sources in multiple languages. Check the links below to see the examples for different languages. Golang Java Python After building a docker image for the written user-defined source, specify the image as below in the vertex spec. spec : vertices : - name : input source : udsource : container : image : my-source:latest Available Environment Variables \u00b6 Some environment variables are available in the user-defined source container: NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex. User-defined sources contributed from the open source community \u00b6 If you're looking for examples and usages contributed by the open source community, head over to the numaproj-contrib repositories . These user-defined sources like AWS SQS, GCP Pub/Sub, provide valuable insights and guidance on how to use and write a user-defined source.","title":"User-defined Sources"},{"location":"user-guide/sources/user-defined-sources/#user-defined-sources","text":"A Pipeline may have multiple Sources, those sources could either be a pre-defined source such as kafka , http , etc., or a user-defined source . With no source data transformer, A pre-defined source vertex runs single-container pods; a user-defined source runs two-container pods.","title":"User-defined Sources"},{"location":"user-guide/sources/user-defined-sources/#build-your-own-user-defined-sources","text":"You can build your own user-defined sources in multiple languages. Check the links below to see the examples for different languages. Golang Java Python After building a docker image for the written user-defined source, specify the image as below in the vertex spec. spec : vertices : - name : input source : udsource : container : image : my-source:latest","title":"Build Your Own User-defined Sources"},{"location":"user-guide/sources/user-defined-sources/#available-environment-variables","text":"Some environment variables are available in the user-defined source container: NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex.","title":"Available Environment Variables"},{"location":"user-guide/sources/user-defined-sources/#user-defined-sources-contributed-from-the-open-source-community","text":"If you're looking for examples and usages contributed by the open source community, head over to the numaproj-contrib repositories . These user-defined sources like AWS SQS, GCP Pub/Sub, provide valuable insights and guidance on how to use and write a user-defined source.","title":"User-defined sources contributed from the open source community"},{"location":"user-guide/sources/transformer/overview/","text":"Source Data Transformer \u00b6 The Source Data Transformer is a feature that allows users to execute custom code to transform their data at source. This functionality offers two primary advantages to users: Event Time Assignment - It enables users to extract the event time from the message payload, providing a more precise and accurate event time than the default mechanisms like LOG_APPEND_TIME of Kafka for Kafka source, custom HTTP header for HTTP source, and others. Early data processing - It pre-processes the data, or filters out unwanted data at source vertex, saving the cost of creating another UDF vertex and an inter-step buffer. Source Data Transformer runs as a sidecar container in a Source Vertex Pod. Data processing in the transformer is supposed to be idempotent. The communication between the main container (platform code) and the sidecar container (user code) is through gRPC over Unix Domain Socket. Built-in Transformers \u00b6 There are some Built-in Transformers that can be used directly. Build Your Own Transformer \u00b6 You can build your own transformer in multiple languages. A user-defined transformer could be as simple as the example below in Golang. In the example, the transformer extracts event times from timestamp of the JSON payload and assigns them to messages as new event times. It also filters out unwanted messages based on filterOut of the payload. package main import ( \"context\" \"encoding/json\" \"log\" \"time\" \"github.com/numaproj/numaflow-go/pkg/sourcetransformer\" ) func transform ( _ context . Context , keys [] string , data sourcetransformer . Datum ) sourcetransformer . Messages { /* Input messages are in JSON format. Sample: {\"timestamp\": \"1673239888\", \"filterOut\": \"true\"}. Field \"timestamp\" shows the real event time of the message, in the format of epoch. Field \"filterOut\" indicates whether the message should be filtered out, in the format of boolean. */ var jsonObject map [ string ] interface {} json . Unmarshal ( data . Value (), & jsonObject ) // event time assignment eventTime := data . EventTime () // if timestamp field exists, extract event time from payload. if ts , ok := jsonObject [ \"timestamp\" ]; ok { eventTime = time . Unix ( int64 ( ts .( float64 )), 0 ) } // data filtering var filterOut bool if f , ok := jsonObject [ \"filterOut\" ]; ok { filterOut = f .( bool ) } if filterOut { return sourcetransformer . MessagesBuilder (). Append ( sourcetransformer . MessageToDrop ( eventTime )) } else { return sourcetransformer . MessagesBuilder (). Append ( sourcetransformer . NewMessage ( data . Value (), eventTime ). WithKeys ( keys )) } } func main () { err := sourcetransformer . NewServer ( sourcetransformer . SourceTransformFunc ( transform )). Start ( context . Background ()) if err != nil { log . Panic ( \"Failed to start source transform server: \" , err ) } } Check the links below to see another transformer example in various programming languages, where we apply conditional forwarding based on the input event time. Python Golang Java After building a docker image for the written transformer, specify the image as below in the source vertex spec. spec : vertices : - name : my-vertex source : http : {} transformer : container : image : my-python-transformer-example:latest Available Environment Variables \u00b6 Some environment variables are available in the source transformer container, they might be useful in you own source data transformer implementation. NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex. Configuration \u00b6 Configuration data can be provided to the transformer container at runtime multiple ways. environment variables args command volumes init containers","title":"Overview"},{"location":"user-guide/sources/transformer/overview/#source-data-transformer","text":"The Source Data Transformer is a feature that allows users to execute custom code to transform their data at source. This functionality offers two primary advantages to users: Event Time Assignment - It enables users to extract the event time from the message payload, providing a more precise and accurate event time than the default mechanisms like LOG_APPEND_TIME of Kafka for Kafka source, custom HTTP header for HTTP source, and others. Early data processing - It pre-processes the data, or filters out unwanted data at source vertex, saving the cost of creating another UDF vertex and an inter-step buffer. Source Data Transformer runs as a sidecar container in a Source Vertex Pod. Data processing in the transformer is supposed to be idempotent. The communication between the main container (platform code) and the sidecar container (user code) is through gRPC over Unix Domain Socket.","title":"Source Data Transformer"},{"location":"user-guide/sources/transformer/overview/#built-in-transformers","text":"There are some Built-in Transformers that can be used directly.","title":"Built-in Transformers"},{"location":"user-guide/sources/transformer/overview/#build-your-own-transformer","text":"You can build your own transformer in multiple languages. A user-defined transformer could be as simple as the example below in Golang. In the example, the transformer extracts event times from timestamp of the JSON payload and assigns them to messages as new event times. It also filters out unwanted messages based on filterOut of the payload. package main import ( \"context\" \"encoding/json\" \"log\" \"time\" \"github.com/numaproj/numaflow-go/pkg/sourcetransformer\" ) func transform ( _ context . Context , keys [] string , data sourcetransformer . Datum ) sourcetransformer . Messages { /* Input messages are in JSON format. Sample: {\"timestamp\": \"1673239888\", \"filterOut\": \"true\"}. Field \"timestamp\" shows the real event time of the message, in the format of epoch. Field \"filterOut\" indicates whether the message should be filtered out, in the format of boolean. */ var jsonObject map [ string ] interface {} json . Unmarshal ( data . Value (), & jsonObject ) // event time assignment eventTime := data . EventTime () // if timestamp field exists, extract event time from payload. if ts , ok := jsonObject [ \"timestamp\" ]; ok { eventTime = time . Unix ( int64 ( ts .( float64 )), 0 ) } // data filtering var filterOut bool if f , ok := jsonObject [ \"filterOut\" ]; ok { filterOut = f .( bool ) } if filterOut { return sourcetransformer . MessagesBuilder (). Append ( sourcetransformer . MessageToDrop ( eventTime )) } else { return sourcetransformer . MessagesBuilder (). Append ( sourcetransformer . NewMessage ( data . Value (), eventTime ). WithKeys ( keys )) } } func main () { err := sourcetransformer . NewServer ( sourcetransformer . SourceTransformFunc ( transform )). Start ( context . Background ()) if err != nil { log . Panic ( \"Failed to start source transform server: \" , err ) } } Check the links below to see another transformer example in various programming languages, where we apply conditional forwarding based on the input event time. Python Golang Java After building a docker image for the written transformer, specify the image as below in the source vertex spec. spec : vertices : - name : my-vertex source : http : {} transformer : container : image : my-python-transformer-example:latest","title":"Build Your Own Transformer"},{"location":"user-guide/sources/transformer/overview/#available-environment-variables","text":"Some environment variables are available in the source transformer container, they might be useful in you own source data transformer implementation. NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex.","title":"Available Environment Variables"},{"location":"user-guide/sources/transformer/overview/#configuration","text":"Configuration data can be provided to the transformer container at runtime multiple ways. environment variables args command volumes init containers","title":"Configuration"},{"location":"user-guide/sources/transformer/builtin-transformers/","text":"Built-in Functions \u00b6 Numaflow provides some built-in source data transformers that can be used directly. Filter A filter built-in transformer filters the message based on expression. payload keyword represents message object. see documentation for filter expression here spec : vertices : - name : in source : http : {} transformer : builtin : name : filter kwargs : expression : int(json(payload).id) < 100 Event Time Extractor A eventTimeExtractor built-in transformer extracts event time from the payload of the message, based on expression and user-specified format. payload keyword represents message object. see documentation for event time extractor expression here . If you want to handle event times in epoch format, you can find helpful resource here . spec : vertices : - name : in source : http : {} transformer : builtin : name : eventTimeExtractor kwargs : expression : json(payload).item[0].time format : 2006-01-02T15:04:05Z07:00 Time Extraction Filter A timeExtractionFilter implements both the eventTimeExtractor and filter built-in functions. It evaluates a message on a pipeline and if valid, extracts event time from the payload of the messsage. spec : vertices : - name : in source : http : {} transformer : builtin : name : timeExtractionFilter kwargs : filterExpr : int(json(payload).id) < 100 eventTimeExpr : json(payload).item[1].time eventTimeFormat : 2006-01-02T15:04:05Z07:00","title":"Overview"},{"location":"user-guide/sources/transformer/builtin-transformers/#built-in-functions","text":"Numaflow provides some built-in source data transformers that can be used directly. Filter A filter built-in transformer filters the message based on expression. payload keyword represents message object. see documentation for filter expression here spec : vertices : - name : in source : http : {} transformer : builtin : name : filter kwargs : expression : int(json(payload).id) < 100 Event Time Extractor A eventTimeExtractor built-in transformer extracts event time from the payload of the message, based on expression and user-specified format. payload keyword represents message object. see documentation for event time extractor expression here . If you want to handle event times in epoch format, you can find helpful resource here . spec : vertices : - name : in source : http : {} transformer : builtin : name : eventTimeExtractor kwargs : expression : json(payload).item[0].time format : 2006-01-02T15:04:05Z07:00 Time Extraction Filter A timeExtractionFilter implements both the eventTimeExtractor and filter built-in functions. It evaluates a message on a pipeline and if valid, extracts event time from the payload of the messsage. spec : vertices : - name : in source : http : {} transformer : builtin : name : timeExtractionFilter kwargs : filterExpr : int(json(payload).id) < 100 eventTimeExpr : json(payload).item[1].time eventTimeFormat : 2006-01-02T15:04:05Z07:00","title":"Built-in Functions"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/","text":"Event Time Extractor \u00b6 A eventTimeExtractor built-in transformer extracts event time from the payload of the message, based on a user-provided expression and an optional format specification. expression is used to compile the payload to a string representation of the event time. format is used to convert the event time in string format to a time.Time object. Expression (required) \u00b6 Event Time Extractor expression is implemented with expr and sprig libraries. Data conversion functions \u00b6 These function can be accessed directly in expression. payload keyword represents the message object. It will be the root element to represent the message object in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount) Sprig functions \u00b6 Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.trim(string(json(payload).timestamp)) # Remove spaces from either side of the value of field timestamp Format (optional) \u00b6 Depending on whether a format is specified, Event Time Extractor uses different approaches to convert the event time string to a time.Time object. When specified \u00b6 When format is specified, the native time.Parse(layout, value string) library is used to make the conversion. In this process, the format parameter is passed as the layout input to the time.Parse() function, while the event time string is passed as the value parameter. When not specified \u00b6 When format is not specified, the extractor uses dateparse to parse the event time string without knowing the format in advance. How to specify format \u00b6 Please refer to golang format library . Error Scenarios \u00b6 When encountering parsing errors, event time extractor skips the extraction and passes on the message without modifying the original input message event time. Errors can occur for a variety of reasons, including: format is specified but the event time string can't parse to the specified format. format is not specified but dataparse can't convert the event time string to a time.Time object. Ambiguous event time strings \u00b6 Event time strings can be ambiguous when it comes to date format, such as MM/DD/YYYY versus DD/MM/YYYY. When using such format, you're required to explicitly specify format , to avoid confusion. If no format is provided, event time extractor treats ambiguous event time strings as an error scenario. Epoch format \u00b6 If the event time string in your message payload is in epoch format, you can skip specifying a format . You can rely on dateparse to recognize a wide range of epoch timestamp formats, including Unix seconds, milliseconds, microseconds, and nanoseconds. Event Time Extractor Spec \u00b6 spec : vertices : - name : in source : http : {} transformer : builtin : name : eventTimeExtractor kwargs : expression : sprig.trim(string(json(payload).timestamp)) format : 2006-01-02T15:04:05Z07:00","title":"Event Time Extractor"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#event-time-extractor","text":"A eventTimeExtractor built-in transformer extracts event time from the payload of the message, based on a user-provided expression and an optional format specification. expression is used to compile the payload to a string representation of the event time. format is used to convert the event time in string format to a time.Time object.","title":"Event Time Extractor"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#expression-required","text":"Event Time Extractor expression is implemented with expr and sprig libraries.","title":"Expression (required)"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#data-conversion-functions","text":"These function can be accessed directly in expression. payload keyword represents the message object. It will be the root element to represent the message object in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount)","title":"Data conversion functions"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#sprig-functions","text":"Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.trim(string(json(payload).timestamp)) # Remove spaces from either side of the value of field timestamp","title":"Sprig functions"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#format-optional","text":"Depending on whether a format is specified, Event Time Extractor uses different approaches to convert the event time string to a time.Time object.","title":"Format (optional)"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#when-specified","text":"When format is specified, the native time.Parse(layout, value string) library is used to make the conversion. In this process, the format parameter is passed as the layout input to the time.Parse() function, while the event time string is passed as the value parameter.","title":"When specified"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#when-not-specified","text":"When format is not specified, the extractor uses dateparse to parse the event time string without knowing the format in advance.","title":"When not specified"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#how-to-specify-format","text":"Please refer to golang format library .","title":"How to specify format"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#error-scenarios","text":"When encountering parsing errors, event time extractor skips the extraction and passes on the message without modifying the original input message event time. Errors can occur for a variety of reasons, including: format is specified but the event time string can't parse to the specified format. format is not specified but dataparse can't convert the event time string to a time.Time object.","title":"Error Scenarios"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#ambiguous-event-time-strings","text":"Event time strings can be ambiguous when it comes to date format, such as MM/DD/YYYY versus DD/MM/YYYY. When using such format, you're required to explicitly specify format , to avoid confusion. If no format is provided, event time extractor treats ambiguous event time strings as an error scenario.","title":"Ambiguous event time strings"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#epoch-format","text":"If the event time string in your message payload is in epoch format, you can skip specifying a format . You can rely on dateparse to recognize a wide range of epoch timestamp formats, including Unix seconds, milliseconds, microseconds, and nanoseconds.","title":"Epoch format"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#event-time-extractor-spec","text":"spec : vertices : - name : in source : http : {} transformer : builtin : name : eventTimeExtractor kwargs : expression : sprig.trim(string(json(payload).timestamp)) format : 2006-01-02T15:04:05Z07:00","title":"Event Time Extractor Spec"},{"location":"user-guide/sources/transformer/builtin-transformers/filter/","text":"Filter \u00b6 A filter is a special-purpose built-in function. It is used to evaluate on each message in a pipeline and is often used to filter the number of messages that are passed to next vertices. Filter function supports comprehensive expression language which extends flexibility write complex expressions. payload will be root element to represent the message object in expression. Expression \u00b6 Filter expression implemented with expr and sprig libraries. Data conversion functions \u00b6 These function can be accessed directly in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount) Sprig functions \u00b6 Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.contains('James', json(payload).name) # James is contained in the value of name . int(json(sprig.b64dec(payload)).id) < 100 Filter Spec \u00b6 spec : vertices : - name : in source : http : {} transformer : builtin : name : filter kwargs : expression : int(json(payload).id) < 100","title":"Filter"},{"location":"user-guide/sources/transformer/builtin-transformers/filter/#filter","text":"A filter is a special-purpose built-in function. It is used to evaluate on each message in a pipeline and is often used to filter the number of messages that are passed to next vertices. Filter function supports comprehensive expression language which extends flexibility write complex expressions. payload will be root element to represent the message object in expression.","title":"Filter"},{"location":"user-guide/sources/transformer/builtin-transformers/filter/#expression","text":"Filter expression implemented with expr and sprig libraries.","title":"Expression"},{"location":"user-guide/sources/transformer/builtin-transformers/filter/#data-conversion-functions","text":"These function can be accessed directly in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount)","title":"Data conversion functions"},{"location":"user-guide/sources/transformer/builtin-transformers/filter/#sprig-functions","text":"Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.contains('James', json(payload).name) # James is contained in the value of name . int(json(sprig.b64dec(payload)).id) < 100","title":"Sprig functions"},{"location":"user-guide/sources/transformer/builtin-transformers/filter/#filter-spec","text":"spec : vertices : - name : in source : http : {} transformer : builtin : name : filter kwargs : expression : int(json(payload).id) < 100","title":"Filter Spec"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/","text":"Time Extraction Filter \u00b6 A timeExtractionFilter built-in transformer implements both the eventTimeExtractor and filter built-in functions. It evaluates a message on a pipeline and if valid, extracts event time from the payload of the messsage. filterExpr is used to evaluate and drop invalid messages. eventTimeExpr is used to compile the payload to a string representation of the event time. format is used to convert the event time in string format to a time.Time object. Expression (required) \u00b6 The expressions for the filter and event time extractor are implemented with expr and sprig libraries. Data conversion functions \u00b6 These function can be accessed directly in expression. payload keyword represents the message object. It will be the root element to represent the message object in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount) Sprig functions \u00b6 Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.trim(string(json(payload).timestamp)) # Remove spaces from either side of the value of field timestamp Format (optional) \u00b6 Depending on whether a format is specified, the Event Time Extractor uses different approaches to convert the event time string to a time.Time object. Time Extraction Filter Spec \u00b6 spec : vertices : - name : in source : http : {} transformer : builtin : name : timeExtractionFilter kwargs : filterExpr : int(json(payload).id) < 100 eventTimeExpr : json(payload).item[1].time eventTimeFormat : 2006-01-02T15:04:05Z07:00","title":"Event Time Extraction Filter"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/#time-extraction-filter","text":"A timeExtractionFilter built-in transformer implements both the eventTimeExtractor and filter built-in functions. It evaluates a message on a pipeline and if valid, extracts event time from the payload of the messsage. filterExpr is used to evaluate and drop invalid messages. eventTimeExpr is used to compile the payload to a string representation of the event time. format is used to convert the event time in string format to a time.Time object.","title":"Time Extraction Filter"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/#expression-required","text":"The expressions for the filter and event time extractor are implemented with expr and sprig libraries.","title":"Expression (required)"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/#data-conversion-functions","text":"These function can be accessed directly in expression. payload keyword represents the message object. It will be the root element to represent the message object in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount)","title":"Data conversion functions"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/#sprig-functions","text":"Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.trim(string(json(payload).timestamp)) # Remove spaces from either side of the value of field timestamp","title":"Sprig functions"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/#format-optional","text":"Depending on whether a format is specified, the Event Time Extractor uses different approaches to convert the event time string to a time.Time object.","title":"Format (optional)"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/#time-extraction-filter-spec","text":"spec : vertices : - name : in source : http : {} transformer : builtin : name : timeExtractionFilter kwargs : filterExpr : int(json(payload).id) < 100 eventTimeExpr : json(payload).item[1].time eventTimeFormat : 2006-01-02T15:04:05Z07:00","title":"Time Extraction Filter Spec"},{"location":"user-guide/use-cases/monitoring-and-observability/","text":"Monitoring and Observability \u00b6 Docs \u00b6 How Intuit platform engineers use Numaflow to compute golden signals . Videos \u00b6 Numaflow as the stream-processing solution in Intuit\u2019s Customer Centric Observability Journey Using AIOps Using Numaflow for fast incident detection: Argo CD Observability with AIOps - Detect Incident Fast Implementing anomaly detection with Numaflow: Cluster Golden Signals to Avoid Alert Fatigue at Scale Appendix: What is Monitoring and Observability? \u00b6 Monitoring and observability are two critical concepts in software engineering that help developers ensure the health and performance of their applications. Monitoring refers to the process of collecting and analyzing data about an application's performance. This data can include metrics such as CPU usage, memory usage, network traffic, and response times. Monitoring tools allow developers to track these metrics over time and set alerts when certain thresholds are exceeded. This enables them to quickly identify and respond to issues before they become critical. Observability, on the other hand, is a more holistic approach to monitoring that focuses on understanding the internal workings of an application. Observability tools provide developers with deep insights into the behavior of their applications, allowing them to understand how different components interact with each other and how changes in one area can affect the overall system. This includes collecting data on things like logs, traces, and events, which can be used to reconstruct the state of the system at any given point in time. Together, monitoring and observability provide developers with a comprehensive view of their applications' performance, enabling them to quickly identify and respond to issues as they arise. By leveraging these tools, software engineers can ensure that their applications are running smoothly and efficiently, delivering the best possible experience to their users.","title":"Monitoring and Observability"},{"location":"user-guide/use-cases/monitoring-and-observability/#monitoring-and-observability","text":"","title":"Monitoring and Observability"},{"location":"user-guide/use-cases/monitoring-and-observability/#docs","text":"How Intuit platform engineers use Numaflow to compute golden signals .","title":"Docs"},{"location":"user-guide/use-cases/monitoring-and-observability/#videos","text":"Numaflow as the stream-processing solution in Intuit\u2019s Customer Centric Observability Journey Using AIOps Using Numaflow for fast incident detection: Argo CD Observability with AIOps - Detect Incident Fast Implementing anomaly detection with Numaflow: Cluster Golden Signals to Avoid Alert Fatigue at Scale","title":"Videos"},{"location":"user-guide/use-cases/monitoring-and-observability/#appendix-what-is-monitoring-and-observability","text":"Monitoring and observability are two critical concepts in software engineering that help developers ensure the health and performance of their applications. Monitoring refers to the process of collecting and analyzing data about an application's performance. This data can include metrics such as CPU usage, memory usage, network traffic, and response times. Monitoring tools allow developers to track these metrics over time and set alerts when certain thresholds are exceeded. This enables them to quickly identify and respond to issues before they become critical. Observability, on the other hand, is a more holistic approach to monitoring that focuses on understanding the internal workings of an application. Observability tools provide developers with deep insights into the behavior of their applications, allowing them to understand how different components interact with each other and how changes in one area can affect the overall system. This includes collecting data on things like logs, traces, and events, which can be used to reconstruct the state of the system at any given point in time. Together, monitoring and observability provide developers with a comprehensive view of their applications' performance, enabling them to quickly identify and respond to issues as they arise. By leveraging these tools, software engineers can ensure that their applications are running smoothly and efficiently, delivering the best possible experience to their users.","title":"Appendix: What is Monitoring and Observability?"},{"location":"user-guide/use-cases/overview/","text":"Overview \u00b6 Numaflow allows developers without any special knowledge of data/stream processing to easily create massively parallel data/stream processing jobs using a programming language of their choice, with just basic knowledge of Kubernetes. In this section, you'll find sample use cases for Numaflow and learn how to leverage its features for your stream processing tasks. Real-time data analytics applications. Event-driven applications: anomaly detection and monitoring . Streaming applications: data instrumentation and movement. Any workflows running in a streaming manner. Numaflow is still a relatively new tool, and there are likely many other use cases that we haven't yet explored. We're committed to keeping this page up-to-date with the latest use cases and best practices for using Numaflow. We welcome contributions from the community and encourage you to share your own use cases and experiences with us. As we continue to develop and improve Numaflow, we look forward to seeing the cool things you build with it!","title":"Overview"},{"location":"user-guide/use-cases/overview/#overview","text":"Numaflow allows developers without any special knowledge of data/stream processing to easily create massively parallel data/stream processing jobs using a programming language of their choice, with just basic knowledge of Kubernetes. In this section, you'll find sample use cases for Numaflow and learn how to leverage its features for your stream processing tasks. Real-time data analytics applications. Event-driven applications: anomaly detection and monitoring . Streaming applications: data instrumentation and movement. Any workflows running in a streaming manner. Numaflow is still a relatively new tool, and there are likely many other use cases that we haven't yet explored. We're committed to keeping this page up-to-date with the latest use cases and best practices for using Numaflow. We welcome contributions from the community and encourage you to share your own use cases and experiences with us. As we continue to develop and improve Numaflow, we look forward to seeing the cool things you build with it!","title":"Overview"},{"location":"user-guide/user-defined-functions/user-defined-functions/","text":"A Pipeline consists of multiple vertices, Source , Sink and UDF(user-defined functions) . User-defined functions (UDF) is the vertex where users can run custom code to transform the data. Data processing in the UDF is supposed to be idempotent. UDF runs as a sidecar container in a Vertex Pod, processes the received data. The communication between the main container (platform code) and the sidecar container (user code) is through gRPC over Unix Domain Socket. There are two kinds of processing users can run Map Reduce","title":"Overview"},{"location":"user-guide/user-defined-functions/map/examples/","text":"Map Examples \u00b6 Please read map to get the best out of these examples. Prerequisites \u00b6 Inter-Step Buffer Service (ISB Service) \u00b6 What is ISB Service? \u00b6 An Inter-Step Buffer Service is described by a Custom Resource , which is used to pass data between vertices of a numaflow pipeline. Please refer to the doc Inter-Step Buffer Service for more information on ISB. How to install the ISB Service \u00b6 kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/0-isbsvc-jetstream.yaml The expected output of the above command is shown below: $ kubectl get isbsvc NAME TYPE PHASE MESSAGE AGE default jetstream Running 3d19h # Wait for pods to be ready $ kubectl get pods NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3 /3 Running 0 19s isbsvc-default-js-1 3 /3 Running 0 19s isbsvc-default-js-2 3 /3 Running 0 19s NOTE The Source used in the examples is an HTTP source producing messages with values 5 and 10 with event time starting from 60000. Please refer to the doc http source on how to use an HTTP source. An example will be as follows, curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"5\" ${ http -source-url } curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"10\" ${ http -source-url } Creating a Simple Map Pipeline \u00b6 Now we will walk you through creating a map pipeline. In our example, this is called the even-odd pipeline, illustrated by the following diagram: There are five vertices in this example of a map pipeline. An HTTP source vertex which serves an HTTP endpoint to receive numbers as source data, a UDF vertex to tag the ingested numbers with the key even or odd , three Log sinks, one to print the even numbers, one to print the odd numbers, and the other one to print both the even and odd numbers. Run the following command to create the even-odd pipeline. kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml You may opt to view the list of pipelines you've created so far by running kubectl get pipeline . Otherwise, proceed to inspect the status of the pipeline, using kubectl get pods . # Wait for pods to be ready kubectl get pods NAME READY STATUS RESTARTS AGE even-odd-daemon-64d65c945d-vjs9f 1 /1 Running 0 5m3s even-odd-even-or-odd-0-pr4ze 2 /2 Running 0 30s even-odd-even-sink-0-unffo 1 /1 Running 0 22s even-odd-in-0-a7iyd 1 /1 Running 0 5m3s even-odd-number-sink-0-zmg2p 1 /1 Running 0 7s even-odd-odd-sink-0-2736r 1 /1 Running 0 15s isbsvc-default-js-0 3 /3 Running 0 10m isbsvc-default-js-1 3 /3 Running 0 10m isbsvc-default-js-2 3 /3 Running 0 10m Next, port-forward the HTTP endpoint, and make a POST request using curl . Remember to replace xxxxx with the appropriate pod names both here and in the next step. kubectl port-forward even-odd-in-0-xxxx 8444 :8443 # Post data to the HTTP endpoint curl -kq -X POST -d \"101\" https://localhost:8444/vertices/in curl -kq -X POST -d \"102\" https://localhost:8444/vertices/in curl -kq -X POST -d \"103\" https://localhost:8444/vertices/in curl -kq -X POST -d \"104\" https://localhost:8444/vertices/in Now you can watch the log for the even and odd vertices by running the commands below. # Watch the log for the even vertex kubectl logs -f even-odd-even-sink-0-xxxxx 2022 /09/07 22 :29:40 ( even-sink ) 102 2022 /09/07 22 :29:40 ( even-sink ) 104 # Watch the log for the odd vertex kubectl logs -f even-odd-odd-sink-0-xxxxx 2022 /09/07 22 :30:19 ( odd-sink ) 101 2022 /09/07 22 :30:19 ( odd-sink ) 103 View the UI for a pipeline at https://localhost:8443/. The source code of the even-odd user-defined function can be found here . You also can replace the Log Sink with some other sinks like Kafka to forward the data to Kafka topics. The pipeline can be deleted by kubectl delete -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml","title":"Examples"},{"location":"user-guide/user-defined-functions/map/examples/#map-examples","text":"Please read map to get the best out of these examples.","title":"Map Examples"},{"location":"user-guide/user-defined-functions/map/examples/#prerequisites","text":"","title":"Prerequisites"},{"location":"user-guide/user-defined-functions/map/examples/#inter-step-buffer-service-isb-service","text":"","title":"Inter-Step Buffer Service (ISB Service)"},{"location":"user-guide/user-defined-functions/map/examples/#what-is-isb-service","text":"An Inter-Step Buffer Service is described by a Custom Resource , which is used to pass data between vertices of a numaflow pipeline. Please refer to the doc Inter-Step Buffer Service for more information on ISB.","title":"What is ISB Service?"},{"location":"user-guide/user-defined-functions/map/examples/#how-to-install-the-isb-service","text":"kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/0-isbsvc-jetstream.yaml The expected output of the above command is shown below: $ kubectl get isbsvc NAME TYPE PHASE MESSAGE AGE default jetstream Running 3d19h # Wait for pods to be ready $ kubectl get pods NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3 /3 Running 0 19s isbsvc-default-js-1 3 /3 Running 0 19s isbsvc-default-js-2 3 /3 Running 0 19s NOTE The Source used in the examples is an HTTP source producing messages with values 5 and 10 with event time starting from 60000. Please refer to the doc http source on how to use an HTTP source. An example will be as follows, curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"5\" ${ http -source-url } curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"10\" ${ http -source-url }","title":"How to install the ISB Service"},{"location":"user-guide/user-defined-functions/map/examples/#creating-a-simple-map-pipeline","text":"Now we will walk you through creating a map pipeline. In our example, this is called the even-odd pipeline, illustrated by the following diagram: There are five vertices in this example of a map pipeline. An HTTP source vertex which serves an HTTP endpoint to receive numbers as source data, a UDF vertex to tag the ingested numbers with the key even or odd , three Log sinks, one to print the even numbers, one to print the odd numbers, and the other one to print both the even and odd numbers. Run the following command to create the even-odd pipeline. kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml You may opt to view the list of pipelines you've created so far by running kubectl get pipeline . Otherwise, proceed to inspect the status of the pipeline, using kubectl get pods . # Wait for pods to be ready kubectl get pods NAME READY STATUS RESTARTS AGE even-odd-daemon-64d65c945d-vjs9f 1 /1 Running 0 5m3s even-odd-even-or-odd-0-pr4ze 2 /2 Running 0 30s even-odd-even-sink-0-unffo 1 /1 Running 0 22s even-odd-in-0-a7iyd 1 /1 Running 0 5m3s even-odd-number-sink-0-zmg2p 1 /1 Running 0 7s even-odd-odd-sink-0-2736r 1 /1 Running 0 15s isbsvc-default-js-0 3 /3 Running 0 10m isbsvc-default-js-1 3 /3 Running 0 10m isbsvc-default-js-2 3 /3 Running 0 10m Next, port-forward the HTTP endpoint, and make a POST request using curl . Remember to replace xxxxx with the appropriate pod names both here and in the next step. kubectl port-forward even-odd-in-0-xxxx 8444 :8443 # Post data to the HTTP endpoint curl -kq -X POST -d \"101\" https://localhost:8444/vertices/in curl -kq -X POST -d \"102\" https://localhost:8444/vertices/in curl -kq -X POST -d \"103\" https://localhost:8444/vertices/in curl -kq -X POST -d \"104\" https://localhost:8444/vertices/in Now you can watch the log for the even and odd vertices by running the commands below. # Watch the log for the even vertex kubectl logs -f even-odd-even-sink-0-xxxxx 2022 /09/07 22 :29:40 ( even-sink ) 102 2022 /09/07 22 :29:40 ( even-sink ) 104 # Watch the log for the odd vertex kubectl logs -f even-odd-odd-sink-0-xxxxx 2022 /09/07 22 :30:19 ( odd-sink ) 101 2022 /09/07 22 :30:19 ( odd-sink ) 103 View the UI for a pipeline at https://localhost:8443/. The source code of the even-odd user-defined function can be found here . You also can replace the Log Sink with some other sinks like Kafka to forward the data to Kafka topics. The pipeline can be deleted by kubectl delete -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml","title":"Creating a Simple Map Pipeline"},{"location":"user-guide/user-defined-functions/map/map/","text":"Map UDF \u00b6 Map in a Map vertex takes an input and returns 0, 1, or more outputs (also known as flat-map operation). Map is an element wise operator. Builtin UDF \u00b6 There are some Built-in Functions that can be used directly. Build Your Own UDF \u00b6 You can build your own UDF in multiple languages. Check the links below to see the UDF examples for different languages. Python Golang Java After building a docker image for the written UDF, specify the image as below in the vertex spec. spec : vertices : - name : my-vertex udf : container : image : my-python-udf-example:latest Streaming Mode \u00b6 In cases the map function generates more than one output (e.g., flat map), the UDF can be configured to run in a streaming mode instead of batching, which is the default mode. In streaming mode, the messages will be pushed to the downstream vertices once generated instead of in a batch at the end. Note that to maintain data orderliness, we restrict the read batch size to be 1 . spec : vertices : - name : my-vertex limits : # mapstreaming won't work if readBatchSize is != 1 readBatchSize : 1 Check the links below to see the UDF examples in streaming mode for different languages. Python Golang Java Available Environment Variables \u00b6 Some environment variables are available in the user-defined function container, they might be useful in your own UDF implementation. NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex. Configuration \u00b6 Configuration data can be provided to the UDF container at runtime multiple ways. environment variables args command volumes init containers","title":"Overview"},{"location":"user-guide/user-defined-functions/map/map/#map-udf","text":"Map in a Map vertex takes an input and returns 0, 1, or more outputs (also known as flat-map operation). Map is an element wise operator.","title":"Map UDF"},{"location":"user-guide/user-defined-functions/map/map/#builtin-udf","text":"There are some Built-in Functions that can be used directly.","title":"Builtin UDF"},{"location":"user-guide/user-defined-functions/map/map/#build-your-own-udf","text":"You can build your own UDF in multiple languages. Check the links below to see the UDF examples for different languages. Python Golang Java After building a docker image for the written UDF, specify the image as below in the vertex spec. spec : vertices : - name : my-vertex udf : container : image : my-python-udf-example:latest","title":"Build Your Own UDF"},{"location":"user-guide/user-defined-functions/map/map/#streaming-mode","text":"In cases the map function generates more than one output (e.g., flat map), the UDF can be configured to run in a streaming mode instead of batching, which is the default mode. In streaming mode, the messages will be pushed to the downstream vertices once generated instead of in a batch at the end. Note that to maintain data orderliness, we restrict the read batch size to be 1 . spec : vertices : - name : my-vertex limits : # mapstreaming won't work if readBatchSize is != 1 readBatchSize : 1 Check the links below to see the UDF examples in streaming mode for different languages. Python Golang Java","title":"Streaming Mode"},{"location":"user-guide/user-defined-functions/map/map/#available-environment-variables","text":"Some environment variables are available in the user-defined function container, they might be useful in your own UDF implementation. NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex.","title":"Available Environment Variables"},{"location":"user-guide/user-defined-functions/map/map/#configuration","text":"Configuration data can be provided to the UDF container at runtime multiple ways. environment variables args command volumes init containers","title":"Configuration"},{"location":"user-guide/user-defined-functions/map/builtin-functions/","text":"Built-in Functions \u00b6 Numaflow provides some built-in functions that can be used directly. Cat A cat builtin UDF does nothing but return the same messages it receives, it is very useful for debugging and testing. spec : vertices : - name : cat-vertex udf : builtin : name : cat Filter A filter built-in UDF does filter the message based on expression. payload keyword represents message object. see documentation for expression here spec : vertices : - name : filter-vertex udf : builtin : name : filter kwargs : expression : int(object(payload).id) > 100","title":"Overview"},{"location":"user-guide/user-defined-functions/map/builtin-functions/#built-in-functions","text":"Numaflow provides some built-in functions that can be used directly. Cat A cat builtin UDF does nothing but return the same messages it receives, it is very useful for debugging and testing. spec : vertices : - name : cat-vertex udf : builtin : name : cat Filter A filter built-in UDF does filter the message based on expression. payload keyword represents message object. see documentation for expression here spec : vertices : - name : filter-vertex udf : builtin : name : filter kwargs : expression : int(object(payload).id) > 100","title":"Built-in Functions"},{"location":"user-guide/user-defined-functions/map/builtin-functions/cat/","text":"Cat \u00b6 A cat builtin function does nothing but return the same messages it receives, it is very useful for debugging and testing. spec : vertices : - name : cat-vertex udf : builtin : name : cat","title":"Cat"},{"location":"user-guide/user-defined-functions/map/builtin-functions/cat/#cat","text":"A cat builtin function does nothing but return the same messages it receives, it is very useful for debugging and testing. spec : vertices : - name : cat-vertex udf : builtin : name : cat","title":"Cat"},{"location":"user-guide/user-defined-functions/map/builtin-functions/filter/","text":"Filter \u00b6 A filter is a special-purpose built-in function. It is used to evaluate on each message in a pipeline and is often used to filter the number of messages that are passed to next vertices. Filter function supports comprehensive expression language which extend flexibility write complex expressions. payload will be root element to represent the message object in expression. Expression \u00b6 Filter expression implemented with expr and sprig libraries. Data conversion functions \u00b6 These function can be accessed directly in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount) Sprig functions \u00b6 Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.contains('James', json(payload).name) # James is contained in the value of name . int(json(sprig.b64dec(payload)).id) < 100 Filter Spec \u00b6 - name : filter-vertex udf : builtin : name : filter kwargs : expression : int(json(payload).id) < 100","title":"Filter"},{"location":"user-guide/user-defined-functions/map/builtin-functions/filter/#filter","text":"A filter is a special-purpose built-in function. It is used to evaluate on each message in a pipeline and is often used to filter the number of messages that are passed to next vertices. Filter function supports comprehensive expression language which extend flexibility write complex expressions. payload will be root element to represent the message object in expression.","title":"Filter"},{"location":"user-guide/user-defined-functions/map/builtin-functions/filter/#expression","text":"Filter expression implemented with expr and sprig libraries.","title":"Expression"},{"location":"user-guide/user-defined-functions/map/builtin-functions/filter/#data-conversion-functions","text":"These function can be accessed directly in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount)","title":"Data conversion functions"},{"location":"user-guide/user-defined-functions/map/builtin-functions/filter/#sprig-functions","text":"Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.contains('James', json(payload).name) # James is contained in the value of name . int(json(sprig.b64dec(payload)).id) < 100","title":"Sprig functions"},{"location":"user-guide/user-defined-functions/map/builtin-functions/filter/#filter-spec","text":"- name : filter-vertex udf : builtin : name : filter kwargs : expression : int(json(payload).id) < 100","title":"Filter Spec"},{"location":"user-guide/user-defined-functions/reduce/examples/","text":"Reduce Examples \u00b6 Please read reduce to get the best out of these examples. Prerequisites \u00b6 Inter-Step Buffer Service (ISB Service) \u00b6 What is ISB Service? \u00b6 An Inter-Step Buffer Service is described by a Custom Resource , which is used to pass data between vertices of a numaflow pipeline. Please refer to the doc Inter-Step Buffer Service for more information on ISB. How to install the ISB Service \u00b6 kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/0-isbsvc-jetstream.yaml The expected output of the above command is shown below: $ kubectl get isbsvc NAME TYPE PHASE MESSAGE AGE default jetstream Running 3d19h # Wait for pods to be ready $ kubectl get pods NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3 /3 Running 0 19s isbsvc-default-js-1 3 /3 Running 0 19s isbsvc-default-js-2 3 /3 Running 0 19s NOTE The Source used in the examples is an HTTP source producing messages with values 5 and 10 with event time starting from 60000. Please refer to the doc http source on how to use an HTTP source. An example will be as follows, curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"5\" ${ http -source-url } curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"10\" ${ http -source-url } Sum Pipeline Using Fixed Window \u00b6 This is a simple reduce pipeline that just does summation (sum of numbers) but uses fixed window. The snippet for the reduce vertex is as follows. - name : compute-sum udf : container : # compute the sum image : quay.io/numaio/numaflow-go/reduce-sum:v0.5.0 groupBy : window : fixed : length : 60s keyed : true 6-reduce-fixed-window.yaml has the complete pipeline definition. In this example we use a partitions of 2 . We are setting a partitions > 1 because it is a keyed window. - name : compute-sum partitions : 2 kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/6-reduce-fixed-window.yaml Output : 2023/01/05 11:54:41 (sink) Payload - 300 Key - odd Start - 60000 End - 120000 2023/01/05 11:54:41 (sink) Payload - 600 Key - even Start - 60000 End - 120000 2023/01/05 11:54:41 (sink) Payload - 300 Key - odd Start - 120000 End - 180000 2023/01/05 11:54:41 (sink) Payload - 600 Key - even Start - 120000 End - 180000 2023/01/05 11:54:42 (sink) Payload - 600 Key - even Start - 180000 End - 240000 2023/01/05 11:54:42 (sink) Payload - 300 Key - odd Start - 180000 End - 240000 In our example, input is an HTTP source producing 2 messages each second with values 5 and 10, and the event time starts from 60000. Since we have considered a fixed window of length 60s, and also we are producing two messages with different keys \"even\" and \"odd\", Numaflow will create two different windows with a start time of 60000 and an end time of 120000. So the output will be 300(5 * 60) and 600(10 * 60). If we had used a non keyed window ( keyed: false ), we would have seen one single output with value of 900(300 of odd + 600 of even) for each window. Sum Pipeline Using Sliding Window \u00b6 This is a simple reduce pipeline that just does summation (sum of numbers) but uses sliding window. The snippet for the reduce vertex is as follows. - name : reduce-sliding udf : container : # compute the sum image : quay.io/numaio/numaflow-go/reduce-sum:v0.5.0 groupBy : window : sliding : length : 60s slide : 10s keyed : true 7-reduce-sliding-window.yaml has the complete pipeline definition kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/7-reduce-sliding-window.yaml Output: 2023/01/05 15:13:16 (sink) Payload - 300 Key - odd Start - 60000 End - 120000 2023/01/05 15:13:16 (sink) Payload - 600 Key - even Start - 60000 End - 120000 2023/01/05 15:13:16 (sink) Payload - 300 Key - odd Start - 70000 End - 130000 2023/01/05 15:13:16 (sink) Payload - 600 Key - even Start - 700000 End - 1300000 2023/01/05 15:13:16 (sink) Payload - 300 Key - odd Start - 80000 End - 140000 2023/01/05 15:13:16 (sink) Payload - 600 Key - even Start - 80000 End - 140000 In our example, input is an HTTP source producing 2 messages each second with values 5 and 10, and the event time starts from 60000. Since we have considered a sliding window of length 60s and slide 10s, and also we are producing two messages with different keys \"even\" and \"odd\". Numaflow will create two different windows with a start time of 60000 and an end time of 120000, and because the slide duration is 10s, a next set of windows will be created with start time of 70000 and an end time of 130000. Since it's a sum operation the output will be 300(5 * 60) and 600(10 * 60). Payload - 50 Key - odd Start - 10000 End - 70000 , we see 50 here for odd because the first window has only 10 elements Complex Reduce Pipeline \u00b6 In the complex reduce example, we will chain of reduce functions use both fixed and sliding windows use keyed and non-keyed windowing 8-reduce-complex-pipeline.yaml has the complete pipeline definition kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/8-reduce-complex-pipeline.yaml Output: 2023/01/05 15:33:55 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 80000 End - 140000 2023/01/05 15:33:55 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 90000 End - 150000 2023/01/05 15:33:55 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 100000 End - 160000 2023/01/05 15:33:56 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 110000 End - 170000 2023/01/05 15:33:56 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 120000 End - 180000 2023/01/05 15:33:56 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 130000 End - 190000 In our example, first we have the reduce vertex with a fixed window of duration 5s. Since the input is 5 and 10, the output from the first reduce vertex will be 25 (5 * 5) and 50 (5 * 10). This will be passed to the next non-keyed reduce vertex with the fixed window duration of 10s. This being a non-keyed, it will combine the inputs and produce the output of 150(25 * 2 + 50 * 2), which will be passed to the reduce vertex with a sliding window of duration 60s and with the slide duration of 10s. Hence the final output will be 900(150 * 6).","title":"Examples"},{"location":"user-guide/user-defined-functions/reduce/examples/#reduce-examples","text":"Please read reduce to get the best out of these examples.","title":"Reduce Examples"},{"location":"user-guide/user-defined-functions/reduce/examples/#prerequisites","text":"","title":"Prerequisites"},{"location":"user-guide/user-defined-functions/reduce/examples/#inter-step-buffer-service-isb-service","text":"","title":"Inter-Step Buffer Service (ISB Service)"},{"location":"user-guide/user-defined-functions/reduce/examples/#what-is-isb-service","text":"An Inter-Step Buffer Service is described by a Custom Resource , which is used to pass data between vertices of a numaflow pipeline. Please refer to the doc Inter-Step Buffer Service for more information on ISB.","title":"What is ISB Service?"},{"location":"user-guide/user-defined-functions/reduce/examples/#how-to-install-the-isb-service","text":"kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/0-isbsvc-jetstream.yaml The expected output of the above command is shown below: $ kubectl get isbsvc NAME TYPE PHASE MESSAGE AGE default jetstream Running 3d19h # Wait for pods to be ready $ kubectl get pods NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3 /3 Running 0 19s isbsvc-default-js-1 3 /3 Running 0 19s isbsvc-default-js-2 3 /3 Running 0 19s NOTE The Source used in the examples is an HTTP source producing messages with values 5 and 10 with event time starting from 60000. Please refer to the doc http source on how to use an HTTP source. An example will be as follows, curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"5\" ${ http -source-url } curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"10\" ${ http -source-url }","title":"How to install the ISB Service"},{"location":"user-guide/user-defined-functions/reduce/examples/#sum-pipeline-using-fixed-window","text":"This is a simple reduce pipeline that just does summation (sum of numbers) but uses fixed window. The snippet for the reduce vertex is as follows. - name : compute-sum udf : container : # compute the sum image : quay.io/numaio/numaflow-go/reduce-sum:v0.5.0 groupBy : window : fixed : length : 60s keyed : true 6-reduce-fixed-window.yaml has the complete pipeline definition. In this example we use a partitions of 2 . We are setting a partitions > 1 because it is a keyed window. - name : compute-sum partitions : 2 kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/6-reduce-fixed-window.yaml Output : 2023/01/05 11:54:41 (sink) Payload - 300 Key - odd Start - 60000 End - 120000 2023/01/05 11:54:41 (sink) Payload - 600 Key - even Start - 60000 End - 120000 2023/01/05 11:54:41 (sink) Payload - 300 Key - odd Start - 120000 End - 180000 2023/01/05 11:54:41 (sink) Payload - 600 Key - even Start - 120000 End - 180000 2023/01/05 11:54:42 (sink) Payload - 600 Key - even Start - 180000 End - 240000 2023/01/05 11:54:42 (sink) Payload - 300 Key - odd Start - 180000 End - 240000 In our example, input is an HTTP source producing 2 messages each second with values 5 and 10, and the event time starts from 60000. Since we have considered a fixed window of length 60s, and also we are producing two messages with different keys \"even\" and \"odd\", Numaflow will create two different windows with a start time of 60000 and an end time of 120000. So the output will be 300(5 * 60) and 600(10 * 60). If we had used a non keyed window ( keyed: false ), we would have seen one single output with value of 900(300 of odd + 600 of even) for each window.","title":"Sum Pipeline Using Fixed Window"},{"location":"user-guide/user-defined-functions/reduce/examples/#sum-pipeline-using-sliding-window","text":"This is a simple reduce pipeline that just does summation (sum of numbers) but uses sliding window. The snippet for the reduce vertex is as follows. - name : reduce-sliding udf : container : # compute the sum image : quay.io/numaio/numaflow-go/reduce-sum:v0.5.0 groupBy : window : sliding : length : 60s slide : 10s keyed : true 7-reduce-sliding-window.yaml has the complete pipeline definition kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/7-reduce-sliding-window.yaml Output: 2023/01/05 15:13:16 (sink) Payload - 300 Key - odd Start - 60000 End - 120000 2023/01/05 15:13:16 (sink) Payload - 600 Key - even Start - 60000 End - 120000 2023/01/05 15:13:16 (sink) Payload - 300 Key - odd Start - 70000 End - 130000 2023/01/05 15:13:16 (sink) Payload - 600 Key - even Start - 700000 End - 1300000 2023/01/05 15:13:16 (sink) Payload - 300 Key - odd Start - 80000 End - 140000 2023/01/05 15:13:16 (sink) Payload - 600 Key - even Start - 80000 End - 140000 In our example, input is an HTTP source producing 2 messages each second with values 5 and 10, and the event time starts from 60000. Since we have considered a sliding window of length 60s and slide 10s, and also we are producing two messages with different keys \"even\" and \"odd\". Numaflow will create two different windows with a start time of 60000 and an end time of 120000, and because the slide duration is 10s, a next set of windows will be created with start time of 70000 and an end time of 130000. Since it's a sum operation the output will be 300(5 * 60) and 600(10 * 60). Payload - 50 Key - odd Start - 10000 End - 70000 , we see 50 here for odd because the first window has only 10 elements","title":"Sum Pipeline Using Sliding Window"},{"location":"user-guide/user-defined-functions/reduce/examples/#complex-reduce-pipeline","text":"In the complex reduce example, we will chain of reduce functions use both fixed and sliding windows use keyed and non-keyed windowing 8-reduce-complex-pipeline.yaml has the complete pipeline definition kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/8-reduce-complex-pipeline.yaml Output: 2023/01/05 15:33:55 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 80000 End - 140000 2023/01/05 15:33:55 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 90000 End - 150000 2023/01/05 15:33:55 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 100000 End - 160000 2023/01/05 15:33:56 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 110000 End - 170000 2023/01/05 15:33:56 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 120000 End - 180000 2023/01/05 15:33:56 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 130000 End - 190000 In our example, first we have the reduce vertex with a fixed window of duration 5s. Since the input is 5 and 10, the output from the first reduce vertex will be 25 (5 * 5) and 50 (5 * 10). This will be passed to the next non-keyed reduce vertex with the fixed window duration of 10s. This being a non-keyed, it will combine the inputs and produce the output of 150(25 * 2 + 50 * 2), which will be passed to the reduce vertex with a sliding window of duration 60s and with the slide duration of 10s. Hence the final output will be 900(150 * 6).","title":"Complex Reduce Pipeline"},{"location":"user-guide/user-defined-functions/reduce/reduce/","text":"Reduce UDF \u00b6 Overview \u00b6 Reduce is one of the most commonly used abstractions in a stream processing pipeline to define aggregation functions on a stream of data. It is the reduce feature that helps us solve problems like \"performs a summary operation(such as counting the number of occurrences of a key, yielding user login frequencies), etc. \"Since the input is an unbounded stream (with infinite entries), we need an additional parameter to convert the unbounded problem to a bounded problem and provide results on that. That bounding condition is \"time\", eg, \"number of users logged in per minute\". So while processing an unbounded stream of data, we need a way to group elements into finite chunks using time. To build these chunks, the reduce function is applied to the set of records produced using the concept of windowing . Reduce Pseudo code \u00b6 Unlike in map vertex where only an element is given to user-defined function, in reduce since there is a group of elements, an iterator is passed to the reduce function. The following is a generic outlook of a reduce function. I have written the pseudo-code using the accumulator to show that very powerful functions can be applied using this reduce semantics. # reduceFn func is a generic reduce function that processes a set of elements def reduceFn ( keys : List [ str ], datums : Iterator [ Datum ], md : Metadata ) -> Messages : # initialize_accumalor could be any function that starts of with an empty # state. eg, accumulator = 0 accumulator = initialize_accumalor () # we are iterating on the input set of elements for d in datums : # accumulator.add_input() can be any function. # e.g., it could be as simple as accumulator += 1 accumulator . add_input ( d ) # once we are done with iterating on the elements, we return the result # acumulator.result() can be str.encode(accumulator) return Messages ( Message ( acumulator . result (), keys )) Specification \u00b6 The structure for defining a reduce vertex is as follows. - name : my-reduce-udf udf : container : image : my-reduce-udf:latest groupBy : window : ... keyed : ... storage : ... The reduce spec adds a new section called groupBy and this how we differentiate a map vertex from reduce vertex. There are two important fields, the window and keyed . These two fields play an important role in grouping the data together and pass it to the user-defined reduce code. The reduce supports parallelism processing by defining a partitions in the vertex. This is because auto-scaling is not supported in reduce vertex. If partitions is not defined default of one will be used. - name : my-reduce-udf partitions : integer It is wrong to give a partitions > 1 if it is a non-keyed vertex ( keyed: false ). There are a couple of examples that demonstrate Fixed windows, Sliding windows, chaining of windows, keyed streams, etc. Time Characteristics \u00b6 All windowing operations generate new records as an output of reduce operations. Event-time and Watermark are two main primitives that determine how the time propagates in a streaming application. so for all new records generated in a reduce operation, event time is set to the end time of the window. For example, for a reduce operation over a keyed/non-keyed window with a start and end defined by [2031-09-29T18:47:00Z, 2031-09-29T18:48:00Z) , event time for all the records generated will be set to 2031-09-29T18:47:59.999Z since millisecond is the smallest granularity (as of now) event time is set to the last timestamp that belongs to a window. Watermark is treated similarly, the watermark is set to the last timestamp for a given window. So for the example above, the value of the watermark will be set to the last timestamp, i.e., 2031-09-29T18:47:59.999Z . This applies to all the window types regardless of whether they are keyed or non-keyed windows. Allowed Lateness \u00b6 allowedLateness flag on the Reduce vertex will allow late data to be processed by slowing the down the close-of-book operation of the Reduce vertex. Late data will be included for the Reduce operation as long as the late data is not later than (CurrentWatermark - AllowedLateness) . Without allowedLateness , late data will be rejected and dropped. Each Reduce vertex can have its own allowedLateness . vertices : - name : my-udf udf : groupBy : allowedLateness : 5s # Optional, allowedLateness is disabled by default Storage \u00b6 Reduce unlike map requires persistence. To support persistence user has to define the storage configuration. We replay the data stored in this storage on pod startup if there has been a restart of the reduce pod caused due to pod migrations, etc. vertices : - name : my-udf udf : groupBy : storage : .... Persistent Volume Claim (PVC) \u00b6 persistentVolumeClaim supports the following fields, volumeSize , storageClassName , and accessMode . As name suggests, volumeSize specifies the size of the volume. accessMode can be of many types, but for reduce use case we need only ReadWriteOnce . storageClassName can also be provided, more info on storage class can be found here . The default value of storageClassName is default which is default StorageClass may be deployed to a Kubernetes cluster by addon manager during installation. Example \u00b6 vertices : - name : my-udf udf : groupBy : storage : persistentVolumeClaim : volumeSize : 10Gi accessMode : ReadWriteOnce EmptyDir \u00b6 We also support emptyDir for quick experimentation. We do not recommend this in production setup. If we use emptyDir , we will end up in data loss if there are pod migrations. emptyDir also takes an optional sizeLimit . medium field controls where emptyDir volumes are stored. By default emptyDir volumes are stored on whatever medium that backs the node such as disk, SSD, or network storage, depending on your environment. If you set the medium field to \"Memory\" , Kubernetes mounts a tmpfs (RAM-backed filesystem) for you instead. Example \u00b6 vertices : - name : my-udf udf : groupBy : storage : emptyDir : {}","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/reduce/#reduce-udf","text":"","title":"Reduce UDF"},{"location":"user-guide/user-defined-functions/reduce/reduce/#overview","text":"Reduce is one of the most commonly used abstractions in a stream processing pipeline to define aggregation functions on a stream of data. It is the reduce feature that helps us solve problems like \"performs a summary operation(such as counting the number of occurrences of a key, yielding user login frequencies), etc. \"Since the input is an unbounded stream (with infinite entries), we need an additional parameter to convert the unbounded problem to a bounded problem and provide results on that. That bounding condition is \"time\", eg, \"number of users logged in per minute\". So while processing an unbounded stream of data, we need a way to group elements into finite chunks using time. To build these chunks, the reduce function is applied to the set of records produced using the concept of windowing .","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/reduce/#reduce-pseudo-code","text":"Unlike in map vertex where only an element is given to user-defined function, in reduce since there is a group of elements, an iterator is passed to the reduce function. The following is a generic outlook of a reduce function. I have written the pseudo-code using the accumulator to show that very powerful functions can be applied using this reduce semantics. # reduceFn func is a generic reduce function that processes a set of elements def reduceFn ( keys : List [ str ], datums : Iterator [ Datum ], md : Metadata ) -> Messages : # initialize_accumalor could be any function that starts of with an empty # state. eg, accumulator = 0 accumulator = initialize_accumalor () # we are iterating on the input set of elements for d in datums : # accumulator.add_input() can be any function. # e.g., it could be as simple as accumulator += 1 accumulator . add_input ( d ) # once we are done with iterating on the elements, we return the result # acumulator.result() can be str.encode(accumulator) return Messages ( Message ( acumulator . result (), keys ))","title":"Reduce Pseudo code"},{"location":"user-guide/user-defined-functions/reduce/reduce/#specification","text":"The structure for defining a reduce vertex is as follows. - name : my-reduce-udf udf : container : image : my-reduce-udf:latest groupBy : window : ... keyed : ... storage : ... The reduce spec adds a new section called groupBy and this how we differentiate a map vertex from reduce vertex. There are two important fields, the window and keyed . These two fields play an important role in grouping the data together and pass it to the user-defined reduce code. The reduce supports parallelism processing by defining a partitions in the vertex. This is because auto-scaling is not supported in reduce vertex. If partitions is not defined default of one will be used. - name : my-reduce-udf partitions : integer It is wrong to give a partitions > 1 if it is a non-keyed vertex ( keyed: false ). There are a couple of examples that demonstrate Fixed windows, Sliding windows, chaining of windows, keyed streams, etc.","title":"Specification"},{"location":"user-guide/user-defined-functions/reduce/reduce/#time-characteristics","text":"All windowing operations generate new records as an output of reduce operations. Event-time and Watermark are two main primitives that determine how the time propagates in a streaming application. so for all new records generated in a reduce operation, event time is set to the end time of the window. For example, for a reduce operation over a keyed/non-keyed window with a start and end defined by [2031-09-29T18:47:00Z, 2031-09-29T18:48:00Z) , event time for all the records generated will be set to 2031-09-29T18:47:59.999Z since millisecond is the smallest granularity (as of now) event time is set to the last timestamp that belongs to a window. Watermark is treated similarly, the watermark is set to the last timestamp for a given window. So for the example above, the value of the watermark will be set to the last timestamp, i.e., 2031-09-29T18:47:59.999Z . This applies to all the window types regardless of whether they are keyed or non-keyed windows.","title":"Time Characteristics"},{"location":"user-guide/user-defined-functions/reduce/reduce/#allowed-lateness","text":"allowedLateness flag on the Reduce vertex will allow late data to be processed by slowing the down the close-of-book operation of the Reduce vertex. Late data will be included for the Reduce operation as long as the late data is not later than (CurrentWatermark - AllowedLateness) . Without allowedLateness , late data will be rejected and dropped. Each Reduce vertex can have its own allowedLateness . vertices : - name : my-udf udf : groupBy : allowedLateness : 5s # Optional, allowedLateness is disabled by default","title":"Allowed Lateness"},{"location":"user-guide/user-defined-functions/reduce/reduce/#storage","text":"Reduce unlike map requires persistence. To support persistence user has to define the storage configuration. We replay the data stored in this storage on pod startup if there has been a restart of the reduce pod caused due to pod migrations, etc. vertices : - name : my-udf udf : groupBy : storage : ....","title":"Storage"},{"location":"user-guide/user-defined-functions/reduce/reduce/#persistent-volume-claim-pvc","text":"persistentVolumeClaim supports the following fields, volumeSize , storageClassName , and accessMode . As name suggests, volumeSize specifies the size of the volume. accessMode can be of many types, but for reduce use case we need only ReadWriteOnce . storageClassName can also be provided, more info on storage class can be found here . The default value of storageClassName is default which is default StorageClass may be deployed to a Kubernetes cluster by addon manager during installation.","title":"Persistent Volume Claim (PVC)"},{"location":"user-guide/user-defined-functions/reduce/reduce/#example","text":"vertices : - name : my-udf udf : groupBy : storage : persistentVolumeClaim : volumeSize : 10Gi accessMode : ReadWriteOnce","title":"Example"},{"location":"user-guide/user-defined-functions/reduce/reduce/#emptydir","text":"We also support emptyDir for quick experimentation. We do not recommend this in production setup. If we use emptyDir , we will end up in data loss if there are pod migrations. emptyDir also takes an optional sizeLimit . medium field controls where emptyDir volumes are stored. By default emptyDir volumes are stored on whatever medium that backs the node such as disk, SSD, or network storage, depending on your environment. If you set the medium field to \"Memory\" , Kubernetes mounts a tmpfs (RAM-backed filesystem) for you instead.","title":"EmptyDir"},{"location":"user-guide/user-defined-functions/reduce/reduce/#example_1","text":"vertices : - name : my-udf udf : groupBy : storage : emptyDir : {}","title":"Example"},{"location":"user-guide/user-defined-functions/reduce/windowing/fixed/","text":"Fixed \u00b6 Overview \u00b6 Fixed windows (sometimes called tumbling windows) are defined by a static window size, e.g. 30 second windows, one minute windows, etc. They are generally aligned, i.e. every window applies across all the data for the corresponding period of time. It has a fixed size measured in time and does not overlap. The element which belongs to one window will not belong to any other tumbling window. For example, a window size of 20 seconds will include all entities of the stream which came in a certain 20-second interval. To enable Fixed window, we use fixed under window section. vertices : - name : my-udf udf : groupBy : window : fixed : length : duration NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"\u00b5s\"), \"ms\", \"s\", \"m\", \"h\". Length \u00b6 The length is the window size of the fixed window. Example \u00b6 A 60-second window size can be defined as following. vertices : - name : my-udf udf : groupBy : window : fixed : length : 60s The yaml snippet above contains an example spec of a reduce vertex that uses fixed window aggregation. As we can see, the length of the window is 60s. This means only one window will be active at any point in time. It is also possible to have multiple inactive and non-empty windows (based on out-of-order arrival of elements). The window boundaries for the first window (post bootstrap) are determined by rounding down from time.now() to the nearest multiple of length of the window. So considering the above example, if the time.now() corresponds to 2031-09-29T18:46:30Z , then the start-time of the window will be adjusted to 2031-09-29T18:46:00Z and the end-time is set accordingly to 2031-09-29T18:47:00Z . Windows are left inclusive and right exclusive which means an element with event time (considering event time characteristic) of 2031-09-29T18:47:00Z will belong to the window with boundaries [2031-09-29T18:47:00Z, 2031-09-29T18:48:00Z) It is important to note that because of this property, for a constant throughput, the first window may contain fewer elements than other windows. Check the links below to see the UDF examples for different languages. Python Golang Java Streaming Mode \u00b6 Reduce can be enabled on streaming mode to stream messages or forward partial responses to the next vertex. This is useful for custom triggering, where we want to forward responses to the next vertex quickly, even before the fixed window closes. The close-of-book and a final triggering will still happen even if partial results have been emitted. To enable reduce streaming, set the streaming flag to true in the fixed window configuration. vertices : - name : my-udf udf : groupBy : window : fixed : length : duration streaming : true # set streaming to true to enable reduce streamer Note: UDFs should use the ReduceStreamer functionality in the SDKs to use this feature. Check the links below to see the UDF examples in streaming mode for different languages. Python Golang Java","title":"Fixed"},{"location":"user-guide/user-defined-functions/reduce/windowing/fixed/#fixed","text":"","title":"Fixed"},{"location":"user-guide/user-defined-functions/reduce/windowing/fixed/#overview","text":"Fixed windows (sometimes called tumbling windows) are defined by a static window size, e.g. 30 second windows, one minute windows, etc. They are generally aligned, i.e. every window applies across all the data for the corresponding period of time. It has a fixed size measured in time and does not overlap. The element which belongs to one window will not belong to any other tumbling window. For example, a window size of 20 seconds will include all entities of the stream which came in a certain 20-second interval. To enable Fixed window, we use fixed under window section. vertices : - name : my-udf udf : groupBy : window : fixed : length : duration NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"\u00b5s\"), \"ms\", \"s\", \"m\", \"h\".","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/windowing/fixed/#length","text":"The length is the window size of the fixed window.","title":"Length"},{"location":"user-guide/user-defined-functions/reduce/windowing/fixed/#example","text":"A 60-second window size can be defined as following. vertices : - name : my-udf udf : groupBy : window : fixed : length : 60s The yaml snippet above contains an example spec of a reduce vertex that uses fixed window aggregation. As we can see, the length of the window is 60s. This means only one window will be active at any point in time. It is also possible to have multiple inactive and non-empty windows (based on out-of-order arrival of elements). The window boundaries for the first window (post bootstrap) are determined by rounding down from time.now() to the nearest multiple of length of the window. So considering the above example, if the time.now() corresponds to 2031-09-29T18:46:30Z , then the start-time of the window will be adjusted to 2031-09-29T18:46:00Z and the end-time is set accordingly to 2031-09-29T18:47:00Z . Windows are left inclusive and right exclusive which means an element with event time (considering event time characteristic) of 2031-09-29T18:47:00Z will belong to the window with boundaries [2031-09-29T18:47:00Z, 2031-09-29T18:48:00Z) It is important to note that because of this property, for a constant throughput, the first window may contain fewer elements than other windows. Check the links below to see the UDF examples for different languages. Python Golang Java","title":"Example"},{"location":"user-guide/user-defined-functions/reduce/windowing/fixed/#streaming-mode","text":"Reduce can be enabled on streaming mode to stream messages or forward partial responses to the next vertex. This is useful for custom triggering, where we want to forward responses to the next vertex quickly, even before the fixed window closes. The close-of-book and a final triggering will still happen even if partial results have been emitted. To enable reduce streaming, set the streaming flag to true in the fixed window configuration. vertices : - name : my-udf udf : groupBy : window : fixed : length : duration streaming : true # set streaming to true to enable reduce streamer Note: UDFs should use the ReduceStreamer functionality in the SDKs to use this feature. Check the links below to see the UDF examples in streaming mode for different languages. Python Golang Java","title":"Streaming Mode"},{"location":"user-guide/user-defined-functions/reduce/windowing/session/","text":"Session \u00b6 Overview \u00b6 Session window is a type of Unaligned window where the window\u2019s end time keeps moving until there is no data for a given time duration. Unlike fixed and sliding windows, session windows do not overlap, nor do they have a set start and end time. They can be used to group data based on activity. vertices : - name : my-udf udf : groupBy : window : session : timeout : duration NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"\u00b5s\"), \"ms\", \"s\", \"m\", \"h\". timeout \u00b6 The timeout is the duration of inactivity (no data flowing in for the particular key) after which the session is considered to be closed. Example \u00b6 To create a session window of timeout 1 minute, we can use the following snippet. vertices : - name : my-udf udf : groupBy : window : session : timeout : 60s The yaml snippet above contains an example spec of a reduce vertex that uses session window aggregation. As we can see, the timeout of the window is 60s. This means we no data arrives for a particular key for 60 seconds, we will mark it as closed. Let's say, time.now() in the pipeline is 2031-09-29T18:46:30Z as the current time, and we have a session gap of 30s. If we receive events in this pattern: Event-1 at 2031-09-29T18:45:40Z Event-2 at 2031-09-29T18:45:55Z # Notice the 15 sec interval from Event-1, still within session gap Event-3 at 2031-09-29T18:46:20Z # Notice the 25 sec interval from Event-2, still within session gap Event-4 at 2031-09-29T18:46:55Z # Notice the 35 sec interval from Event-3, beyond the session gap Event-5 at 2031-09-29T18:47:10Z # Notice the 15 sec interval from Event-4, within the new session gap This would lead to two session windows as follows: [2031-09-29T18:45:40Z, 2031-09-29T18:46:20Z) # includes Event-1, Event-2 and Event-3 [2031-09-29T18:46:55Z, 2031-09-29T18:47:10Z) # includes Event-4 and Event-5 In this example, the start time is inclusive and the end time is exclusive. Event-1 , Event-2 , and Event-3 fall within the first window, and this window closes 30 seconds after Event-3 at 2031-09-29T18:46:50Z . Event-4 arrives 5 seconds later, meaning it's beyond the session gap of the previous window, initiating a new window. The second window includes Event-4 and Event-5 , and it closes 30 seconds after Event-5 at 2031-09-29T18:47:40Z , if no further events arrive for the key until the timeout. Note: Streaming mode is by default enabled for session windows. Check the links below to see the UDF examples for different languages. Currently, we have the SDK support for Golang and Java. Golang Java","title":"Session"},{"location":"user-guide/user-defined-functions/reduce/windowing/session/#session","text":"","title":"Session"},{"location":"user-guide/user-defined-functions/reduce/windowing/session/#overview","text":"Session window is a type of Unaligned window where the window\u2019s end time keeps moving until there is no data for a given time duration. Unlike fixed and sliding windows, session windows do not overlap, nor do they have a set start and end time. They can be used to group data based on activity. vertices : - name : my-udf udf : groupBy : window : session : timeout : duration NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"\u00b5s\"), \"ms\", \"s\", \"m\", \"h\".","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/windowing/session/#timeout","text":"The timeout is the duration of inactivity (no data flowing in for the particular key) after which the session is considered to be closed.","title":"timeout"},{"location":"user-guide/user-defined-functions/reduce/windowing/session/#example","text":"To create a session window of timeout 1 minute, we can use the following snippet. vertices : - name : my-udf udf : groupBy : window : session : timeout : 60s The yaml snippet above contains an example spec of a reduce vertex that uses session window aggregation. As we can see, the timeout of the window is 60s. This means we no data arrives for a particular key for 60 seconds, we will mark it as closed. Let's say, time.now() in the pipeline is 2031-09-29T18:46:30Z as the current time, and we have a session gap of 30s. If we receive events in this pattern: Event-1 at 2031-09-29T18:45:40Z Event-2 at 2031-09-29T18:45:55Z # Notice the 15 sec interval from Event-1, still within session gap Event-3 at 2031-09-29T18:46:20Z # Notice the 25 sec interval from Event-2, still within session gap Event-4 at 2031-09-29T18:46:55Z # Notice the 35 sec interval from Event-3, beyond the session gap Event-5 at 2031-09-29T18:47:10Z # Notice the 15 sec interval from Event-4, within the new session gap This would lead to two session windows as follows: [2031-09-29T18:45:40Z, 2031-09-29T18:46:20Z) # includes Event-1, Event-2 and Event-3 [2031-09-29T18:46:55Z, 2031-09-29T18:47:10Z) # includes Event-4 and Event-5 In this example, the start time is inclusive and the end time is exclusive. Event-1 , Event-2 , and Event-3 fall within the first window, and this window closes 30 seconds after Event-3 at 2031-09-29T18:46:50Z . Event-4 arrives 5 seconds later, meaning it's beyond the session gap of the previous window, initiating a new window. The second window includes Event-4 and Event-5 , and it closes 30 seconds after Event-5 at 2031-09-29T18:47:40Z , if no further events arrive for the key until the timeout. Note: Streaming mode is by default enabled for session windows. Check the links below to see the UDF examples for different languages. Currently, we have the SDK support for Golang and Java. Golang Java","title":"Example"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/","text":"Sliding \u00b6 Overview \u00b6 Sliding windows are similar to Fixed windows, the size of the windows is measured in time and is fixed. The important difference from the Fixed window is the fact that it allows an element to be present in more than one window. The additional window slide parameter controls how frequently a sliding window is started. Hence, sliding windows will be overlapping and the slide should be smaller than the window length. vertices : - name : my-udf udf : groupBy : window : sliding : length : duration slide : duration NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"\u00b5s\"), \"ms\", \"s\", \"m\", \"h\". Length \u00b6 The length is the window size of the fixed window. Slide \u00b6 slide is the slide parameter that controls the frequency at which the sliding window is created. Example \u00b6 To create a sliding window of length 1 minute which slides every 10 seconds, we can use the following snippet. vertices : - name : my-udf udf : groupBy : window : sliding : length : 60s slide : 10s The yaml snippet above contains an example spec of a reduce vertex that uses sliding window aggregation. As we can see, the length of the window is 60s and sliding frequency is once every 10s. This means there will be multiple windows active at any point in time. Let's say, time.now() in the pipeline is 2031-09-29T18:46:30Z the active window boundaries will be as follows (there are total of 6 windows 60s/10s ) [2031-09-29T18:45:40Z, 2031-09-29T18:46:40Z) [2031-09-29T18:45:50Z, 2031-09-29T18:46:50Z) # notice the 10 sec shift from the above window [2031-09-29T18:46:00Z, 2031-09-29T18:47:00Z) [2031-09-29T18:46:10Z, 2031-09-29T18:47:10Z) [2031-09-29T18:46:20Z, 2031-09-29T18:47:20Z) [2031-09-29T18:46:30Z, 2031-09-29T18:47:30Z) The window start time is always be left inclusive and right exclusive. That is why [2031-09-29T18:45:30Z, 2031-09-29T18:46:30Z) window is not considered active (it fell on the previous window, right exclusive) but [2031-09-29T18:46:30Z, 2031-09-29T18:47:30Z) is an active (left inclusive). The first window always ends after the sliding seconds from the time.Now() , the start time of the window will be the nearest integer multiple of the slide which is less than the message's event time. So the first window starts in the past and ends _sliding_duration (based on time progression in the pipeline and not the wall time) from present. It is important to note that regardless of the window boundary (starting in the past or ending in the future) the target element set totally depends on the matching time (in case of event time, all the elements with the time that falls with in the boundaries of the window, and in case of system time, all the elements that arrive from the present until the end of window present + sliding ) From the point above, it follows then that immediately upon startup, for the first window, fewer elements may get aggregated depending on the current lateness of the data stream. Check the links below to see the UDF examples for different languages. Python Golang Java Streaming Mode \u00b6 Reduce can be enabled on streaming mode to stream messages or forward partial responses to the next vertex. This is useful for custom triggering, where we want to forward responses to the next vertex quickly, even before the fixed window closes. The close-of-book and a final triggering will still happen even if partial results have been emitted. To enable reduce streaming, set the streaming flag to true in the sliding window configuration. vertices : - name : my-udf udf : groupBy : window : sliding : length : duration slide : duration streaming : true # set streaming to true to enable reduce streamer Note: UDFs should use the ReduceStreamer functionality in the SDKs to use this feature. Check the links below to see the UDF examples in streaming mode for different languages. Python Golang Java","title":"Sliding"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/#sliding","text":"","title":"Sliding"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/#overview","text":"Sliding windows are similar to Fixed windows, the size of the windows is measured in time and is fixed. The important difference from the Fixed window is the fact that it allows an element to be present in more than one window. The additional window slide parameter controls how frequently a sliding window is started. Hence, sliding windows will be overlapping and the slide should be smaller than the window length. vertices : - name : my-udf udf : groupBy : window : sliding : length : duration slide : duration NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"\u00b5s\"), \"ms\", \"s\", \"m\", \"h\".","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/#length","text":"The length is the window size of the fixed window.","title":"Length"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/#slide","text":"slide is the slide parameter that controls the frequency at which the sliding window is created.","title":"Slide"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/#example","text":"To create a sliding window of length 1 minute which slides every 10 seconds, we can use the following snippet. vertices : - name : my-udf udf : groupBy : window : sliding : length : 60s slide : 10s The yaml snippet above contains an example spec of a reduce vertex that uses sliding window aggregation. As we can see, the length of the window is 60s and sliding frequency is once every 10s. This means there will be multiple windows active at any point in time. Let's say, time.now() in the pipeline is 2031-09-29T18:46:30Z the active window boundaries will be as follows (there are total of 6 windows 60s/10s ) [2031-09-29T18:45:40Z, 2031-09-29T18:46:40Z) [2031-09-29T18:45:50Z, 2031-09-29T18:46:50Z) # notice the 10 sec shift from the above window [2031-09-29T18:46:00Z, 2031-09-29T18:47:00Z) [2031-09-29T18:46:10Z, 2031-09-29T18:47:10Z) [2031-09-29T18:46:20Z, 2031-09-29T18:47:20Z) [2031-09-29T18:46:30Z, 2031-09-29T18:47:30Z) The window start time is always be left inclusive and right exclusive. That is why [2031-09-29T18:45:30Z, 2031-09-29T18:46:30Z) window is not considered active (it fell on the previous window, right exclusive) but [2031-09-29T18:46:30Z, 2031-09-29T18:47:30Z) is an active (left inclusive). The first window always ends after the sliding seconds from the time.Now() , the start time of the window will be the nearest integer multiple of the slide which is less than the message's event time. So the first window starts in the past and ends _sliding_duration (based on time progression in the pipeline and not the wall time) from present. It is important to note that regardless of the window boundary (starting in the past or ending in the future) the target element set totally depends on the matching time (in case of event time, all the elements with the time that falls with in the boundaries of the window, and in case of system time, all the elements that arrive from the present until the end of window present + sliding ) From the point above, it follows then that immediately upon startup, for the first window, fewer elements may get aggregated depending on the current lateness of the data stream. Check the links below to see the UDF examples for different languages. Python Golang Java","title":"Example"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/#streaming-mode","text":"Reduce can be enabled on streaming mode to stream messages or forward partial responses to the next vertex. This is useful for custom triggering, where we want to forward responses to the next vertex quickly, even before the fixed window closes. The close-of-book and a final triggering will still happen even if partial results have been emitted. To enable reduce streaming, set the streaming flag to true in the sliding window configuration. vertices : - name : my-udf udf : groupBy : window : sliding : length : duration slide : duration streaming : true # set streaming to true to enable reduce streamer Note: UDFs should use the ReduceStreamer functionality in the SDKs to use this feature. Check the links below to see the UDF examples in streaming mode for different languages. Python Golang Java","title":"Streaming Mode"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/","text":"Windowing \u00b6 Overview \u00b6 In the world of data processing on an unbounded stream, Windowing is a concept of grouping data using temporal boundaries. We use event-time to discover temporal boundaries on an unbounded, infinite stream and Watermark to ensure the datasets within the boundaries are complete. The reduce is applied on these grouped datasets. For example, when we say, we want to find number of users online per minute, we use windowing to group the users into one minute buckets. The entirety of windowing is under the groupBy section. vertices : - name : my-udf udf : groupBy : window : ... keyed : ... Since a window can be Non-Keyed v/s Keyed , we have an explicit field called keyed to differentiate between both (see below). Under the window section we will define different types of windows. Window Types \u00b6 Numaflow supports the following types of windows Fixed Sliding Session Non-Keyed v/s Keyed Windows \u00b6 Non-Keyed \u00b6 A non-keyed partition is a partition where the window is the boundary condition. Data processing on a non-keyed partition cannot be scaled horizontally because only one partition exists. A non-keyed partition is usually used after aggregation and is hardly seen at the head section of any data processing pipeline. (There is a concept called Global Window where there is no windowing, but let us table that for later). Keyed \u00b6 A keyed partition is a partition where the partition boundary is a composite key of both the window and the key from the payload (e.g., GROUP BY country, where country names are the keys). Each smaller partition now has a complete set of datasets for that key and boundary. The subdivision of dividing a huge window-based partition into smaller partitions by adding keys along with the window will help us horizontally scale the distribution. Keyed partitions are heavily used to aggregate data and are frequently seen throughout the processing pipeline. We could also convert a non-keyed problem to a set of keyed problems and apply a non-keyed function at the end. This will help solve the original problem in a scalable manner without affecting the result's completeness and/or accuracy. When a keyed window is used, an optional partitions can be specified in the vertex for parallel processing. Usage \u00b6 Numaflow supports both Keyed and Non-Keyed windows. We set keyed to either true (keyed) or false (non-keyed). Please note that the non-keyed windows are not horizontally scalable as mentioned above. vertices : - name : my-reduce partitions : 5 # Optional, defaults to 1 udf : groupBy : window : ... keyed : true # Optional, defaults to false","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#windowing","text":"","title":"Windowing"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#overview","text":"In the world of data processing on an unbounded stream, Windowing is a concept of grouping data using temporal boundaries. We use event-time to discover temporal boundaries on an unbounded, infinite stream and Watermark to ensure the datasets within the boundaries are complete. The reduce is applied on these grouped datasets. For example, when we say, we want to find number of users online per minute, we use windowing to group the users into one minute buckets. The entirety of windowing is under the groupBy section. vertices : - name : my-udf udf : groupBy : window : ... keyed : ... Since a window can be Non-Keyed v/s Keyed , we have an explicit field called keyed to differentiate between both (see below). Under the window section we will define different types of windows.","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#window-types","text":"Numaflow supports the following types of windows Fixed Sliding Session","title":"Window Types"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#non-keyed-vs-keyed-windows","text":"","title":"Non-Keyed v/s Keyed Windows"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#non-keyed","text":"A non-keyed partition is a partition where the window is the boundary condition. Data processing on a non-keyed partition cannot be scaled horizontally because only one partition exists. A non-keyed partition is usually used after aggregation and is hardly seen at the head section of any data processing pipeline. (There is a concept called Global Window where there is no windowing, but let us table that for later).","title":"Non-Keyed"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#keyed","text":"A keyed partition is a partition where the partition boundary is a composite key of both the window and the key from the payload (e.g., GROUP BY country, where country names are the keys). Each smaller partition now has a complete set of datasets for that key and boundary. The subdivision of dividing a huge window-based partition into smaller partitions by adding keys along with the window will help us horizontally scale the distribution. Keyed partitions are heavily used to aggregate data and are frequently seen throughout the processing pipeline. We could also convert a non-keyed problem to a set of keyed problems and apply a non-keyed function at the end. This will help solve the original problem in a scalable manner without affecting the result's completeness and/or accuracy. When a keyed window is used, an optional partitions can be specified in the vertex for parallel processing.","title":"Keyed"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#usage","text":"Numaflow supports both Keyed and Non-Keyed windows. We set keyed to either true (keyed) or false (non-keyed). Please note that the non-keyed windows are not horizontally scalable as mentioned above. vertices : - name : my-reduce partitions : 5 # Optional, defaults to 1 udf : groupBy : window : ... keyed : true # Optional, defaults to false","title":"Usage"}]} \ No newline at end of file +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Numaflow \u00b6 Welcome to Numaflow! A Kubernetes-native, serverless platform for running scalable and reliable event-driven applications. Numaflow decouples event sources and sinks from the processing logic, allowing each component to independently auto-scale based on demand. With out-of-the-box sources and sinks, and built-in observability, developers can focus on their processing logic without worrying about event consumption, writing boilerplate code, or operational complexities. Each step of the pipeline can be written in any programming language, offering unparalleled flexibility in using the best programming language for each step and ease of using the languages you are most familiar with. Use Cases \u00b6 Event driven applications: Process events as they happen, e.g., updating inventory and sending customer notifications in e-commerce. Real time analytics: Analyze data instantly, e.g., social media analytics, observability data processing. Inference on streaming data: Perform real-time predictions, e.g., anomaly detection. Workflows running in a streaming manner. Learn more in our User Guide . Key Features \u00b6 Kubernetes-native: If you know Kubernetes, you already know how to use Numaflow. Serverless: Focus on your code and let the system scale up and down based on demand. Language agnostic: Use your favorite programming language. Exactly-Once semantics: No input element is duplicated or lost even as pods are rescheduled or restarted. Auto-scaling with back-pressure: Each vertex automatically scales from zero to whatever is needed. Data Integrity Guarantees \u00b6 Minimally provide at-least-once semantics Provide exactly-once semantics for unbounded and near real-time data sources Preserving order is not required Roadmap \u00b6 Map Streaming (1.3) Demo \u00b6 Getting Started \u00b6 For set-up information and running your first Numaflow pipeline, please see our getting started guide .","title":"Home"},{"location":"#numaflow","text":"Welcome to Numaflow! A Kubernetes-native, serverless platform for running scalable and reliable event-driven applications. Numaflow decouples event sources and sinks from the processing logic, allowing each component to independently auto-scale based on demand. With out-of-the-box sources and sinks, and built-in observability, developers can focus on their processing logic without worrying about event consumption, writing boilerplate code, or operational complexities. Each step of the pipeline can be written in any programming language, offering unparalleled flexibility in using the best programming language for each step and ease of using the languages you are most familiar with.","title":"Numaflow"},{"location":"#use-cases","text":"Event driven applications: Process events as they happen, e.g., updating inventory and sending customer notifications in e-commerce. Real time analytics: Analyze data instantly, e.g., social media analytics, observability data processing. Inference on streaming data: Perform real-time predictions, e.g., anomaly detection. Workflows running in a streaming manner. Learn more in our User Guide .","title":"Use Cases"},{"location":"#key-features","text":"Kubernetes-native: If you know Kubernetes, you already know how to use Numaflow. Serverless: Focus on your code and let the system scale up and down based on demand. Language agnostic: Use your favorite programming language. Exactly-Once semantics: No input element is duplicated or lost even as pods are rescheduled or restarted. Auto-scaling with back-pressure: Each vertex automatically scales from zero to whatever is needed.","title":"Key Features"},{"location":"#data-integrity-guarantees","text":"Minimally provide at-least-once semantics Provide exactly-once semantics for unbounded and near real-time data sources Preserving order is not required","title":"Data Integrity Guarantees"},{"location":"#roadmap","text":"Map Streaming (1.3)","title":"Roadmap"},{"location":"#demo","text":"","title":"Demo"},{"location":"#getting-started","text":"For set-up information and running your first Numaflow pipeline, please see our getting started guide .","title":"Getting Started"},{"location":"APIs/","text":"Packages: numaflow.numaproj.io/v1alpha1 numaflow.numaproj.io/v1alpha1 Resource Types: AbstractPodTemplate ( Appears on: AbstractVertex , DaemonTemplate , JetStreamBufferService , JobTemplate , NativeRedis , SideInputsManagerTemplate , VertexTemplate ) AbstractPodTemplate provides a template for pod customization in vertices, daemon deployments and so on. Field Description metadata Metadata (Optional) Metadata sets the pods\u2019s metadata, i.e. annotations and labels nodeSelector map\\[string\\]string (Optional) NodeSelector is a selector which must be true for the pod to fit on a node. Selector which must match a node\u2019s labels for the pod to be scheduled on that node. More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ tolerations \\[\\]Kubernetes core/v1.Toleration (Optional) If specified, the pod\u2019s tolerations. securityContext Kubernetes core/v1.PodSecurityContext (Optional) SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field. imagePullSecrets \\[\\]Kubernetes core/v1.LocalObjectReference (Optional) ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images used by this PodSpec. If specified, these secrets will be passed to individual puller implementations for them to use. For example, in the case of docker, only DockerConfig type secrets are honored. More info: https://kubernetes.io/docs/concepts/containers/images#specifying-imagepullsecrets-on-a-pod priorityClassName string (Optional) If specified, indicates the Redis pod\u2019s priority. \u201csystem-node-critical\u201d and \u201csystem-cluster-critical\u201d are two special keywords which indicate the highest priorities with the former being the highest priority. Any other name must be defined by creating a PriorityClass object with that name. If not specified, the pod priority will be default or zero if there is no default. More info: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ priority int32 (Optional) The priority value. Various system components use this field to find the priority of the Redis pod. When Priority Admission Controller is enabled, it prevents users from setting this field. The admission controller populates this field from PriorityClassName. The higher the value, the higher the priority. More info: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ affinity Kubernetes core/v1.Affinity (Optional) The pod\u2019s scheduling constraints More info: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ serviceAccountName string (Optional) ServiceAccountName applied to the pod runtimeClassName string (Optional) RuntimeClassName refers to a RuntimeClass object in the node.k8s.io group, which should be used to run this pod. If no RuntimeClass resource matches the named class, the pod will not be run. If unset or empty, the \u201clegacy\u201d RuntimeClass will be used, which is an implicit class with an empty definition that uses the default runtime handler. More info: https://git.k8s.io/enhancements/keps/sig-node/585-runtime-class automountServiceAccountToken bool (Optional) AutomountServiceAccountToken indicates whether a service account token should be automatically mounted. dnsPolicy Kubernetes core/v1.DNSPolicy (Optional) Set DNS policy for the pod. Defaults to \u201cClusterFirst\u201d. Valid values are \u2018ClusterFirstWithHostNet\u2019, \u2018ClusterFirst\u2019, \u2018Default\u2019 or \u2018None\u2019. DNS parameters given in DNSConfig will be merged with the policy selected with DNSPolicy. To have DNS options set along with hostNetwork, you have to specify DNS policy explicitly to \u2018ClusterFirstWithHostNet\u2019. dnsConfig Kubernetes core/v1.PodDNSConfig (Optional) Specifies the DNS parameters of a pod. Parameters specified here will be merged to the generated DNS configuration based on DNSPolicy. AbstractSink ( Appears on: Sink ) Field Description log Log (Optional) Log sink is used to write the data to the log. kafka KafkaSink (Optional) Kafka sink is used to write the data to the Kafka. blackhole Blackhole (Optional) Blackhole sink is used to write the data to the blackhole sink, which is a sink that discards all the data written to it. udsink UDSink (Optional) UDSink sink is used to write the data to the user-defined sink. AbstractVertex ( Appears on: PipelineSpec , VertexSpec ) Field Description name string source Source (Optional) sink Sink (Optional) udf UDF (Optional) containerTemplate ContainerTemplate (Optional) Container template for the main numa container. initContainerTemplate ContainerTemplate (Optional) Container template for all the vertex pod init containers spawned by numaflow, excluding the ones specified by the user. AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) volumes \\[\\]Kubernetes core/v1.Volume (Optional) limits VertexLimits (Optional) Limits define the limitations such as buffer read batch size for all the vertices of a pipeline, will override pipeline level settings scale Scale (Optional) Settings for autoscaling initContainers \\[\\]Kubernetes core/v1.Container (Optional) List of customized init containers belonging to the pod. More info: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/ sidecars \\[\\]Kubernetes core/v1.Container (Optional) List of customized sidecar containers belonging to the pod. partitions int32 (Optional) Number of partitions of the vertex owned buffers. It applies to udf and sink vertices only. sideInputs \\[\\]string (Optional) Names of the side inputs used in this vertex. sideInputsContainerTemplate ContainerTemplate (Optional) Container template for the side inputs watcher container. Authorization ( Appears on: HTTPSource , ServingSource ) Field Description token Kubernetes core/v1.SecretKeySelector (Optional) A secret selector which contains bearer token To use this, the client needs to add \u201cAuthorization: Bearer \u201d in the header BasicAuth ( Appears on: NatsAuth ) BasicAuth represents the basic authentication approach which contains a user name and a password. Field Description user Kubernetes core/v1.SecretKeySelector (Optional) Secret for auth user password Kubernetes core/v1.SecretKeySelector (Optional) Secret for auth password Blackhole ( Appears on: AbstractSink ) Blackhole is a sink to emulate /dev/null BufferFullWritingStrategy ( string alias) ( Appears on: Edge ) BufferServiceConfig ( Appears on: InterStepBufferServiceStatus ) Field Description redis RedisConfig jetstream JetStreamConfig CombinedEdge ( Appears on: VertexSpec ) CombinedEdge is a combination of Edge and some other properties such as vertex type, partitions, limits. It\u2019s used to decorate the fromEdges and toEdges of the generated Vertex objects, so that in the vertex pod, it knows the properties of the connected vertices, for example, how many partitioned buffers I should write to, what is the write buffer length, etc. Field Description Edge Edge (Members of Edge are embedded into this type.) fromVertexType VertexType From vertex type. fromVertexPartitionCount int32 (Optional) The number of partitions of the from vertex, if not provided, the default value is set to \u201c1\u201d. fromVertexLimits VertexLimits (Optional) toVertexType VertexType To vertex type. toVertexPartitionCount int32 (Optional) The number of partitions of the to vertex, if not provided, the default value is set to \u201c1\u201d. toVertexLimits VertexLimits (Optional) ConditionType ( string alias) ConditionType is a valid value of Condition.Type Container ( Appears on: SideInput , UDF , UDSink , UDSource , UDTransformer ) Container is used to define the container properties for user-defined functions, sinks, etc. Field Description image string (Optional) command \\[\\]string (Optional) args \\[\\]string (Optional) env \\[\\]Kubernetes core/v1.EnvVar (Optional) envFrom \\[\\]Kubernetes core/v1.EnvFromSource (Optional) volumeMounts \\[\\]Kubernetes core/v1.VolumeMount (Optional) resources Kubernetes core/v1.ResourceRequirements (Optional) securityContext Kubernetes core/v1.SecurityContext (Optional) imagePullPolicy Kubernetes core/v1.PullPolicy (Optional) ContainerTemplate ( Appears on: AbstractVertex , DaemonTemplate , JetStreamBufferService , JobTemplate , NativeRedis , SideInputsManagerTemplate , VertexTemplate ) ContainerTemplate defines customized spec for a container Field Description resources Kubernetes core/v1.ResourceRequirements (Optional) imagePullPolicy Kubernetes core/v1.PullPolicy (Optional) securityContext Kubernetes core/v1.SecurityContext (Optional) env \\[\\]Kubernetes core/v1.EnvVar (Optional) envFrom \\[\\]Kubernetes core/v1.EnvFromSource (Optional) DaemonTemplate ( Appears on: Templates ) Field Description AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) replicas int32 (Optional) Replicas is the number of desired replicas of the Deployment. This is a pointer to distinguish between explicit zero and unspecified. Defaults to 1. More info: https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller#what-is-a-replicationcontroller containerTemplate ContainerTemplate (Optional) initContainerTemplate ContainerTemplate (Optional) Edge ( Appears on: CombinedEdge , PipelineSpec ) Field Description from string to string conditions ForwardConditions (Optional) Conditional forwarding, only allowed when \u201cFrom\u201d is a Source or UDF. onFull BufferFullWritingStrategy (Optional) OnFull specifies the behaviour for the write actions when the inter step buffer is full. There are currently two options, retryUntilSuccess and discardLatest. if not provided, the default value is set to \u201cretryUntilSuccess\u201d FixedWindow ( Appears on: Window ) FixedWindow describes a fixed window Field Description length Kubernetes meta/v1.Duration Length is the duration of the fixed window. streaming bool (Optional) Streaming should be set to true if the reduce udf is streaming. ForwardConditions ( Appears on: Edge ) Field Description tags TagConditions Tags used to specify tags for conditional forwarding Function ( Appears on: UDF ) Field Description name string args \\[\\]string (Optional) kwargs map\\[string\\]string (Optional) GSSAPI ( Appears on: SASL ) GSSAPI represents a SASL GSSAPI config Field Description serviceName string realm string usernameSecret Kubernetes core/v1.SecretKeySelector UsernameSecret refers to the secret that contains the username authType KRB5AuthType valid inputs - KRB5_USER_AUTH, KRB5_KEYTAB_AUTH passwordSecret Kubernetes core/v1.SecretKeySelector (Optional) PasswordSecret refers to the secret that contains the password keytabSecret Kubernetes core/v1.SecretKeySelector (Optional) KeytabSecret refers to the secret that contains the keytab kerberosConfigSecret Kubernetes core/v1.SecretKeySelector (Optional) KerberosConfigSecret refers to the secret that contains the kerberos config GeneratorSource ( Appears on: Source ) Field Description rpu int64 (Optional) duration Kubernetes meta/v1.Duration (Optional) msgSize int32 (Optional) Size of each generated message keyCount int32 (Optional) KeyCount is the number of unique keys in the payload value uint64 (Optional) Value is an optional uint64 value to be written in to the payload jitter Kubernetes meta/v1.Duration (Optional) Jitter is the jitter for the message generation, used to simulate out of order messages for example if the jitter is 10s, then the message\u2019s event time will be delayed by a random time between 0 and 10s which will result in the message being out of order by 0 to 10s valueBlob string (Optional) ValueBlob is an optional string which is the base64 encoding of direct payload to send. This is useful for attaching a GeneratorSource to a true pipeline to test load behavior with true messages without requiring additional work to generate messages through the external source if present, the Value and MsgSize fields will be ignored. GetDaemonDeploymentReq Field Description ISBSvcType ISBSvcType Image string PullPolicy Kubernetes core/v1.PullPolicy Env \\[\\]Kubernetes core/v1.EnvVar DefaultResources Kubernetes core/v1.ResourceRequirements GetJetStreamServiceSpecReq Field Description Labels map\\[string\\]string ClusterPort int32 ClientPort int32 MonitorPort int32 MetricsPort int32 GetJetStreamStatefulSetSpecReq Field Description ServiceName string Labels map\\[string\\]string NatsImage string MetricsExporterImage string ConfigReloaderImage string ClusterPort int32 ClientPort int32 MonitorPort int32 MetricsPort int32 ServerAuthSecretName string ServerEncryptionSecretName string ConfigMapName string PvcNameIfNeeded string StartCommand string DefaultResources Kubernetes core/v1.ResourceRequirements GetRedisServiceSpecReq Field Description Labels map\\[string\\]string RedisContainerPort int32 SentinelContainerPort int32 GetRedisStatefulSetSpecReq Field Description ServiceName string Labels map\\[string\\]string RedisImage string SentinelImage string MetricsExporterImage string InitContainerImage string RedisContainerPort int32 SentinelContainerPort int32 RedisMetricsContainerPort int32 CredentialSecretName string TLSEnabled bool PvcNameIfNeeded string ConfConfigMapName string ScriptsConfigMapName string HealthConfigMapName string DefaultResources Kubernetes core/v1.ResourceRequirements GetSideInputDeploymentReq Field Description ISBSvcType ISBSvcType Image string PullPolicy Kubernetes core/v1.PullPolicy Env \\[\\]Kubernetes core/v1.EnvVar DefaultResources Kubernetes core/v1.ResourceRequirements GetVertexPodSpecReq Field Description ISBSvcType ISBSvcType Image string PullPolicy Kubernetes core/v1.PullPolicy Env \\[\\]Kubernetes core/v1.EnvVar SideInputsStoreName string ServingSourceStreamName string PipelineSpec PipelineSpec DefaultResources Kubernetes core/v1.ResourceRequirements GroupBy ( Appears on: UDF ) GroupBy indicates it is a reducer UDF Field Description window Window Window describes the windowing strategy. keyed bool (Optional) allowedLateness Kubernetes meta/v1.Duration (Optional) AllowedLateness allows late data to be included for the Reduce operation as long as the late data is not later than (Watermark - AllowedLateness). storage PBQStorage Storage is used to define the PBQ storage for a reduce vertex. HTTPSource ( Appears on: Source ) Field Description auth Authorization (Optional) service bool (Optional) Whether to create a ClusterIP Service ISBSvcPhase ( string alias) ( Appears on: InterStepBufferServiceStatus ) ISBSvcType ( string alias) ( Appears on: GetDaemonDeploymentReq , GetSideInputDeploymentReq , GetVertexPodSpecReq , InterStepBufferServiceStatus ) IdleSource ( Appears on: Watermark ) Field Description threshold Kubernetes meta/v1.Duration Threshold is the duration after which a source is marked as Idle due to lack of data. Ex: If watermark found to be idle after the Threshold duration then the watermark is progressed by IncrementBy . stepInterval Kubernetes meta/v1.Duration (Optional) StepInterval is the duration between the subsequent increment of the watermark as long the source remains Idle. The default value is 0s which means that once we detect idle source, we will be incrementing the watermark by IncrementBy for time we detect that we source is empty (in other words, this will be a very frequent update). incrementBy Kubernetes meta/v1.Duration IncrementBy is the duration to be added to the current watermark to progress the watermark when source is idling. InterStepBufferService Field Description metadata Kubernetes meta/v1.ObjectMeta Refer to the Kubernetes API documentation for the fields of the metadata field. spec InterStepBufferServiceSpec redis RedisBufferService jetstream JetStreamBufferService status InterStepBufferServiceStatus (Optional) InterStepBufferServiceSpec ( Appears on: InterStepBufferService ) Field Description redis RedisBufferService jetstream JetStreamBufferService InterStepBufferServiceStatus ( Appears on: InterStepBufferService ) Field Description Status Status (Members of Status are embedded into this type.) phase ISBSvcPhase message string config BufferServiceConfig type ISBSvcType observedGeneration int64 ObservedGeneration stores the generation value observed by the controller. JetStreamBufferService ( Appears on: InterStepBufferServiceSpec ) Field Description version string JetStream version, such as \u201c2.7.1\u201d replicas int32 JetStream StatefulSet size containerTemplate ContainerTemplate (Optional) ContainerTemplate contains customized spec for NATS container reloaderContainerTemplate ContainerTemplate (Optional) ReloaderContainerTemplate contains customized spec for config reloader container metricsContainerTemplate ContainerTemplate (Optional) MetricsContainerTemplate contains customized spec for metrics container persistence PersistenceStrategy (Optional) AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) settings string (Optional) Nats/JetStream configuration, if not specified, global settings in numaflow-controller-config will be used. See https://docs.nats.io/running-a-nats-service/configuration#limits and https://docs.nats.io/running-a-nats-service/configuration#jetstream . For limits, only \u201cmax_payload\u201d is supported for configuration, defaults to 1048576 (1MB), not recommended to use values over 8388608 (8MB) but max_payload can be set up to 67108864 (64MB). For jetstream, only \u201cmax_memory_store\u201d and \u201cmax_file_store\u201d are supported for configuration, do not set \u201cstore_dir\u201d as it has been hardcoded. startArgs \\[\\]string (Optional) Optional arguments to start nats-server. For example, \u201c-D\u201d to enable debugging output, \u201c-DV\u201d to enable debugging and tracing. Check https://docs.nats.io/ for all the available arguments. bufferConfig string (Optional) Optional configuration for the streams, consumers and buckets to be created in this JetStream service, if specified, it will be merged with the default configuration in numaflow-controller-config. It accepts a YAML format configuration, it may include 4 sections, \u201cstream\u201d, \u201cconsumer\u201d, \u201cotBucket\u201d and \u201cprocBucket\u201d. Available fields under \u201cstream\u201d include \u201cretention\u201d (e.g. interest, limits, workerQueue), \u201cmaxMsgs\u201d, \u201cmaxAge\u201d (e.g. 72h), \u201creplicas\u201d (1, 3, 5), \u201cduplicates\u201d (e.g. 5m). Available fields under \u201cconsumer\u201d include \u201cackWait\u201d (e.g. 60s) Available fields under \u201cotBucket\u201d include \u201cmaxValueSize\u201d, \u201chistory\u201d, \u201cttl\u201d (e.g. 72h), \u201cmaxBytes\u201d, \u201creplicas\u201d (1, 3, 5). Available fields under \u201cprocBucket\u201d include \u201cmaxValueSize\u201d, \u201chistory\u201d, \u201cttl\u201d (e.g. 72h), \u201cmaxBytes\u201d, \u201creplicas\u201d (1, 3, 5). encryption bool (Optional) Whether encrypt the data at rest, defaults to false Enabling encryption might impact the performance, see https://docs.nats.io/running-a-nats-service/nats_admin/jetstream_admin/encryption_at_rest for the detail Toggling the value will impact encrypting/decrypting existing messages. tls bool (Optional) Whether enable TLS, defaults to false Enabling TLS might impact the performance JetStreamConfig ( Appears on: BufferServiceConfig ) Field Description url string JetStream (NATS) URL auth NatsAuth streamConfig string (Optional) tlsEnabled bool TLS enabled or not JetStreamSource ( Appears on: Source ) Field Description url string URL to connect to NATS cluster, multiple urls could be separated by comma. stream string Stream represents the name of the stream. tls TLS (Optional) TLS configuration for the nats client. auth NatsAuth (Optional) Auth information JobTemplate ( Appears on: Templates ) Field Description AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) containerTemplate ContainerTemplate (Optional) ttlSecondsAfterFinished int32 (Optional) ttlSecondsAfterFinished limits the lifetime of a Job that has finished execution (either Complete or Failed). If this field is set, ttlSecondsAfterFinished after the Job finishes, it is eligible to be automatically deleted. When the Job is being deleted, its lifecycle guarantees (e.g. finalizers) will be honored. If this field is unset, the Job won\u2019t be automatically deleted. If this field is set to zero, the Job becomes eligible to be deleted immediately after it finishes. Numaflow defaults to 30 backoffLimit int32 (Optional) Specifies the number of retries before marking this job failed. More info: https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy Numaflow defaults to 20 KRB5AuthType ( string alias) ( Appears on: GSSAPI ) KRB5AuthType describes the kerberos auth type KafkaSink ( Appears on: AbstractSink ) Field Description brokers \\[\\]string topic string tls TLS (Optional) TLS user to configure TLS connection for kafka broker TLS.enable=true default for TLS. config string (Optional) sasl SASL (Optional) SASL user to configure SASL connection for kafka broker SASL.enable=true default for SASL. KafkaSource ( Appears on: Source ) Field Description brokers \\[\\]string topic string consumerGroup string tls TLS (Optional) TLS user to configure TLS connection for kafka broker TLS.enable=true default for TLS. config string (Optional) sasl SASL (Optional) SASL user to configure SASL connection for kafka broker SASL.enable=true default for SASL. Lifecycle ( Appears on: PipelineSpec ) Field Description deleteGracePeriodSeconds int32 (Optional) DeleteGracePeriodSeconds used to delete pipeline gracefully desiredPhase PipelinePhase (Optional) DesiredPhase used to bring the pipeline from current phase to desired phase pauseGracePeriodSeconds int32 (Optional) PauseGracePeriodSeconds used to pause pipeline gracefully Log ( Appears on: AbstractSink ) LogicOperator ( string alias) ( Appears on: TagConditions ) Metadata ( Appears on: AbstractPodTemplate ) Field Description annotations map\\[string\\]string labels map\\[string\\]string NativeRedis ( Appears on: RedisBufferService ) Field Description version string Redis version, such as \u201c6.0.16\u201d replicas int32 Redis StatefulSet size redisContainerTemplate ContainerTemplate (Optional) RedisContainerTemplate contains customized spec for Redis container sentinelContainerTemplate ContainerTemplate (Optional) SentinelContainerTemplate contains customized spec for Redis container metricsContainerTemplate ContainerTemplate (Optional) MetricsContainerTemplate contains customized spec for metrics container initContainerTemplate ContainerTemplate (Optional) persistence PersistenceStrategy (Optional) AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) settings RedisSettings (Optional) Redis configuration, if not specified, global settings in numaflow-controller-config will be used. NatsAuth ( Appears on: JetStreamConfig , JetStreamSource , NatsSource ) NatsAuth defines how to authenticate the nats access Field Description basic BasicAuth (Optional) Basic auth which contains a username and a password token Kubernetes core/v1.SecretKeySelector (Optional) Token auth nkey Kubernetes core/v1.SecretKeySelector (Optional) NKey auth NatsSource ( Appears on: Source ) Field Description url string URL to connect to NATS cluster, multiple urls could be separated by comma. subject string Subject holds the name of the subject onto which messages are published. queue string Queue is used for queue subscription. tls TLS (Optional) TLS configuration for the nats client. auth NatsAuth (Optional) Auth information NoStore ( Appears on: PBQStorage ) NoStore means there will be no persistence storage and there will be data loss during pod restarts. Use this option only if you do not care about correctness (e.g., approx statistics pipeline like sampling rate, etc.). PBQStorage ( Appears on: GroupBy ) PBQStorage defines the persistence configuration for a vertex. Field Description persistentVolumeClaim PersistenceStrategy (Optional) emptyDir Kubernetes core/v1.EmptyDirVolumeSource (Optional) no_store NoStore (Optional) PersistenceStrategy ( Appears on: JetStreamBufferService , NativeRedis , PBQStorage ) PersistenceStrategy defines the strategy of persistence Field Description storageClassName string (Optional) Name of the StorageClass required by the claim. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#class-1 accessMode Kubernetes core/v1.PersistentVolumeAccessMode (Optional) Available access modes such as ReadWriteOnce, ReadWriteMany https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes volumeSize k8s.io/apimachinery/pkg/api/resource.Quantity Volume size, e.g. 50Gi Pipeline Field Description metadata Kubernetes meta/v1.ObjectMeta Refer to the Kubernetes API documentation for the fields of the metadata field. spec PipelineSpec interStepBufferServiceName string (Optional) vertices \\[\\]AbstractVertex edges \\[\\]Edge Edges define the relationships between vertices lifecycle Lifecycle (Optional) Lifecycle define the Lifecycle properties limits PipelineLimits (Optional) Limits define the limitations such as buffer read batch size for all the vertices of a pipeline, they could be overridden by each vertex\u2019s settings watermark Watermark (Optional) Watermark enables watermark progression across the entire pipeline. templates Templates (Optional) Templates are used to customize additional kubernetes resources required for the Pipeline sideInputs \\[\\]SideInput (Optional) SideInputs defines the Side Inputs of a pipeline. status PipelineStatus (Optional) PipelineLimits ( Appears on: PipelineSpec ) Field Description readBatchSize uint64 (Optional) Read batch size for all the vertices in the pipeline, can be overridden by the vertex\u2019s limit settings. bufferMaxLength uint64 (Optional) BufferMaxLength is used to define the max length of a buffer. Only applies to UDF and Source vertices as only they do buffer write. It can be overridden by the settings in vertex limits. bufferUsageLimit uint32 (Optional) BufferUsageLimit is used to define the percentage of the buffer usage limit, a valid value should be less than 100, for example, 85. Only applies to UDF and Source vertices as only they do buffer write. It will be overridden by the settings in vertex limits. readTimeout Kubernetes meta/v1.Duration (Optional) Read timeout for all the vertices in the pipeline, can be overridden by the vertex\u2019s limit settings PipelinePhase ( string alias) ( Appears on: Lifecycle , PipelineStatus ) PipelineSpec ( Appears on: GetVertexPodSpecReq , Pipeline ) Field Description interStepBufferServiceName string (Optional) vertices \\[\\]AbstractVertex edges \\[\\]Edge Edges define the relationships between vertices lifecycle Lifecycle (Optional) Lifecycle define the Lifecycle properties limits PipelineLimits (Optional) Limits define the limitations such as buffer read batch size for all the vertices of a pipeline, they could be overridden by each vertex\u2019s settings watermark Watermark (Optional) Watermark enables watermark progression across the entire pipeline. templates Templates (Optional) Templates are used to customize additional kubernetes resources required for the Pipeline sideInputs \\[\\]SideInput (Optional) SideInputs defines the Side Inputs of a pipeline. PipelineStatus ( Appears on: Pipeline ) Field Description Status Status (Members of Status are embedded into this type.) phase PipelinePhase message string lastUpdated Kubernetes meta/v1.Time vertexCount uint32 sourceCount uint32 sinkCount uint32 udfCount uint32 observedGeneration int64 ObservedGeneration stores the generation value observed by the controller. RedisBufferService ( Appears on: InterStepBufferServiceSpec ) Field Description native NativeRedis Native brings up a native Redis service external RedisConfig External holds an External Redis config RedisConfig ( Appears on: BufferServiceConfig , RedisBufferService ) Field Description url string (Optional) Redis URL sentinelUrl string (Optional) Sentinel URL, will be ignored if Redis URL is provided masterName string (Optional) Only required when Sentinel is used user string (Optional) Redis user password Kubernetes core/v1.SecretKeySelector (Optional) Redis password secret selector sentinelPassword Kubernetes core/v1.SecretKeySelector (Optional) Sentinel password secret selector RedisSettings ( Appears on: NativeRedis ) Field Description redis string (Optional) Redis settings shared by both master and slaves, will override the global settings from controller config master string (Optional) Special settings for Redis master node, will override the global settings from controller config replica string (Optional) Special settings for Redis replica nodes, will override the global settings from controller config sentinel string (Optional) Sentinel settings, will override the global settings from controller config SASL ( Appears on: KafkaSink , KafkaSource ) Field Description mechanism SASLType SASL mechanism to use gssapi GSSAPI (Optional) GSSAPI contains the kerberos config plain SASLPlain (Optional) SASLPlain contains the sasl plain config scramsha256 SASLPlain (Optional) SASLSCRAMSHA256 contains the sasl plain config scramsha512 SASLPlain (Optional) SASLSCRAMSHA512 contains the sasl plain config SASLPlain ( Appears on: SASL ) Field Description userSecret Kubernetes core/v1.SecretKeySelector UserSecret refers to the secret that contains the user passwordSecret Kubernetes core/v1.SecretKeySelector (Optional) PasswordSecret refers to the secret that contains the password handshake bool SASLType ( string alias) ( Appears on: SASL ) SASLType describes the SASL type Scale ( Appears on: AbstractVertex ) Scale defines the parameters for autoscaling. Field Description disabled bool (Optional) Whether to disable autoscaling. Set to \u201ctrue\u201d when using Kubernetes HPA or any other 3rd party autoscaling strategies. min int32 (Optional) Minimum replicas. max int32 (Optional) Maximum replicas. lookbackSeconds uint32 (Optional) Lookback seconds to calculate the average pending messages and processing rate. cooldownSeconds uint32 (Optional) Deprecated: Use scaleUpCooldownSeconds and scaleDownCooldownSeconds instead. Cooldown seconds after a scaling operation before another one. zeroReplicaSleepSeconds uint32 (Optional) After scaling down the source vertex to 0, sleep how many seconds before scaling the source vertex back up to peek. targetProcessingSeconds uint32 (Optional) TargetProcessingSeconds is used to tune the aggressiveness of autoscaling for source vertices, it measures how fast you want the vertex to process all the pending messages. Typically increasing the value, which leads to lower processing rate, thus less replicas. It\u2019s only effective for source vertices. targetBufferAvailability uint32 (Optional) TargetBufferAvailability is used to define the target percentage of the buffer availability. A valid and meaningful value should be less than the BufferUsageLimit defined in the Edge spec (or Pipeline spec), for example, 50. It only applies to UDF and Sink vertices because only they have buffers to read. replicasPerScale uint32 (Optional) ReplicasPerScale defines maximum replicas can be scaled up or down at once. The is use to prevent too aggressive scaling operations scaleUpCooldownSeconds uint32 (Optional) ScaleUpCooldownSeconds defines the cooldown seconds after a scaling operation, before a follow-up scaling up. It defaults to the CooldownSeconds if not set. scaleDownCooldownSeconds uint32 (Optional) ScaleDownCooldownSeconds defines the cooldown seconds after a scaling operation, before a follow-up scaling down. It defaults to the CooldownSeconds if not set. ServingSource ( Appears on: Source ) ServingSource is the HTTP endpoint for Numaflow. Field Description auth Authorization (Optional) service bool (Optional) Whether to create a ClusterIP Service msgIDHeaderKey string The header key from which the message id will be extracted store ServingStore Persistent store for the callbacks for serving and tracking ServingStore ( Appears on: ServingSource ) ServingStore to track and store data and metadata for tracking and serving. Field Description url string URL of the persistent store to write the callbacks ttl Kubernetes meta/v1.Duration (Optional) TTL for the data in the store and tracker SessionWindow ( Appears on: Window ) SessionWindow describes a session window Field Description timeout Kubernetes meta/v1.Duration Timeout is the duration of inactivity after which a session window closes. SideInput ( Appears on: PipelineSpec ) SideInput defines information of a Side Input Field Description name string container Container volumes \\[\\]Kubernetes core/v1.Volume (Optional) trigger SideInputTrigger SideInputTrigger ( Appears on: SideInput ) Field Description schedule string The schedule to trigger the retrievement of the side input data. It supports cron format, for example, \u201c0 30 \\* \\* \\* \\*\u201d. Or interval based format, such as \u201c@hourly\u201d, \u201c@every 1h30m\u201d, etc. timezone string (Optional) SideInputsManagerTemplate ( Appears on: Templates ) Field Description AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) containerTemplate ContainerTemplate (Optional) Template for the side inputs manager numa container initContainerTemplate ContainerTemplate (Optional) Template for the side inputs manager init container Sink ( Appears on: AbstractVertex ) Field Description AbstractSink AbstractSink (Members of AbstractSink are embedded into this type.) fallback AbstractSink (Optional) Fallback sink can be imagined as DLQ for primary Sink. The writes to Fallback sink will only be initiated if the ud-sink response field sets it. SlidingWindow ( Appears on: Window ) SlidingWindow describes a sliding window Field Description length Kubernetes meta/v1.Duration Length is the duration of the sliding window. slide Kubernetes meta/v1.Duration Slide is the slide parameter that controls the frequency at which the sliding window is created. streaming bool (Optional) Streaming should be set to true if the reduce udf is streaming. Source ( Appears on: AbstractVertex ) Field Description generator GeneratorSource (Optional) kafka KafkaSource (Optional) http HTTPSource (Optional) nats NatsSource (Optional) transformer UDTransformer (Optional) udsource UDSource (Optional) jetstream JetStreamSource (Optional) serving ServingSource (Optional) Status ( Appears on: InterStepBufferServiceStatus , PipelineStatus ) Status is a common structure which can be used for Status field. Field Description conditions \\[\\]Kubernetes meta/v1.Condition (Optional) Conditions are the latest available observations of a resource\u2019s current state. TLS ( Appears on: JetStreamSource , KafkaSink , KafkaSource , NatsSource ) Field Description insecureSkipVerify bool (Optional) caCertSecret Kubernetes core/v1.SecretKeySelector (Optional) CACertSecret refers to the secret that contains the CA cert certSecret Kubernetes core/v1.SecretKeySelector (Optional) CertSecret refers to the secret that contains the cert keySecret Kubernetes core/v1.SecretKeySelector (Optional) KeySecret refers to the secret that contains the key TagConditions ( Appears on: ForwardConditions ) Field Description operator LogicOperator (Optional) Operator specifies the type of operation that should be used for conditional forwarding value could be \u201cand\u201d, \u201cor\u201d, \u201cnot\u201d values \\[\\]string Values tag values for conditional forwarding Templates ( Appears on: PipelineSpec ) Field Description daemon DaemonTemplate (Optional) DaemonTemplate is used to customize the Daemon Deployment. job JobTemplate (Optional) JobTemplate is used to customize Jobs. sideInputsManager SideInputsManagerTemplate (Optional) SideInputsManagerTemplate is used to customize the Side Inputs Manager. vertex VertexTemplate (Optional) VertexTemplate is used to customize the vertices of the pipeline. Transformer ( Appears on: UDTransformer ) Field Description name string args \\[\\]string (Optional) kwargs map\\[string\\]string (Optional) UDF ( Appears on: AbstractVertex ) Field Description container Container (Optional) builtin Function (Optional) groupBy GroupBy (Optional) UDSink ( Appears on: AbstractSink ) Field Description container Container UDSource ( Appears on: Source ) Field Description container Container UDTransformer ( Appears on: Source ) Field Description container Container (Optional) builtin Transformer (Optional) Vertex ( Appears on: VertexInstance ) Field Description metadata Kubernetes meta/v1.ObjectMeta Refer to the Kubernetes API documentation for the fields of the metadata field. spec VertexSpec AbstractVertex AbstractVertex (Members of AbstractVertex are embedded into this type.) pipelineName string interStepBufferServiceName string (Optional) replicas int32 (Optional) fromEdges \\[\\]CombinedEdge (Optional) toEdges \\[\\]CombinedEdge (Optional) watermark Watermark (Optional) Watermark indicates watermark progression in the vertex, it\u2019s populated from the pipeline watermark settings. status VertexStatus (Optional) VertexInstance VertexInstance is a wrapper of a vertex instance, which contains the vertex spec and the instance information such as hostname and replica index. Field Description vertex Vertex hostname string replica int32 VertexLimits ( Appears on: AbstractVertex , CombinedEdge ) Field Description readBatchSize uint64 (Optional) Read batch size from the source or buffer. It overrides the settings from pipeline limits. readTimeout Kubernetes meta/v1.Duration (Optional) Read timeout duration from the source or buffer It overrides the settings from pipeline limits. bufferMaxLength uint64 (Optional) BufferMaxLength is used to define the max length of a buffer. It overrides the settings from pipeline limits. bufferUsageLimit uint32 (Optional) BufferUsageLimit is used to define the percentage of the buffer usage limit, a valid value should be less than 100, for example, 85. It overrides the settings from pipeline limits. VertexPhase ( string alias) ( Appears on: VertexStatus ) VertexSpec ( Appears on: Vertex ) Field Description AbstractVertex AbstractVertex (Members of AbstractVertex are embedded into this type.) pipelineName string interStepBufferServiceName string (Optional) replicas int32 (Optional) fromEdges \\[\\]CombinedEdge (Optional) toEdges \\[\\]CombinedEdge (Optional) watermark Watermark (Optional) Watermark indicates watermark progression in the vertex, it\u2019s populated from the pipeline watermark settings. VertexStatus ( Appears on: Vertex ) Field Description phase VertexPhase reason string message string replicas uint32 selector string lastScaledAt Kubernetes meta/v1.Time VertexTemplate ( Appears on: Templates ) Field Description AbstractPodTemplate AbstractPodTemplate (Members of AbstractPodTemplate are embedded into this type.) (Optional) containerTemplate ContainerTemplate (Optional) Template for the vertex numa container initContainerTemplate ContainerTemplate (Optional) Template for the vertex init container VertexType ( string alias) ( Appears on: CombinedEdge ) Watermark ( Appears on: PipelineSpec , VertexSpec ) Field Description disabled bool (Optional) Disabled toggles the watermark propagation, defaults to false. maxDelay Kubernetes meta/v1.Duration (Optional) Maximum delay allowed for watermark calculation, defaults to \u201c0s\u201d, which means no delay. idleSource IdleSource (Optional) IdleSource defines the idle watermark properties, it could be configured in case source is idling. Window ( Appears on: GroupBy ) Window describes windowing strategy Field Description fixed FixedWindow (Optional) sliding SlidingWindow (Optional) session SessionWindow (Optional) Generated with gen-crd-api-reference-docs .","title":"APIs"},{"location":"quick-start/","text":"Quick Start \u00b6 In this page, we will guide you through the steps to: Install Numaflow. Create and run a simple pipeline. Create and run an advanced pipeline. Before you begin: prerequisites \u00b6 To try Numaflow, you will first need to setup using one of the following options to run container images: Docker Desktop podman Then use one of the following options to create a local Kubernete Cluster: Docker Desktop Kubernetes k3d kind minikube You will also need kubectl to manage the cluster. Follow these steps to install kubectl . In case you need a refresher, all the kubectl commands used in this quick start guide can be found in the kubectl Cheat Sheet . Installing Numaflow \u00b6 Once you have completed all the prerequisites, run the following command lines to install Numaflow and start the Inter-Step Buffer Service that handles communication between vertices. kubectl create ns numaflow-system kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/install.yaml kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/0-isbsvc-jetstream.yaml Creating a simple pipeline \u00b6 As an example, we will create a simple pipeline that contains a source vertex to generate messages, a processing vertex that echos the messages, and a sink vertex that logs the messages. Run the command below to create a simple pipeline. kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/1-simple-pipeline.yaml To view a list of pipelines you've created, run: kubectl get pipeline # or \"pl\" as a short name This should create a response like the following, with AGE indicating the time elapsed since the creation of your simple pipeline. NAME PHASE MESSAGE VERTICES AGE simple-pipeline Running 3 9s To inspect the status of the pipeline, use kubectl get pods . Note that the pod names will be different from the sample response: # Wait for pods to be ready kubectl get pods NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3 /3 Running 0 19s isbsvc-default-js-1 3 /3 Running 0 19s isbsvc-default-js-2 3 /3 Running 0 19s simple-pipeline-daemon-78b798fb98-qf4t4 1 /1 Running 0 10s simple-pipeline-out-0-xc0pf 1 /1 Running 0 10s simple-pipeline-cat-0-kqrhy 2 /2 Running 0 10s simple-pipeline-in-0-rhpjm 1 /1 Running 0 11s Now you can watch the log for the output vertex. Run the command below and remember to replace xxxxx with the appropriate pod name above. kubectl logs -f simple-pipeline-out-0-xxxxx This should generate an output like the sample below: 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"VT+G+/W7Dhc=\" , \"Createdts\" :1661471977707552597 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"0TaH+/W7Dhc=\" , \"Createdts\" :1661471977707615953 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"EEGH+/W7Dhc=\" , \"Createdts\" :1661471977707618576 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"WESH+/W7Dhc=\" , \"Createdts\" :1661471977707619416 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"YEaH+/W7Dhc=\" , \"Createdts\" :1661471977707619936 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"qfomN/a7Dhc=\" , \"Createdts\" :1661471978707942057 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"aUcnN/a7Dhc=\" , \"Createdts\" :1661471978707961705 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"iUonN/a7Dhc=\" , \"Createdts\" :1661471978707962505 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"mkwnN/a7Dhc=\" , \"Createdts\" :1661471978707963034 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"jk4nN/a7Dhc=\" , \"Createdts\" :1661471978707963534 } Numaflow also comes with a built-in user interface. NOTE : Please install the metrics server if your local Kubernetes cluster does not bring it by default (e.g., Kind). You can install it by running the below command. kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml kubectl patch -n kube-system deployment metrics-server --type = json -p '[{\"op\":\"add\",\"path\":\"/spec/template/spec/containers/0/args/-\",\"value\":\"--kubelet-insecure-tls\"}]' To port forward the UI, run the following command. # Port forward the UI to https://localhost:8443/ kubectl -n numaflow-system port-forward deployment/numaflow-server 8443 :8443 This renders the following UI on https://localhost:8443/. The pipeline can be deleted by issuing the following command: kubectl delete -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/1-simple-pipeline.yaml Creating an advanced pipeline \u00b6 Now we will walk you through creating an advanced pipeline. In our example, this is called the even-odd pipeline, illustrated by the following diagram: There are five vertices in this example of an advanced pipeline. An HTTP source vertex which serves an HTTP endpoint to receive numbers as source data, a UDF vertex to tag the ingested numbers with the key even or odd , three Log sinks, one to print the even numbers, one to print the odd numbers, and the other one to print both the even and odd numbers. Run the following command to create the even-odd pipeline. kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml You may opt to view the list of pipelines you've created so far by running kubectl get pipeline . Otherwise, proceed to inspect the status of the pipeline, using kubectl get pods . # Wait for pods to be ready kubectl get pods NAME READY STATUS RESTARTS AGE even-odd-daemon-64d65c945d-vjs9f 1 /1 Running 0 5m3s even-odd-even-or-odd-0-pr4ze 2 /2 Running 0 30s even-odd-even-sink-0-unffo 1 /1 Running 0 22s even-odd-in-0-a7iyd 1 /1 Running 0 5m3s even-odd-number-sink-0-zmg2p 1 /1 Running 0 7s even-odd-odd-sink-0-2736r 1 /1 Running 0 15s isbsvc-default-js-0 3 /3 Running 0 10m isbsvc-default-js-1 3 /3 Running 0 10m isbsvc-default-js-2 3 /3 Running 0 10m Next, port-forward the HTTP endpoint, and make a POST request using curl . Remember to replace xxxxx with the appropriate pod names both here and in the next step. kubectl port-forward even-odd-in-0-xxxx 8444 :8443 # Post data to the HTTP endpoint curl -kq -X POST -d \"101\" https://localhost:8444/vertices/in curl -kq -X POST -d \"102\" https://localhost:8444/vertices/in curl -kq -X POST -d \"103\" https://localhost:8444/vertices/in curl -kq -X POST -d \"104\" https://localhost:8444/vertices/in Now you can watch the log for the even and odd vertices by running the commands below. # Watch the log for the even vertex kubectl logs -f even-odd-even-sink-0-xxxxx 2022 /09/07 22 :29:40 ( even-sink ) 102 2022 /09/07 22 :29:40 ( even-sink ) 104 # Watch the log for the odd vertex kubectl logs -f even-odd-odd-sink-0-xxxxx 2022 /09/07 22 :30:19 ( odd-sink ) 101 2022 /09/07 22 :30:19 ( odd-sink ) 103 View the UI for the advanced pipeline at https://localhost:8443/. The source code of the even-odd user-defined function can be found here . You also can replace the Log Sink with some other sinks like Kafka to forward the data to Kafka topics. The pipeline can be deleted by kubectl delete -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml A pipeline with reduce (aggregation) \u00b6 To set up an example pipeline with the Reduce UDF , see Reduce Examples . What's Next \u00b6 Try more examples in the examples directory. After exploring how Numaflow pipelines run, you can check what data Sources and Sinks Numaflow supports out of the box, or learn how to write User-defined Functions . Numaflow can also be paired with Numalogic, a collection of ML models and algorithms for real-time data analytics and AIOps including anomaly detection. Visit the Numalogic homepage for more information.","title":"Quick Start"},{"location":"quick-start/#quick-start","text":"In this page, we will guide you through the steps to: Install Numaflow. Create and run a simple pipeline. Create and run an advanced pipeline.","title":"Quick Start"},{"location":"quick-start/#before-you-begin-prerequisites","text":"To try Numaflow, you will first need to setup using one of the following options to run container images: Docker Desktop podman Then use one of the following options to create a local Kubernete Cluster: Docker Desktop Kubernetes k3d kind minikube You will also need kubectl to manage the cluster. Follow these steps to install kubectl . In case you need a refresher, all the kubectl commands used in this quick start guide can be found in the kubectl Cheat Sheet .","title":"Before you begin: prerequisites"},{"location":"quick-start/#installing-numaflow","text":"Once you have completed all the prerequisites, run the following command lines to install Numaflow and start the Inter-Step Buffer Service that handles communication between vertices. kubectl create ns numaflow-system kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/install.yaml kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/0-isbsvc-jetstream.yaml","title":"Installing Numaflow"},{"location":"quick-start/#creating-a-simple-pipeline","text":"As an example, we will create a simple pipeline that contains a source vertex to generate messages, a processing vertex that echos the messages, and a sink vertex that logs the messages. Run the command below to create a simple pipeline. kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/1-simple-pipeline.yaml To view a list of pipelines you've created, run: kubectl get pipeline # or \"pl\" as a short name This should create a response like the following, with AGE indicating the time elapsed since the creation of your simple pipeline. NAME PHASE MESSAGE VERTICES AGE simple-pipeline Running 3 9s To inspect the status of the pipeline, use kubectl get pods . Note that the pod names will be different from the sample response: # Wait for pods to be ready kubectl get pods NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3 /3 Running 0 19s isbsvc-default-js-1 3 /3 Running 0 19s isbsvc-default-js-2 3 /3 Running 0 19s simple-pipeline-daemon-78b798fb98-qf4t4 1 /1 Running 0 10s simple-pipeline-out-0-xc0pf 1 /1 Running 0 10s simple-pipeline-cat-0-kqrhy 2 /2 Running 0 10s simple-pipeline-in-0-rhpjm 1 /1 Running 0 11s Now you can watch the log for the output vertex. Run the command below and remember to replace xxxxx with the appropriate pod name above. kubectl logs -f simple-pipeline-out-0-xxxxx This should generate an output like the sample below: 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"VT+G+/W7Dhc=\" , \"Createdts\" :1661471977707552597 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"0TaH+/W7Dhc=\" , \"Createdts\" :1661471977707615953 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"EEGH+/W7Dhc=\" , \"Createdts\" :1661471977707618576 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"WESH+/W7Dhc=\" , \"Createdts\" :1661471977707619416 } 2022 /08/25 23 :59:38 ( out ) { \"Data\" : \"YEaH+/W7Dhc=\" , \"Createdts\" :1661471977707619936 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"qfomN/a7Dhc=\" , \"Createdts\" :1661471978707942057 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"aUcnN/a7Dhc=\" , \"Createdts\" :1661471978707961705 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"iUonN/a7Dhc=\" , \"Createdts\" :1661471978707962505 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"mkwnN/a7Dhc=\" , \"Createdts\" :1661471978707963034 } 2022 /08/25 23 :59:39 ( out ) { \"Data\" : \"jk4nN/a7Dhc=\" , \"Createdts\" :1661471978707963534 } Numaflow also comes with a built-in user interface. NOTE : Please install the metrics server if your local Kubernetes cluster does not bring it by default (e.g., Kind). You can install it by running the below command. kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml kubectl patch -n kube-system deployment metrics-server --type = json -p '[{\"op\":\"add\",\"path\":\"/spec/template/spec/containers/0/args/-\",\"value\":\"--kubelet-insecure-tls\"}]' To port forward the UI, run the following command. # Port forward the UI to https://localhost:8443/ kubectl -n numaflow-system port-forward deployment/numaflow-server 8443 :8443 This renders the following UI on https://localhost:8443/. The pipeline can be deleted by issuing the following command: kubectl delete -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/1-simple-pipeline.yaml","title":"Creating a simple pipeline"},{"location":"quick-start/#creating-an-advanced-pipeline","text":"Now we will walk you through creating an advanced pipeline. In our example, this is called the even-odd pipeline, illustrated by the following diagram: There are five vertices in this example of an advanced pipeline. An HTTP source vertex which serves an HTTP endpoint to receive numbers as source data, a UDF vertex to tag the ingested numbers with the key even or odd , three Log sinks, one to print the even numbers, one to print the odd numbers, and the other one to print both the even and odd numbers. Run the following command to create the even-odd pipeline. kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml You may opt to view the list of pipelines you've created so far by running kubectl get pipeline . Otherwise, proceed to inspect the status of the pipeline, using kubectl get pods . # Wait for pods to be ready kubectl get pods NAME READY STATUS RESTARTS AGE even-odd-daemon-64d65c945d-vjs9f 1 /1 Running 0 5m3s even-odd-even-or-odd-0-pr4ze 2 /2 Running 0 30s even-odd-even-sink-0-unffo 1 /1 Running 0 22s even-odd-in-0-a7iyd 1 /1 Running 0 5m3s even-odd-number-sink-0-zmg2p 1 /1 Running 0 7s even-odd-odd-sink-0-2736r 1 /1 Running 0 15s isbsvc-default-js-0 3 /3 Running 0 10m isbsvc-default-js-1 3 /3 Running 0 10m isbsvc-default-js-2 3 /3 Running 0 10m Next, port-forward the HTTP endpoint, and make a POST request using curl . Remember to replace xxxxx with the appropriate pod names both here and in the next step. kubectl port-forward even-odd-in-0-xxxx 8444 :8443 # Post data to the HTTP endpoint curl -kq -X POST -d \"101\" https://localhost:8444/vertices/in curl -kq -X POST -d \"102\" https://localhost:8444/vertices/in curl -kq -X POST -d \"103\" https://localhost:8444/vertices/in curl -kq -X POST -d \"104\" https://localhost:8444/vertices/in Now you can watch the log for the even and odd vertices by running the commands below. # Watch the log for the even vertex kubectl logs -f even-odd-even-sink-0-xxxxx 2022 /09/07 22 :29:40 ( even-sink ) 102 2022 /09/07 22 :29:40 ( even-sink ) 104 # Watch the log for the odd vertex kubectl logs -f even-odd-odd-sink-0-xxxxx 2022 /09/07 22 :30:19 ( odd-sink ) 101 2022 /09/07 22 :30:19 ( odd-sink ) 103 View the UI for the advanced pipeline at https://localhost:8443/. The source code of the even-odd user-defined function can be found here . You also can replace the Log Sink with some other sinks like Kafka to forward the data to Kafka topics. The pipeline can be deleted by kubectl delete -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml","title":"Creating an advanced pipeline"},{"location":"quick-start/#a-pipeline-with-reduce-aggregation","text":"To set up an example pipeline with the Reduce UDF , see Reduce Examples .","title":"A pipeline with reduce (aggregation)"},{"location":"quick-start/#whats-next","text":"Try more examples in the examples directory. After exploring how Numaflow pipelines run, you can check what data Sources and Sinks Numaflow supports out of the box, or learn how to write User-defined Functions . Numaflow can also be paired with Numalogic, a collection of ML models and algorithms for real-time data analytics and AIOps including anomaly detection. Visit the Numalogic homepage for more information.","title":"What's Next"},{"location":"core-concepts/inter-step-buffer-service/","text":"Inter-Step Buffer Service \u00b6 Inter-Step Buffer Service is the service to provide Inter-Step Buffers . An Inter-Step Buffer Service is described by a Custom Resource . It is required to be existing in a namespace before Pipeline objects are created. A sample InterStepBufferService with JetStream implementation looks like below. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : version : latest # Do NOT use \"latest\" but a specific version in your real deployment InterStepBufferService is a namespaced object. It can be used by all the Pipelines in the same namespace. By default, Pipeline objects look for an InterStepBufferService named default , so a common practice is to create an InterStepBufferService with the name default . If you give the InterStepBufferService a name other than default , then you need to give the same name in the Pipeline spec. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : # Optional, if not specified, defaults to \"default\" interStepBufferServiceName : different-name To query Inter-Step Buffer Service objects with kubectl : kubectl get isbsvc JetStream \u00b6 JetStream is one of the supported Inter-Step Buffer Service implementations. A keyword jetstream under spec means a JetStream cluster will be created in the namespace. Version \u00b6 Property spec.jetstream.version is required for a JetStream InterStepBufferService . Supported versions can be found from the ConfigMap numaflow-controller-config in the control plane namespace. Note The version latest in the ConfigMap should only be used for testing purpose. It's recommended that you always use a fixed version in your real workload. Replicas \u00b6 An optional property spec.jetstream.replicas (defaults to 3) can be specified, which gives the total number of nodes. Persistence \u00b6 Following example shows a JetStream InterStepBufferService with persistence. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : version : latest # Do NOT use \"latest\" but a specific version in your real deployment persistence : storageClassName : standard # Optional, will use K8s cluster default storage class if not specified accessMode : ReadWriteOnce # Optional, defaults to ReadWriteOnce volumeSize : 10Gi # Optional, defaults to 20Gi JetStream Settings \u00b6 There are 2 places to configure JetStream settings: ConfigMap numaflow-controller-config in the control plane namespace. This is the default configuration for all the JetStream InterStepBufferService created in the Kubernetes cluster. Property spec.jetstream.settings in an InterStepBufferService object. This optional property can be used to override the default configuration defined in the ConfigMap numaflow-controller-config . A sample JetStream configuration: # https://docs.nats.io/running-a-nats-service/configuration#limits # Only \"max_payload\" is supported for configuration in this section. # Max payload size in bytes, defaults to 1 MB. It is not recommended to use values over 8MB but max_payload can be set up to 64MB. max_payload: 1048576 # # https://docs.nats.io/running-a-nats-service/configuration#jetstream # Only configure \"max_memory_store\" or \"max_file_store\" in this section, do not set \"store_dir\" as it has been hardcoded. # # e.g. 1G. -1 means no limit, up to 75% of available memory. This only take effect for streams created using memory storage. max_memory_store: -1 # e.g. 20G. -1 means no limit, Up to 1TB if available max_file_store: 1TB Buffer Configuration \u00b6 For the Inter-Step Buffers created in JetStream ISB Service, there are 2 places to configure the default properties. ConfigMap numaflow-controller-config in the control plane namespace. This is the place to configure the default properties for the streams and consumers created in all the Jet Stream ISB - Services in the Kubernetes cluster. Field spec.jetstream.bufferConfig in an InterStepBufferService object. This optional field can be used to customize the stream and consumer properties of that particular InterStepBufferService , - and the configuration will be merged into the default one from the ConfigMap numaflow-controller-config . For example, - if you only want to change maxMsgs for created streams, then you only need to give stream.maxMsgs in the field, all - the rest config will still go with the default values in the control plane ConfigMap. Both these 2 places expect a YAML format configuration like below: bufferConfig : | # The properties of the buffers (streams) to be created in this JetStream service stream: # 0: Limits, 1: Interest, 2: WorkQueue retention: 1 maxMsgs: 30000 maxAge: 168h maxBytes: -1 # 0: File, 1: Memory storage: 0 replicas: 3 duplicates: 60s # The consumer properties for the created streams consumer: ackWait: 60s maxAckPending: 20000 Note Changing the buffer configuration either in the control plane ConfigMap or in the InterStepBufferService object does NOT make any change to the buffers (streams) already existing. TLS \u00b6 TLS is optional to configure through spec.jetstream.tls: true . Enabling TLS will use a self signed CERT to encrypt the connection from Vertex Pods to JetStream service. By default TLS is not enabled. Encryption At Rest \u00b6 Encryption at rest can be enabled by setting spec.jetstream.encryption: true . Be aware this will impact the performance a bit, see the detail at official doc . Once a JetStream ISB Service is created, toggling the encryption field will cause problem for the exiting messages, so if you want to change the value, please delete and recreate the ISB Service, and you also need to restart all the Vertex Pods to pick up the new credentials. Other Configuration \u00b6 Check here for the full spec of spec.jetstream . Redis \u00b6 NOTE Today when using Redis, the pipeline will stall if Redis has any data loss, especially during failovers. Redis is supported as an Inter-Step Buffer Service implementation. A keyword native under spec.redis means several Redis nodes with a Master-Replicas topology will be created in the namespace. We also support external redis. External Redis \u00b6 If you have a managed Redis, say in AWS, etc., we can make that Redis your ISB. All you need to do is provide the external Redis endpoint name. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : redis : external : url : \"\" user : \"default\" Cluster Mode \u00b6 We support cluster mode , only if the Redis is an external managed Redis. You will have to enter the url twice to indicate that the mode is cluster. This is because we use Universal Client which requires more than one address to indicate the Redis is in cluster mode. url : \"numaflow-redis-cluster-0.numaflow-redis-cluster-headless:6379,numaflow-redis-cluster-1.numaflow-redis-cluster-headless:6379\" Version \u00b6 Property spec.redis.native.version is required for a native Redis InterStepBufferService . Supported versions can be found from the ConfigMap numaflow-controller-config in the control plane namespace. Replicas \u00b6 An optional property spec.redis.native.replicas (defaults to 3) can be specified, which gives the total number of nodes (including master and replicas). An odd number >= 3 is suggested. If the given number < 3, 3 will be used. Persistence \u00b6 The following example shows an native Redis InterStepBufferService with persistence. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : redis : native : version : 6.2.6 persistence : storageClassName : standard # Optional, will use K8s cluster default storage class if not specified accessMode : ReadWriteOnce # Optional, defaults to ReadWriteOnce volumeSize : 10Gi # Optional, defaults to 20Gi Redis Configuration \u00b6 Redis configuration includes: spec.redis.native.settings.redis - Redis configuration shared by both master and replicas spec.redis.native.settings.master - Redis configuration only for master spec.redis.native.settings.replica - Redis configuration only for replicas spec.redis.native.settings.sentinel - Sentinel configuration A sample Redis configuration: # Enable AOF https://redis.io/topics/persistence#append-only-file appendonly yes auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb # Disable RDB persistence, AOF persistence already enabled. save \"\" maxmemory 512mb maxmemory-policy allkeys-lru A sample Sentinel configuration: sentinel down-after-milliseconds mymaster 10000 sentinel failover-timeout mymaster 2000 sentinel parallel-syncs mymaster 1 There are 2 places to configure these settings: ConfigMap numaflow-controller-config in the control plane namespace. This is the default configuration for all the native Redis InterStepBufferService created in the Kubernetes cluster. Property spec.redis.native.settings in an InterStepBufferService object. This optional property can be used to override the default configuration defined in the ConfigMap numaflow-controller-config . Here is the reference to the full Redis configuration. Other Configuration \u00b6 Check here for the full spec of spec.redis.native .","title":"Inter-Step Buffer Service"},{"location":"core-concepts/inter-step-buffer-service/#inter-step-buffer-service","text":"Inter-Step Buffer Service is the service to provide Inter-Step Buffers . An Inter-Step Buffer Service is described by a Custom Resource . It is required to be existing in a namespace before Pipeline objects are created. A sample InterStepBufferService with JetStream implementation looks like below. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : version : latest # Do NOT use \"latest\" but a specific version in your real deployment InterStepBufferService is a namespaced object. It can be used by all the Pipelines in the same namespace. By default, Pipeline objects look for an InterStepBufferService named default , so a common practice is to create an InterStepBufferService with the name default . If you give the InterStepBufferService a name other than default , then you need to give the same name in the Pipeline spec. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : # Optional, if not specified, defaults to \"default\" interStepBufferServiceName : different-name To query Inter-Step Buffer Service objects with kubectl : kubectl get isbsvc","title":"Inter-Step Buffer Service"},{"location":"core-concepts/inter-step-buffer-service/#jetstream","text":"JetStream is one of the supported Inter-Step Buffer Service implementations. A keyword jetstream under spec means a JetStream cluster will be created in the namespace.","title":"JetStream"},{"location":"core-concepts/inter-step-buffer-service/#version","text":"Property spec.jetstream.version is required for a JetStream InterStepBufferService . Supported versions can be found from the ConfigMap numaflow-controller-config in the control plane namespace. Note The version latest in the ConfigMap should only be used for testing purpose. It's recommended that you always use a fixed version in your real workload.","title":"Version"},{"location":"core-concepts/inter-step-buffer-service/#replicas","text":"An optional property spec.jetstream.replicas (defaults to 3) can be specified, which gives the total number of nodes.","title":"Replicas"},{"location":"core-concepts/inter-step-buffer-service/#persistence","text":"Following example shows a JetStream InterStepBufferService with persistence. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : version : latest # Do NOT use \"latest\" but a specific version in your real deployment persistence : storageClassName : standard # Optional, will use K8s cluster default storage class if not specified accessMode : ReadWriteOnce # Optional, defaults to ReadWriteOnce volumeSize : 10Gi # Optional, defaults to 20Gi","title":"Persistence"},{"location":"core-concepts/inter-step-buffer-service/#jetstream-settings","text":"There are 2 places to configure JetStream settings: ConfigMap numaflow-controller-config in the control plane namespace. This is the default configuration for all the JetStream InterStepBufferService created in the Kubernetes cluster. Property spec.jetstream.settings in an InterStepBufferService object. This optional property can be used to override the default configuration defined in the ConfigMap numaflow-controller-config . A sample JetStream configuration: # https://docs.nats.io/running-a-nats-service/configuration#limits # Only \"max_payload\" is supported for configuration in this section. # Max payload size in bytes, defaults to 1 MB. It is not recommended to use values over 8MB but max_payload can be set up to 64MB. max_payload: 1048576 # # https://docs.nats.io/running-a-nats-service/configuration#jetstream # Only configure \"max_memory_store\" or \"max_file_store\" in this section, do not set \"store_dir\" as it has been hardcoded. # # e.g. 1G. -1 means no limit, up to 75% of available memory. This only take effect for streams created using memory storage. max_memory_store: -1 # e.g. 20G. -1 means no limit, Up to 1TB if available max_file_store: 1TB","title":"JetStream Settings"},{"location":"core-concepts/inter-step-buffer-service/#buffer-configuration","text":"For the Inter-Step Buffers created in JetStream ISB Service, there are 2 places to configure the default properties. ConfigMap numaflow-controller-config in the control plane namespace. This is the place to configure the default properties for the streams and consumers created in all the Jet Stream ISB - Services in the Kubernetes cluster. Field spec.jetstream.bufferConfig in an InterStepBufferService object. This optional field can be used to customize the stream and consumer properties of that particular InterStepBufferService , - and the configuration will be merged into the default one from the ConfigMap numaflow-controller-config . For example, - if you only want to change maxMsgs for created streams, then you only need to give stream.maxMsgs in the field, all - the rest config will still go with the default values in the control plane ConfigMap. Both these 2 places expect a YAML format configuration like below: bufferConfig : | # The properties of the buffers (streams) to be created in this JetStream service stream: # 0: Limits, 1: Interest, 2: WorkQueue retention: 1 maxMsgs: 30000 maxAge: 168h maxBytes: -1 # 0: File, 1: Memory storage: 0 replicas: 3 duplicates: 60s # The consumer properties for the created streams consumer: ackWait: 60s maxAckPending: 20000 Note Changing the buffer configuration either in the control plane ConfigMap or in the InterStepBufferService object does NOT make any change to the buffers (streams) already existing.","title":"Buffer Configuration"},{"location":"core-concepts/inter-step-buffer-service/#tls","text":"TLS is optional to configure through spec.jetstream.tls: true . Enabling TLS will use a self signed CERT to encrypt the connection from Vertex Pods to JetStream service. By default TLS is not enabled.","title":"TLS"},{"location":"core-concepts/inter-step-buffer-service/#encryption-at-rest","text":"Encryption at rest can be enabled by setting spec.jetstream.encryption: true . Be aware this will impact the performance a bit, see the detail at official doc . Once a JetStream ISB Service is created, toggling the encryption field will cause problem for the exiting messages, so if you want to change the value, please delete and recreate the ISB Service, and you also need to restart all the Vertex Pods to pick up the new credentials.","title":"Encryption At Rest"},{"location":"core-concepts/inter-step-buffer-service/#other-configuration","text":"Check here for the full spec of spec.jetstream .","title":"Other Configuration"},{"location":"core-concepts/inter-step-buffer-service/#redis","text":"NOTE Today when using Redis, the pipeline will stall if Redis has any data loss, especially during failovers. Redis is supported as an Inter-Step Buffer Service implementation. A keyword native under spec.redis means several Redis nodes with a Master-Replicas topology will be created in the namespace. We also support external redis.","title":"Redis"},{"location":"core-concepts/inter-step-buffer-service/#external-redis","text":"If you have a managed Redis, say in AWS, etc., we can make that Redis your ISB. All you need to do is provide the external Redis endpoint name. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : redis : external : url : \"\" user : \"default\"","title":"External Redis"},{"location":"core-concepts/inter-step-buffer-service/#cluster-mode","text":"We support cluster mode , only if the Redis is an external managed Redis. You will have to enter the url twice to indicate that the mode is cluster. This is because we use Universal Client which requires more than one address to indicate the Redis is in cluster mode. url : \"numaflow-redis-cluster-0.numaflow-redis-cluster-headless:6379,numaflow-redis-cluster-1.numaflow-redis-cluster-headless:6379\"","title":"Cluster Mode"},{"location":"core-concepts/inter-step-buffer-service/#version_1","text":"Property spec.redis.native.version is required for a native Redis InterStepBufferService . Supported versions can be found from the ConfigMap numaflow-controller-config in the control plane namespace.","title":"Version"},{"location":"core-concepts/inter-step-buffer-service/#replicas_1","text":"An optional property spec.redis.native.replicas (defaults to 3) can be specified, which gives the total number of nodes (including master and replicas). An odd number >= 3 is suggested. If the given number < 3, 3 will be used.","title":"Replicas"},{"location":"core-concepts/inter-step-buffer-service/#persistence_1","text":"The following example shows an native Redis InterStepBufferService with persistence. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : redis : native : version : 6.2.6 persistence : storageClassName : standard # Optional, will use K8s cluster default storage class if not specified accessMode : ReadWriteOnce # Optional, defaults to ReadWriteOnce volumeSize : 10Gi # Optional, defaults to 20Gi","title":"Persistence"},{"location":"core-concepts/inter-step-buffer-service/#redis-configuration","text":"Redis configuration includes: spec.redis.native.settings.redis - Redis configuration shared by both master and replicas spec.redis.native.settings.master - Redis configuration only for master spec.redis.native.settings.replica - Redis configuration only for replicas spec.redis.native.settings.sentinel - Sentinel configuration A sample Redis configuration: # Enable AOF https://redis.io/topics/persistence#append-only-file appendonly yes auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb # Disable RDB persistence, AOF persistence already enabled. save \"\" maxmemory 512mb maxmemory-policy allkeys-lru A sample Sentinel configuration: sentinel down-after-milliseconds mymaster 10000 sentinel failover-timeout mymaster 2000 sentinel parallel-syncs mymaster 1 There are 2 places to configure these settings: ConfigMap numaflow-controller-config in the control plane namespace. This is the default configuration for all the native Redis InterStepBufferService created in the Kubernetes cluster. Property spec.redis.native.settings in an InterStepBufferService object. This optional property can be used to override the default configuration defined in the ConfigMap numaflow-controller-config . Here is the reference to the full Redis configuration.","title":"Redis Configuration"},{"location":"core-concepts/inter-step-buffer-service/#other-configuration_1","text":"Check here for the full spec of spec.redis.native .","title":"Other Configuration"},{"location":"core-concepts/inter-step-buffer/","text":"Inter-Step Buffer \u00b6 A Pipeline contains multiple vertices that ingest data from sources, process data, and forward processed data to sinks. Vertices are not connected directly, but through Inter-Step Buffers. Inter-Step Buffer can be implemented by a variety of data buffering technologies. Those technologies should support: Durability Offsets Transactions for Exactly-Once forwarding Concurrent reading Ability to explicitly acknowledge each data or offset Claim pending messages (read but not acknowledge) Ability to trim data (buffer size control) Fast (high throughput low latency) Ability to query buffer information Currently, there are 2 Inter-Step Buffer implementations: Nats JetStream Redis Stream","title":"Inter-Step Buffer"},{"location":"core-concepts/inter-step-buffer/#inter-step-buffer","text":"A Pipeline contains multiple vertices that ingest data from sources, process data, and forward processed data to sinks. Vertices are not connected directly, but through Inter-Step Buffers. Inter-Step Buffer can be implemented by a variety of data buffering technologies. Those technologies should support: Durability Offsets Transactions for Exactly-Once forwarding Concurrent reading Ability to explicitly acknowledge each data or offset Claim pending messages (read but not acknowledge) Ability to trim data (buffer size control) Fast (high throughput low latency) Ability to query buffer information Currently, there are 2 Inter-Step Buffer implementations: Nats JetStream Redis Stream","title":"Inter-Step Buffer"},{"location":"core-concepts/pipeline/","text":"Pipeline \u00b6 The Pipeline represents a data processing job. The most important concept in Numaflow, it defines: A list of vertices , which define the data processing tasks; A list of edges , which are used to describe the relationship between the vertices. Note an edge may go from a vertex to multiple vertices, and as of v0.10, an edge may also go from multiple vertices to a vertex. This many-to-one relationship is possible via Join and Cycles The Pipeline is abstracted as a Kubernetes Custom Resource . A Pipeline spec looks like below. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : vertices : - name : in source : generator : rpu : 5 duration : 1s - name : cat udf : builtin : name : cat - name : out sink : log : {} edges : - from : in to : cat - from : cat to : out To query Pipeline objects with kubectl : kubectl get pipeline # or \"pl\" as a short name","title":"Pipeline"},{"location":"core-concepts/pipeline/#pipeline","text":"The Pipeline represents a data processing job. The most important concept in Numaflow, it defines: A list of vertices , which define the data processing tasks; A list of edges , which are used to describe the relationship between the vertices. Note an edge may go from a vertex to multiple vertices, and as of v0.10, an edge may also go from multiple vertices to a vertex. This many-to-one relationship is possible via Join and Cycles The Pipeline is abstracted as a Kubernetes Custom Resource . A Pipeline spec looks like below. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : vertices : - name : in source : generator : rpu : 5 duration : 1s - name : cat udf : builtin : name : cat - name : out sink : log : {} edges : - from : in to : cat - from : cat to : out To query Pipeline objects with kubectl : kubectl get pipeline # or \"pl\" as a short name","title":"Pipeline"},{"location":"core-concepts/vertex/","text":"Vertex \u00b6 The Vertex is a key component of Numaflow Pipeline where the data processing happens. Vertex is defined as a list in the pipeline spec, each representing a data processing task. There are 3 types of Vertex in Numaflow today: Source - To ingest data from sources. Sink - To forward processed data to sinks. UDF - User-defined Function, which is used to define data processing logic. We have defined a Kubernetes Custom Resource for Vertex . A Pipeline containing multiple vertices will automatically generate multiple Vertex objects by the controller. As a user, you should NOT create a Vertex object directly. In a Pipeline , the vertices are not connected directly, but through Inter-Step Buffers . To query Vertex objects with kubectl : kubectl get vertex # or \"vtx\" as a short name","title":"Vertex"},{"location":"core-concepts/vertex/#vertex","text":"The Vertex is a key component of Numaflow Pipeline where the data processing happens. Vertex is defined as a list in the pipeline spec, each representing a data processing task. There are 3 types of Vertex in Numaflow today: Source - To ingest data from sources. Sink - To forward processed data to sinks. UDF - User-defined Function, which is used to define data processing logic. We have defined a Kubernetes Custom Resource for Vertex . A Pipeline containing multiple vertices will automatically generate multiple Vertex objects by the controller. As a user, you should NOT create a Vertex object directly. In a Pipeline , the vertices are not connected directly, but through Inter-Step Buffers . To query Vertex objects with kubectl : kubectl get vertex # or \"vtx\" as a short name","title":"Vertex"},{"location":"core-concepts/watermarks/","text":"Watermarks \u00b6 When processing an unbounded data stream, Numaflow has to materialize the results of the processing done on the data. The materialization of the output depends on the notion of time, e.g., the total number of logins served per minute. Without the idea of time inbuilt into the platform, we will not be able to determine the passage of time, which is necessary for grouping elements together to materialize the result. Watermarks is that notion of time that will help us group unbounded data into discrete chunks. Numaflow supports watermarks out-of-the-box. Source vertices generate watermarks based on the event time, and propagate to downstream vertices. Watermark is defined as \u201ca monotonically increasing timestamp of the oldest work/event not yet completed\u201d . In other words, if the watermark has advanced past some timestamp T, we are guaranteed by its monotonic property that no more processing will occur for on-time events at or before T. Configuration \u00b6 Disable Watermark \u00b6 Watermarks can be disabled with by setting disabled: true . Idle Detection \u00b6 Watermark is assigned at the source; this means that the watermark will only progress if there is data coming into the source. There are many cases where the source might not be getting data, causing the source to idle (e.g., data is periodic, say once an hour). When the source is idling the reduce vertices won't emit results because the watermark is not moving. To detect source idling and propagate watermark, we can use the idle detection feature. The idle source watermark progressor will make sure that the watermark cannot progress beyond time.now() - maxDelay ( maxDelay is defined below). To enable this, we provide the following setting: Threshold \u00b6 Threshold is the duration after which a source is marked as Idle due to a lack of data flowing into the source. StepInterval \u00b6 StepInterval is the duration between the subsequent increment of the watermark as long the source remains Idle. The default value is 0s, which means that once we detect an idle source, we will increment the watermark by IncrementBy for the time we detect that our source is empty (in other words, this will be a very frequent update). Default Value: 0s IncrementBy \u00b6 IncrementBy is the duration to be added to the current watermark to progress the watermark when the source is idling. Example \u00b6 The below example will consider the source as idle after there is no data at the source for 5s. After 5s, every other 2s an idle watermark will be emitted which increments the watermark by 3s. watermark : idleSource : threshold : 5s # The pipeline will be considered idle if the source has not emitted any data for given threshold value. incrementBy : 3s # If source is found to be idle then increment the watermark by given incrementBy value. stepInterval : 2s # If source is idling then publish the watermark only when step interval has passed. maxDelay \u00b6 Watermark assignments happen at the source. Sources could be out of order, so sometimes we want to extend the window (default is 0s ) to wait before we start marking data as late-data. You can give more time for the system to wait for late data with maxDelay so that the late data within the specified time duration will be considered as data on-time. This means the watermark propagation will be delayed by maxDelay . Example \u00b6 apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline spec : watermark : disabled : false # Optional, defaults to false. maxDelay : 60s # Optional, defaults to \"0s\". Watermark API \u00b6 When processing data in user-defined functions , you can get the current watermark through an API. Watermark API is supported in all our client SDKs. Example Golang \u00b6 // Go func mapFn ( context context . Context , keys [] string , d mapper . Datum ) mapper . Messages { _ = d . EventTime () // Event time _ = d . Watermark () // Watermark ... ... }","title":"Watermarks"},{"location":"core-concepts/watermarks/#watermarks","text":"When processing an unbounded data stream, Numaflow has to materialize the results of the processing done on the data. The materialization of the output depends on the notion of time, e.g., the total number of logins served per minute. Without the idea of time inbuilt into the platform, we will not be able to determine the passage of time, which is necessary for grouping elements together to materialize the result. Watermarks is that notion of time that will help us group unbounded data into discrete chunks. Numaflow supports watermarks out-of-the-box. Source vertices generate watermarks based on the event time, and propagate to downstream vertices. Watermark is defined as \u201ca monotonically increasing timestamp of the oldest work/event not yet completed\u201d . In other words, if the watermark has advanced past some timestamp T, we are guaranteed by its monotonic property that no more processing will occur for on-time events at or before T.","title":"Watermarks"},{"location":"core-concepts/watermarks/#configuration","text":"","title":"Configuration"},{"location":"core-concepts/watermarks/#disable-watermark","text":"Watermarks can be disabled with by setting disabled: true .","title":"Disable Watermark"},{"location":"core-concepts/watermarks/#idle-detection","text":"Watermark is assigned at the source; this means that the watermark will only progress if there is data coming into the source. There are many cases where the source might not be getting data, causing the source to idle (e.g., data is periodic, say once an hour). When the source is idling the reduce vertices won't emit results because the watermark is not moving. To detect source idling and propagate watermark, we can use the idle detection feature. The idle source watermark progressor will make sure that the watermark cannot progress beyond time.now() - maxDelay ( maxDelay is defined below). To enable this, we provide the following setting:","title":"Idle Detection"},{"location":"core-concepts/watermarks/#threshold","text":"Threshold is the duration after which a source is marked as Idle due to a lack of data flowing into the source.","title":"Threshold"},{"location":"core-concepts/watermarks/#stepinterval","text":"StepInterval is the duration between the subsequent increment of the watermark as long the source remains Idle. The default value is 0s, which means that once we detect an idle source, we will increment the watermark by IncrementBy for the time we detect that our source is empty (in other words, this will be a very frequent update). Default Value: 0s","title":"StepInterval"},{"location":"core-concepts/watermarks/#incrementby","text":"IncrementBy is the duration to be added to the current watermark to progress the watermark when the source is idling.","title":"IncrementBy"},{"location":"core-concepts/watermarks/#example","text":"The below example will consider the source as idle after there is no data at the source for 5s. After 5s, every other 2s an idle watermark will be emitted which increments the watermark by 3s. watermark : idleSource : threshold : 5s # The pipeline will be considered idle if the source has not emitted any data for given threshold value. incrementBy : 3s # If source is found to be idle then increment the watermark by given incrementBy value. stepInterval : 2s # If source is idling then publish the watermark only when step interval has passed.","title":"Example"},{"location":"core-concepts/watermarks/#maxdelay","text":"Watermark assignments happen at the source. Sources could be out of order, so sometimes we want to extend the window (default is 0s ) to wait before we start marking data as late-data. You can give more time for the system to wait for late data with maxDelay so that the late data within the specified time duration will be considered as data on-time. This means the watermark propagation will be delayed by maxDelay .","title":"maxDelay"},{"location":"core-concepts/watermarks/#example_1","text":"apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline spec : watermark : disabled : false # Optional, defaults to false. maxDelay : 60s # Optional, defaults to \"0s\".","title":"Example"},{"location":"core-concepts/watermarks/#watermark-api","text":"When processing data in user-defined functions , you can get the current watermark through an API. Watermark API is supported in all our client SDKs.","title":"Watermark API"},{"location":"core-concepts/watermarks/#example-golang","text":"// Go func mapFn ( context context . Context , keys [] string , d mapper . Datum ) mapper . Messages { _ = d . EventTime () // Event time _ = d . Watermark () // Watermark ... ... }","title":"Example Golang"},{"location":"development/debugging/","text":"How To Debug \u00b6 To enable debug logs in a Vertex Pod, set environment variable NUMAFLOW_DEBUG to true for the Vertex. For example: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : vertices : - name : in source : generator : rpu : 100 duration : 1s - name : p1 udf : builtin : name : cat containerTemplate : env : - name : NUMAFLOW_DEBUG value : !!str \"true\" - name : out sink : log : {} edges : - from : in to : p1 - from : p1 to : out To enable debug logs in the daemon pod, set environment variable NUMAFLOW_DEBUG to true for the daemon pod. For example: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : templates : daemon : containerTemplate : env : - name : NUMAFLOW_DEBUG value : !!str \"true\" Profiling \u00b6 If your pipeline is running with NUMAFLOW_DEBUG then pprof is enabled in the Vertex Pod. You can also enable just pprof by setting NUMAFLOW_PPROF to true . For example, run the commands like below to profile memory usage for a Vertex Pod, a web page displaying the memory information will be automatically opened. # Port-forward kubectl port-forward simple-pipeline-p1-0-7jzbn 2469 go tool pprof -http localhost:8081 https+insecure://localhost:2469/debug/pprof/heap Tracing is also available with commands below. # Add optional \"&seconds=n\" to specify the duration. curl -skq https://localhost:2469/debug/pprof/trace?debug = 1 -o trace.out go tool trace -http localhost:8082 trace.out Debug Inside the Container \u00b6 When doing local development using command lines such as make start , or make image , the built numaflow docker image is based on alpine , which allows you to execute into the container for debugging with kubectl exec -it {pod-name} -c {container-name} -- sh . This is not allowed when running pipelines with official released images, as they are based on scratch .","title":"How To Debug"},{"location":"development/debugging/#how-to-debug","text":"To enable debug logs in a Vertex Pod, set environment variable NUMAFLOW_DEBUG to true for the Vertex. For example: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : vertices : - name : in source : generator : rpu : 100 duration : 1s - name : p1 udf : builtin : name : cat containerTemplate : env : - name : NUMAFLOW_DEBUG value : !!str \"true\" - name : out sink : log : {} edges : - from : in to : p1 - from : p1 to : out To enable debug logs in the daemon pod, set environment variable NUMAFLOW_DEBUG to true for the daemon pod. For example: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : templates : daemon : containerTemplate : env : - name : NUMAFLOW_DEBUG value : !!str \"true\"","title":"How To Debug"},{"location":"development/debugging/#profiling","text":"If your pipeline is running with NUMAFLOW_DEBUG then pprof is enabled in the Vertex Pod. You can also enable just pprof by setting NUMAFLOW_PPROF to true . For example, run the commands like below to profile memory usage for a Vertex Pod, a web page displaying the memory information will be automatically opened. # Port-forward kubectl port-forward simple-pipeline-p1-0-7jzbn 2469 go tool pprof -http localhost:8081 https+insecure://localhost:2469/debug/pprof/heap Tracing is also available with commands below. # Add optional \"&seconds=n\" to specify the duration. curl -skq https://localhost:2469/debug/pprof/trace?debug = 1 -o trace.out go tool trace -http localhost:8082 trace.out","title":"Profiling"},{"location":"development/debugging/#debug-inside-the-container","text":"When doing local development using command lines such as make start , or make image , the built numaflow docker image is based on alpine , which allows you to execute into the container for debugging with kubectl exec -it {pod-name} -c {container-name} -- sh . This is not allowed when running pipelines with official released images, as they are based on scratch .","title":"Debug Inside the Container"},{"location":"development/development/","text":"Development \u00b6 This doc explains how to set up a development environment for Numaflow. Install required tools \u00b6 go 1.20+. git . kubectl . protoc 3.19 for compiling protocol buffers. pandoc 2.17 for generating API markdown. Node.js\u00ae for running the UI. yarn . A local Kubernetes cluster for development usage, pick either one of k3d , kind , or minikube . Example: Create a local Kubernetes cluster with kind \u00b6 # Install kind on macOS brew install kind # Create a cluster with default name kind kind create cluster # Get kubeconfig for the cluster kind export kubeconfig Metrics Server \u00b6 Please install the metrics server if your local Kubernetes cluster does not bring it by default (e.g., Kind). Without the metrics-server , we will not be able to see the pods in the UI. kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml kubectl patch -n kube-system deployment metrics-server --type = json -p '[{\"op\":\"add\",\"path\":\"/spec/template/spec/containers/0/args/-\",\"value\":\"--kubelet-insecure-tls\"}]' Useful Commands \u00b6 make start Build the source code, image, and install the Numaflow controller in the numaflow-system namespace. make build Binaries are placed in ./dist . make manifests Regenerate all the manifests after making any base manifest changes. This is also covered by make codegen . make codegen Run after making changes to ./pkg/api/ . make test Run unit tests. make test-* Run one e2e test suite. e.g. make test-kafka-e2e to run the kafka e2e suite. make Test* Run one e2e test case. e.g. make TestKafkaSourceSink to run the TestKafkaSourceSink case in the kafka e2e suite. make image Build container image, and import it to k3d , kind , or minikube cluster if corresponding KUBECONFIG is sourced. make docs Convert the docs to GitHub pages, check if there's any error. make docs-serve Start an HTTP server on your local to host the docs generated Github pages.","title":"Development"},{"location":"development/development/#development","text":"This doc explains how to set up a development environment for Numaflow.","title":"Development"},{"location":"development/development/#install-required-tools","text":"go 1.20+. git . kubectl . protoc 3.19 for compiling protocol buffers. pandoc 2.17 for generating API markdown. Node.js\u00ae for running the UI. yarn . A local Kubernetes cluster for development usage, pick either one of k3d , kind , or minikube .","title":"Install required tools"},{"location":"development/development/#example-create-a-local-kubernetes-cluster-with-kind","text":"# Install kind on macOS brew install kind # Create a cluster with default name kind kind create cluster # Get kubeconfig for the cluster kind export kubeconfig","title":"Example: Create a local Kubernetes cluster with kind"},{"location":"development/development/#metrics-server","text":"Please install the metrics server if your local Kubernetes cluster does not bring it by default (e.g., Kind). Without the metrics-server , we will not be able to see the pods in the UI. kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml kubectl patch -n kube-system deployment metrics-server --type = json -p '[{\"op\":\"add\",\"path\":\"/spec/template/spec/containers/0/args/-\",\"value\":\"--kubelet-insecure-tls\"}]'","title":"Metrics Server"},{"location":"development/development/#useful-commands","text":"make start Build the source code, image, and install the Numaflow controller in the numaflow-system namespace. make build Binaries are placed in ./dist . make manifests Regenerate all the manifests after making any base manifest changes. This is also covered by make codegen . make codegen Run after making changes to ./pkg/api/ . make test Run unit tests. make test-* Run one e2e test suite. e.g. make test-kafka-e2e to run the kafka e2e suite. make Test* Run one e2e test case. e.g. make TestKafkaSourceSink to run the TestKafkaSourceSink case in the kafka e2e suite. make image Build container image, and import it to k3d , kind , or minikube cluster if corresponding KUBECONFIG is sourced. make docs Convert the docs to GitHub pages, check if there's any error. make docs-serve Start an HTTP server on your local to host the docs generated Github pages.","title":"Useful Commands"},{"location":"development/releasing/","text":"How To Release \u00b6 Release Branch \u00b6 Always create a release branch for the releases, for example branch release-0.5 is for all the v0.5.x versions release. If it's a new release branch, simply create a branch from main . Release Steps \u00b6 Cherry-pick fixes to the release branch, skip this step if it's the first release in the branch. Run make test to make sure all test cases pass locally. Push to remote branch, and make sure all the CI jobs pass. Run make prepare-release VERSION=v{x.y.z} to update version in manifests, where x.y.x is the expected new version. Follow the output of last step, to confirm if all the changes are expected, and then run make release VERSION=v{x.y.z} . Follow the output, push a new tag to the release branch, GitHub actions will automatically build and publish the new release, this will take around 10 minutes. Test the new release, make sure everything is running as expected, and then recreate a stable tag against the latest release. git tag -d stable git tag -a stable -m stable git push -d { your-remote } stable git push { your-remote } stable Find the new release tag, and edit the release notes.","title":"How To Release"},{"location":"development/releasing/#how-to-release","text":"","title":"How To Release"},{"location":"development/releasing/#release-branch","text":"Always create a release branch for the releases, for example branch release-0.5 is for all the v0.5.x versions release. If it's a new release branch, simply create a branch from main .","title":"Release Branch"},{"location":"development/releasing/#release-steps","text":"Cherry-pick fixes to the release branch, skip this step if it's the first release in the branch. Run make test to make sure all test cases pass locally. Push to remote branch, and make sure all the CI jobs pass. Run make prepare-release VERSION=v{x.y.z} to update version in manifests, where x.y.x is the expected new version. Follow the output of last step, to confirm if all the changes are expected, and then run make release VERSION=v{x.y.z} . Follow the output, push a new tag to the release branch, GitHub actions will automatically build and publish the new release, this will take around 10 minutes. Test the new release, make sure everything is running as expected, and then recreate a stable tag against the latest release. git tag -d stable git tag -a stable -m stable git push -d { your-remote } stable git push { your-remote } stable Find the new release tag, and edit the release notes.","title":"Release Steps"},{"location":"development/static-code-analysis/","text":"Static Code Analysis \u00b6 We use the following static code analysis tools: golangci-lint for compile time linting. Snyk for image scanning. These are at least run daily or on each pull request.","title":"Static Code Analysis"},{"location":"development/static-code-analysis/#static-code-analysis","text":"We use the following static code analysis tools: golangci-lint for compile time linting. Snyk for image scanning. These are at least run daily or on each pull request.","title":"Static Code Analysis"},{"location":"operations/controller-configmap/","text":"Controller ConfigMap \u00b6 The controller ConfigMap is used for controller-wide settings. For a detailed example, please see numaflow-controller-config.yaml . Configuration Structure \u00b6 The configuration should be under controller-config.yaml key in the ConfigMap, as a string in yaml format: apiVersion : v1 kind : ConfigMap metadata : name : numaflow-controller-config data : controller-config.yaml : | defaults: containerResources: | ... isbsvc: jetstream: ... Default Controller Configuration \u00b6 Currently, we support configuring the init and main container resources for steps across all the pipelines. The configuration is under defaults key in the ConfigMap. For example, to set the default container resources for steps across all the pipelines: apiVersion : v1 kind : ConfigMap metadata : name : numaflow-controller-config data : controller-config.yaml : | defaults: containerResources: | limits: memory: \"256Mi\" cpu: \"200m\" requests: memory: \"128Mi\" cpu: \"100m\" ISB Service Configuration \u00b6 One of the important configuration items in the ConfigMap is about ISB Service . We currently use 3rd party technologies such as JetStream to implement ISB Services, if those applications have new releases, to make them available in Numaflow, the new versions need to be added in the ConfigMap. For example, there's a new Nats JetStream version x.y.x available, a new version configuration like below needs to be added before it can be referenced in the InterStepBufferService spec. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-controller-config data : controller-config.yaml : | isbsvc: jetstream: versions: - version: x.y.x # Name it whatever you want, it will be referenced in the InterStepBufferService spec. natsImage: nats:x.y.x metricsExporterImage: natsio/prometheus-nats-exporter:0.9.1 configReloaderImage: natsio/nats-server-config-reloader:0.7.0 startCommand: /nats-server","title":"Controller Configuration"},{"location":"operations/controller-configmap/#controller-configmap","text":"The controller ConfigMap is used for controller-wide settings. For a detailed example, please see numaflow-controller-config.yaml .","title":"Controller ConfigMap"},{"location":"operations/controller-configmap/#configuration-structure","text":"The configuration should be under controller-config.yaml key in the ConfigMap, as a string in yaml format: apiVersion : v1 kind : ConfigMap metadata : name : numaflow-controller-config data : controller-config.yaml : | defaults: containerResources: | ... isbsvc: jetstream: ...","title":"Configuration Structure"},{"location":"operations/controller-configmap/#default-controller-configuration","text":"Currently, we support configuring the init and main container resources for steps across all the pipelines. The configuration is under defaults key in the ConfigMap. For example, to set the default container resources for steps across all the pipelines: apiVersion : v1 kind : ConfigMap metadata : name : numaflow-controller-config data : controller-config.yaml : | defaults: containerResources: | limits: memory: \"256Mi\" cpu: \"200m\" requests: memory: \"128Mi\" cpu: \"100m\"","title":"Default Controller Configuration"},{"location":"operations/controller-configmap/#isb-service-configuration","text":"One of the important configuration items in the ConfigMap is about ISB Service . We currently use 3rd party technologies such as JetStream to implement ISB Services, if those applications have new releases, to make them available in Numaflow, the new versions need to be added in the ConfigMap. For example, there's a new Nats JetStream version x.y.x available, a new version configuration like below needs to be added before it can be referenced in the InterStepBufferService spec. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-controller-config data : controller-config.yaml : | isbsvc: jetstream: versions: - version: x.y.x # Name it whatever you want, it will be referenced in the InterStepBufferService spec. natsImage: nats:x.y.x metricsExporterImage: natsio/prometheus-nats-exporter:0.9.1 configReloaderImage: natsio/nats-server-config-reloader:0.7.0 startCommand: /nats-server","title":"ISB Service Configuration"},{"location":"operations/grafana/","text":"Grafana \u00b6 Numaflow provides prometheus metrics on top of which you can build Grafana dashboard to monitor your pipeline. Setup Grafana \u00b6 (Pre-requisite) Follow Metrics to set up prometheus operator. Follow Prometheus Tutorial to install Grafana and visualize metrics. Sample Dashboard \u00b6 You can customize your own dashboard by selecting metrics that best describe the health of your pipeline. Below is a sample dashboard which includes some basic metrics. To use the sample dashboard, download the corresponding sample dashboard template , import(before importing change the uid of the datasource in json, issue link ) it to Grafana and use the dropdown menu at top-left of the dashboard to choose which pipeline/vertex/buffer metrics to display.","title":"Grafana"},{"location":"operations/grafana/#grafana","text":"Numaflow provides prometheus metrics on top of which you can build Grafana dashboard to monitor your pipeline.","title":"Grafana"},{"location":"operations/grafana/#setup-grafana","text":"(Pre-requisite) Follow Metrics to set up prometheus operator. Follow Prometheus Tutorial to install Grafana and visualize metrics.","title":"Setup Grafana"},{"location":"operations/grafana/#sample-dashboard","text":"You can customize your own dashboard by selecting metrics that best describe the health of your pipeline. Below is a sample dashboard which includes some basic metrics. To use the sample dashboard, download the corresponding sample dashboard template , import(before importing change the uid of the datasource in json, issue link ) it to Grafana and use the dropdown menu at top-left of the dashboard to choose which pipeline/vertex/buffer metrics to display.","title":"Sample Dashboard"},{"location":"operations/installation/","text":"Installation \u00b6 Numaflow can be installed in different scopes with different approaches. Cluster Scope \u00b6 A cluster scope installation watches and executes pipelines in all the namespaces in the cluster. Run following command line to install latest stable Numaflow in cluster scope. kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/install.yaml If you use kustomize , use kustomization.yaml below. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization resources : - https://github.com/numaproj/numaflow/config/cluster-install?ref=stable # Or specify a version namespace : numaflow-system Namespace Scope \u00b6 A namespace scoped installation only watches and executes pipelines in the namespace it is installed (typically numaflow-system ). Configure the ConfigMap numaflow-cmd-params-config to achieve namespace scoped installation. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : # Whether to run in namespaced scope, defaults to false. namespaced : \"true\" Another approach to do namespace scoped installation is to add an argument --namespaced to the numaflow-controller and numaflow-server deployments. This approach takes precedence over the ConfigMap approach. - args: - --namespaced If there are multiple namespace scoped installations in one cluster, potentially there will be backward compatibility issue when any of the installation gets upgraded to a new version that has new CRD definition. To avoid this issue, we suggest to use minimal CRD definition for namespaced installation, which does not have detailed property definitions, thus no CRD changes between different versions. # Minimal CRD kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/config/advanced-install/minimal-crds.yaml # Controller in namespaced scope kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/advanced-install/namespaced-controller-wo-crds.yaml If you use kustomize , kustomization.yaml looks like below. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization resources : - https://github.com/numaproj/numaflow/config/advanced-install/minimal-crds?ref=stable # Or specify a version - https://github.com/numaproj/numaflow/config/advanced-install/namespaced-controller?ref=stable # Or specify a version namespace : numaflow-system Managed Namespace Scope \u00b6 A managed namespace installation watches and executes pipelines in a specific namespace. To do managed namespace installation, configure the ConfigMap numaflow-cmd-params-config as following. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : # Whether to run the controller and the UX server in namespaced scope, defaults to false. namespaced : \"true\" # The namespace that the controller and UX server watch when \"namespaced\" is true, defaults to the installation namespace. managed.namespace : numaflow-system Similarly, another approach is to add --managed-namespace and the specific namespace to the numaflow-controller and numaflow-server deployment arguments. This approach takes precedence over the ConfigMap approach. - args: - --namespaced - --managed-namespace - my-namespace High Availability \u00b6 By default, the Numaflow controller is installed with Active-Passive HA strategy enabled, which means you can run the controller with multiple replicas (defaults to 1 in the manifests). There are some parameters can be tuned for the leader election mechanism of HA. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : ### The duration that non-leader candidates will wait to force acquire leadership. # This is measured against time of last observed ack. Default is 15 seconds. # The configuration has to be: lease.duration > lease.renew.deadline > lease.renew.period controller.leader.election.lease.duration : 15s # ### The duration that the acting controlplane will retry refreshing leadership before giving up. # Default is 10 seconds. # The configuration has to be: lease.duration > lease.renew.deadline > lease.renew.period controller.leader.election.lease.renew.deadline : 10s ### The duration the LeaderElector clients should wait between tries of actions, which means every # this period of time, it tries to renew the lease. Default is 2 seconds. # The configuration has to be: lease.duration > lease.renew.deadline > lease.renew.period controller.leader.election.lease.renew.period : 2s These parameters are useful when you want to tune the frequency of leader election renewal calls to K8s API server, which are usually configured at a high priority level of API Priority and Fairness . To turn off HA, configure the ConfigMap numaflow-cmd-params-config as following. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : # Whether to disable leader election for the controller, defaults to false controller.leader.election.disabled : \"true\" If HA is turned off, the controller deployment should not run with multiple replicas.","title":"Installation"},{"location":"operations/installation/#installation","text":"Numaflow can be installed in different scopes with different approaches.","title":"Installation"},{"location":"operations/installation/#cluster-scope","text":"A cluster scope installation watches and executes pipelines in all the namespaces in the cluster. Run following command line to install latest stable Numaflow in cluster scope. kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/install.yaml If you use kustomize , use kustomization.yaml below. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization resources : - https://github.com/numaproj/numaflow/config/cluster-install?ref=stable # Or specify a version namespace : numaflow-system","title":"Cluster Scope"},{"location":"operations/installation/#namespace-scope","text":"A namespace scoped installation only watches and executes pipelines in the namespace it is installed (typically numaflow-system ). Configure the ConfigMap numaflow-cmd-params-config to achieve namespace scoped installation. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : # Whether to run in namespaced scope, defaults to false. namespaced : \"true\" Another approach to do namespace scoped installation is to add an argument --namespaced to the numaflow-controller and numaflow-server deployments. This approach takes precedence over the ConfigMap approach. - args: - --namespaced If there are multiple namespace scoped installations in one cluster, potentially there will be backward compatibility issue when any of the installation gets upgraded to a new version that has new CRD definition. To avoid this issue, we suggest to use minimal CRD definition for namespaced installation, which does not have detailed property definitions, thus no CRD changes between different versions. # Minimal CRD kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/config/advanced-install/minimal-crds.yaml # Controller in namespaced scope kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/advanced-install/namespaced-controller-wo-crds.yaml If you use kustomize , kustomization.yaml looks like below. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization resources : - https://github.com/numaproj/numaflow/config/advanced-install/minimal-crds?ref=stable # Or specify a version - https://github.com/numaproj/numaflow/config/advanced-install/namespaced-controller?ref=stable # Or specify a version namespace : numaflow-system","title":"Namespace Scope"},{"location":"operations/installation/#managed-namespace-scope","text":"A managed namespace installation watches and executes pipelines in a specific namespace. To do managed namespace installation, configure the ConfigMap numaflow-cmd-params-config as following. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : # Whether to run the controller and the UX server in namespaced scope, defaults to false. namespaced : \"true\" # The namespace that the controller and UX server watch when \"namespaced\" is true, defaults to the installation namespace. managed.namespace : numaflow-system Similarly, another approach is to add --managed-namespace and the specific namespace to the numaflow-controller and numaflow-server deployment arguments. This approach takes precedence over the ConfigMap approach. - args: - --namespaced - --managed-namespace - my-namespace","title":"Managed Namespace Scope"},{"location":"operations/installation/#high-availability","text":"By default, the Numaflow controller is installed with Active-Passive HA strategy enabled, which means you can run the controller with multiple replicas (defaults to 1 in the manifests). There are some parameters can be tuned for the leader election mechanism of HA. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : ### The duration that non-leader candidates will wait to force acquire leadership. # This is measured against time of last observed ack. Default is 15 seconds. # The configuration has to be: lease.duration > lease.renew.deadline > lease.renew.period controller.leader.election.lease.duration : 15s # ### The duration that the acting controlplane will retry refreshing leadership before giving up. # Default is 10 seconds. # The configuration has to be: lease.duration > lease.renew.deadline > lease.renew.period controller.leader.election.lease.renew.deadline : 10s ### The duration the LeaderElector clients should wait between tries of actions, which means every # this period of time, it tries to renew the lease. Default is 2 seconds. # The configuration has to be: lease.duration > lease.renew.deadline > lease.renew.period controller.leader.election.lease.renew.period : 2s These parameters are useful when you want to tune the frequency of leader election renewal calls to K8s API server, which are usually configured at a high priority level of API Priority and Fairness . To turn off HA, configure the ConfigMap numaflow-cmd-params-config as following. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : # Whether to disable leader election for the controller, defaults to false controller.leader.election.disabled : \"true\" If HA is turned off, the controller deployment should not run with multiple replicas.","title":"High Availability"},{"location":"operations/releases/","text":"Releases \u00b6 You can find the most recent version under Github Releases . Versioning \u00b6 Versions are expressed as vx.y.z (for example, v0.5.3 ), where x is the major version, y is the minor version, and z is the patch version, following Semantic Versioning terminology. Numaflow does not use Semantic Versioning. Minor versions may contain breaking changes. Patch versions only contain bug fixes and minor features. There's a stable tag, pointing to a latest stable release, usually it is the latest patch version. Release Cycle \u00b6 TBD as Numaflow is under active development. Nightly Build \u00b6 If you want to try out the new features on main branch, Numaflow provides nightly build images from main , the images are available in the format of quay.io/numaproj/numaflow:nightly-yyyyMMdd . Nightly build images expire in 30 days.","title":"Releases \u29c9"},{"location":"operations/releases/#releases","text":"You can find the most recent version under Github Releases .","title":"Releases"},{"location":"operations/releases/#versioning","text":"Versions are expressed as vx.y.z (for example, v0.5.3 ), where x is the major version, y is the minor version, and z is the patch version, following Semantic Versioning terminology. Numaflow does not use Semantic Versioning. Minor versions may contain breaking changes. Patch versions only contain bug fixes and minor features. There's a stable tag, pointing to a latest stable release, usually it is the latest patch version.","title":"Versioning"},{"location":"operations/releases/#release-cycle","text":"TBD as Numaflow is under active development.","title":"Release Cycle"},{"location":"operations/releases/#nightly-build","text":"If you want to try out the new features on main branch, Numaflow provides nightly build images from main , the images are available in the format of quay.io/numaproj/numaflow:nightly-yyyyMMdd . Nightly build images expire in 30 days.","title":"Nightly Build"},{"location":"operations/security/","text":"Security \u00b6 Controller \u00b6 Numaflow controller can be deployed in two scopes. It can be either at the Cluster level or at the Namespace level. When the Numaflow controller is deployed at the Namespace level, it will only have access to the Namespace resources. Pipeline \u00b6 Data Movement \u00b6 Data movement happens only within the namespace (no cross-namespaces). Numaflow provides the ability to encrypt data at rest and also in transit. Controller and Data Plane \u00b6 All communications between the controller and Numaflow pipeline components are encrypted. These are uni-directional read-only communications.","title":"Security"},{"location":"operations/security/#security","text":"","title":"Security"},{"location":"operations/security/#controller","text":"Numaflow controller can be deployed in two scopes. It can be either at the Cluster level or at the Namespace level. When the Numaflow controller is deployed at the Namespace level, it will only have access to the Namespace resources.","title":"Controller"},{"location":"operations/security/#pipeline","text":"","title":"Pipeline"},{"location":"operations/security/#data-movement","text":"Data movement happens only within the namespace (no cross-namespaces). Numaflow provides the ability to encrypt data at rest and also in transit.","title":"Data Movement"},{"location":"operations/security/#controller-and-data-plane","text":"All communications between the controller and Numaflow pipeline components are encrypted. These are uni-directional read-only communications.","title":"Controller and Data Plane"},{"location":"operations/validating-webhook/","text":"Validating Admission Webhook \u00b6 This validating webhook will prevent disallowed spec changes to immutable fields of Numaflow CRDs including Pipelines and InterStepBufferServices. It also prevents creating a CRD with a faulty spec. The user sees an error immediately returned by the server explaining why the request was denied. Installation \u00b6 To install the validating webhook, run the following command line: kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/validating-webhook-install.yaml Examples \u00b6 Currently, the validating webhook prevents updating the type of an InterStepBufferService from JetStream to Redis for example. Example spec: apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : // change to redis and reapply will cause below error version : latest Error from server ( BadRequest ) : error when applying patch: { \"metadata\" : { \"annotations\" : { \"kubectl.kubernetes.io/last-applied-configuration\" : \"{\\\"apiVersion\\\":\\\"numaflow.numaproj.io/v1alpha1\\\",\\\"kind\\\":\\\"InterStepBufferService\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"name\\\":\\\"default\\\",\\\"namespace\\\":\\\"numaflow-system\\\"},\\\"spec\\\":{\\\"redis\\\":{\\\"native\\\":{\\\"version\\\":\\\"7.0.11\\\"}}}}\\n\" }} , \"spec\" : { \"jetstream\" :null, \"redis\" : { \"native\" : { \"version\" : \"7.0.11\" }}}} to: Resource: \"numaflow.numaproj.io/v1alpha1, Resource=interstepbufferservices\" , GroupVersionKind: \"numaflow.numaproj.io/v1alpha1, Kind=InterStepBufferService\" Name: \"default\" , Namespace: \"numaflow-system\" for : \"redis.yaml\" : error when patching \"redis.yaml\" : admission webhook \"webhook.numaflow.numaproj.io\" denied the request: Can not change ISB Service type from Jetstream to Redis There is also validation that prevents the interStepBufferServiceName of a Pipeline from being updated. Error from server ( BadRequest ) : error when applying patch: { \"metadata\" : { \"annotations\" : { \"kubectl.kubernetes.io/last-applied-configuration\" : \"{\\\"apiVersion\\\":\\\"numaflow.numaproj.io/v1alpha1\\\",\\\"kind\\\":\\\"Pipeline\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"name\\\":\\\"simple-pipeline\\\",\\\"namespace\\\":\\\"numaflow-system\\\"},\\\"spec\\\":{\\\"edges\\\":[{\\\"from\\\":\\\"in\\\",\\\"to\\\":\\\"cat\\\"},{\\\"from\\\":\\\"cat\\\",\\\"to\\\":\\\"out\\\"}],\\\"interStepBufferServiceName\\\":\\\"change\\\",\\\"vertices\\\":[{\\\"name\\\":\\\"in\\\",\\\"source\\\":{\\\"generator\\\":{\\\"duration\\\":\\\"1s\\\",\\\"rpu\\\":5}}},{\\\"name\\\":\\\"cat\\\",\\\"udf\\\":{\\\"builtin\\\":{\\\"name\\\":\\\"cat\\\"}}},{\\\"name\\\":\\\"out\\\",\\\"sink\\\":{\\\"log\\\":{}}}]}}\\n\" }} , \"spec\" : { \"interStepBufferServiceName\" : \"change\" , \"vertices\" : [{ \"name\" : \"in\" , \"source\" : { \"generator\" : { \"duration\" : \"1s\" , \"rpu\" :5 }}} , { \"name\" : \"cat\" , \"udf\" : { \"builtin\" : { \"name\" : \"cat\" }}} , { \"name\" : \"out\" , \"sink\" : { \"log\" : {}}}]}} to: Resource: \"numaflow.numaproj.io/v1alpha1, Resource=pipelines\" , GroupVersionKind: \"numaflow.numaproj.io/v1alpha1, Kind=Pipeline\" Name: \"simple-pipeline\" , Namespace: \"numaflow-system\" for : \"examples/1-simple-pipeline.yaml\" : error when patching \"examples/1-simple-pipeline.yaml\" : admission webhook \"webhook.numaflow.numaproj.io\" denied the request: Cannot update pipeline with different interStepBufferServiceName Other validations include: Pipeline: cannot change the type of an existing vertex cannot change the partition count of a reduce vertex cannot change the storage class of a reduce vertex etc. InterStepBufferService: cannot change the persistence configuration of an ISB Service etc.","title":"Validating Webhook"},{"location":"operations/validating-webhook/#validating-admission-webhook","text":"This validating webhook will prevent disallowed spec changes to immutable fields of Numaflow CRDs including Pipelines and InterStepBufferServices. It also prevents creating a CRD with a faulty spec. The user sees an error immediately returned by the server explaining why the request was denied.","title":"Validating Admission Webhook"},{"location":"operations/validating-webhook/#installation","text":"To install the validating webhook, run the following command line: kubectl apply -n numaflow-system -f https://raw.githubusercontent.com/numaproj/numaflow/stable/config/validating-webhook-install.yaml","title":"Installation"},{"location":"operations/validating-webhook/#examples","text":"Currently, the validating webhook prevents updating the type of an InterStepBufferService from JetStream to Redis for example. Example spec: apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : // change to redis and reapply will cause below error version : latest Error from server ( BadRequest ) : error when applying patch: { \"metadata\" : { \"annotations\" : { \"kubectl.kubernetes.io/last-applied-configuration\" : \"{\\\"apiVersion\\\":\\\"numaflow.numaproj.io/v1alpha1\\\",\\\"kind\\\":\\\"InterStepBufferService\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"name\\\":\\\"default\\\",\\\"namespace\\\":\\\"numaflow-system\\\"},\\\"spec\\\":{\\\"redis\\\":{\\\"native\\\":{\\\"version\\\":\\\"7.0.11\\\"}}}}\\n\" }} , \"spec\" : { \"jetstream\" :null, \"redis\" : { \"native\" : { \"version\" : \"7.0.11\" }}}} to: Resource: \"numaflow.numaproj.io/v1alpha1, Resource=interstepbufferservices\" , GroupVersionKind: \"numaflow.numaproj.io/v1alpha1, Kind=InterStepBufferService\" Name: \"default\" , Namespace: \"numaflow-system\" for : \"redis.yaml\" : error when patching \"redis.yaml\" : admission webhook \"webhook.numaflow.numaproj.io\" denied the request: Can not change ISB Service type from Jetstream to Redis There is also validation that prevents the interStepBufferServiceName of a Pipeline from being updated. Error from server ( BadRequest ) : error when applying patch: { \"metadata\" : { \"annotations\" : { \"kubectl.kubernetes.io/last-applied-configuration\" : \"{\\\"apiVersion\\\":\\\"numaflow.numaproj.io/v1alpha1\\\",\\\"kind\\\":\\\"Pipeline\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"name\\\":\\\"simple-pipeline\\\",\\\"namespace\\\":\\\"numaflow-system\\\"},\\\"spec\\\":{\\\"edges\\\":[{\\\"from\\\":\\\"in\\\",\\\"to\\\":\\\"cat\\\"},{\\\"from\\\":\\\"cat\\\",\\\"to\\\":\\\"out\\\"}],\\\"interStepBufferServiceName\\\":\\\"change\\\",\\\"vertices\\\":[{\\\"name\\\":\\\"in\\\",\\\"source\\\":{\\\"generator\\\":{\\\"duration\\\":\\\"1s\\\",\\\"rpu\\\":5}}},{\\\"name\\\":\\\"cat\\\",\\\"udf\\\":{\\\"builtin\\\":{\\\"name\\\":\\\"cat\\\"}}},{\\\"name\\\":\\\"out\\\",\\\"sink\\\":{\\\"log\\\":{}}}]}}\\n\" }} , \"spec\" : { \"interStepBufferServiceName\" : \"change\" , \"vertices\" : [{ \"name\" : \"in\" , \"source\" : { \"generator\" : { \"duration\" : \"1s\" , \"rpu\" :5 }}} , { \"name\" : \"cat\" , \"udf\" : { \"builtin\" : { \"name\" : \"cat\" }}} , { \"name\" : \"out\" , \"sink\" : { \"log\" : {}}}]}} to: Resource: \"numaflow.numaproj.io/v1alpha1, Resource=pipelines\" , GroupVersionKind: \"numaflow.numaproj.io/v1alpha1, Kind=Pipeline\" Name: \"simple-pipeline\" , Namespace: \"numaflow-system\" for : \"examples/1-simple-pipeline.yaml\" : error when patching \"examples/1-simple-pipeline.yaml\" : admission webhook \"webhook.numaflow.numaproj.io\" denied the request: Cannot update pipeline with different interStepBufferServiceName Other validations include: Pipeline: cannot change the type of an existing vertex cannot change the partition count of a reduce vertex cannot change the storage class of a reduce vertex etc. InterStepBufferService: cannot change the persistence configuration of an ISB Service etc.","title":"Examples"},{"location":"operations/metrics/metrics/","text":"Metrics \u00b6 Numaflow provides the following prometheus metrics which we can use to monitor our pipeline and setup any alerts if needed. Golden Signals \u00b6 These metrics in combination can be used to determine the overall health of your pipeline Traffic \u00b6 These metrics can be used to determine throughput of your pipeline. Data-forward \u00b6 Metric name Metric type Labels Description forwarder_data_read_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages read by a given Vertex from an Inter-Step Buffer Partition forwarder_read_bytes_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of bytes read by a given Vertex from an Inter-Step Buffer Partition forwarder_write_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages written to Inter-Step Buffer by a given Vertex forwarder_write_bytes_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of bytes written to Inter-Step Buffer by a given Vertex forwarder_ack_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages acknowledged by a given Vertex from an Inter-Step Buffer Partition forwarder_drop_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages dropped by a given Vertex due to a full Inter-Step Buffer Partition forwarder_drop_bytes_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of bytes dropped by a given Vertex due to a full Inter-Step Buffer Partition Kafka Source \u00b6 Metric name Metric type Labels Description kafka_source_read_total Counter pipeline= vertex= Provides the number of messages read by the Kafka Source Vertex/Processor. kafka_source_ack_total Counter pipeline= vertex= Provides the number of messages acknowledged by the Kafka Source Vertex/Processor Generator Source \u00b6 Metric name Metric type Labels Description tickgen_source_read_total Counter pipeline= vertex= Provides the number of messages read by the Generator Source Vertex/Processor. Http Source \u00b6 Metric name Metric type Labels Description http_source_read_total Counter pipeline= vertex= Provides the number of messages read by the HTTP Source Vertex/Processor. Kafka Sink \u00b6 Metric name Metric type Labels Description kafka_sink_write_total Counter pipeline= vertex= Provides the number of messages written by the Kafka Sink Vertex/Processor Log Sink \u00b6 Metric name Metric type Labels Description log_sink_write_total Counter pipeline= vertex= Provides the number of messages written by the Log Sink Vertex/Processor Latency \u00b6 These metrics can be used to determine the latency of your pipeline. Metric name Metric type Labels Description pipeline_lag_milliseconds Gauge pipeline= Provides the pipeline processing lag in milliseconds watermark_cmp_now_milliseconds Gauge pipeline= Provides the Watermark compared with current time in milliseconds source_forwarder_transformer_processing_time Histogram pipeline= vertex= vertex_type= replica= partition_name= Provides a histogram distribution of the processing times of User-defined Source Transformer forwarder_udf_processing_time Histogram pipeline= vertex= vertex_type= replica= Provides a histogram distribution of the processing times of User-defined Functions. (UDF's) forwarder_forward_chunk_processing_time Histogram pipeline= vertex= vertex_type= replica= Provides a histogram distribution of the processing times of the forwarder function as a whole reduce_pnf_process_time Histogram pipeline= vertex= replica= Provides a histogram distribution of the processing times of the reducer reduce_pnf_forward_time Histogram pipeline= vertex= replica= Provides a histogram distribution of the forwarding times of the reducer Errors \u00b6 These metrics can be used to determine if there are any errors in the pipeline Metric name Metric type Labels Description forwarder_platform_error_total Counter pipeline= vertex= vertex_type= replica= Indicates any internal errors which could stop pipeline processing forwarder_read_error_total Counter pipeline= vertex= vertex_type= replica= partition_name= Indicates any errors while reading messages by the forwarder forwarder_write_error_total Counter pipeline= vertex= vertex_type= replica= partition_name= Indicates any errors while writing messages by the forwarder forwarder_ack_error_total Counter pipeline= vertex= vertex_type= replica= partition_name= Indicates any errors while acknowledging messages by the forwarder kafka_source_offset_ack_errors Counter pipeline= vertex= Indicates any kafka acknowledgement errors kafka_sink_write_error_total Counter pipeline= vertex= Provides the number of errors while writing to the Kafka sink kafka_sink_write_timeout_total Counter pipeline= vertex= Provides the write timeouts while writing to the Kafka sink isb_jetstream_read_error_total Counter partition_name= Indicates any read errors with NATS Jetstream ISB isb_jetstream_write_error_total Counter partition_name= Indicates any write errors with NATS Jetstream ISB isb_redis_read_error_total Counter partition_name= Indicates any read errors with Redis ISB isb_redis_write_error_total Counter partition_name= Indicates any write errors with Redis ISB Saturation \u00b6 NATS JetStream ISB \u00b6 Metric name Metric type Labels Description isb_jetstream_isFull_total Counter buffer= Indicates if the ISB is full. Continual increase of this counter metric indicates a potential backpressure that can be built on the pipeline isb_jetstream_buffer_soft_usage Gauge buffer= Indicates the usage/utilization of a NATS Jetstream ISB isb_jetstream_buffer_solid_usage Gauge buffer= Indicates the solid usage of a NATS Jetstream ISB isb_jetstream_buffer_pending Gauge buffer= Indicate the number of pending messages at a given point in time. isb_jetstream_buffer_ack_pending Gauge buffer= Indicates the number of messages pending acknowledge at a given point in time Redis ISB \u00b6 Metric name Metric type Labels Description isb_redis_isFull_total Counter buffer= Indicates if the ISB is full. Continual increase of this counter metric indicates a potential backpressure that can be built on the pipeline isb_redis_buffer_usage Gauge buffer= Indicates the usage/utilization of a Redis ISB isb_redis_consumer_lag Gauge buffer= Indicates the the consumer lag of a Redis ISB Prometheus Operator for Scraping Metrics: \u00b6 You can follow the prometheus operator setup guide if you would like to use prometheus operator configured in your cluster. You can also set up prometheus operator via helm . Configure the below Service Monitors for scraping your pipeline metrics: \u00b6 apiVersion : monitoring.coreos.com/v1 kind : ServiceMonitor metadata : labels : app.kubernetes.io/part-of : numaflow name : numaflow-pipeline-metrics spec : endpoints : - scheme : https port : metrics targetPort : 2469 tlsConfig : insecureSkipVerify : true selector : matchLabels : app.kubernetes.io/component : vertex app.kubernetes.io/managed-by : vertex-controller app.kubernetes.io/part-of : numaflow matchExpressions : - key : numaflow.numaproj.io/pipeline-name operator : Exists - key : numaflow.numaproj.io/vertex-name operator : Exists Configure the below Service Monitor if you use the NATS Jetstream ISB for your NATS Jetstream metrics: \u00b6 apiVersion : monitoring.coreos.com/v1 kind : ServiceMonitor metadata : labels : app.kubernetes.io/part-of : numaflow name : numaflow-isbsvc-jetstream-metrics spec : endpoints : - scheme : http port : metrics targetPort : 7777 selector : matchLabels : app.kubernetes.io/component : isbsvc app.kubernetes.io/managed-by : isbsvc-controller app.kubernetes.io/part-of : numaflow numaflow.numaproj.io/isbsvc-type : jetstream matchExpressions : - key : numaflow.numaproj.io/isbsvc-name operator : Exists","title":"Metrics"},{"location":"operations/metrics/metrics/#metrics","text":"Numaflow provides the following prometheus metrics which we can use to monitor our pipeline and setup any alerts if needed.","title":"Metrics"},{"location":"operations/metrics/metrics/#golden-signals","text":"These metrics in combination can be used to determine the overall health of your pipeline","title":"Golden Signals"},{"location":"operations/metrics/metrics/#traffic","text":"These metrics can be used to determine throughput of your pipeline.","title":"Traffic"},{"location":"operations/metrics/metrics/#data-forward","text":"Metric name Metric type Labels Description forwarder_data_read_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages read by a given Vertex from an Inter-Step Buffer Partition forwarder_read_bytes_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of bytes read by a given Vertex from an Inter-Step Buffer Partition forwarder_write_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages written to Inter-Step Buffer by a given Vertex forwarder_write_bytes_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of bytes written to Inter-Step Buffer by a given Vertex forwarder_ack_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages acknowledged by a given Vertex from an Inter-Step Buffer Partition forwarder_drop_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of messages dropped by a given Vertex due to a full Inter-Step Buffer Partition forwarder_drop_bytes_total Counter pipeline= vertex= vertex_type= replica= partition_name= Provides the total number of bytes dropped by a given Vertex due to a full Inter-Step Buffer Partition","title":"Data-forward"},{"location":"operations/metrics/metrics/#kafka-source","text":"Metric name Metric type Labels Description kafka_source_read_total Counter pipeline= vertex= Provides the number of messages read by the Kafka Source Vertex/Processor. kafka_source_ack_total Counter pipeline= vertex= Provides the number of messages acknowledged by the Kafka Source Vertex/Processor","title":"Kafka Source"},{"location":"operations/metrics/metrics/#generator-source","text":"Metric name Metric type Labels Description tickgen_source_read_total Counter pipeline= vertex= Provides the number of messages read by the Generator Source Vertex/Processor.","title":"Generator Source"},{"location":"operations/metrics/metrics/#http-source","text":"Metric name Metric type Labels Description http_source_read_total Counter pipeline= vertex= Provides the number of messages read by the HTTP Source Vertex/Processor.","title":"Http Source"},{"location":"operations/metrics/metrics/#kafka-sink","text":"Metric name Metric type Labels Description kafka_sink_write_total Counter pipeline= vertex= Provides the number of messages written by the Kafka Sink Vertex/Processor","title":"Kafka Sink"},{"location":"operations/metrics/metrics/#log-sink","text":"Metric name Metric type Labels Description log_sink_write_total Counter pipeline= vertex= Provides the number of messages written by the Log Sink Vertex/Processor","title":"Log Sink"},{"location":"operations/metrics/metrics/#latency","text":"These metrics can be used to determine the latency of your pipeline. Metric name Metric type Labels Description pipeline_lag_milliseconds Gauge pipeline= Provides the pipeline processing lag in milliseconds watermark_cmp_now_milliseconds Gauge pipeline= Provides the Watermark compared with current time in milliseconds source_forwarder_transformer_processing_time Histogram pipeline= vertex= vertex_type= replica= partition_name= Provides a histogram distribution of the processing times of User-defined Source Transformer forwarder_udf_processing_time Histogram pipeline= vertex= vertex_type= replica= Provides a histogram distribution of the processing times of User-defined Functions. (UDF's) forwarder_forward_chunk_processing_time Histogram pipeline= vertex= vertex_type= replica= Provides a histogram distribution of the processing times of the forwarder function as a whole reduce_pnf_process_time Histogram pipeline= vertex= replica= Provides a histogram distribution of the processing times of the reducer reduce_pnf_forward_time Histogram pipeline= vertex= replica= Provides a histogram distribution of the forwarding times of the reducer","title":"Latency"},{"location":"operations/metrics/metrics/#errors","text":"These metrics can be used to determine if there are any errors in the pipeline Metric name Metric type Labels Description forwarder_platform_error_total Counter pipeline= vertex= vertex_type= replica= Indicates any internal errors which could stop pipeline processing forwarder_read_error_total Counter pipeline= vertex= vertex_type= replica= partition_name= Indicates any errors while reading messages by the forwarder forwarder_write_error_total Counter pipeline= vertex= vertex_type= replica= partition_name= Indicates any errors while writing messages by the forwarder forwarder_ack_error_total Counter pipeline= vertex= vertex_type= replica= partition_name= Indicates any errors while acknowledging messages by the forwarder kafka_source_offset_ack_errors Counter pipeline= vertex= Indicates any kafka acknowledgement errors kafka_sink_write_error_total Counter pipeline= vertex= Provides the number of errors while writing to the Kafka sink kafka_sink_write_timeout_total Counter pipeline= vertex= Provides the write timeouts while writing to the Kafka sink isb_jetstream_read_error_total Counter partition_name= Indicates any read errors with NATS Jetstream ISB isb_jetstream_write_error_total Counter partition_name= Indicates any write errors with NATS Jetstream ISB isb_redis_read_error_total Counter partition_name= Indicates any read errors with Redis ISB isb_redis_write_error_total Counter partition_name= Indicates any write errors with Redis ISB","title":"Errors"},{"location":"operations/metrics/metrics/#saturation","text":"","title":"Saturation"},{"location":"operations/metrics/metrics/#nats-jetstream-isb","text":"Metric name Metric type Labels Description isb_jetstream_isFull_total Counter buffer= Indicates if the ISB is full. Continual increase of this counter metric indicates a potential backpressure that can be built on the pipeline isb_jetstream_buffer_soft_usage Gauge buffer= Indicates the usage/utilization of a NATS Jetstream ISB isb_jetstream_buffer_solid_usage Gauge buffer= Indicates the solid usage of a NATS Jetstream ISB isb_jetstream_buffer_pending Gauge buffer= Indicate the number of pending messages at a given point in time. isb_jetstream_buffer_ack_pending Gauge buffer= Indicates the number of messages pending acknowledge at a given point in time","title":"NATS JetStream ISB"},{"location":"operations/metrics/metrics/#redis-isb","text":"Metric name Metric type Labels Description isb_redis_isFull_total Counter buffer= Indicates if the ISB is full. Continual increase of this counter metric indicates a potential backpressure that can be built on the pipeline isb_redis_buffer_usage Gauge buffer= Indicates the usage/utilization of a Redis ISB isb_redis_consumer_lag Gauge buffer= Indicates the the consumer lag of a Redis ISB","title":"Redis ISB"},{"location":"operations/metrics/metrics/#prometheus-operator-for-scraping-metrics","text":"You can follow the prometheus operator setup guide if you would like to use prometheus operator configured in your cluster. You can also set up prometheus operator via helm .","title":"Prometheus Operator for Scraping Metrics:"},{"location":"operations/metrics/metrics/#configure-the-below-service-monitors-for-scraping-your-pipeline-metrics","text":"apiVersion : monitoring.coreos.com/v1 kind : ServiceMonitor metadata : labels : app.kubernetes.io/part-of : numaflow name : numaflow-pipeline-metrics spec : endpoints : - scheme : https port : metrics targetPort : 2469 tlsConfig : insecureSkipVerify : true selector : matchLabels : app.kubernetes.io/component : vertex app.kubernetes.io/managed-by : vertex-controller app.kubernetes.io/part-of : numaflow matchExpressions : - key : numaflow.numaproj.io/pipeline-name operator : Exists - key : numaflow.numaproj.io/vertex-name operator : Exists","title":"Configure the below Service Monitors for scraping your pipeline metrics:"},{"location":"operations/metrics/metrics/#configure-the-below-service-monitor-if-you-use-the-nats-jetstream-isb-for-your-nats-jetstream-metrics","text":"apiVersion : monitoring.coreos.com/v1 kind : ServiceMonitor metadata : labels : app.kubernetes.io/part-of : numaflow name : numaflow-isbsvc-jetstream-metrics spec : endpoints : - scheme : http port : metrics targetPort : 7777 selector : matchLabels : app.kubernetes.io/component : isbsvc app.kubernetes.io/managed-by : isbsvc-controller app.kubernetes.io/part-of : numaflow numaflow.numaproj.io/isbsvc-type : jetstream matchExpressions : - key : numaflow.numaproj.io/isbsvc-name operator : Exists","title":"Configure the below Service Monitor if you use the NATS Jetstream ISB for your NATS Jetstream metrics:"},{"location":"operations/ui/ui-access-path/","text":"UI Access Path \u00b6 By default, Numaflow UI server will host the service at the root / ie. localhost:8443 . If a user needs to access the UI server under a different path, this can be achieved with following configuration. This is useful when the UI is hosted behind a reverse proxy or ingress controller that requires a specific path. Configure server.base.href in the ConfigMap numaflow-cmd-params-config . apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : ### Base href for Numaflow UI server, defaults to '/'. server.base.href : \"/app\" The configuration above will host the service at localhost:8443/app . Note that this new access path will work with or without a trailing slash.","title":"Access Path"},{"location":"operations/ui/ui-access-path/#ui-access-path","text":"By default, Numaflow UI server will host the service at the root / ie. localhost:8443 . If a user needs to access the UI server under a different path, this can be achieved with following configuration. This is useful when the UI is hosted behind a reverse proxy or ingress controller that requires a specific path. Configure server.base.href in the ConfigMap numaflow-cmd-params-config . apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : ### Base href for Numaflow UI server, defaults to '/'. server.base.href : \"/app\" The configuration above will host the service at localhost:8443/app . Note that this new access path will work with or without a trailing slash.","title":"UI Access Path"},{"location":"operations/ui/authn/authentication/","text":"Authentication \u00b6 Numaflow UI server provides 2 approaches for authentication. SSO with Dex Local users There's also an option to disable authentication/authorization by setting server.disable.auth: \"true\" in the ConfigMap 1numaflow-cmd-params-config`, in this case, everybody has full access and privileges to any features of the UI (not recommended).","title":"Overview"},{"location":"operations/ui/authn/authentication/#authentication","text":"Numaflow UI server provides 2 approaches for authentication. SSO with Dex Local users There's also an option to disable authentication/authorization by setting server.disable.auth: \"true\" in the ConfigMap 1numaflow-cmd-params-config`, in this case, everybody has full access and privileges to any features of the UI (not recommended).","title":"Authentication"},{"location":"operations/ui/authn/dex/","text":"Dex Server \u00b6 Numaflow comes with a Dex Server for authentication integration. Currently, the supported identity provider is Github. SSO configuration of Numaflow UI will require editing some configuration detailed below. 1. Register application for Github \u00b6 In Github, register a new OAuth application. The callback address should be the homepage of your Numaflow UI + /dex/callback . After registering this application, you will be given a client ID. You will need this value and also generate a new client secret. 2. Configuring Numaflow \u00b6 First we need to configure server.disable.auth to false in the ConfigMap numaflow-cmd-params-config . This will enable authentication and authorization for the UX server. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : ### Whether to disable authentication and authorization for the UX server, defaults to false. server.disable.auth : \"false\" # Next we need to configure the numaflow-dex-server-config ConfigMap. Change to your organization you created the application under and include the correct teams. This file will be read by the init container of the Dex server and generate the config it will server. kind : ConfigMap apiVersion : v1 metadata : name : numaflow-dex-server-config data : config.yaml : | connectors: - type: github # https://dexidp.io/docs/connectors/github/ id: github name: GitHub config: clientID: $GITHUB_CLIENT_ID clientSecret: $GITHUB_CLIENT_SECRET orgs: - name: teams: - admin - readonly Finally we will need to create/update the numaflow-dex-secrets Secret. You will need to add the client ID and secret you created earlier for the application here. apiVersion : v1 kind : Secret metadata : name : numaflow-dex-secrets stringData : # https://dexidp.io/docs/connectors/github/ dex-github-client-id : dex-github-client-secret : 3. Restarting Pods \u00b6 If you are enabling/disabling authorization and authentication for the Numaflow server, it will need to be restarted. Any changes or additions to the connectors in the numaflow-dex-server-config ConfigMap will need to be read and generated again requiring a restart as well.","title":"SSO with Dex"},{"location":"operations/ui/authn/dex/#dex-server","text":"Numaflow comes with a Dex Server for authentication integration. Currently, the supported identity provider is Github. SSO configuration of Numaflow UI will require editing some configuration detailed below.","title":"Dex Server"},{"location":"operations/ui/authn/dex/#1-register-application-for-github","text":"In Github, register a new OAuth application. The callback address should be the homepage of your Numaflow UI + /dex/callback . After registering this application, you will be given a client ID. You will need this value and also generate a new client secret.","title":"1. Register application for Github"},{"location":"operations/ui/authn/dex/#2-configuring-numaflow","text":"First we need to configure server.disable.auth to false in the ConfigMap numaflow-cmd-params-config . This will enable authentication and authorization for the UX server. apiVersion : v1 kind : ConfigMap metadata : name : numaflow-cmd-params-config data : ### Whether to disable authentication and authorization for the UX server, defaults to false. server.disable.auth : \"false\" # Next we need to configure the numaflow-dex-server-config ConfigMap. Change to your organization you created the application under and include the correct teams. This file will be read by the init container of the Dex server and generate the config it will server. kind : ConfigMap apiVersion : v1 metadata : name : numaflow-dex-server-config data : config.yaml : | connectors: - type: github # https://dexidp.io/docs/connectors/github/ id: github name: GitHub config: clientID: $GITHUB_CLIENT_ID clientSecret: $GITHUB_CLIENT_SECRET orgs: - name: teams: - admin - readonly Finally we will need to create/update the numaflow-dex-secrets Secret. You will need to add the client ID and secret you created earlier for the application here. apiVersion : v1 kind : Secret metadata : name : numaflow-dex-secrets stringData : # https://dexidp.io/docs/connectors/github/ dex-github-client-id : dex-github-client-secret : ","title":"2. Configuring Numaflow"},{"location":"operations/ui/authn/dex/#3-restarting-pods","text":"If you are enabling/disabling authorization and authentication for the Numaflow server, it will need to be restarted. Any changes or additions to the connectors in the numaflow-dex-server-config ConfigMap will need to be read and generated again requiring a restart as well.","title":"3. Restarting Pods"},{"location":"operations/ui/authn/local-users/","text":"Local Users \u00b6 In addition to the authentication using Dex, we also provide an authentication mechanism for local user based on JSON Web Token (JWT). NOTE \u00b6 When you create local users, each of those users will need additional RBAC rules set up, otherwise they will fall back to the default policy specified by policy.default field of the numaflow-server-rbac-config ConfigMap. Numaflow comes with a built-in admin user that has full access to the system. It is recommended to use admin user for initial configuration then switch to local users or configure SSO integration. Accessing with admin user \u00b6 A built-in admin user comes with a randomly generated password that is stored in numaflow-server-secrets Secret: Example \u00b6 kubectl get secret numaflow-server-secrets -n -o jsonpath = '{.data.admin\\.initial-password}' | base64 --decode Use the admin username and password obtained above to log in to the UI. Creating Users \u00b6 1. Adding the username \u00b6 Users can be created by updating the numaflow-server-local-user-config ConfigMap: Example \u00b6 apiVersion: v1 kind: ConfigMap metadata: name: numaflow-server-local-user-config data: # Format: {username}.enabled: \"true\" bob.enabled: \"true\" 2. Generating the password \u00b6 When adding new users, it is necessary to generate a bcrypt hash of their password: Example \u00b6 # Format: htpasswd -bnBC 10 \"\" | tr -d ':\\n' htpasswd -bnBC 10 \"\" password | tr -d ':\\n' 3. Adding the password for the username \u00b6 To add the password generated above for the respective user, you can update the numaflow-server-secrets Secret: Example \u00b6 apiVersion: v1 kind: Secret metadata: name: numaflow-server-secrets type: Opaque stringData: # Format: {username}.password: bob.password: $2 y $10$0 TCvrnLHQsQtEJVdXNNL6eeXaxHmGnQO.R8zhh0Mwr2RM7s42knTK You can also update the password for admin user similarly, it will be considered over the initial password NOTE \u00b6 For the example above, the username is bob and the password is password . Disabling Users \u00b6 Users can be disabled by updating the numaflow-server-local-user-config ConfigMap, including the system generated admin user: Example \u00b6 apiVersion: v1 kind: ConfigMap metadata: name: numaflow-server-local-user-config data: # Set the value to \"false\" to disable the user. bob.enabled: \"false\" Deleting Users \u00b6 Users can be deleted by removing the corresponding entries: 1. numaflow-server-local-user-config ConfigMap \u00b6 # Format: {username}.enabled: null kubectl patch configmap -n -p '{\"data\": {\"bob.enabled\": null}}' --type merge 2. numaflow-server-secrets Secret \u00b6 # Format: {username}.password: null kubectl patch secret -n -p '{\"data\": {\"bob.password\": null}}' --type merge","title":"Local Users"},{"location":"operations/ui/authn/local-users/#local-users","text":"In addition to the authentication using Dex, we also provide an authentication mechanism for local user based on JSON Web Token (JWT).","title":"Local Users"},{"location":"operations/ui/authn/local-users/#note","text":"When you create local users, each of those users will need additional RBAC rules set up, otherwise they will fall back to the default policy specified by policy.default field of the numaflow-server-rbac-config ConfigMap. Numaflow comes with a built-in admin user that has full access to the system. It is recommended to use admin user for initial configuration then switch to local users or configure SSO integration.","title":"NOTE"},{"location":"operations/ui/authn/local-users/#accessing-with-admin-user","text":"A built-in admin user comes with a randomly generated password that is stored in numaflow-server-secrets Secret:","title":"Accessing with admin user"},{"location":"operations/ui/authn/local-users/#example","text":"kubectl get secret numaflow-server-secrets -n -o jsonpath = '{.data.admin\\.initial-password}' | base64 --decode Use the admin username and password obtained above to log in to the UI.","title":"Example"},{"location":"operations/ui/authn/local-users/#creating-users","text":"","title":"Creating Users"},{"location":"operations/ui/authn/local-users/#1-adding-the-username","text":"Users can be created by updating the numaflow-server-local-user-config ConfigMap:","title":"1. Adding the username"},{"location":"operations/ui/authn/local-users/#example_1","text":"apiVersion: v1 kind: ConfigMap metadata: name: numaflow-server-local-user-config data: # Format: {username}.enabled: \"true\" bob.enabled: \"true\"","title":"Example"},{"location":"operations/ui/authn/local-users/#2-generating-the-password","text":"When adding new users, it is necessary to generate a bcrypt hash of their password:","title":"2. Generating the password"},{"location":"operations/ui/authn/local-users/#example_2","text":"# Format: htpasswd -bnBC 10 \"\" | tr -d ':\\n' htpasswd -bnBC 10 \"\" password | tr -d ':\\n'","title":"Example"},{"location":"operations/ui/authn/local-users/#3-adding-the-password-for-the-username","text":"To add the password generated above for the respective user, you can update the numaflow-server-secrets Secret:","title":"3. Adding the password for the username"},{"location":"operations/ui/authn/local-users/#example_3","text":"apiVersion: v1 kind: Secret metadata: name: numaflow-server-secrets type: Opaque stringData: # Format: {username}.password: bob.password: $2 y $10$0 TCvrnLHQsQtEJVdXNNL6eeXaxHmGnQO.R8zhh0Mwr2RM7s42knTK You can also update the password for admin user similarly, it will be considered over the initial password","title":"Example"},{"location":"operations/ui/authn/local-users/#note_1","text":"For the example above, the username is bob and the password is password .","title":"NOTE"},{"location":"operations/ui/authn/local-users/#disabling-users","text":"Users can be disabled by updating the numaflow-server-local-user-config ConfigMap, including the system generated admin user:","title":"Disabling Users"},{"location":"operations/ui/authn/local-users/#example_4","text":"apiVersion: v1 kind: ConfigMap metadata: name: numaflow-server-local-user-config data: # Set the value to \"false\" to disable the user. bob.enabled: \"false\"","title":"Example"},{"location":"operations/ui/authn/local-users/#deleting-users","text":"Users can be deleted by removing the corresponding entries:","title":"Deleting Users"},{"location":"operations/ui/authn/local-users/#1-numaflow-server-local-user-config-configmap","text":"# Format: {username}.enabled: null kubectl patch configmap -n -p '{\"data\": {\"bob.enabled\": null}}' --type merge","title":"1. numaflow-server-local-user-config ConfigMap"},{"location":"operations/ui/authn/local-users/#2-numaflow-server-secrets-secret","text":"# Format: {username}.password: null kubectl patch secret -n -p '{\"data\": {\"bob.password\": null}}' --type merge","title":"2. numaflow-server-secrets Secret"},{"location":"operations/ui/authz/rbac/","text":"Authorization \u00b6 Numaflow UI utilizes a role-based access control (RBAC) model to manage authorization, the RBAC policy and permissions are defined in the ConfigMap numaflow-server-rbac-config . There are two main sections in the ConfigMap. Rules \u00b6 Policies and groups are the two main entities defined in rules section, both of them work in conjunction with each other. The groups are used to define a set of users with the same permissions and the policies are used to define the specific permissions for these users or groups. # Policies go here p, role:admin, *, *, * p, role:readonly, *, *, GET # Groups go here g, admin, role:admin g, my-github-org:my-github-team, role:readonly Here we have defined two policies for the custom groups role:admin and role:readonly . The first policy allows the group role:admin to access all resources in all namespaces with all actions. The second policy allows the group role:readonly to access all resources in all namespaces with the GET action. To add a new policy , add a new line in the format: p, , , , User/Group : The user/group requesting access to a resource. This is the identifier extracted from the authentication token, such as a username, email address, or ID. Or could be a group defined in the groups section. Resource : The namespace in the cluster which is being accessed by the user. This can allow for selective access to namespaces. Object : This could be a specific resource in the namespace, such as a pipeline, isbsvc or any event based resource. Action : The action being performed on the resource using the API. These follow the standard HTTP verbs, such as GET, POST, PUT, DELETE, etc. The namespace, resource and action supports a wildcard * as an allow all function. Few examples: a policy line p, test@test.com, *, *, POST would allow the user with the given email address to access all resources in all namespaces with the POST action. a policy line p, test_user, *, *, * would allow the user with the given username to access all resources in all namespaces with all actions. a policy line p, role:admin_ns, test_ns, *, * would allow the group role:admin_ns to access all resources in the namespace test_ns with all actions. a policy line p, test_user, test_ns, *, GET would allow the user with the given username to access all resources in the namespace test_ns with the GET action. Groups can be defined by adding a new line in the format: g, , Here user is the identifier extracted from the authentication token, such as a username, email address, or ID. And group is the name of the group to which the user is being added. These are useful for defining a set of users with the same permissions. The group can be used in the policy definition in place of the user. And thus any user added to the group will have the same permissions as the group. Few examples: a group line g, test@test.com, role:readonly would add the user with the given email address to the group role:readonly. a group line g, test_user, role:admin would add the user with the given username to the group role:admin. Configuration \u00b6 This defines certain properties for the Casbin enforcer. The properties are defined in the following format: rbac-conf.yaml: | policy.default: role:readonly policy.scopes: groups,email,username We see two properties defined here: policy.default : This defines the default role for a user. If a user does not have any roles defined, then this role will be used for the user. This is useful for defining a default role for all users. policy.scopes : The scopes field controls which authentication scopes to examine during rbac enforcement. We can have multiple scopes, and the first scope that matches with the policy will be used. \"groups\", which means that the groups field of the user's token will be examined, This is default value and is used if no scopes are defined. \"email\", which means that the email field of the user's token will be examined \"username\", which means that the username field of the user's token will be examined Multiple scopes can be provided as a comma-separated, e.g \"groups,email,username\" This scope information is used to extract the user information from the token and then used to enforce the policies. Thus is it important to have the rules defined in the above section to map with the scopes expected in the configuration. Note : The rbac-conf.yaml file can be updated during runtime and the changes will be reflected immediately. This is useful for changing the default role for all users or adding a new scope to be used for rbac enforcement.","title":"Authorization"},{"location":"operations/ui/authz/rbac/#authorization","text":"Numaflow UI utilizes a role-based access control (RBAC) model to manage authorization, the RBAC policy and permissions are defined in the ConfigMap numaflow-server-rbac-config . There are two main sections in the ConfigMap.","title":"Authorization"},{"location":"operations/ui/authz/rbac/#rules","text":"Policies and groups are the two main entities defined in rules section, both of them work in conjunction with each other. The groups are used to define a set of users with the same permissions and the policies are used to define the specific permissions for these users or groups. # Policies go here p, role:admin, *, *, * p, role:readonly, *, *, GET # Groups go here g, admin, role:admin g, my-github-org:my-github-team, role:readonly Here we have defined two policies for the custom groups role:admin and role:readonly . The first policy allows the group role:admin to access all resources in all namespaces with all actions. The second policy allows the group role:readonly to access all resources in all namespaces with the GET action. To add a new policy , add a new line in the format: p, , , , User/Group : The user/group requesting access to a resource. This is the identifier extracted from the authentication token, such as a username, email address, or ID. Or could be a group defined in the groups section. Resource : The namespace in the cluster which is being accessed by the user. This can allow for selective access to namespaces. Object : This could be a specific resource in the namespace, such as a pipeline, isbsvc or any event based resource. Action : The action being performed on the resource using the API. These follow the standard HTTP verbs, such as GET, POST, PUT, DELETE, etc. The namespace, resource and action supports a wildcard * as an allow all function. Few examples: a policy line p, test@test.com, *, *, POST would allow the user with the given email address to access all resources in all namespaces with the POST action. a policy line p, test_user, *, *, * would allow the user with the given username to access all resources in all namespaces with all actions. a policy line p, role:admin_ns, test_ns, *, * would allow the group role:admin_ns to access all resources in the namespace test_ns with all actions. a policy line p, test_user, test_ns, *, GET would allow the user with the given username to access all resources in the namespace test_ns with the GET action. Groups can be defined by adding a new line in the format: g, , Here user is the identifier extracted from the authentication token, such as a username, email address, or ID. And group is the name of the group to which the user is being added. These are useful for defining a set of users with the same permissions. The group can be used in the policy definition in place of the user. And thus any user added to the group will have the same permissions as the group. Few examples: a group line g, test@test.com, role:readonly would add the user with the given email address to the group role:readonly. a group line g, test_user, role:admin would add the user with the given username to the group role:admin.","title":"Rules"},{"location":"operations/ui/authz/rbac/#configuration","text":"This defines certain properties for the Casbin enforcer. The properties are defined in the following format: rbac-conf.yaml: | policy.default: role:readonly policy.scopes: groups,email,username We see two properties defined here: policy.default : This defines the default role for a user. If a user does not have any roles defined, then this role will be used for the user. This is useful for defining a default role for all users. policy.scopes : The scopes field controls which authentication scopes to examine during rbac enforcement. We can have multiple scopes, and the first scope that matches with the policy will be used. \"groups\", which means that the groups field of the user's token will be examined, This is default value and is used if no scopes are defined. \"email\", which means that the email field of the user's token will be examined \"username\", which means that the username field of the user's token will be examined Multiple scopes can be provided as a comma-separated, e.g \"groups,email,username\" This scope information is used to extract the user information from the token and then used to enforce the policies. Thus is it important to have the rules defined in the above section to map with the scopes expected in the configuration. Note : The rbac-conf.yaml file can be updated during runtime and the changes will be reflected immediately. This is useful for changing the default role for all users or adding a new scope to be used for rbac enforcement.","title":"Configuration"},{"location":"specifications/authorization/","text":"UI Authorization \u00b6 We utilize a role-based access control (RBAC) model to manage authorization in Numaflow. Along with this we utilize Casbin as a library for the implementation of these policies. Permissions and Policies \u00b6 The following model configuration is given to define the policies. The policy model is defined in the Casbin policy language. [request_definition] r = sub, res, obj, act [policy_definition] p = sub, res, obj, act [role_definition] g = _, _ [policy_effect] e = some(where (p.eft == allow)) [matchers] m = g(r.sub, p.sub) && patternMatch(r.res, p.res) && stringMatch(r.obj, p.obj) && stringMatch(r.act, p.act) The policy model consists of the following sections: request_definition: The request definition section defines the request attributes. In our case, the request attributes are the user, resource, action, and object. policy_definition: The policy definition section defines the policy attributes. In our case, the policy attributes are the user, resource, action, and object. role_definition: The role definition section defines the role attributes. In our case, the role attributes are the user and role. policy_effect: The policy effect defines what action is to be taken on auth, In our case, the policy effect is allow. matchers: The matcher section defines the matching logic which decides whether is a given request matches any policy or not. These matches are done in order of the definition above and shortcircuit at the first failure. There are custom functions like patternMatch and stringMatch. patternMatch: This function is used to match the resource with the policy resource using os path pattern matching along with adding support for wildcards for allowAll. stringMatch: This function is used to match the object and action and uses a simple exact string match. This also supports wildcards for allowAll The policy model for us follows the following structure for all policies defined and any requests made to th UI server: User: The user requesting access to a resource. This could be any identifier, such as a username, email address, or ID. Resource: The namespace in the cluster which is being accessed by the user. This can allow for selective access to namespaces. We have wildcard \"*\" to allow access to all namespaces. Object : This could be a specific resource in the namespace, such as a pipeline, isbsvc or any event based resource. We have wildcard \"*\" to allow access to all resources. Action: The action being performed on the resource using the API. These follow the standard HTTP verbs, such as GET, POST, PUT, DELETE, etc. We have wildcard \"*\" to allow access to all actions. Refer to the RBAC to learn more about how to configure authorization policies for Numaflow UI.","title":"UI Authorization"},{"location":"specifications/authorization/#ui-authorization","text":"We utilize a role-based access control (RBAC) model to manage authorization in Numaflow. Along with this we utilize Casbin as a library for the implementation of these policies.","title":"UI Authorization"},{"location":"specifications/authorization/#permissions-and-policies","text":"The following model configuration is given to define the policies. The policy model is defined in the Casbin policy language. [request_definition] r = sub, res, obj, act [policy_definition] p = sub, res, obj, act [role_definition] g = _, _ [policy_effect] e = some(where (p.eft == allow)) [matchers] m = g(r.sub, p.sub) && patternMatch(r.res, p.res) && stringMatch(r.obj, p.obj) && stringMatch(r.act, p.act) The policy model consists of the following sections: request_definition: The request definition section defines the request attributes. In our case, the request attributes are the user, resource, action, and object. policy_definition: The policy definition section defines the policy attributes. In our case, the policy attributes are the user, resource, action, and object. role_definition: The role definition section defines the role attributes. In our case, the role attributes are the user and role. policy_effect: The policy effect defines what action is to be taken on auth, In our case, the policy effect is allow. matchers: The matcher section defines the matching logic which decides whether is a given request matches any policy or not. These matches are done in order of the definition above and shortcircuit at the first failure. There are custom functions like patternMatch and stringMatch. patternMatch: This function is used to match the resource with the policy resource using os path pattern matching along with adding support for wildcards for allowAll. stringMatch: This function is used to match the object and action and uses a simple exact string match. This also supports wildcards for allowAll The policy model for us follows the following structure for all policies defined and any requests made to th UI server: User: The user requesting access to a resource. This could be any identifier, such as a username, email address, or ID. Resource: The namespace in the cluster which is being accessed by the user. This can allow for selective access to namespaces. We have wildcard \"*\" to allow access to all namespaces. Object : This could be a specific resource in the namespace, such as a pipeline, isbsvc or any event based resource. We have wildcard \"*\" to allow access to all resources. Action: The action being performed on the resource using the API. These follow the standard HTTP verbs, such as GET, POST, PUT, DELETE, etc. We have wildcard \"*\" to allow access to all actions. Refer to the RBAC to learn more about how to configure authorization policies for Numaflow UI.","title":"Permissions and Policies"},{"location":"specifications/autoscaling/","text":"Autoscaling \u00b6 Scale Subresource is enabled in Vertex Custom Resource , which makes it possible to scale vertex pods. To be specifically, it is enabled by adding following comments to Vertex struct model, and then corresponding CRD definition is automatically generated. // +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.replicas,selectorpath=.status.selector Pods management is done by vertex controller. With scale subresource implemented, vertex object can be scaled by either horizontal or vertical pod autoscaling. Numaflow Autoscaling \u00b6 The out of box Numaflow autoscaling is done by a scaling component running in the controller manager, you can find the source code here . The autoscaling strategy is implemented according to different type of vertices. Source Vertices \u00b6 For source vertices, we define a target time (in seconds) to finish processing the pending messages based on the processing rate (tps) of the vertex. pendingMessages / processingRate = targetSeconds For example, if targetSeconds is 3, current replica number is 2 , current tps is 10000/second, and the pending messages is 60000, so we calculate the desired replica number as following: desiredReplicas = 60000 / (3 * (10000 / 2)) = 4 Numaflow autoscaling does not work for those source vertices that can not calculate pending messages. UDF and Sink Vertices \u00b6 Pending messages of a UDF or Sink vertex does not represent the real number because of the restrained writing caused by back pressure, so we use a different model to achieve autoscaling for them. For each of the vertices, we calculate the available buffer length, and consider it is contributed by all the replicas, so that we can get each replica's contribution. availableBufferLength = totalBufferLength * bufferLimit(%) - pendingMessages singleReplicaContribution = availableBufferLength / currentReplicas We define a target available buffer length, and then calculate how many replicas are needed to achieve the target. desiredReplicas = targetAvailableBufferLength / singleReplicaContribution Back Pressure Impact \u00b6 Back pressure is considered during autoscaling (which is only available for Source and UDF vertices). We measure the back pressure by defining a threshold of the buffer usage. For example, the total buffer length is 50000, buffer limit is 80%, and the back pressure threshold is 90%, if in the past period of time, the average pending messages is more than 36000 (50000 * 80% * 90%) , we consider there's back pressure. When the calculated desired replicas is greater than current replicas: For vertices which have back pressure from the directly connected vertices, instead of increasing the replica number, we decrease it by 1; For vertices which have back pressure in any of its downstream vertices, the replica number remains unchanged. Autoscaling Tuning \u00b6 Numaflow autoscaling can be tuned by updating some parameters, find the details at the doc .","title":"Autoscaling"},{"location":"specifications/autoscaling/#autoscaling","text":"Scale Subresource is enabled in Vertex Custom Resource , which makes it possible to scale vertex pods. To be specifically, it is enabled by adding following comments to Vertex struct model, and then corresponding CRD definition is automatically generated. // +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.replicas,selectorpath=.status.selector Pods management is done by vertex controller. With scale subresource implemented, vertex object can be scaled by either horizontal or vertical pod autoscaling.","title":"Autoscaling"},{"location":"specifications/autoscaling/#numaflow-autoscaling","text":"The out of box Numaflow autoscaling is done by a scaling component running in the controller manager, you can find the source code here . The autoscaling strategy is implemented according to different type of vertices.","title":"Numaflow Autoscaling"},{"location":"specifications/autoscaling/#source-vertices","text":"For source vertices, we define a target time (in seconds) to finish processing the pending messages based on the processing rate (tps) of the vertex. pendingMessages / processingRate = targetSeconds For example, if targetSeconds is 3, current replica number is 2 , current tps is 10000/second, and the pending messages is 60000, so we calculate the desired replica number as following: desiredReplicas = 60000 / (3 * (10000 / 2)) = 4 Numaflow autoscaling does not work for those source vertices that can not calculate pending messages.","title":"Source Vertices"},{"location":"specifications/autoscaling/#udf-and-sink-vertices","text":"Pending messages of a UDF or Sink vertex does not represent the real number because of the restrained writing caused by back pressure, so we use a different model to achieve autoscaling for them. For each of the vertices, we calculate the available buffer length, and consider it is contributed by all the replicas, so that we can get each replica's contribution. availableBufferLength = totalBufferLength * bufferLimit(%) - pendingMessages singleReplicaContribution = availableBufferLength / currentReplicas We define a target available buffer length, and then calculate how many replicas are needed to achieve the target. desiredReplicas = targetAvailableBufferLength / singleReplicaContribution","title":"UDF and Sink Vertices"},{"location":"specifications/autoscaling/#back-pressure-impact","text":"Back pressure is considered during autoscaling (which is only available for Source and UDF vertices). We measure the back pressure by defining a threshold of the buffer usage. For example, the total buffer length is 50000, buffer limit is 80%, and the back pressure threshold is 90%, if in the past period of time, the average pending messages is more than 36000 (50000 * 80% * 90%) , we consider there's back pressure. When the calculated desired replicas is greater than current replicas: For vertices which have back pressure from the directly connected vertices, instead of increasing the replica number, we decrease it by 1; For vertices which have back pressure in any of its downstream vertices, the replica number remains unchanged.","title":"Back Pressure Impact"},{"location":"specifications/autoscaling/#autoscaling-tuning","text":"Numaflow autoscaling can be tuned by updating some parameters, find the details at the doc .","title":"Autoscaling Tuning"},{"location":"specifications/controllers/","text":"Controllers \u00b6 Currently in Numaflow , there are 3 CRDs introduced, each one has a corresponding controller. interstepbufferservices.numaflow.numaproj.io pipelines.numaflow.numaproj.io vertices.numaflow.numaproj.io The source code of the controllers is located at ./pkg/reconciler/ . Inter-Step Buffer Service Controller \u00b6 Inter-Step Buffer Service Controller is used to watch InterStepBufferService object, depending on the spec of the object, it might install services (such as JetStream, or Redis) in the namespace, or simply provide the configuration of the InterStepBufferService (for example, when an external redis ISB Service is given). Pipeline Controller \u00b6 Pipeline Controller is used to watch Pipeline objects, it does following major things when there's a pipeline object created. Spawn a Kubernetes Job to create buffers and buckets in the Inter-Step Buffer Services . Create Vertex objects according to .spec.vertices defined in Pipeline object. Create some other Kubernetes objects used for the Pipeline, such as a Deployment and a Service for daemon service application. Vertex Controller \u00b6 Vertex controller watches the Vertex objects, based on the replica defined in the spec, creates a number of pods to run the workloads.","title":"Controllers"},{"location":"specifications/controllers/#controllers","text":"Currently in Numaflow , there are 3 CRDs introduced, each one has a corresponding controller. interstepbufferservices.numaflow.numaproj.io pipelines.numaflow.numaproj.io vertices.numaflow.numaproj.io The source code of the controllers is located at ./pkg/reconciler/ .","title":"Controllers"},{"location":"specifications/controllers/#inter-step-buffer-service-controller","text":"Inter-Step Buffer Service Controller is used to watch InterStepBufferService object, depending on the spec of the object, it might install services (such as JetStream, or Redis) in the namespace, or simply provide the configuration of the InterStepBufferService (for example, when an external redis ISB Service is given).","title":"Inter-Step Buffer Service Controller"},{"location":"specifications/controllers/#pipeline-controller","text":"Pipeline Controller is used to watch Pipeline objects, it does following major things when there's a pipeline object created. Spawn a Kubernetes Job to create buffers and buckets in the Inter-Step Buffer Services . Create Vertex objects according to .spec.vertices defined in Pipeline object. Create some other Kubernetes objects used for the Pipeline, such as a Deployment and a Service for daemon service application.","title":"Pipeline Controller"},{"location":"specifications/controllers/#vertex-controller","text":"Vertex controller watches the Vertex objects, based on the replica defined in the spec, creates a number of pods to run the workloads.","title":"Vertex Controller"},{"location":"specifications/edges-buffers-buckets/","text":"Edges, Buffers and Buckets \u00b6 This document describes the concepts of Edge , Buffer and Bucket in a pipeline. Edges \u00b6 Edge is the connection between the vertices, specifically, edge is defined in the pipeline spec under .spec.edges . No matter if the to vertex is a Map, or a Reduce with multiple partitions, it is considered as one edge. In the following pipeline, there are 3 edges defined ( in - aoti , aoti - compute-sum , compute-sum - out ). apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : even-odd-sum spec : vertices : - name : in source : http : {} - name : atoi scale : min : 1 udf : container : image : quay.io/numaio/numaflow-go/map-even-odd:v0.5.0 - name : compute-sum partitions : 2 udf : container : image : quay.io/numaio/numaflow-go/reduce-sum:v0.5.0 groupBy : window : fixed : length : 60s keyed : true - name : out scale : min : 1 sink : log : {} edges : - from : in to : atoi - from : atoi to : compute-sum - from : compute-sum to : out Each edge could have a name for internal usage, the naming convention is {pipeline-name}-{from-vertex-name}-{to-vertex-name} . Buffers \u00b6 Buffer is InterStepBuffer . Each buffer has an owner, which is the vertex who reads from it. Each udf and sink vertex in a pipeline owns a group of partitioned buffers. Each buffer has a name with the naming convention {pipeline-name}-{vertex-name}-{index} , where the index is the partition index, starting from 0. This naming convention applies to the buffers of both map and reduce udf vertices. When multiple vertices connecting to the same vertex, if the to vertex is a Map, the data from all the from vertices will be forwarded to the group of partitoned buffers round-robinly. If the to vertex is a Reduce, the data from all the from vertices will be forwarded to the group of partitioned buffers based on the partitioning key. A Source vertex does not have any owned buffers. But a pipeline may have multiple Source vertices, followed by one vertex. Same as above, if the following vertex is a map, the data from all the Source vertices will be forwarded to the group of partitioned buffers round-robinly. If it is a reduce, the data from all the Source vertices will be forwarded to the group of partitioned buffers based on the partitioning key. Buckets \u00b6 Bucket is a K/V store (or a pair of stores) used for watermark propagation. There are 3 types of buckets in a pipeline: Edge Bucket : Each edge has a bucket, used for edge watermark propagation, no matter if the vertex that the edge leads to is a Map or a Reduce. The naming convention of an edge bucket is {pipeline-name}-{from-vertex-name}-{to-vertex-name} . Source Bucket : Each Source vertex has a source bucket, used for source watermark propagation. The naming convention of a source bucket is {pipeline-name}-{vertex-name}-SOURCE . Sink Bucket : Sitting on the right side of a Sink vertex, used for sink watermark. The naming convention of a sink bucket is {pipeline-name}-{vertex-name}-SINK . Diagrams \u00b6 Map Reduce","title":"Edges, Buffers and Buckets"},{"location":"specifications/edges-buffers-buckets/#edges-buffers-and-buckets","text":"This document describes the concepts of Edge , Buffer and Bucket in a pipeline.","title":"Edges, Buffers and Buckets"},{"location":"specifications/edges-buffers-buckets/#edges","text":"Edge is the connection between the vertices, specifically, edge is defined in the pipeline spec under .spec.edges . No matter if the to vertex is a Map, or a Reduce with multiple partitions, it is considered as one edge. In the following pipeline, there are 3 edges defined ( in - aoti , aoti - compute-sum , compute-sum - out ). apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : even-odd-sum spec : vertices : - name : in source : http : {} - name : atoi scale : min : 1 udf : container : image : quay.io/numaio/numaflow-go/map-even-odd:v0.5.0 - name : compute-sum partitions : 2 udf : container : image : quay.io/numaio/numaflow-go/reduce-sum:v0.5.0 groupBy : window : fixed : length : 60s keyed : true - name : out scale : min : 1 sink : log : {} edges : - from : in to : atoi - from : atoi to : compute-sum - from : compute-sum to : out Each edge could have a name for internal usage, the naming convention is {pipeline-name}-{from-vertex-name}-{to-vertex-name} .","title":"Edges"},{"location":"specifications/edges-buffers-buckets/#buffers","text":"Buffer is InterStepBuffer . Each buffer has an owner, which is the vertex who reads from it. Each udf and sink vertex in a pipeline owns a group of partitioned buffers. Each buffer has a name with the naming convention {pipeline-name}-{vertex-name}-{index} , where the index is the partition index, starting from 0. This naming convention applies to the buffers of both map and reduce udf vertices. When multiple vertices connecting to the same vertex, if the to vertex is a Map, the data from all the from vertices will be forwarded to the group of partitoned buffers round-robinly. If the to vertex is a Reduce, the data from all the from vertices will be forwarded to the group of partitioned buffers based on the partitioning key. A Source vertex does not have any owned buffers. But a pipeline may have multiple Source vertices, followed by one vertex. Same as above, if the following vertex is a map, the data from all the Source vertices will be forwarded to the group of partitioned buffers round-robinly. If it is a reduce, the data from all the Source vertices will be forwarded to the group of partitioned buffers based on the partitioning key.","title":"Buffers"},{"location":"specifications/edges-buffers-buckets/#buckets","text":"Bucket is a K/V store (or a pair of stores) used for watermark propagation. There are 3 types of buckets in a pipeline: Edge Bucket : Each edge has a bucket, used for edge watermark propagation, no matter if the vertex that the edge leads to is a Map or a Reduce. The naming convention of an edge bucket is {pipeline-name}-{from-vertex-name}-{to-vertex-name} . Source Bucket : Each Source vertex has a source bucket, used for source watermark propagation. The naming convention of a source bucket is {pipeline-name}-{vertex-name}-SOURCE . Sink Bucket : Sitting on the right side of a Sink vertex, used for sink watermark. The naming convention of a sink bucket is {pipeline-name}-{vertex-name}-SINK .","title":"Buckets"},{"location":"specifications/edges-buffers-buckets/#diagrams","text":"Map Reduce","title":"Diagrams"},{"location":"specifications/overview/","text":"Numaflow Dataplane High-Level Architecture \u00b6 Synopsis \u00b6 Numaflow allows developers without any special knowledge of data/stream processing to easily create massively parallel data/stream processing jobs using a programming language of their choice, with just basic knowledge of Kubernetes. Reliable data processing is highly desirable and exactly-once semantics is often required by many data processing applications. This document describes the use cases, requirements, and design for providing exactly-once semantics with Numaflow. Use Cases Continuous stream processing for unbounded streams. Efficient batch processing for bounded streams and data sets. Definitions \u00b6 Pipeline A pipeline contains multiple processors, which include source processors, data processors, and sink processors. These processors are not connected directly, but through inter-step buffers . Source The actual source for the data (not a step in the Numaflow). Sink The actual sink for the data (not a step in the Numaflow). Inter-Step Buffers Inter-step buffers are used to connect processors and they should support the following Durability Support offsets Support transactions for Exactly-Once forwarding Concurrent operations (reader group) Ability to explicitly ack each data/offset Claim pending messages (read but never acknowledged) Ability to trim (buffer size controls) Fast (high throughput and low latency) Ability to query buffer information (observability) Source Processors Source processors are the initial processors that ingest data into the Numaflow. They sit in front of the first data processor, ingest the data from the data source, and forward to inter-step buffers. Logic: Read data from the data source; Write to the inter-step buffer; Ack the data in the data source. Data Processors The data processors execute idempotent user-defined functions and will be sandwiched between source and sink processors. There could be one or more data processors. A data processor only reads from one upstream buffer, but it might write to multiple downstream buffers. Logic: Read data from the upstream inter-step buffer; Process data; Write to downstream inter-step buffers; Ack the data in the upstream buffer. Sink Processors Sink processors are the final processors used to write processed data to sinks. A sink processor only reads from one upstream buffer and writes to a single sink. Logic: Read data from the upstream inter-step buffer; Write to the sink; Ack the data in the upstream buffer. UDF (User-defined Function) Use-defined Functions run in data processors. UDFs implements a unified interface to process data. UDFs are typically implemented by end-users, but there will be some built-in functions that can be used without writing any code. UDFs can be implemented in different languages, a pseudo-interface might look like the below, where the function signatures include step context and input payload and returns a result. The Result contains the processed data as well as optional labels that will be exposed to the DSL to do complex conditional forwarding. Process(key, message, context) (result, err) UDFs should only focus on user logic, buffer message reading and writing should not be handled by this function. UDFs should be idempotent. Matrix of Operations Source Processor Sink ReadFromBuffer Read From Source Generic Generic CallUDF Void User Defined Void Forward Generic Generic Write To Sink Ack Ack Source Generic Generic Requirements \u00b6 Exactly once semantics from the source processor to the sink processor. Be able to support a variety of data buffering technologies. Numaflow is restartable if aborted or steps fail while preserving exactly-once semantics. Do not generate more output than can be used by the next stage in a reasonable amount of time, i.e., the size of buffers between steps should be limited, (aka backpressure). User code should be isolated from offset management, restart, exactly once, backpressure, etc. Streaming process systems inherently require a concept of time, this time will be either derived from the Source (LOG_APPEND_TIME in Kafka, etc.) or will be inserted at ingestion time if the source doesn't provide it. Every processor is connected by an inter-step buffer. Source processors add a \"header\" to each \"item\" received from the source in order to: Uniquely identify the item for implementing exactly-once Uniquely identify the source of the message. Sink processors should avoid writing output for the same input when possible. Numaflow should support the following types of flows: Line Tree Diamond (In Future) Multiple Sources with the same schema (In Future) Non-Requirements \u00b6 Support for non-idempotent data processors (UDFs?) Distributed transactions/checkpoints are not needed Open Issues \u00b6 None Closed Issues \u00b6 In order to be able to support various buffering technologies, we will persist and manage stream \"offsets\" rather than relying on the buffering technology (e.g., Kafka) Each processor may persist state associated with their processing no distributed transactions are needed for checkpointing If we have a tree DAG, how will we manage acknowledgments? We will use back-pressure and exactly-once schematics on the buffer to solve it. How/where will offsets be persisted? Buffer will have a \"lookup - insert - update\" as a txn What will be used to implement the inter-step buffers between processors? The interface is abstracted out, but internally we will use Redis Streams (supports streams, hash, txn) Design Details \u00b6 Duplicates \u00b6 Numaflow (like any other stream processing engine) at its core has Read -> Process -> Forward -> Acknowledge loop for every message it has to process. Given that the user-defined process is idempotent, there are two failure mode scenarios where there could be duplicates. The message has been forwarded but the information failed to reach back (we do not know whether we really have successfully forwarded the message). A retry on forwarding again could lead to duplication. Acknowledgment has been sent back to the source buffer, but we do not know whether we have really acknowledged the successful processing of the message. A retry on reading could end up in duplications (both in processing and forwarding, but we need to worry only about forwarding because processing is idempotent). To detect duplicates, make sure the delivery is Exactly-Once: A unique and immutable identifier for the message from the upstream buffer will be used as the key of the data in the downstream buffer Best effort of the transactional commit. Data processors make transactional commits for data forwarding to the next buffer, and upstream buffer acknowledgment. Source processors have no way to do similar transactional operations for data source message acknowledgment and message forwarding, but #1 will make sure there's no duplicate after retrying in case of failure. Sink processors can not do transactional operations unless there's a contract between Numaflow and the sink, which is out of the scope of this doc. We will rely on the sink to implement this (eg, \"enable.idempotent\" in Kafka producer). Unique Identifier for Message \u00b6 To detect duplicates, we first need to uniquely identify each message. We will be relying on the \"identifier\" available (e.g., \"offset\" in Kafka) in the buffer to uniquely identify each message. If such an identifier is not available, we will be creating a unique identifier (sequence numbers are tough because there are multiple readers). We can use this unique identifier to ensure that we forward only if the message has not been forwarded yet. We will only look back for a fixed window of time since this is a stream processing application on an unbounded stream of data, and we do not have infinite resources. The same offset will not be used across all the steps in Numaflow, but we will be using the current offset only while forwarding to the next step. Step N will use step N-1th offset to deduplicate. This requires each step to generate an unique ID. The reason we are not sticking to the original offset is because there will be operations in future which will require, say aggregations, where multiple messages will be grouped together and we will not be able to choose an offset from the original messages because the single output is based on multiple messages. Restarting After a Failure \u00b6 Numaflow needs to be able to recover from the failure of any step (pods) or even the complete failure of the Numaflow while preserving exactly-once semantics. When a message is successfully processed by a processor, it should have been written to the downstream buffer, and its status in the upstream buffer becomes \"Acknowledged\". So when a processor restarts, it checks if any message assigned to it in the upstream buffer is in the \"In-Flight\" state, if yes, it will read and process those messages before picking up other messages. Processing those messages follows the flowchart above, which makes sure they will only be processed once. Back Pressure \u00b6 The durable buffers allocated to the processors are not infinite but have a bounded buffer. Backpressure handling in Numaflow utilizes the buffer. At any time t, the durable buffer should contain messages in the following states: Acked messages - processed messages to be deleted Inflight messages - messages being handled by downstream processor Pending messages - messages to be read by the downstream processor The buffer acts like a sliding window, new messages will always be written to the right, and there's some automation to clean up the acknowledged messages on the left. If the processor is too slow, the pending messages will buffer up, and the space available for writing will become limited. Every time (or periodically for better throughput) before the upstream processor writes a message to the buffer, it checks if there's any available space, or else it stops writing (or slows down the processing while approaching the buffer limit). This buffer pressure will then pass back to the beginning of the pipeline, which is the buffer used by the source processor so that the entire flow will stop (or slow down).","title":"Overview"},{"location":"specifications/overview/#numaflow-dataplane-high-level-architecture","text":"","title":"Numaflow Dataplane High-Level Architecture"},{"location":"specifications/overview/#synopsis","text":"Numaflow allows developers without any special knowledge of data/stream processing to easily create massively parallel data/stream processing jobs using a programming language of their choice, with just basic knowledge of Kubernetes. Reliable data processing is highly desirable and exactly-once semantics is often required by many data processing applications. This document describes the use cases, requirements, and design for providing exactly-once semantics with Numaflow. Use Cases Continuous stream processing for unbounded streams. Efficient batch processing for bounded streams and data sets.","title":"Synopsis"},{"location":"specifications/overview/#definitions","text":"Pipeline A pipeline contains multiple processors, which include source processors, data processors, and sink processors. These processors are not connected directly, but through inter-step buffers . Source The actual source for the data (not a step in the Numaflow). Sink The actual sink for the data (not a step in the Numaflow). Inter-Step Buffers Inter-step buffers are used to connect processors and they should support the following Durability Support offsets Support transactions for Exactly-Once forwarding Concurrent operations (reader group) Ability to explicitly ack each data/offset Claim pending messages (read but never acknowledged) Ability to trim (buffer size controls) Fast (high throughput and low latency) Ability to query buffer information (observability) Source Processors Source processors are the initial processors that ingest data into the Numaflow. They sit in front of the first data processor, ingest the data from the data source, and forward to inter-step buffers. Logic: Read data from the data source; Write to the inter-step buffer; Ack the data in the data source. Data Processors The data processors execute idempotent user-defined functions and will be sandwiched between source and sink processors. There could be one or more data processors. A data processor only reads from one upstream buffer, but it might write to multiple downstream buffers. Logic: Read data from the upstream inter-step buffer; Process data; Write to downstream inter-step buffers; Ack the data in the upstream buffer. Sink Processors Sink processors are the final processors used to write processed data to sinks. A sink processor only reads from one upstream buffer and writes to a single sink. Logic: Read data from the upstream inter-step buffer; Write to the sink; Ack the data in the upstream buffer. UDF (User-defined Function) Use-defined Functions run in data processors. UDFs implements a unified interface to process data. UDFs are typically implemented by end-users, but there will be some built-in functions that can be used without writing any code. UDFs can be implemented in different languages, a pseudo-interface might look like the below, where the function signatures include step context and input payload and returns a result. The Result contains the processed data as well as optional labels that will be exposed to the DSL to do complex conditional forwarding. Process(key, message, context) (result, err) UDFs should only focus on user logic, buffer message reading and writing should not be handled by this function. UDFs should be idempotent. Matrix of Operations Source Processor Sink ReadFromBuffer Read From Source Generic Generic CallUDF Void User Defined Void Forward Generic Generic Write To Sink Ack Ack Source Generic Generic","title":"Definitions"},{"location":"specifications/overview/#requirements","text":"Exactly once semantics from the source processor to the sink processor. Be able to support a variety of data buffering technologies. Numaflow is restartable if aborted or steps fail while preserving exactly-once semantics. Do not generate more output than can be used by the next stage in a reasonable amount of time, i.e., the size of buffers between steps should be limited, (aka backpressure). User code should be isolated from offset management, restart, exactly once, backpressure, etc. Streaming process systems inherently require a concept of time, this time will be either derived from the Source (LOG_APPEND_TIME in Kafka, etc.) or will be inserted at ingestion time if the source doesn't provide it. Every processor is connected by an inter-step buffer. Source processors add a \"header\" to each \"item\" received from the source in order to: Uniquely identify the item for implementing exactly-once Uniquely identify the source of the message. Sink processors should avoid writing output for the same input when possible. Numaflow should support the following types of flows: Line Tree Diamond (In Future) Multiple Sources with the same schema (In Future)","title":"Requirements"},{"location":"specifications/overview/#non-requirements","text":"Support for non-idempotent data processors (UDFs?) Distributed transactions/checkpoints are not needed","title":"Non-Requirements"},{"location":"specifications/overview/#open-issues","text":"None","title":"Open Issues"},{"location":"specifications/overview/#closed-issues","text":"In order to be able to support various buffering technologies, we will persist and manage stream \"offsets\" rather than relying on the buffering technology (e.g., Kafka) Each processor may persist state associated with their processing no distributed transactions are needed for checkpointing If we have a tree DAG, how will we manage acknowledgments? We will use back-pressure and exactly-once schematics on the buffer to solve it. How/where will offsets be persisted? Buffer will have a \"lookup - insert - update\" as a txn What will be used to implement the inter-step buffers between processors? The interface is abstracted out, but internally we will use Redis Streams (supports streams, hash, txn)","title":"Closed Issues"},{"location":"specifications/overview/#design-details","text":"","title":"Design Details"},{"location":"specifications/overview/#duplicates","text":"Numaflow (like any other stream processing engine) at its core has Read -> Process -> Forward -> Acknowledge loop for every message it has to process. Given that the user-defined process is idempotent, there are two failure mode scenarios where there could be duplicates. The message has been forwarded but the information failed to reach back (we do not know whether we really have successfully forwarded the message). A retry on forwarding again could lead to duplication. Acknowledgment has been sent back to the source buffer, but we do not know whether we have really acknowledged the successful processing of the message. A retry on reading could end up in duplications (both in processing and forwarding, but we need to worry only about forwarding because processing is idempotent). To detect duplicates, make sure the delivery is Exactly-Once: A unique and immutable identifier for the message from the upstream buffer will be used as the key of the data in the downstream buffer Best effort of the transactional commit. Data processors make transactional commits for data forwarding to the next buffer, and upstream buffer acknowledgment. Source processors have no way to do similar transactional operations for data source message acknowledgment and message forwarding, but #1 will make sure there's no duplicate after retrying in case of failure. Sink processors can not do transactional operations unless there's a contract between Numaflow and the sink, which is out of the scope of this doc. We will rely on the sink to implement this (eg, \"enable.idempotent\" in Kafka producer).","title":"Duplicates"},{"location":"specifications/overview/#unique-identifier-for-message","text":"To detect duplicates, we first need to uniquely identify each message. We will be relying on the \"identifier\" available (e.g., \"offset\" in Kafka) in the buffer to uniquely identify each message. If such an identifier is not available, we will be creating a unique identifier (sequence numbers are tough because there are multiple readers). We can use this unique identifier to ensure that we forward only if the message has not been forwarded yet. We will only look back for a fixed window of time since this is a stream processing application on an unbounded stream of data, and we do not have infinite resources. The same offset will not be used across all the steps in Numaflow, but we will be using the current offset only while forwarding to the next step. Step N will use step N-1th offset to deduplicate. This requires each step to generate an unique ID. The reason we are not sticking to the original offset is because there will be operations in future which will require, say aggregations, where multiple messages will be grouped together and we will not be able to choose an offset from the original messages because the single output is based on multiple messages.","title":"Unique Identifier for Message"},{"location":"specifications/overview/#restarting-after-a-failure","text":"Numaflow needs to be able to recover from the failure of any step (pods) or even the complete failure of the Numaflow while preserving exactly-once semantics. When a message is successfully processed by a processor, it should have been written to the downstream buffer, and its status in the upstream buffer becomes \"Acknowledged\". So when a processor restarts, it checks if any message assigned to it in the upstream buffer is in the \"In-Flight\" state, if yes, it will read and process those messages before picking up other messages. Processing those messages follows the flowchart above, which makes sure they will only be processed once.","title":"Restarting After a Failure"},{"location":"specifications/overview/#back-pressure","text":"The durable buffers allocated to the processors are not infinite but have a bounded buffer. Backpressure handling in Numaflow utilizes the buffer. At any time t, the durable buffer should contain messages in the following states: Acked messages - processed messages to be deleted Inflight messages - messages being handled by downstream processor Pending messages - messages to be read by the downstream processor The buffer acts like a sliding window, new messages will always be written to the right, and there's some automation to clean up the acknowledged messages on the left. If the processor is too slow, the pending messages will buffer up, and the space available for writing will become limited. Every time (or periodically for better throughput) before the upstream processor writes a message to the buffer, it checks if there's any available space, or else it stops writing (or slows down the processing while approaching the buffer limit). This buffer pressure will then pass back to the beginning of the pipeline, which is the buffer used by the source processor so that the entire flow will stop (or slow down).","title":"Back Pressure"},{"location":"specifications/side-inputs/","text":"Side Inputs \u00b6 Side Inputs allow the user-defined functions (including UDF, UDSink, Transformer, etc.) to access slow updated data or configuration (such as database, file system, etc.) without needing to load it during each message processing. Side Inputs are read-only and can be used in both batch and streaming jobs. Requirements \u00b6 The Side Inputs should be programmable with any language. The Side Inputs should be updated centralized (for a pipeline), and be able to broadcast to each of the vertex pods in an efficient manner. The Side Inputs update could be based on a configurable interval. Assumptions \u00b6 Size of a Side Input data could be up to 1MB. The Side Inputs data is updated at a low frequency (minutes level). As a platform, Numaflow has no idea about the data format of the Side Inputs, instead, the pipeline owner (programmer) is responsible for parsing the data. Design Proposal \u00b6 Data Format \u00b6 Numaflow processes the Side Inputs data as bytes array, thus there\u2019s no data format requirement for it, the pipeline developers are supposed to parse the Side Inputs data from bytes array to any format they expect. Architecture \u00b6 There will be the following components introduced when a pipeline has Side Inputs enabled. A Side Inputs Manager - a service for Side Inputs data updating. A Side Inputs watcher sidecar - a container enabled for each of the vertex pods to receive updated Side Inputs. Side Inputs data store - a data store to store the latest Side Inputs data. Data Store \u00b6 Data store is the place where the latest retrieved Side Inputs data stays. The data is published by the Side Inputs Manager after retrieving from the Side Inputs data source, and consumed by each of the vertex Pods. The data store implementation could be a Key/Value store in JetStream, which by default supports maximum 1MB - 64MB size data. Extended implementation could be Key/Value store + object store, which makes it possible to store large sizes of data. Data Store management is supposed to be done by the controller, through the same Kubernetes Job to create/delete Inter-Step Buffers and Buckets. Side Inputs Manager \u00b6 A Side Inputs Manager is a pod (or a group of pods with active-passive HA), created by the Numaflow controller, used to run cron like jobs to retrieve the Side Inputs data and save to a data store. Each Side Inputs Manager is only responsible for corresponding pipeline, and is only created when Side Inputs is enabled for the pipeline. A pipeline may have multiple Side Inputs sources, each of them will have a Side Inputs Manger. Each of the Side Inputs Manager pods contains: An init container, which checks if the data store is ready. A user-defined container, which runs a predefined Numaflow SDK to start a service, calling a user implemented function to get Side Input data. A numa container, which runs a cron like job to call the service in the user-defined container, and store the returned data in the data store. The communication protocol between the 2 containers could be based on UDS or FIFO (TBD). High Availability \u00b6 Side Inputs Manager needs to run with Active-Passive HA, which requires a leader election mechanism support. Kubernetes has a native leader election API backed by etcd, but it requires extra RBAC privileges to use it. Considering a similar leader election mechanism is needed in some other scenarios such as Active-Passive User-defined Source, a proposal is to implement our own leader election mechanism by leveraging ISB Service. Why NOT CronJob? \u00b6 Using Kubernetes CronJob could also achieve the cron like job orchestration, but there are few downsides. A K8s Job has to be used together with the CronJob to solve the immediate starting problem - A CronJob can not trigger a job immediately after it\u2019s created, it has to wait until the first trigger condition meets. Using K8s CronJob/Job will be a challenge when ServiceMesh (Istio) is enabled. Vertex Pod Sidecar \u00b6 When Side Inputs is enabled for a pipeline, each of its vertex pods will have a second init container added, the init container will have a shared volume (emptyDir) mounted, and the same volume will be mounted to the User-defined Function/Sink/Transformer container. The init container reads from the data store, and saves to the shared volume. A sidecar container will also be injected by the controller, and it mounts the same volume as above. The sidecar runs a service provided by numaflow, watching the Side Inputs data from the data store, if there\u2019s any update, reads the data and updates the shared volume. In the User-defined Function/Sink/Sink container, a helper function will be provided by Numaflow SDK, to return the Side Input data. The helper function caches the Side Inputs data in the memory, but performs thread safe updates if it watches the changes in the shared volume. Numaflow SDK \u00b6 Some new features will be added to the Numaflow SDK. Interface for the users to implement the Side Inputs retrievement. A pseudo interface might look like below. RetrieveSideInput () ([] bytes , error ) A main function to start the service in the Side Inputs Manager user container. A helper function to be used in the udf/udsink/transformer containers to get the Side Inputs, which reads, watches and caches the data from the shared volume. SideInput [ T any ]( name string , parseFunc func ([] byte ) ( T , error )) ( T , error ) User Spec \u00b6 Side Inputs support is exposed through sideInputs in the pipeline spec, it\u2019s updated based on cron like schedule, specified in the pipeline spec with a trigger field. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : sideInputs : - name : devPortal container : image : my-sideinputs-image:v1 trigger : schedule : \"*/15 * * * *\" # interval: 180s # timezone: America/Los_Angeles vertices : - name : my-vertex sideInputs : - devPortal Open Issues \u00b6 To support multiple ways to trigger Side Inputs updating other than cron only? Event based side inputs where the changes are coming via a stream?","title":"Side Inputs"},{"location":"specifications/side-inputs/#side-inputs","text":"Side Inputs allow the user-defined functions (including UDF, UDSink, Transformer, etc.) to access slow updated data or configuration (such as database, file system, etc.) without needing to load it during each message processing. Side Inputs are read-only and can be used in both batch and streaming jobs.","title":"Side Inputs"},{"location":"specifications/side-inputs/#requirements","text":"The Side Inputs should be programmable with any language. The Side Inputs should be updated centralized (for a pipeline), and be able to broadcast to each of the vertex pods in an efficient manner. The Side Inputs update could be based on a configurable interval.","title":"Requirements"},{"location":"specifications/side-inputs/#assumptions","text":"Size of a Side Input data could be up to 1MB. The Side Inputs data is updated at a low frequency (minutes level). As a platform, Numaflow has no idea about the data format of the Side Inputs, instead, the pipeline owner (programmer) is responsible for parsing the data.","title":"Assumptions"},{"location":"specifications/side-inputs/#design-proposal","text":"","title":"Design Proposal"},{"location":"specifications/side-inputs/#data-format","text":"Numaflow processes the Side Inputs data as bytes array, thus there\u2019s no data format requirement for it, the pipeline developers are supposed to parse the Side Inputs data from bytes array to any format they expect.","title":"Data Format"},{"location":"specifications/side-inputs/#architecture","text":"There will be the following components introduced when a pipeline has Side Inputs enabled. A Side Inputs Manager - a service for Side Inputs data updating. A Side Inputs watcher sidecar - a container enabled for each of the vertex pods to receive updated Side Inputs. Side Inputs data store - a data store to store the latest Side Inputs data.","title":"Architecture"},{"location":"specifications/side-inputs/#data-store","text":"Data store is the place where the latest retrieved Side Inputs data stays. The data is published by the Side Inputs Manager after retrieving from the Side Inputs data source, and consumed by each of the vertex Pods. The data store implementation could be a Key/Value store in JetStream, which by default supports maximum 1MB - 64MB size data. Extended implementation could be Key/Value store + object store, which makes it possible to store large sizes of data. Data Store management is supposed to be done by the controller, through the same Kubernetes Job to create/delete Inter-Step Buffers and Buckets.","title":"Data Store"},{"location":"specifications/side-inputs/#side-inputs-manager","text":"A Side Inputs Manager is a pod (or a group of pods with active-passive HA), created by the Numaflow controller, used to run cron like jobs to retrieve the Side Inputs data and save to a data store. Each Side Inputs Manager is only responsible for corresponding pipeline, and is only created when Side Inputs is enabled for the pipeline. A pipeline may have multiple Side Inputs sources, each of them will have a Side Inputs Manger. Each of the Side Inputs Manager pods contains: An init container, which checks if the data store is ready. A user-defined container, which runs a predefined Numaflow SDK to start a service, calling a user implemented function to get Side Input data. A numa container, which runs a cron like job to call the service in the user-defined container, and store the returned data in the data store. The communication protocol between the 2 containers could be based on UDS or FIFO (TBD).","title":"Side Inputs Manager"},{"location":"specifications/side-inputs/#high-availability","text":"Side Inputs Manager needs to run with Active-Passive HA, which requires a leader election mechanism support. Kubernetes has a native leader election API backed by etcd, but it requires extra RBAC privileges to use it. Considering a similar leader election mechanism is needed in some other scenarios such as Active-Passive User-defined Source, a proposal is to implement our own leader election mechanism by leveraging ISB Service.","title":"High Availability"},{"location":"specifications/side-inputs/#why-not-cronjob","text":"Using Kubernetes CronJob could also achieve the cron like job orchestration, but there are few downsides. A K8s Job has to be used together with the CronJob to solve the immediate starting problem - A CronJob can not trigger a job immediately after it\u2019s created, it has to wait until the first trigger condition meets. Using K8s CronJob/Job will be a challenge when ServiceMesh (Istio) is enabled.","title":"Why NOT CronJob?"},{"location":"specifications/side-inputs/#vertex-pod-sidecar","text":"When Side Inputs is enabled for a pipeline, each of its vertex pods will have a second init container added, the init container will have a shared volume (emptyDir) mounted, and the same volume will be mounted to the User-defined Function/Sink/Transformer container. The init container reads from the data store, and saves to the shared volume. A sidecar container will also be injected by the controller, and it mounts the same volume as above. The sidecar runs a service provided by numaflow, watching the Side Inputs data from the data store, if there\u2019s any update, reads the data and updates the shared volume. In the User-defined Function/Sink/Sink container, a helper function will be provided by Numaflow SDK, to return the Side Input data. The helper function caches the Side Inputs data in the memory, but performs thread safe updates if it watches the changes in the shared volume.","title":"Vertex Pod Sidecar"},{"location":"specifications/side-inputs/#numaflow-sdk","text":"Some new features will be added to the Numaflow SDK. Interface for the users to implement the Side Inputs retrievement. A pseudo interface might look like below. RetrieveSideInput () ([] bytes , error ) A main function to start the service in the Side Inputs Manager user container. A helper function to be used in the udf/udsink/transformer containers to get the Side Inputs, which reads, watches and caches the data from the shared volume. SideInput [ T any ]( name string , parseFunc func ([] byte ) ( T , error )) ( T , error )","title":"Numaflow SDK"},{"location":"specifications/side-inputs/#user-spec","text":"Side Inputs support is exposed through sideInputs in the pipeline spec, it\u2019s updated based on cron like schedule, specified in the pipeline spec with a trigger field. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : sideInputs : - name : devPortal container : image : my-sideinputs-image:v1 trigger : schedule : \"*/15 * * * *\" # interval: 180s # timezone: America/Los_Angeles vertices : - name : my-vertex sideInputs : - devPortal","title":"User Spec"},{"location":"specifications/side-inputs/#open-issues","text":"To support multiple ways to trigger Side Inputs updating other than cron only? Event based side inputs where the changes are coming via a stream?","title":"Open Issues"},{"location":"user-guide/reference/autoscaling/","text":"Autoscaling \u00b6 Numaflow is able to run with both Horizontal Pod Autoscaling and Vertical Pod Autoscaling . Horizontal Pod Autoscaling \u00b6 Horizontal Pod Autoscaling approaches supported in Numaflow include: Numaflow Autoscaling Kubernetes HPA Third Party Autoscaling (such as KEDA ) Numaflow Autoscaling \u00b6 Numaflow provides 0 - N autoscaling capability out of the box, it's available for all the UDF , Sink and most of the Source vertices (please check each source for more details). Numaflow autoscaling is enabled by default, there are some parameters can be tuned to achieve better results. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex scale : disabled : false # Optional, defaults to false. min : 0 # Optional, minimum replicas, defaults to 0. max : 20 # Optional, maximum replicas, defaults to 50. lookbackSeconds : 120 # Optional, defaults to 120. scaleUpCooldownSeconds : 90 # Optional, defaults to 90. scaleDownCooldownSeconds : 90 # Optional, defaults to 90. zeroReplicaSleepSeconds : 120 # Optional, defaults to 120. targetProcessingSeconds : 20 # Optional, defaults to 20. targetBufferAvailability : 50 # Optional, defaults to 50. replicasPerScale : 2 # Optional, defaults to 2. disabled - Whether to disable Numaflow autoscaling, defaults to false . min - Minimum replicas, valid value could be an integer >= 0. Defaults to 0 , which means it could be scaled down to 0. max - Maximum replicas, positive integer which should not be less than min , defaults to 50 . if max and min are the same, that will be the fixed replica number. lookbackSeconds - How many seconds to lookback for vertex average processing rate (tps) and pending messages calculation, defaults to 120 . Rate and pending messages metrics are critical for autoscaling, you might need to tune this parameter a bit to see better results. For example, your data source only have 1 minute data input in every 5 minutes, and you don't want the vertices to be scaled down to 0 . In this case, you need to increase lookbackSeconds to overlap 5 minutes, so that the calculated average rate and pending messages won't be 0 during the silent period, in order to prevent from scaling down to 0. scaleUpCooldownSeconds - After a scaling operation, how many seconds to wait for the same vertex, if the follow-up operation is a scaling up, defaults to 90 . Please make sure that the time is greater than the pod to be Running and start processing, because the autoscaling algorithm will divide the TPS by the number of pods even if the pod is not Running . scaleDownCooldownSeconds - After a scaling operation, how many seconds to wait for the same vertex, if the follow-up operation is a scaling down, defaults to 90 . zeroReplicaSleepSeconds - After scaling a source vertex replicas down to 0 , how many seconds to wait before scaling up to 1 replica to peek, defaults to 120 . Numaflow autoscaler periodically scales up a source vertex pod to \"peek\" the incoming data, this is the period of time to wait before peeking. targetProcessingSeconds - It is used to tune the aggressiveness of autoscaling for source vertices, it measures how fast you want the vertex to process all the pending messages, defaults to 20 . It is only effective for the Source vertices that support autoscaling, typically increasing the value leads to lower processing rate, thus less replicas. targetBufferAvailability - Targeted buffer availability in percentage, defaults to 50 . It is only effective for UDF and Sink vertices, it determines how aggressive you want to do for autoscaling, increasing the value will bring more replicas. replicasPerScale - Maximum number of replicas change happens in one scale up or down operation, defaults to 2 . For example, if current replica number is 3, the calculated desired replica number is 8; instead of scaling up the vertex to 8, it only does 5. To disable Numaflow autoscaling, set disabled: true as following. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex scale : disabled : true Notes Numaflow autoscaling does not apply to reduce vertices, and the source vertices which do not have a way to calculate their pending messages. Generator HTTP Nats For User-defined Sources, if the function Pending() returns a negative value, autoscaling will not be applied. Kubernetes HPA \u00b6 Kubernetes HPA is supported in Numaflow for any type of Vertex. To use HPA, remember to point the scaleTargetRef to the vertex as below, and disable Numaflow autoscaling in your Pipeline spec. apiVersion : autoscaling/v2 kind : HorizontalPodAutoscaler metadata : name : my-vertex-hpa spec : minReplicas : 1 maxReplicas : 3 metrics : - resource : name : cpu targetAverageUtilization : 50 type : Resource scaleTargetRef : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Vertex name : my-vertex With the configuration above, Kubernetes HPA controller will keep the target utilization of the pods of the Vertex at 50%. Kubernetes HPA autoscaling is useful for those Source vertices not able to count pending messages, such as HTTP . Third Party Autoscaling \u00b6 Third party autoscaling tools like KEDA are also supported in Numaflow, which can be used to autoscale any type of vertex with the scalers it supports. To use KEDA for vertex autoscaling, same as Kubernetes HPA, point the scaleTargetRef to your vertex, and disable Numaflow autoscaling in your Pipeline spec. apiVersion : keda.sh/v1alpha1 kind : ScaledObject metadata : name : my-keda-scaler spec : scaleTargetRef : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Vertex name : my-vertex ... ... Vertical Pod Autoscaling \u00b6 Vertical Pod Autoscaling can be achieved by setting the targetRef to Vertex objects as following. spec : targetRef : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Vertex name : my-vertex","title":"Autoscaling"},{"location":"user-guide/reference/autoscaling/#autoscaling","text":"Numaflow is able to run with both Horizontal Pod Autoscaling and Vertical Pod Autoscaling .","title":"Autoscaling"},{"location":"user-guide/reference/autoscaling/#horizontal-pod-autoscaling","text":"Horizontal Pod Autoscaling approaches supported in Numaflow include: Numaflow Autoscaling Kubernetes HPA Third Party Autoscaling (such as KEDA )","title":"Horizontal Pod Autoscaling"},{"location":"user-guide/reference/autoscaling/#numaflow-autoscaling","text":"Numaflow provides 0 - N autoscaling capability out of the box, it's available for all the UDF , Sink and most of the Source vertices (please check each source for more details). Numaflow autoscaling is enabled by default, there are some parameters can be tuned to achieve better results. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex scale : disabled : false # Optional, defaults to false. min : 0 # Optional, minimum replicas, defaults to 0. max : 20 # Optional, maximum replicas, defaults to 50. lookbackSeconds : 120 # Optional, defaults to 120. scaleUpCooldownSeconds : 90 # Optional, defaults to 90. scaleDownCooldownSeconds : 90 # Optional, defaults to 90. zeroReplicaSleepSeconds : 120 # Optional, defaults to 120. targetProcessingSeconds : 20 # Optional, defaults to 20. targetBufferAvailability : 50 # Optional, defaults to 50. replicasPerScale : 2 # Optional, defaults to 2. disabled - Whether to disable Numaflow autoscaling, defaults to false . min - Minimum replicas, valid value could be an integer >= 0. Defaults to 0 , which means it could be scaled down to 0. max - Maximum replicas, positive integer which should not be less than min , defaults to 50 . if max and min are the same, that will be the fixed replica number. lookbackSeconds - How many seconds to lookback for vertex average processing rate (tps) and pending messages calculation, defaults to 120 . Rate and pending messages metrics are critical for autoscaling, you might need to tune this parameter a bit to see better results. For example, your data source only have 1 minute data input in every 5 minutes, and you don't want the vertices to be scaled down to 0 . In this case, you need to increase lookbackSeconds to overlap 5 minutes, so that the calculated average rate and pending messages won't be 0 during the silent period, in order to prevent from scaling down to 0. scaleUpCooldownSeconds - After a scaling operation, how many seconds to wait for the same vertex, if the follow-up operation is a scaling up, defaults to 90 . Please make sure that the time is greater than the pod to be Running and start processing, because the autoscaling algorithm will divide the TPS by the number of pods even if the pod is not Running . scaleDownCooldownSeconds - After a scaling operation, how many seconds to wait for the same vertex, if the follow-up operation is a scaling down, defaults to 90 . zeroReplicaSleepSeconds - After scaling a source vertex replicas down to 0 , how many seconds to wait before scaling up to 1 replica to peek, defaults to 120 . Numaflow autoscaler periodically scales up a source vertex pod to \"peek\" the incoming data, this is the period of time to wait before peeking. targetProcessingSeconds - It is used to tune the aggressiveness of autoscaling for source vertices, it measures how fast you want the vertex to process all the pending messages, defaults to 20 . It is only effective for the Source vertices that support autoscaling, typically increasing the value leads to lower processing rate, thus less replicas. targetBufferAvailability - Targeted buffer availability in percentage, defaults to 50 . It is only effective for UDF and Sink vertices, it determines how aggressive you want to do for autoscaling, increasing the value will bring more replicas. replicasPerScale - Maximum number of replicas change happens in one scale up or down operation, defaults to 2 . For example, if current replica number is 3, the calculated desired replica number is 8; instead of scaling up the vertex to 8, it only does 5. To disable Numaflow autoscaling, set disabled: true as following. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex scale : disabled : true Notes Numaflow autoscaling does not apply to reduce vertices, and the source vertices which do not have a way to calculate their pending messages. Generator HTTP Nats For User-defined Sources, if the function Pending() returns a negative value, autoscaling will not be applied.","title":"Numaflow Autoscaling"},{"location":"user-guide/reference/autoscaling/#kubernetes-hpa","text":"Kubernetes HPA is supported in Numaflow for any type of Vertex. To use HPA, remember to point the scaleTargetRef to the vertex as below, and disable Numaflow autoscaling in your Pipeline spec. apiVersion : autoscaling/v2 kind : HorizontalPodAutoscaler metadata : name : my-vertex-hpa spec : minReplicas : 1 maxReplicas : 3 metrics : - resource : name : cpu targetAverageUtilization : 50 type : Resource scaleTargetRef : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Vertex name : my-vertex With the configuration above, Kubernetes HPA controller will keep the target utilization of the pods of the Vertex at 50%. Kubernetes HPA autoscaling is useful for those Source vertices not able to count pending messages, such as HTTP .","title":"Kubernetes HPA"},{"location":"user-guide/reference/autoscaling/#third-party-autoscaling","text":"Third party autoscaling tools like KEDA are also supported in Numaflow, which can be used to autoscale any type of vertex with the scalers it supports. To use KEDA for vertex autoscaling, same as Kubernetes HPA, point the scaleTargetRef to your vertex, and disable Numaflow autoscaling in your Pipeline spec. apiVersion : keda.sh/v1alpha1 kind : ScaledObject metadata : name : my-keda-scaler spec : scaleTargetRef : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Vertex name : my-vertex ... ...","title":"Third Party Autoscaling"},{"location":"user-guide/reference/autoscaling/#vertical-pod-autoscaling","text":"Vertical Pod Autoscaling can be achieved by setting the targetRef to Vertex objects as following. spec : targetRef : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Vertex name : my-vertex","title":"Vertical Pod Autoscaling"},{"location":"user-guide/reference/conditional-forwarding/","text":"Conditional Forwarding \u00b6 After processing the data, conditional forwarding is doable based on the Tags returned in the result. Below is a list of different logic operations that can be done on tags. - and - forwards the message if all the tags specified are present in Message's tags. - or - forwards the message if one of the tags specified is present in Message's tags. - not - forwards the message if all the tags specified are not present in Message's tags. For example, there's a UDF used to process numbers, and forward the result to different vertices based on the number is even or odd. In this case, you can set the tag to even-tag or odd-tag in each of the returned messages, and define the edges as below: edges : - from : p1 to : even-vertex conditions : tags : operator : or # Optional, defaults to \"or\". values : - even-tag - from : p1 to : odd-vertex conditions : tags : operator : not values : - odd-tag - from : p1 to : all conditions : tags : operator : and values : - odd-tag - even-tag","title":"Conditional Forwarding"},{"location":"user-guide/reference/conditional-forwarding/#conditional-forwarding","text":"After processing the data, conditional forwarding is doable based on the Tags returned in the result. Below is a list of different logic operations that can be done on tags. - and - forwards the message if all the tags specified are present in Message's tags. - or - forwards the message if one of the tags specified is present in Message's tags. - not - forwards the message if all the tags specified are not present in Message's tags. For example, there's a UDF used to process numbers, and forward the result to different vertices based on the number is even or odd. In this case, you can set the tag to even-tag or odd-tag in each of the returned messages, and define the edges as below: edges : - from : p1 to : even-vertex conditions : tags : operator : or # Optional, defaults to \"or\". values : - even-tag - from : p1 to : odd-vertex conditions : tags : operator : not values : - odd-tag - from : p1 to : all conditions : tags : operator : and values : - odd-tag - even-tag","title":"Conditional Forwarding"},{"location":"user-guide/reference/edge-tuning/","text":"Edge Tuning \u00b6 Drop message onFull \u00b6 We need to have an edge level setting to drop the messages if the buffer.isFull == true . Even if the UDF or UDSink drops a message due to some internal error in the user-defined code, the processing latency will spike up causing a natural back pressure. A kill switch to drop messages can help alleviate/avoid any repercussions on the rest of the DAG. This setting is an edge-level setting and can be enabled by onFull and the default is retryUntilSuccess (other option is discardLatest ). This is a data loss scenario but can be useful in cases where we are doing user-introduced experimentations, like A/B testing, on the pipeline. It is totally okay for the experimentation side of the DAG to have data loss while the production is unaffected. discardLatest \u00b6 Setting onFull to discardLatest will drop the message on the floor if the edge is full. edges : - from : a to : b onFull : discardLatest retryUntilSuccess \u00b6 The default setting for onFull in retryUntilSuccess which will make sure the message is retried until successful. edges : - from : a to : b onFull : retryUntilSuccess","title":"Edge Tuning"},{"location":"user-guide/reference/edge-tuning/#edge-tuning","text":"","title":"Edge Tuning"},{"location":"user-guide/reference/edge-tuning/#drop-message-onfull","text":"We need to have an edge level setting to drop the messages if the buffer.isFull == true . Even if the UDF or UDSink drops a message due to some internal error in the user-defined code, the processing latency will spike up causing a natural back pressure. A kill switch to drop messages can help alleviate/avoid any repercussions on the rest of the DAG. This setting is an edge-level setting and can be enabled by onFull and the default is retryUntilSuccess (other option is discardLatest ). This is a data loss scenario but can be useful in cases where we are doing user-introduced experimentations, like A/B testing, on the pipeline. It is totally okay for the experimentation side of the DAG to have data loss while the production is unaffected.","title":"Drop message onFull"},{"location":"user-guide/reference/edge-tuning/#discardlatest","text":"Setting onFull to discardLatest will drop the message on the floor if the edge is full. edges : - from : a to : b onFull : discardLatest","title":"discardLatest"},{"location":"user-guide/reference/edge-tuning/#retryuntilsuccess","text":"The default setting for onFull in retryUntilSuccess which will make sure the message is retried until successful. edges : - from : a to : b onFull : retryUntilSuccess","title":"retryUntilSuccess"},{"location":"user-guide/reference/join-vertex/","text":"Joins and Cycles \u00b6 Numaflow Pipeline Edges can be defined such that multiple Vertices can forward messages to a single vertex. Quick Start \u00b6 Please see the following examples: Join on Map Vertex Join on Reduce Vertex Join on Sink Vertex Cycle to Self Cycle to Previous Why do we need JOIN \u00b6 Without JOIN \u00b6 Without JOIN, Numaflow could only allow users to build pipelines where vertices could only read from previous one vertex. This meant that Numaflow could only support simple pipelines or tree-like pipelines. Supporting pipelines where you had to read from multiple sources or UDFs were cumbersome and required creating redundant vertices. With JOIN \u00b6 Join vertices allow users the flexibility to read from multiple sources, process data from multiple UDFs, and even write to a single sink. The Pipeline Spec doesn't change at all with JOIN, now you can create multiple Edges that have the same \u201cTo\u201d Vertex, which would have otherwise been prohibited. There is no limitation on which vertices can be joined. For instance, one can join Map or Reduce vertices as shown below: Benefits \u00b6 The introduction of Join Vertex allows users to eliminate redundancy in their pipelines. It supports many-to-one data flow without needing multiple vertices performing the same job. Examples \u00b6 Join on Sink Vertex \u00b6 By joining the sink vertices, we now only need a single vertex responsible for sending to the data sink. Example \u00b6 Join on Sink Vertex Join on Map Vertex \u00b6 Two different Sources containing similar data that can be processed the same way can now point to a single vertex. Example \u00b6 Join on Map Vertex Join on Reduce Vertex \u00b6 This feature allows for efficient aggregation of data from multiple sources. Example \u00b6 Join on Reduce Vertex Cycles \u00b6 A special case of a \"Join\" is a Cycle (a Vertex which can send either to itself or to a previous Vertex.) An example use of this is a Map UDF which does some sort of reprocessing of data under certain conditions such as a transient error. Cycles are permitted, except in the case that there's a Reduce Vertex at or downstream of the cycle. (This is because a cycle inevitably produces late data, which would get dropped by the Reduce Vertex. For this reason, cycles should be used sparingly.) The following examples are of Cycles: Cycle to Self Cycle to Previous","title":"Joins and Cycles"},{"location":"user-guide/reference/join-vertex/#joins-and-cycles","text":"Numaflow Pipeline Edges can be defined such that multiple Vertices can forward messages to a single vertex.","title":"Joins and Cycles"},{"location":"user-guide/reference/join-vertex/#quick-start","text":"Please see the following examples: Join on Map Vertex Join on Reduce Vertex Join on Sink Vertex Cycle to Self Cycle to Previous","title":"Quick Start"},{"location":"user-guide/reference/join-vertex/#why-do-we-need-join","text":"","title":"Why do we need JOIN"},{"location":"user-guide/reference/join-vertex/#without-join","text":"Without JOIN, Numaflow could only allow users to build pipelines where vertices could only read from previous one vertex. This meant that Numaflow could only support simple pipelines or tree-like pipelines. Supporting pipelines where you had to read from multiple sources or UDFs were cumbersome and required creating redundant vertices.","title":"Without JOIN"},{"location":"user-guide/reference/join-vertex/#with-join","text":"Join vertices allow users the flexibility to read from multiple sources, process data from multiple UDFs, and even write to a single sink. The Pipeline Spec doesn't change at all with JOIN, now you can create multiple Edges that have the same \u201cTo\u201d Vertex, which would have otherwise been prohibited. There is no limitation on which vertices can be joined. For instance, one can join Map or Reduce vertices as shown below:","title":"With JOIN"},{"location":"user-guide/reference/join-vertex/#benefits","text":"The introduction of Join Vertex allows users to eliminate redundancy in their pipelines. It supports many-to-one data flow without needing multiple vertices performing the same job.","title":"Benefits"},{"location":"user-guide/reference/join-vertex/#examples","text":"","title":"Examples"},{"location":"user-guide/reference/join-vertex/#join-on-sink-vertex","text":"By joining the sink vertices, we now only need a single vertex responsible for sending to the data sink.","title":"Join on Sink Vertex"},{"location":"user-guide/reference/join-vertex/#example","text":"Join on Sink Vertex","title":"Example"},{"location":"user-guide/reference/join-vertex/#join-on-map-vertex","text":"Two different Sources containing similar data that can be processed the same way can now point to a single vertex.","title":"Join on Map Vertex"},{"location":"user-guide/reference/join-vertex/#example_1","text":"Join on Map Vertex","title":"Example"},{"location":"user-guide/reference/join-vertex/#join-on-reduce-vertex","text":"This feature allows for efficient aggregation of data from multiple sources.","title":"Join on Reduce Vertex"},{"location":"user-guide/reference/join-vertex/#example_2","text":"Join on Reduce Vertex","title":"Example"},{"location":"user-guide/reference/join-vertex/#cycles","text":"A special case of a \"Join\" is a Cycle (a Vertex which can send either to itself or to a previous Vertex.) An example use of this is a Map UDF which does some sort of reprocessing of data under certain conditions such as a transient error. Cycles are permitted, except in the case that there's a Reduce Vertex at or downstream of the cycle. (This is because a cycle inevitably produces late data, which would get dropped by the Reduce Vertex. For this reason, cycles should be used sparingly.) The following examples are of Cycles: Cycle to Self Cycle to Previous","title":"Cycles"},{"location":"user-guide/reference/multi-partition/","text":"Multi-partitioned Edges \u00b6 To achieve higher throughput(> 10K but < 30K tps), users can create multi-partitioned edges. Multi-partitioned edges are only supported for pipelines with JetStream as ISB. Please ensure that the JetStream is provisioned with more nodes to support higher throughput. Since partitions are owned by the vertex reading the data, to create a multi-partitioned edge we need to configure the vertex reading the data (to-vertex) to have multiple partitions. The following code snippet provides an example of how to configure a vertex (in this case, the cat vertex) to have multiple partitions, which enables it ( cat vertex) to read at a higher throughput. - name : cat partitions : 3 udf : builtin : name : cat # A built-in UDF which simply cats the message","title":"Multi-partitioned Edges"},{"location":"user-guide/reference/multi-partition/#multi-partitioned-edges","text":"To achieve higher throughput(> 10K but < 30K tps), users can create multi-partitioned edges. Multi-partitioned edges are only supported for pipelines with JetStream as ISB. Please ensure that the JetStream is provisioned with more nodes to support higher throughput. Since partitions are owned by the vertex reading the data, to create a multi-partitioned edge we need to configure the vertex reading the data (to-vertex) to have multiple partitions. The following code snippet provides an example of how to configure a vertex (in this case, the cat vertex) to have multiple partitions, which enables it ( cat vertex) to read at a higher throughput. - name : cat partitions : 3 udf : builtin : name : cat # A built-in UDF which simply cats the message","title":"Multi-partitioned Edges"},{"location":"user-guide/reference/pipeline-operations/","text":"Pipeline Operations \u00b6 Update a Pipeline \u00b6 You might want to make some changes to an existing pipeline, for example, updating request CPU, or changing the minimal replicas for a vertex. Updating a pipeline is as simple as applying the new pipeline spec to the existing one. But there are some scenarios that you'd better not update the pipeline, instead, you should delete and recreate it. The scenarios include but are not limited to: Topology changes such as adding or removing vertices, or updating the edges between vertices. Updating the partitions for a keyed reduce vertex. Updating the user-defined container image for a vertex, while the new image can not properly handle the unprocessed data in its backlog. To summarize, if there are unprocessed messages in the pipeline, and the new pipeline spec will change the way how the messages are processed, then you should delete and recreate the pipeline. Pause a Pipeline \u00b6 To pause a pipeline, use the command below, it will bring the pipeline to Paused status, and terminate all the running vertex pods. kubectl patch pl my-pipeline --type = merge --patch '{\"spec\": {\"lifecycle\": {\"desiredPhase\": \"Paused\"}}}' Pausing a pipeline will not cause data loss. It does not clean up the unprocessed data in the pipeline, but just terminates the running pods. When the pipeline is resumed, the pods will be restarted and continue processing the unprocessed data. When pausing a pipeline, it will shutdown the source vertex pods first, and then wait for the other vertices to finish the backlog before terminating them. However, it will not wait forever and will terminate the pods after pauseGracePeriodSeconds . This is default set to 30 and can be customized by setting spec.lifecycle.pauseGracePeriodSeconds . If there's a reduce vertex in the pipeline, please make sure it uses Persistent Volume Claim for storage, otherwise the data will be lost. Resume a Pipeline \u00b6 The command below will bring the pipeline back to Running status. kubectl patch pl my-pipeline --type = merge --patch '{\"spec\": {\"lifecycle\": {\"desiredPhase\": \"Running\"}}}' Delete a Pipeline \u00b6 When deleting a pipeline, before terminating all the pods, it will try to wait for all the backlog messages that have already been ingested into the pipeline to be processed. However, it will not wait forever, if the backlog is too large, it will terminate the pods after terminationGracePeriodSeconds , which defaults to 30, and can be customized by setting spec.lifecycle.terminationGracePeriodSeconds .","title":"Pipeline Operations"},{"location":"user-guide/reference/pipeline-operations/#pipeline-operations","text":"","title":"Pipeline Operations"},{"location":"user-guide/reference/pipeline-operations/#update-a-pipeline","text":"You might want to make some changes to an existing pipeline, for example, updating request CPU, or changing the minimal replicas for a vertex. Updating a pipeline is as simple as applying the new pipeline spec to the existing one. But there are some scenarios that you'd better not update the pipeline, instead, you should delete and recreate it. The scenarios include but are not limited to: Topology changes such as adding or removing vertices, or updating the edges between vertices. Updating the partitions for a keyed reduce vertex. Updating the user-defined container image for a vertex, while the new image can not properly handle the unprocessed data in its backlog. To summarize, if there are unprocessed messages in the pipeline, and the new pipeline spec will change the way how the messages are processed, then you should delete and recreate the pipeline.","title":"Update a Pipeline"},{"location":"user-guide/reference/pipeline-operations/#pause-a-pipeline","text":"To pause a pipeline, use the command below, it will bring the pipeline to Paused status, and terminate all the running vertex pods. kubectl patch pl my-pipeline --type = merge --patch '{\"spec\": {\"lifecycle\": {\"desiredPhase\": \"Paused\"}}}' Pausing a pipeline will not cause data loss. It does not clean up the unprocessed data in the pipeline, but just terminates the running pods. When the pipeline is resumed, the pods will be restarted and continue processing the unprocessed data. When pausing a pipeline, it will shutdown the source vertex pods first, and then wait for the other vertices to finish the backlog before terminating them. However, it will not wait forever and will terminate the pods after pauseGracePeriodSeconds . This is default set to 30 and can be customized by setting spec.lifecycle.pauseGracePeriodSeconds . If there's a reduce vertex in the pipeline, please make sure it uses Persistent Volume Claim for storage, otherwise the data will be lost.","title":"Pause a Pipeline"},{"location":"user-guide/reference/pipeline-operations/#resume-a-pipeline","text":"The command below will bring the pipeline back to Running status. kubectl patch pl my-pipeline --type = merge --patch '{\"spec\": {\"lifecycle\": {\"desiredPhase\": \"Running\"}}}'","title":"Resume a Pipeline"},{"location":"user-guide/reference/pipeline-operations/#delete-a-pipeline","text":"When deleting a pipeline, before terminating all the pods, it will try to wait for all the backlog messages that have already been ingested into the pipeline to be processed. However, it will not wait forever, if the backlog is too large, it will terminate the pods after terminationGracePeriodSeconds , which defaults to 30, and can be customized by setting spec.lifecycle.terminationGracePeriodSeconds .","title":"Delete a Pipeline"},{"location":"user-guide/reference/pipeline-tuning/","text":"Pipeline Tuning \u00b6 For a data processing pipeline, each vertex keeps running the cycle of reading data from an Inter-Step Buffer (or data source), processing the data, and writing to next Inter-Step Buffers (or sinks). It is possible to make some tuning for this data processing cycle. readBatchSize - How many messages to read for each cycle, defaults to 500 . bufferMaxLength - How many unprocessed messages can be existing in the Inter-Step Buffer, defaults to 30000 . bufferUsageLimit - The percentage of the buffer usage limit, a valid number should be less than 100. Default value is 80 , which means 80% . These parameters can be customized under spec.limits as below, once defined, they apply to all the vertices and Inter-Step Buffers of the pipeline. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : limits : readBatchSize : 100 bufferMaxLength : 30000 bufferUsageLimit : 85 They also can be defined in a vertex level, which will override the pipeline level settings. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : limits : # Default limits for all the vertices and edges (buffers) of this pipeline readBatchSize : 100 bufferMaxLength : 30000 bufferUsageLimit : 85 vertices : - name : in source : generator : rpu : 5 duration : 1s - name : cat udf : builtin : name : cat limits : readBatchSize : 200 # It overrides the default limit \"100\" bufferMaxLength : 20000 # It overrides the default limit \"30000\" for the buffers owned by this vertex bufferUsageLimit : 70 # It overrides the default limit \"85\" for the buffers owned by this vertex - name : out sink : log : {} edges : - from : in to : cat - from : cat to : out","title":"Pipeline Tuning"},{"location":"user-guide/reference/pipeline-tuning/#pipeline-tuning","text":"For a data processing pipeline, each vertex keeps running the cycle of reading data from an Inter-Step Buffer (or data source), processing the data, and writing to next Inter-Step Buffers (or sinks). It is possible to make some tuning for this data processing cycle. readBatchSize - How many messages to read for each cycle, defaults to 500 . bufferMaxLength - How many unprocessed messages can be existing in the Inter-Step Buffer, defaults to 30000 . bufferUsageLimit - The percentage of the buffer usage limit, a valid number should be less than 100. Default value is 80 , which means 80% . These parameters can be customized under spec.limits as below, once defined, they apply to all the vertices and Inter-Step Buffers of the pipeline. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : limits : readBatchSize : 100 bufferMaxLength : 30000 bufferUsageLimit : 85 They also can be defined in a vertex level, which will override the pipeline level settings. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : limits : # Default limits for all the vertices and edges (buffers) of this pipeline readBatchSize : 100 bufferMaxLength : 30000 bufferUsageLimit : 85 vertices : - name : in source : generator : rpu : 5 duration : 1s - name : cat udf : builtin : name : cat limits : readBatchSize : 200 # It overrides the default limit \"100\" bufferMaxLength : 20000 # It overrides the default limit \"30000\" for the buffers owned by this vertex bufferUsageLimit : 70 # It overrides the default limit \"85\" for the buffers owned by this vertex - name : out sink : log : {} edges : - from : in to : cat - from : cat to : out","title":"Pipeline Tuning"},{"location":"user-guide/reference/side-inputs/","text":"Side Inputs \u00b6 For an unbounded pipeline in Numaflow that never terminates, there are many cases where users want to update a configuration of the UDF without restarting the pipeline. Numaflow enables it by the Side Inputs feature where we can broadcast changes to vertices automatically. The Side Inputs feature achieves this by allowing users to write custom UDFs to broadcast changes to the vertices that are listening in for updates. Using Side Inputs in Numaflow \u00b6 The Side Inputs are updated based on a cron-like schedule, specified in the pipeline spec with a trigger field. Multiple side inputs are supported as well. Below is an example of pipeline spec with side inputs, which runs the custom UDFs every 15 mins and broadcasts the changes if there is any change to be broadcasted. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : sideInputs : - name : s3 container : image : my-sideinputs-s3-image:v1 trigger : schedule : \"*/15 * * * *\" # timezone: America/Los_Angeles - name : redis container : image : my-sideinputs-redis-image:v1 trigger : schedule : \"*/15 * * * *\" # timezone: America/Los_Angeles vertices : - name : my-vertex sideInputs : - s3 - name : my-vertex-multiple-side-inputs sideInputs : - s3 - redis Implementing User-defined Side Inputs \u00b6 To use the Side Inputs feature, a User-defined function implementing an interface defined in the Numaflow SDK ( Go , Python , Java ) is needed to retrieve the data. You can choose the SDK of your choice to create a User-defined Side Input image which implements the Side Inputs Update. Example in Golang \u00b6 Here is an example of how to write a User-defined Side Input in Golang, // handle is the side input handler function. func handle ( _ context . Context ) sideinputsdk . Message { t := time . Now () // val is the side input message value. This would be the value that the side input vertex receives. val := \"an example: \" + string ( t . String ()) // randomly drop side input message. Note that the side input message is not retried. // NoBroadcastMessage() is used to drop the message and not to // broadcast it to other side input vertices. counter = ( counter + 1 ) % 10 if counter % 2 == 0 { return sideinputsdk . NoBroadcastMessage () } // BroadcastMessage() is used to broadcast the message with the given value to other side input vertices. // val must be converted to []byte. return sideinputsdk . BroadcastMessage ([] byte ( val )) } Similarly, this can be written in Python and Java as well. After performing the retrieval/update, the side input value is then broadcasted to all vertices that use the side input. // BroadcastMessage() is used to broadcast the message with the given value. sideinputsdk . BroadcastMessage ([] byte ( val )) In some cased the user may want to drop the message and not to broadcast the side input value further. // NoBroadcastMessage() is used to drop the message and not to broadcast it further sideinputsdk . NoBroadcastMessage () UDF \u00b6 Users need to add a watcher on the filesystem to fetch the updated side inputs in their User-defined Source/Function/Sink in order to apply the new changes into the data process. For each side input there will be a file with the given path and after any update to the side input value the file will be updated. The directory is fixed and can be accessed through a sideinput constant and the file name is the name of the side input. sideinput . DirPath - > \"/var/numaflow/side-inputs\" sideInputFileName - > \"/var/numaflow/side-inputs/sideInputName\" Here are some examples of watching the side input filesystem for changes in Golang , Python and Java .","title":"Side Inputs"},{"location":"user-guide/reference/side-inputs/#side-inputs","text":"For an unbounded pipeline in Numaflow that never terminates, there are many cases where users want to update a configuration of the UDF without restarting the pipeline. Numaflow enables it by the Side Inputs feature where we can broadcast changes to vertices automatically. The Side Inputs feature achieves this by allowing users to write custom UDFs to broadcast changes to the vertices that are listening in for updates.","title":"Side Inputs"},{"location":"user-guide/reference/side-inputs/#using-side-inputs-in-numaflow","text":"The Side Inputs are updated based on a cron-like schedule, specified in the pipeline spec with a trigger field. Multiple side inputs are supported as well. Below is an example of pipeline spec with side inputs, which runs the custom UDFs every 15 mins and broadcasts the changes if there is any change to be broadcasted. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : sideInputs : - name : s3 container : image : my-sideinputs-s3-image:v1 trigger : schedule : \"*/15 * * * *\" # timezone: America/Los_Angeles - name : redis container : image : my-sideinputs-redis-image:v1 trigger : schedule : \"*/15 * * * *\" # timezone: America/Los_Angeles vertices : - name : my-vertex sideInputs : - s3 - name : my-vertex-multiple-side-inputs sideInputs : - s3 - redis","title":"Using Side Inputs in Numaflow"},{"location":"user-guide/reference/side-inputs/#implementing-user-defined-side-inputs","text":"To use the Side Inputs feature, a User-defined function implementing an interface defined in the Numaflow SDK ( Go , Python , Java ) is needed to retrieve the data. You can choose the SDK of your choice to create a User-defined Side Input image which implements the Side Inputs Update.","title":"Implementing User-defined Side Inputs"},{"location":"user-guide/reference/side-inputs/#example-in-golang","text":"Here is an example of how to write a User-defined Side Input in Golang, // handle is the side input handler function. func handle ( _ context . Context ) sideinputsdk . Message { t := time . Now () // val is the side input message value. This would be the value that the side input vertex receives. val := \"an example: \" + string ( t . String ()) // randomly drop side input message. Note that the side input message is not retried. // NoBroadcastMessage() is used to drop the message and not to // broadcast it to other side input vertices. counter = ( counter + 1 ) % 10 if counter % 2 == 0 { return sideinputsdk . NoBroadcastMessage () } // BroadcastMessage() is used to broadcast the message with the given value to other side input vertices. // val must be converted to []byte. return sideinputsdk . BroadcastMessage ([] byte ( val )) } Similarly, this can be written in Python and Java as well. After performing the retrieval/update, the side input value is then broadcasted to all vertices that use the side input. // BroadcastMessage() is used to broadcast the message with the given value. sideinputsdk . BroadcastMessage ([] byte ( val )) In some cased the user may want to drop the message and not to broadcast the side input value further. // NoBroadcastMessage() is used to drop the message and not to broadcast it further sideinputsdk . NoBroadcastMessage ()","title":"Example in Golang"},{"location":"user-guide/reference/side-inputs/#udf","text":"Users need to add a watcher on the filesystem to fetch the updated side inputs in their User-defined Source/Function/Sink in order to apply the new changes into the data process. For each side input there will be a file with the given path and after any update to the side input value the file will be updated. The directory is fixed and can be accessed through a sideinput constant and the file name is the name of the side input. sideinput . DirPath - > \"/var/numaflow/side-inputs\" sideInputFileName - > \"/var/numaflow/side-inputs/sideInputName\" Here are some examples of watching the side input filesystem for changes in Golang , Python and Java .","title":"UDF"},{"location":"user-guide/reference/configuration/container-resources/","text":"Container Resources \u00b6 Container Resources can be customized for all the types of vertices. For configuring container resources on pods not owned by a vertex, see Pipeline Customization . Numa Container \u00b6 To specify resources for the numa container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex containerTemplate : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi UDF Container \u00b6 To specify resources for udf container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex udf : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi UDSource Container \u00b6 To specify resources for udsource container of a source vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex source : udsource : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi Source Transformer Container \u00b6 To specify resources for transformer container of a source vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex source : transformer : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi UDSink Container \u00b6 To specify resources for udsink container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex sink : udsink : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi Init Container \u00b6 To specify resources for the init init-container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex initContainerTemplate : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi Container resources for user init-containers are instead specified at .spec.vertices[*].initContainers[*].resources .","title":"Container Resources"},{"location":"user-guide/reference/configuration/container-resources/#container-resources","text":"Container Resources can be customized for all the types of vertices. For configuring container resources on pods not owned by a vertex, see Pipeline Customization .","title":"Container Resources"},{"location":"user-guide/reference/configuration/container-resources/#numa-container","text":"To specify resources for the numa container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex containerTemplate : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi","title":"Numa Container"},{"location":"user-guide/reference/configuration/container-resources/#udf-container","text":"To specify resources for udf container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex udf : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi","title":"UDF Container"},{"location":"user-guide/reference/configuration/container-resources/#udsource-container","text":"To specify resources for udsource container of a source vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex source : udsource : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi","title":"UDSource Container"},{"location":"user-guide/reference/configuration/container-resources/#source-transformer-container","text":"To specify resources for transformer container of a source vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex source : transformer : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi","title":"Source Transformer Container"},{"location":"user-guide/reference/configuration/container-resources/#udsink-container","text":"To specify resources for udsink container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex sink : udsink : container : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi","title":"UDSink Container"},{"location":"user-guide/reference/configuration/container-resources/#init-container","text":"To specify resources for the init init-container of vertex pods: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex initContainerTemplate : resources : limits : cpu : \"3\" memory : 6Gi requests : cpu : \"1\" memory : 4Gi Container resources for user init-containers are instead specified at .spec.vertices[*].initContainers[*].resources .","title":"Init Container"},{"location":"user-guide/reference/configuration/environment-variables/","text":"Environment Variables \u00b6 For the numa container of vertex pods, environment variable NUMAFLOW_DEBUG can be set to true for debugging . In udf , udsink and transformer containers, there are some preset environment variables that can be used directly. NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex. NUMAFLOW_CPU_REQUEST - resources.requests.cpu , roundup to N cores, 0 if missing. NUMAFLOW_CPU_LIMIT - resources.limits.cpu , roundup to N cores, use host cpu cores if missing. NUMAFLOW_MEMORY_REQUEST - resources.requests.memory in bytes, 0 if missing. NUMAFLOW_MEMORY_LIMIT - resources.limits.memory in bytes, use host memory if missing. For setting environment variables on pods not owned by a vertex, see Pipeline Customization . Your Own Environment Variables \u00b6 To add your own environment variables to udf or udsink containers, check the example below. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf udf : container : image : my-function:latest env : - name : env01 value : value01 - name : env02 valueFrom : secretKeyRef : name : my-secret key : my-key - name : my-sink sink : udsink : container : image : my-sink:latest env : - name : env03 value : value03 Similarly, envFrom also can be specified in udf or udsink containers. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf udf : container : image : my-function:latest envFrom : - configMapRef : name : my-config - name : my-sink sink : udsink : container : image : my-sink:latest envFrom : - secretRef : name : my-secret","title":"Environment Variables"},{"location":"user-guide/reference/configuration/environment-variables/#environment-variables","text":"For the numa container of vertex pods, environment variable NUMAFLOW_DEBUG can be set to true for debugging . In udf , udsink and transformer containers, there are some preset environment variables that can be used directly. NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex. NUMAFLOW_CPU_REQUEST - resources.requests.cpu , roundup to N cores, 0 if missing. NUMAFLOW_CPU_LIMIT - resources.limits.cpu , roundup to N cores, use host cpu cores if missing. NUMAFLOW_MEMORY_REQUEST - resources.requests.memory in bytes, 0 if missing. NUMAFLOW_MEMORY_LIMIT - resources.limits.memory in bytes, use host memory if missing. For setting environment variables on pods not owned by a vertex, see Pipeline Customization .","title":"Environment Variables"},{"location":"user-guide/reference/configuration/environment-variables/#your-own-environment-variables","text":"To add your own environment variables to udf or udsink containers, check the example below. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf udf : container : image : my-function:latest env : - name : env01 value : value01 - name : env02 valueFrom : secretKeyRef : name : my-secret key : my-key - name : my-sink sink : udsink : container : image : my-sink:latest env : - name : env03 value : value03 Similarly, envFrom also can be specified in udf or udsink containers. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf udf : container : image : my-function:latest envFrom : - configMapRef : name : my-config - name : my-sink sink : udsink : container : image : my-sink:latest envFrom : - secretRef : name : my-secret","title":"Your Own Environment Variables"},{"location":"user-guide/reference/configuration/init-containers/","text":"Init Containers \u00b6 Init Containers can be provided for all the types of vertices. The following example shows how to add an init-container to a udf vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf initContainers : - name : my-init image : busybox:latest command : [ \"/bin/sh\" , \"-c\" , \"echo \\\"my-init is running!\\\" && sleep 60\" ] udf : container : image : my-function:latest The following example shows how to use init-containers and volumes together to provide a udf container files on startup. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf volumes : - name : my-udf-data emptyDir : {} initContainers : - name : my-init image : amazon/aws-cli:latest command : [ \"/bin/sh\" , \"-c\" , \"aws s3 sync s3://path/to/my-s3-data /path/to/my-init-data\" ] volumeMounts : - mountPath : /path/to/my-init-data name : my-udf-data udf : container : image : my-function:latest volumeMounts : - mountPath : /path/to/my-data name : my-udf-data","title":"Init Containers"},{"location":"user-guide/reference/configuration/init-containers/#init-containers","text":"Init Containers can be provided for all the types of vertices. The following example shows how to add an init-container to a udf vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf initContainers : - name : my-init image : busybox:latest command : [ \"/bin/sh\" , \"-c\" , \"echo \\\"my-init is running!\\\" && sleep 60\" ] udf : container : image : my-function:latest The following example shows how to use init-containers and volumes together to provide a udf container files on startup. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf volumes : - name : my-udf-data emptyDir : {} initContainers : - name : my-init image : amazon/aws-cli:latest command : [ \"/bin/sh\" , \"-c\" , \"aws s3 sync s3://path/to/my-s3-data /path/to/my-init-data\" ] volumeMounts : - mountPath : /path/to/my-init-data name : my-udf-data udf : container : image : my-function:latest volumeMounts : - mountPath : /path/to/my-data name : my-udf-data","title":"Init Containers"},{"location":"user-guide/reference/configuration/istio/","text":"Running on Istio \u00b6 If you want to get pipeline vertices running on Istio, so that they are able to talk to other services with Istio enabled, one approach is to whitelist the ports that Numaflow uses. To whitelist the ports, add traffic.sidecar.istio.io/excludeInboundPorts and traffic.sidecar.istio.io/excludeOutboundPorts annotations to your vertex configuration: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex metadata : annotations : sidecar.istio.io/inject : \"true\" traffic.sidecar.istio.io/excludeOutboundPorts : \"4222\" # Nats JetStream port traffic.sidecar.istio.io/excludeInboundPorts : \"2469\" # Numaflow vertex metrics port udf : container : image : my-udf-image:latest ... If you want to apply same configuration to all the vertices, use Vertex Template : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : vertex : metadata : annotations : sidecar.istio.io/inject : \"true\" traffic.sidecar.istio.io/excludeOutboundPorts : \"4222\" # Nats JetStream port traffic.sidecar.istio.io/excludeInboundPorts : \"2469\" # Numaflow vertex metrics port vertices : - name : my-vertex-1 udf : container : image : my-udf-1-image:latest - name : my-vertex-2 udf : container : image : my-udf-2-image:latest ...","title":"Running on Istio"},{"location":"user-guide/reference/configuration/istio/#running-on-istio","text":"If you want to get pipeline vertices running on Istio, so that they are able to talk to other services with Istio enabled, one approach is to whitelist the ports that Numaflow uses. To whitelist the ports, add traffic.sidecar.istio.io/excludeInboundPorts and traffic.sidecar.istio.io/excludeOutboundPorts annotations to your vertex configuration: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex metadata : annotations : sidecar.istio.io/inject : \"true\" traffic.sidecar.istio.io/excludeOutboundPorts : \"4222\" # Nats JetStream port traffic.sidecar.istio.io/excludeInboundPorts : \"2469\" # Numaflow vertex metrics port udf : container : image : my-udf-image:latest ... If you want to apply same configuration to all the vertices, use Vertex Template : apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : vertex : metadata : annotations : sidecar.istio.io/inject : \"true\" traffic.sidecar.istio.io/excludeOutboundPorts : \"4222\" # Nats JetStream port traffic.sidecar.istio.io/excludeInboundPorts : \"2469\" # Numaflow vertex metrics port vertices : - name : my-vertex-1 udf : container : image : my-udf-1-image:latest - name : my-vertex-2 udf : container : image : my-udf-2-image:latest ...","title":"Running on Istio"},{"location":"user-guide/reference/configuration/labels-and-annotations/","text":"Labels And Annotations \u00b6 Sometimes customized Labels or Annotations are needed for the vertices, for example, adding an annotation to enable or disable Istio sidecar injection. To do that, a metadata with labels or annotations can be added to the vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex metadata : labels : key1 : val1 key2 : val2 annotations : key3 : val3 key4 : val4","title":"Labels And Annotations"},{"location":"user-guide/reference/configuration/labels-and-annotations/#labels-and-annotations","text":"Sometimes customized Labels or Annotations are needed for the vertices, for example, adding an annotation to enable or disable Istio sidecar injection. To do that, a metadata with labels or annotations can be added to the vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-vertex metadata : labels : key1 : val1 key2 : val2 annotations : key3 : val3 key4 : val4","title":"Labels And Annotations"},{"location":"user-guide/reference/configuration/max-message-size/","text":"Maximum Message Size \u00b6 The default maximum message size is 1MB . There's a way to increase this limit in case you want to, but please think it through before doing so. The max message size is determined by: Max messages size supported by gRPC (default value is 64MB in Numaflow). Max messages size supported by the Inter-Step Buffer implementation. If JetStream is used as the Inter-Step Buffer implementation, the default max message size for it is configured as 1MB . You can change it by setting the spec.jetstream.settings in the InterStepBufferService specification. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : settings : | max_payload: 8388608 # 8MB It's not recommended to use values over 8388608 (8MB) but max_payload can be set up to 67108864 (64MB). Please be aware that if you increase the max message size of the InterStepBufferService , you probably will also need to change some other limits. For example, if the size of each messages is as large as 8MB, then 100 messages flowing in the pipeline will make each of the Inter-Step Buffer need at least 800MB of disk space to store the messages, and the memory consumption will also be high, that will probably cause the Inter-Step Buffer Service to crash. In that case, you might need to update the retention policy in the Inter-Step Buffer Service to make sure the messages are not stored for too long. Check out the Inter-Step Buffer Service for more details.","title":"Maximum Message Size"},{"location":"user-guide/reference/configuration/max-message-size/#maximum-message-size","text":"The default maximum message size is 1MB . There's a way to increase this limit in case you want to, but please think it through before doing so. The max message size is determined by: Max messages size supported by gRPC (default value is 64MB in Numaflow). Max messages size supported by the Inter-Step Buffer implementation. If JetStream is used as the Inter-Step Buffer implementation, the default max message size for it is configured as 1MB . You can change it by setting the spec.jetstream.settings in the InterStepBufferService specification. apiVersion : numaflow.numaproj.io/v1alpha1 kind : InterStepBufferService metadata : name : default spec : jetstream : settings : | max_payload: 8388608 # 8MB It's not recommended to use values over 8388608 (8MB) but max_payload can be set up to 67108864 (64MB). Please be aware that if you increase the max message size of the InterStepBufferService , you probably will also need to change some other limits. For example, if the size of each messages is as large as 8MB, then 100 messages flowing in the pipeline will make each of the Inter-Step Buffer need at least 800MB of disk space to store the messages, and the memory consumption will also be high, that will probably cause the Inter-Step Buffer Service to crash. In that case, you might need to update the retention policy in the Inter-Step Buffer Service to make sure the messages are not stored for too long. Check out the Inter-Step Buffer Service for more details.","title":"Maximum Message Size"},{"location":"user-guide/reference/configuration/pipeline-customization/","text":"Pipeline Customization \u00b6 There is an optional .spec.templates field in the Pipeline resource which may be used to customize kubernetes resources owned by the Pipeline. Vertex customization is described separately in more detail (i.e. Environment Variables , Container Resources , etc.). Daemon Deployment \u00b6 The following example shows how to configure a Daemon Deployment with all currently supported fields. The .spec.templates.daemon field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : daemon : # Deployment spec replicas : 3 # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : podAntiAffinity : requiredDuringSchedulingIgnoredDuringExecution : - labelSelector : matchExpressions : - key : app.kubernetes.io/component operator : In values : - daemon - key : numaflow.numaproj.io/pipeline-name operator : In values : - my-pipeline topologyKey : kubernetes.io/hostname # Containers containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi initContainerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi Jobs \u00b6 The following example shows how to configure kubernetes Jobs owned by a Pipeline with all currently supported fields. The .spec.templates.job field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : job : # Job spec ttlSecondsAfterFinished : 600 # numaflow defaults to 30 backoffLimit : 5 # numaflow defaults to 20 # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : {} # Container containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi Vertices \u00b6 The following example shows how to configure the all the vertex pods owned by a pipeline with all currently supported fields. Be aware these configurations applied to all vertex pods can be overridden by the vertex configuration. The .spec.templates.vertex field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : vertex : # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : podAntiAffinity : requiredDuringSchedulingIgnoredDuringExecution : - labelSelector : matchExpressions : - key : numaflow.numaproj.io/pipeline-name operator : In values : - my-pipeline topologyKey : kubernetes.io/hostname # Containers containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi initContainerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi Side Inputs \u00b6 The following example shows how to configure the all the Side Inputs Manager pods owned by a pipeline with all currently supported fields. The .spec.templates.sideInputsManager field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : sideInputsManager : # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : podAntiAffinity : requiredDuringSchedulingIgnoredDuringExecution : - labelSelector : matchExpressions : - key : numaflow.numaproj.io/pipeline-name operator : In values : - my-pipeline topologyKey : kubernetes.io/hostname # Containers containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi initContainerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi","title":"Pipeline Customization"},{"location":"user-guide/reference/configuration/pipeline-customization/#pipeline-customization","text":"There is an optional .spec.templates field in the Pipeline resource which may be used to customize kubernetes resources owned by the Pipeline. Vertex customization is described separately in more detail (i.e. Environment Variables , Container Resources , etc.).","title":"Pipeline Customization"},{"location":"user-guide/reference/configuration/pipeline-customization/#daemon-deployment","text":"The following example shows how to configure a Daemon Deployment with all currently supported fields. The .spec.templates.daemon field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : daemon : # Deployment spec replicas : 3 # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : podAntiAffinity : requiredDuringSchedulingIgnoredDuringExecution : - labelSelector : matchExpressions : - key : app.kubernetes.io/component operator : In values : - daemon - key : numaflow.numaproj.io/pipeline-name operator : In values : - my-pipeline topologyKey : kubernetes.io/hostname # Containers containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi initContainerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi","title":"Daemon Deployment"},{"location":"user-guide/reference/configuration/pipeline-customization/#jobs","text":"The following example shows how to configure kubernetes Jobs owned by a Pipeline with all currently supported fields. The .spec.templates.job field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : job : # Job spec ttlSecondsAfterFinished : 600 # numaflow defaults to 30 backoffLimit : 5 # numaflow defaults to 20 # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : {} # Container containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi","title":"Jobs"},{"location":"user-guide/reference/configuration/pipeline-customization/#vertices","text":"The following example shows how to configure the all the vertex pods owned by a pipeline with all currently supported fields. Be aware these configurations applied to all vertex pods can be overridden by the vertex configuration. The .spec.templates.vertex field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : vertex : # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : podAntiAffinity : requiredDuringSchedulingIgnoredDuringExecution : - labelSelector : matchExpressions : - key : numaflow.numaproj.io/pipeline-name operator : In values : - my-pipeline topologyKey : kubernetes.io/hostname # Containers containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi initContainerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi","title":"Vertices"},{"location":"user-guide/reference/configuration/pipeline-customization/#side-inputs","text":"The following example shows how to configure the all the Side Inputs Manager pods owned by a pipeline with all currently supported fields. The .spec.templates.sideInputsManager field and all fields directly under it are optional. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : templates : sideInputsManager : # Pod metadata metadata : labels : my-label-name : my-label-value annotations : my-annotation-name : my-annotation-value # Pod spec nodeSelector : my-node-label-name : my-node-label-value tolerations : - key : \"my-example-key\" operator : \"Exists\" effect : \"NoSchedule\" securityContext : {} imagePullSecrets : - name : regcred priorityClassName : my-priority-class-name priority : 50 serviceAccountName : my-service-account affinity : podAntiAffinity : requiredDuringSchedulingIgnoredDuringExecution : - labelSelector : matchExpressions : - key : numaflow.numaproj.io/pipeline-name operator : In values : - my-pipeline topologyKey : kubernetes.io/hostname # Containers containerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi initContainerTemplate : env : - name : MY_ENV_NAME value : my-env-value resources : limits : memory : 500Mi","title":"Side Inputs"},{"location":"user-guide/reference/configuration/sidecar-containers/","text":"Sidecar Containers \u00b6 Additional \" sidecar \" containers can be provided for udf and sink vertices. source vertices do not currently support sidecars. The following example shows how to add a sidecar container to a udf vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf sidecars : - name : my-sidecar image : busybox:latest command : [ \"/bin/sh\" , \"-c\" , \"echo \\\"my-sidecar is running!\\\" && tail -f /dev/null\" ] udf : container : image : my-function:latest There are various use-cases for sidecars. One possible use-case is a udf container that needs functionality from a library written in a different language. The library's functionality could be made available through gRPC over Unix Domain Socket. The following example shows how that could be accomplished using a shared volume . It is the sidecar owner's responsibility to come up with a protocol that can be used with the UDF. It could be volume, gRPC, TCP, HTTP 1.x, etc., apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf-vertex volumes : - name : my-udf-volume emptyDir : medium : Memory sidecars : - name : my-sidecar image : alpine:latest command : [ \"/bin/sh\" , \"-c\" , \"apk add socat && socat UNIX-LISTEN:/path/to/my-sidecar-mount-path/my.sock - && tail -f /dev/null\" ] volumeMounts : - mountPath : /path/to/my-sidecar-mount-path name : my-udf-volume udf : container : image : alpine:latest command : [ \"/bin/sh\" , \"-c\" , \"apk add socat && echo \\\"hello\\\" | socat UNIX-CONNECT:/path/to/my-udf-mount-path/my.sock,forever - && tail -f /dev/null\" ] volumeMounts : - mountPath : /path/to/my-udf-mount-path name : my-udf-volume","title":"Sidecar Containers"},{"location":"user-guide/reference/configuration/sidecar-containers/#sidecar-containers","text":"Additional \" sidecar \" containers can be provided for udf and sink vertices. source vertices do not currently support sidecars. The following example shows how to add a sidecar container to a udf vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf sidecars : - name : my-sidecar image : busybox:latest command : [ \"/bin/sh\" , \"-c\" , \"echo \\\"my-sidecar is running!\\\" && tail -f /dev/null\" ] udf : container : image : my-function:latest There are various use-cases for sidecars. One possible use-case is a udf container that needs functionality from a library written in a different language. The library's functionality could be made available through gRPC over Unix Domain Socket. The following example shows how that could be accomplished using a shared volume . It is the sidecar owner's responsibility to come up with a protocol that can be used with the UDF. It could be volume, gRPC, TCP, HTTP 1.x, etc., apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-udf-vertex volumes : - name : my-udf-volume emptyDir : medium : Memory sidecars : - name : my-sidecar image : alpine:latest command : [ \"/bin/sh\" , \"-c\" , \"apk add socat && socat UNIX-LISTEN:/path/to/my-sidecar-mount-path/my.sock - && tail -f /dev/null\" ] volumeMounts : - mountPath : /path/to/my-sidecar-mount-path name : my-udf-volume udf : container : image : alpine:latest command : [ \"/bin/sh\" , \"-c\" , \"apk add socat && echo \\\"hello\\\" | socat UNIX-CONNECT:/path/to/my-udf-mount-path/my.sock,forever - && tail -f /dev/null\" ] volumeMounts : - mountPath : /path/to/my-udf-mount-path name : my-udf-volume","title":"Sidecar Containers"},{"location":"user-guide/reference/configuration/volumes/","text":"Volumes \u00b6 Volumes can be mounted to udsource , udf or udsink containers. Following example shows how to mount a ConfigMap to an udsource vertex, an udf vertex and an udsink vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-source volumes : - name : my-udsource-config configMap : name : udsource-config source : udsource : container : image : my-source:latest volumeMounts : - mountPath : /path/to/my-source-config name : my-udsource-config - name : my-udf volumes : - name : my-udf-config configMap : name : udf-config udf : container : image : my-function:latest volumeMounts : - mountPath : /path/to/my-function-config name : my-udf-config - name : my-sink volumes : - name : my-udsink-config configMap : name : udsink-config sink : udsink : container : image : my-sink:latest volumeMounts : - mountPath : /path/to/my-sink-config name : my-udsink-config","title":"Volumes"},{"location":"user-guide/reference/configuration/volumes/#volumes","text":"Volumes can be mounted to udsource , udf or udsink containers. Following example shows how to mount a ConfigMap to an udsource vertex, an udf vertex and an udsink vertex. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : my-source volumes : - name : my-udsource-config configMap : name : udsource-config source : udsource : container : image : my-source:latest volumeMounts : - mountPath : /path/to/my-source-config name : my-udsource-config - name : my-udf volumes : - name : my-udf-config configMap : name : udf-config udf : container : image : my-function:latest volumeMounts : - mountPath : /path/to/my-function-config name : my-udf-config - name : my-sink volumes : - name : my-udsink-config configMap : name : udsink-config sink : udsink : container : image : my-sink:latest volumeMounts : - mountPath : /path/to/my-sink-config name : my-udsink-config","title":"Volumes"},{"location":"user-guide/reference/kustomize/kustomize/","text":"Kustomize Integration \u00b6 Transformers \u00b6 Kustomize Transformer Configurations can be used to do lots of powerful operations such as ConfigMap and Secret generations, applying common labels and annotations, updating image names and tags. To use these features with Numaflow CRD objects, download numaflow-transformer-config.yaml into your kustomize directory, and add it to configurations section. kind : Kustomization apiVersion : kustomize.config.k8s.io/v1beta1 configurations : - numaflow-transformer-config.yaml # Or reference the remote configuration directly. # - https://raw.githubusercontent.com/numaproj/numaflow/main/docs/user-guide/reference/kustomize/numaflow-transformer-config.yaml Here is an example to use transformers with a Pipeline. Patch \u00b6 Starting from version 4.5.5, kustomize can use Kubernetes OpenAPI schema to provide merge key and patch strategy information. To use that with Numaflow CRD objects, download schema.json into your kustomize directory, and add it to openapi section. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization openapi : path : schema.json # Or reference the remote configuration directly. # path: https://raw.githubusercontent.com/numaproj/numaflow/main/api/json-schema/schema.json For example, given the following Pipeline spec: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : in source : generator : rpu : 5 duration : 1s - name : my-udf udf : container : image : my-pipeline/my-udf:v0.1 - name : out sink : log : {} edges : - from : in to : my-udf - from : my-udf to : out You can update the source spec via a patch in a kustomize file. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization resources : - my-pipeline.yaml openapi : path : https://raw.githubusercontent.com/numaproj/numaflow/main/api/json-schema/schema.json patches : - patch : |- apiVersion: numaflow.numaproj.io/v1alpha1 kind: Pipeline metadata: name: my-pipeline spec: vertices: - name: in source: generator: rpu: 500 See the full example here .","title":"Kustomize Integration"},{"location":"user-guide/reference/kustomize/kustomize/#kustomize-integration","text":"","title":"Kustomize Integration"},{"location":"user-guide/reference/kustomize/kustomize/#transformers","text":"Kustomize Transformer Configurations can be used to do lots of powerful operations such as ConfigMap and Secret generations, applying common labels and annotations, updating image names and tags. To use these features with Numaflow CRD objects, download numaflow-transformer-config.yaml into your kustomize directory, and add it to configurations section. kind : Kustomization apiVersion : kustomize.config.k8s.io/v1beta1 configurations : - numaflow-transformer-config.yaml # Or reference the remote configuration directly. # - https://raw.githubusercontent.com/numaproj/numaflow/main/docs/user-guide/reference/kustomize/numaflow-transformer-config.yaml Here is an example to use transformers with a Pipeline.","title":"Transformers"},{"location":"user-guide/reference/kustomize/kustomize/#patch","text":"Starting from version 4.5.5, kustomize can use Kubernetes OpenAPI schema to provide merge key and patch strategy information. To use that with Numaflow CRD objects, download schema.json into your kustomize directory, and add it to openapi section. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization openapi : path : schema.json # Or reference the remote configuration directly. # path: https://raw.githubusercontent.com/numaproj/numaflow/main/api/json-schema/schema.json For example, given the following Pipeline spec: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : my-pipeline spec : vertices : - name : in source : generator : rpu : 5 duration : 1s - name : my-udf udf : container : image : my-pipeline/my-udf:v0.1 - name : out sink : log : {} edges : - from : in to : my-udf - from : my-udf to : out You can update the source spec via a patch in a kustomize file. apiVersion : kustomize.config.k8s.io/v1beta1 kind : Kustomization resources : - my-pipeline.yaml openapi : path : https://raw.githubusercontent.com/numaproj/numaflow/main/api/json-schema/schema.json patches : - patch : |- apiVersion: numaflow.numaproj.io/v1alpha1 kind: Pipeline metadata: name: my-pipeline spec: vertices: - name: in source: generator: rpu: 500 See the full example here .","title":"Patch"},{"location":"user-guide/sinks/blackhole/","text":"Blackhole Sink \u00b6 A Blackhole sink is where the output is drained without writing to any sink, it is to emulate /dev/null . spec : vertices : - name : output sink : blackhole : {} NOTE: The previous vertex should ideally be not forwarding the message to make it more efficient to avoid network latency.","title":"Blackhole Sink"},{"location":"user-guide/sinks/blackhole/#blackhole-sink","text":"A Blackhole sink is where the output is drained without writing to any sink, it is to emulate /dev/null . spec : vertices : - name : output sink : blackhole : {} NOTE: The previous vertex should ideally be not forwarding the message to make it more efficient to avoid network latency.","title":"Blackhole Sink"},{"location":"user-guide/sinks/fallback/","text":"Fallback Sink \u00b6 A Fallback Sink functions as a Dead Letter Queue (DLQ) Sink and can be configured to serve as a backup when the primary sink is down, unavailable, or under maintenance. This is particularly useful when multiple sinks are in a pipeline; if a sink fails, the resulting back-pressure will back-propagate and stop the source vertex from reading more data. A Fallback Sink can beset up to prevent this from happening. This backup sink stores data while the primary sink is offline. The stored data can be replayed once the primary sink is back online. Note: The fallback field is optional. Users are required to return a fallback response from the user-defined sink when the primary sink fails; only then the messages will be directed to the fallback sink. Example of a fallback response in a user-defined sink: here CAVEATs \u00b6 The fallback field can only be utilized when the primary sink is a User Defined Sink. Example \u00b6 Builtin Kafka \u00b6 An example using builtin kafka as fallback sink: - name : out sink : udsink : container : image : my-sink:latest fallback : kafka : brokers : - my-broker1:19700 - my-broker2:19700 topic : my-topic UD Sink \u00b6 An example using custom user-defined sink as fallback sink. User Defined Sink as a fallback sink: - name : out sink : udsink : container : image : my-sink:latest fallback : udsink : container : image : my-sink:latest","title":"Fallback Sink"},{"location":"user-guide/sinks/fallback/#fallback-sink","text":"A Fallback Sink functions as a Dead Letter Queue (DLQ) Sink and can be configured to serve as a backup when the primary sink is down, unavailable, or under maintenance. This is particularly useful when multiple sinks are in a pipeline; if a sink fails, the resulting back-pressure will back-propagate and stop the source vertex from reading more data. A Fallback Sink can beset up to prevent this from happening. This backup sink stores data while the primary sink is offline. The stored data can be replayed once the primary sink is back online. Note: The fallback field is optional. Users are required to return a fallback response from the user-defined sink when the primary sink fails; only then the messages will be directed to the fallback sink. Example of a fallback response in a user-defined sink: here","title":"Fallback Sink"},{"location":"user-guide/sinks/fallback/#caveats","text":"The fallback field can only be utilized when the primary sink is a User Defined Sink.","title":"CAVEATs"},{"location":"user-guide/sinks/fallback/#example","text":"","title":"Example"},{"location":"user-guide/sinks/fallback/#builtin-kafka","text":"An example using builtin kafka as fallback sink: - name : out sink : udsink : container : image : my-sink:latest fallback : kafka : brokers : - my-broker1:19700 - my-broker2:19700 topic : my-topic","title":"Builtin Kafka"},{"location":"user-guide/sinks/fallback/#ud-sink","text":"An example using custom user-defined sink as fallback sink. User Defined Sink as a fallback sink: - name : out sink : udsink : container : image : my-sink:latest fallback : udsink : container : image : my-sink:latest","title":"UD Sink"},{"location":"user-guide/sinks/kafka/","text":"Kafka Sink \u00b6 A Kafka sink is used to forward the messages to a Kafka topic. Kafka sink supports configuration overrides. spec : vertices : - name : kafka-output sink : kafka : brokers : - my-broker1:19700 - my-broker2:19700 topic : my-topic tls : # Optional. insecureSkipVerify : # Optional, where to skip TLS verification. Default to false. caCertSecret : # Optional, a secret reference, which contains the CA Cert. name : my-ca-cert key : my-ca-cert-key certSecret : # Optional, pointing to a secret reference which contains the Cert. name : my-cert key : my-cert-key keySecret : # Optional, pointing to a secret reference which contains the Private Key. name : my-pk key : my-pk-key sasl : # Optional mechanism : GSSAPI # PLAIN, GSSAPI, SCRAM-SHA-256 or SCRAM-SHA-512, other mechanisms not supported gssapi : # Optional, for GSSAPI mechanism serviceName : my-service realm : my-realm # KRB5_USER_AUTH for auth using password # KRB5_KEYTAB_AUTH for auth using keytab authType : KRB5_KEYTAB_AUTH usernameSecret : # Pointing to a secret reference which contains the username name : gssapi-username key : gssapi-username-key # Pointing to a secret reference which contains the keytab (authType: KRB5_KEYTAB_AUTH) keytabSecret : name : gssapi-keytab key : gssapi-keytab-key # Pointing to a secret reference which contains the keytab (authType: KRB5_USER_AUTH) passwordSecret : name : gssapi-password key : gssapi-password-key kerberosConfigSecret : # Pointing to a secret reference which contains the kerberos config name : my-kerberos-config key : my-kerberos-config-key plain : # Optional, for PLAIN mechanism userSecret : # Pointing to a secret reference which contains the user name : plain-user key : plain-user-key passwordSecret : # Pointing to a secret reference which contains the password name : plain-password key : plain-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha256 : # Optional, for SCRAM-SHA-256 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-256-user key : scram-sha-256-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-256-password key : scram-sha-256-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha512 : # Optional, for SCRAM-SHA-512 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-512-user key : scram-sha-512-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-512-password key : scram-sha-512-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true # Optional, a yaml format string which could apply more configuration for the sink. # The configuration hierarchy follows the Struct of sarama.Config at https://github.com/IBM/sarama/blob/main/config.go. config : | producer: compression: 2","title":"Kafka Sink"},{"location":"user-guide/sinks/kafka/#kafka-sink","text":"A Kafka sink is used to forward the messages to a Kafka topic. Kafka sink supports configuration overrides. spec : vertices : - name : kafka-output sink : kafka : brokers : - my-broker1:19700 - my-broker2:19700 topic : my-topic tls : # Optional. insecureSkipVerify : # Optional, where to skip TLS verification. Default to false. caCertSecret : # Optional, a secret reference, which contains the CA Cert. name : my-ca-cert key : my-ca-cert-key certSecret : # Optional, pointing to a secret reference which contains the Cert. name : my-cert key : my-cert-key keySecret : # Optional, pointing to a secret reference which contains the Private Key. name : my-pk key : my-pk-key sasl : # Optional mechanism : GSSAPI # PLAIN, GSSAPI, SCRAM-SHA-256 or SCRAM-SHA-512, other mechanisms not supported gssapi : # Optional, for GSSAPI mechanism serviceName : my-service realm : my-realm # KRB5_USER_AUTH for auth using password # KRB5_KEYTAB_AUTH for auth using keytab authType : KRB5_KEYTAB_AUTH usernameSecret : # Pointing to a secret reference which contains the username name : gssapi-username key : gssapi-username-key # Pointing to a secret reference which contains the keytab (authType: KRB5_KEYTAB_AUTH) keytabSecret : name : gssapi-keytab key : gssapi-keytab-key # Pointing to a secret reference which contains the keytab (authType: KRB5_USER_AUTH) passwordSecret : name : gssapi-password key : gssapi-password-key kerberosConfigSecret : # Pointing to a secret reference which contains the kerberos config name : my-kerberos-config key : my-kerberos-config-key plain : # Optional, for PLAIN mechanism userSecret : # Pointing to a secret reference which contains the user name : plain-user key : plain-user-key passwordSecret : # Pointing to a secret reference which contains the password name : plain-password key : plain-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha256 : # Optional, for SCRAM-SHA-256 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-256-user key : scram-sha-256-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-256-password key : scram-sha-256-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha512 : # Optional, for SCRAM-SHA-512 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-512-user key : scram-sha-512-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-512-password key : scram-sha-512-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true # Optional, a yaml format string which could apply more configuration for the sink. # The configuration hierarchy follows the Struct of sarama.Config at https://github.com/IBM/sarama/blob/main/config.go. config : | producer: compression: 2","title":"Kafka Sink"},{"location":"user-guide/sinks/log/","text":"Log Sink \u00b6 A Log sink is very useful for debugging, it prints all the received messages to stdout . spec : vertices : - name : output sink : log : {}","title":"Log Sink"},{"location":"user-guide/sinks/log/#log-sink","text":"A Log sink is very useful for debugging, it prints all the received messages to stdout . spec : vertices : - name : output sink : log : {}","title":"Log Sink"},{"location":"user-guide/sinks/overview/","text":"Sinks \u00b6 The Sink serves as the endpoint for processed data that has been outputted from the platform, which is then sent to an external system or application. The purpose of the Sink is to deliver the processed data to its ultimate destination, such as a database, data warehouse, visualization tool, or alerting system. It's the opposite of the Source vertex, which receives input data into the platform. Sink vertex may require transformation or formatting of data prior to sending it to the target system. Depending on the target system's needs, this transformation can be simple or complex. A pipeline can have many Sink vertices, unlike the Source vertex. Numaflow currently supports the following Sinks Kafka Log Black Hole User-defined Sink A user-defined sink is a custom Sink that a user can write using Numaflow SDK when the user needs to output the processed data to a system or using a certain transformation that is not supported by the platform's built-in sinks. As an example, once we have processed the input messages, we can use Elasticsearch as a user-defined sink to store the processed data and enable search and analysis on the data. Fallback Sink (DLQ) \u00b6 There is an explicit DLQ support for sinks using a concept called fallback sink . For the rest of vertices, if you need DLQ, please use conditional-forwarding . Sink cannot not do conditional-forwarding since it is a terminal state and hence we have explicit fallback option.","title":"Overview"},{"location":"user-guide/sinks/overview/#sinks","text":"The Sink serves as the endpoint for processed data that has been outputted from the platform, which is then sent to an external system or application. The purpose of the Sink is to deliver the processed data to its ultimate destination, such as a database, data warehouse, visualization tool, or alerting system. It's the opposite of the Source vertex, which receives input data into the platform. Sink vertex may require transformation or formatting of data prior to sending it to the target system. Depending on the target system's needs, this transformation can be simple or complex. A pipeline can have many Sink vertices, unlike the Source vertex. Numaflow currently supports the following Sinks Kafka Log Black Hole User-defined Sink A user-defined sink is a custom Sink that a user can write using Numaflow SDK when the user needs to output the processed data to a system or using a certain transformation that is not supported by the platform's built-in sinks. As an example, once we have processed the input messages, we can use Elasticsearch as a user-defined sink to store the processed data and enable search and analysis on the data.","title":"Sinks"},{"location":"user-guide/sinks/overview/#fallback-sink-dlq","text":"There is an explicit DLQ support for sinks using a concept called fallback sink . For the rest of vertices, if you need DLQ, please use conditional-forwarding . Sink cannot not do conditional-forwarding since it is a terminal state and hence we have explicit fallback option.","title":"Fallback Sink (DLQ)"},{"location":"user-guide/sinks/user-defined-sinks/","text":"User-defined Sinks \u00b6 A Pipeline may have multiple Sinks, those sinks could either be a pre-defined sink such as kafka , log , etc., or a user-defined sink . A pre-defined sink vertex runs single-container pods, a user-defined sink runs two-container pods. Build Your Own User-defined Sinks \u00b6 You can build your own user-defined sinks in multiple languages. Check the links below to see the examples for different languages. Golang Java Python A user-defined sink vertex looks like below. spec : vertices : - name : output sink : udsink : container : image : my-sink:latest Available Environment Variables \u00b6 Some environment variables are available in the user-defined sink container: NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex. User-defined Sinks contributed from the open source community \u00b6 If you're looking for examples and usages contributed by the open source community, head over to the numaproj-contrib repositories . These user-defined sinks like AWS SQS, GCP Pub/Sub, provide valuable insights and guidance on how to use and write a user-defined sink.","title":"User-defined Sinks"},{"location":"user-guide/sinks/user-defined-sinks/#user-defined-sinks","text":"A Pipeline may have multiple Sinks, those sinks could either be a pre-defined sink such as kafka , log , etc., or a user-defined sink . A pre-defined sink vertex runs single-container pods, a user-defined sink runs two-container pods.","title":"User-defined Sinks"},{"location":"user-guide/sinks/user-defined-sinks/#build-your-own-user-defined-sinks","text":"You can build your own user-defined sinks in multiple languages. Check the links below to see the examples for different languages. Golang Java Python A user-defined sink vertex looks like below. spec : vertices : - name : output sink : udsink : container : image : my-sink:latest","title":"Build Your Own User-defined Sinks"},{"location":"user-guide/sinks/user-defined-sinks/#available-environment-variables","text":"Some environment variables are available in the user-defined sink container: NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex.","title":"Available Environment Variables"},{"location":"user-guide/sinks/user-defined-sinks/#user-defined-sinks-contributed-from-the-open-source-community","text":"If you're looking for examples and usages contributed by the open source community, head over to the numaproj-contrib repositories . These user-defined sinks like AWS SQS, GCP Pub/Sub, provide valuable insights and guidance on how to use and write a user-defined sink.","title":"User-defined Sinks contributed from the open source community"},{"location":"user-guide/sources/generator/","text":"Generator Source \u00b6 Generator Source is mainly used for development purpose, where you want to have self-contained source to generate some messages. We mainly use generator for load testing and integration testing of Numaflow. The load generated is per replica. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : vertices : - name : in source : generator : # How many messages to generate in the duration. rpu : 100 duration : 1s # Optional, size of each generated message, defaults to 10. msgSize : 1024 - name : p1 udf : builtin : name : cat - name : out sink : log : {} edges : - from : in to : p1 - from : p1 to : out User Defined Data \u00b6 The default data created by the generator is not likely to be useful in testing user pipelines with specific business logic. To allow this to help with user testing, a user defined value can be provided which will be emitted for each of the generator. - name: in source: generator: # How many messages to generate in the duration. rpu: 100 duration: 1s # Base64 encoding of data to send. Can be example serialized packet to # run through user pipeline to exercise particular capability or path through pipeline valueBlob: \"InlvdXIgc3BlY2lmaWMgZGF0YSI=\" # Note: msgSize and value will be ignored if valueBlob is set","title":"Generator Source"},{"location":"user-guide/sources/generator/#generator-source","text":"Generator Source is mainly used for development purpose, where you want to have self-contained source to generate some messages. We mainly use generator for load testing and integration testing of Numaflow. The load generated is per replica. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : simple-pipeline spec : vertices : - name : in source : generator : # How many messages to generate in the duration. rpu : 100 duration : 1s # Optional, size of each generated message, defaults to 10. msgSize : 1024 - name : p1 udf : builtin : name : cat - name : out sink : log : {} edges : - from : in to : p1 - from : p1 to : out","title":"Generator Source"},{"location":"user-guide/sources/generator/#user-defined-data","text":"The default data created by the generator is not likely to be useful in testing user pipelines with specific business logic. To allow this to help with user testing, a user defined value can be provided which will be emitted for each of the generator. - name: in source: generator: # How many messages to generate in the duration. rpu: 100 duration: 1s # Base64 encoding of data to send. Can be example serialized packet to # run through user pipeline to exercise particular capability or path through pipeline valueBlob: \"InlvdXIgc3BlY2lmaWMgZGF0YSI=\" # Note: msgSize and value will be ignored if valueBlob is set","title":"User Defined Data"},{"location":"user-guide/sources/http/","text":"HTTP Source \u00b6 HTTP Source starts an HTTP service with TLS enabled to accept POST request in the Vertex Pod. It listens to port 8443, with request URI /vertices/{vertexName} . A Pipeline with HTTP Source: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : http-pipeline spec : vertices : - name : in source : http : {} - name : p1 udf : builtin : name : cat - name : out sink : log : {} edges : - from : in to : p1 - from : p1 to : out Sending Data \u00b6 Data could be sent to an HTTP source through: ClusterIP Service (within the cluster) Ingress or LoadBalancer Service (outside of the cluster) Port-forward (for testing) ClusterIP Service \u00b6 An HTTP Source Vertex can generate a ClusterIP Service if service: true is specified, the service name is in the format of {pipelineName}-{vertexName} , so the HTTP Source can be accessed through https://{pipelineName}-{vertexName}.{namespace}.svc:8443/vertices/{vertexName} within the cluster. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : http-pipeline spec : vertices : - name : in source : http : service : true LoadBalancer Service or Ingress \u00b6 To create a LoadBalander type Service, or a NodePort one for Ingress, you need to do it by you own. Just remember to use selector like following in the Service: numaflow.numaproj.io/pipeline-name : http-pipeline # pipeline name numaflow.numaproj.io/vertex-name : in # vertex name Port-forwarding \u00b6 To test an HTTP source, you can do it from your local through port-forwarding. kubectl port-forward pod ${ pod -name } 8443 curl -kq -X POST -d \"hello world\" https://localhost:8443/vertices/in x-numaflow-id \u00b6 When posting data to the HTTP Source, an optional HTTP header x-numaflow-id can be specified, which will be used to dedup. If it's not provided, the HTTP Source will generate a random UUID to do it. curl -kq -X POST -H \"x-numaflow-id: ${ id } \" -d \"hello world\" ${ http -source-url } x-numaflow-event-time \u00b6 By default, the time of the date coming to the HTTP source is used as the event time, it could be set by putting an HTTP header x-numaflow-event-time with value of the number of milliseconds elapsed since January 1, 1970 UTC. curl -kq -X POST -H \"x-numaflow-event-time: 1663006726000\" -d \"hello world\" ${ http -source-url } Auth \u00b6 A Bearer token can be configured to prevent the HTTP Source from being accessed by unexpected clients. To do so, a Kubernetes Secret needs to be created to store the token, and the valid clients also need to include the token in its HTTP request header. Firstly, create a k8s secret containing your token. echo -n 'tr3qhs321fjglwf1e2e67dfda4tr' > ./token.txt kubectl create secret generic http-source-token --from-file = my-token = ./token.txt Then add auth to the Source Vertex: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : http-pipeline spec : vertices : - name : in source : http : auth : token : name : http-source-token key : my-token When the clients post data to the Source Vertex, add Authorization: Bearer tr3qhs321fjglwf1e2e67dfda4tr to the header, for example: TOKEN = \"Bearer tr3qhs321fjglwf1e2e67dfda4tr\" # Post data from a Pod in the same namespace of the cluster curl -kq -X POST -H \"Authorization: $TOKEN \" -d \"hello world\" https://http-pipeline-in:8443/vertices/in Health Check \u00b6 The HTTP Source also has an endpoint /health created automatically, which is useful for LoadBalancer or Ingress configuration, where a health check endpoint is often required by the cloud provider.","title":"HTTP Source"},{"location":"user-guide/sources/http/#http-source","text":"HTTP Source starts an HTTP service with TLS enabled to accept POST request in the Vertex Pod. It listens to port 8443, with request URI /vertices/{vertexName} . A Pipeline with HTTP Source: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : http-pipeline spec : vertices : - name : in source : http : {} - name : p1 udf : builtin : name : cat - name : out sink : log : {} edges : - from : in to : p1 - from : p1 to : out","title":"HTTP Source"},{"location":"user-guide/sources/http/#sending-data","text":"Data could be sent to an HTTP source through: ClusterIP Service (within the cluster) Ingress or LoadBalancer Service (outside of the cluster) Port-forward (for testing)","title":"Sending Data"},{"location":"user-guide/sources/http/#clusterip-service","text":"An HTTP Source Vertex can generate a ClusterIP Service if service: true is specified, the service name is in the format of {pipelineName}-{vertexName} , so the HTTP Source can be accessed through https://{pipelineName}-{vertexName}.{namespace}.svc:8443/vertices/{vertexName} within the cluster. apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : http-pipeline spec : vertices : - name : in source : http : service : true","title":"ClusterIP Service"},{"location":"user-guide/sources/http/#loadbalancer-service-or-ingress","text":"To create a LoadBalander type Service, or a NodePort one for Ingress, you need to do it by you own. Just remember to use selector like following in the Service: numaflow.numaproj.io/pipeline-name : http-pipeline # pipeline name numaflow.numaproj.io/vertex-name : in # vertex name","title":"LoadBalancer Service or Ingress"},{"location":"user-guide/sources/http/#port-forwarding","text":"To test an HTTP source, you can do it from your local through port-forwarding. kubectl port-forward pod ${ pod -name } 8443 curl -kq -X POST -d \"hello world\" https://localhost:8443/vertices/in","title":"Port-forwarding"},{"location":"user-guide/sources/http/#x-numaflow-id","text":"When posting data to the HTTP Source, an optional HTTP header x-numaflow-id can be specified, which will be used to dedup. If it's not provided, the HTTP Source will generate a random UUID to do it. curl -kq -X POST -H \"x-numaflow-id: ${ id } \" -d \"hello world\" ${ http -source-url }","title":"x-numaflow-id"},{"location":"user-guide/sources/http/#x-numaflow-event-time","text":"By default, the time of the date coming to the HTTP source is used as the event time, it could be set by putting an HTTP header x-numaflow-event-time with value of the number of milliseconds elapsed since January 1, 1970 UTC. curl -kq -X POST -H \"x-numaflow-event-time: 1663006726000\" -d \"hello world\" ${ http -source-url }","title":"x-numaflow-event-time"},{"location":"user-guide/sources/http/#auth","text":"A Bearer token can be configured to prevent the HTTP Source from being accessed by unexpected clients. To do so, a Kubernetes Secret needs to be created to store the token, and the valid clients also need to include the token in its HTTP request header. Firstly, create a k8s secret containing your token. echo -n 'tr3qhs321fjglwf1e2e67dfda4tr' > ./token.txt kubectl create secret generic http-source-token --from-file = my-token = ./token.txt Then add auth to the Source Vertex: apiVersion : numaflow.numaproj.io/v1alpha1 kind : Pipeline metadata : name : http-pipeline spec : vertices : - name : in source : http : auth : token : name : http-source-token key : my-token When the clients post data to the Source Vertex, add Authorization: Bearer tr3qhs321fjglwf1e2e67dfda4tr to the header, for example: TOKEN = \"Bearer tr3qhs321fjglwf1e2e67dfda4tr\" # Post data from a Pod in the same namespace of the cluster curl -kq -X POST -H \"Authorization: $TOKEN \" -d \"hello world\" https://http-pipeline-in:8443/vertices/in","title":"Auth"},{"location":"user-guide/sources/http/#health-check","text":"The HTTP Source also has an endpoint /health created automatically, which is useful for LoadBalancer or Ingress configuration, where a health check endpoint is often required by the cloud provider.","title":"Health Check"},{"location":"user-guide/sources/kafka/","text":"Kafka Source \u00b6 A Kafka source is used to ingest the messages from a Kafka topic. Numaflow uses consumer-groups to manage offsets. spec : vertices : - name : input source : kafka : brokers : - my-broker1:19700 - my-broker2:19700 topic : my-topic consumerGroup : my-consumer-group config : | # Optional. consumer: offsets: initial: -2 # -2 for sarama.OffsetOldest, -1 for sarama.OffsetNewest. Default to sarama.OffsetNewest. tls : # Optional. insecureSkipVerify : # Optional, where to skip TLS verification. Default to false. caCertSecret : # Optional, a secret reference, which contains the CA Cert. name : my-ca-cert key : my-ca-cert-key certSecret : # Optional, pointing to a secret reference which contains the Cert. name : my-cert key : my-cert-key keySecret : # Optional, pointing to a secret reference which contains the Private Key. name : my-pk key : my-pk-key sasl : # Optional mechanism : GSSAPI # PLAIN, GSSAPI, SCRAM-SHA-256 or SCRAM-SHA-512, other mechanisms not supported gssapi : # Optional, for GSSAPI mechanism serviceName : my-service realm : my-realm # KRB5_USER_AUTH for auth using password # KRB5_KEYTAB_AUTH for auth using keytab authType : KRB5_KEYTAB_AUTH usernameSecret : # Pointing to a secret reference which contains the username name : gssapi-username key : gssapi-username-key # Pointing to a secret reference which contains the keytab (authType: KRB5_KEYTAB_AUTH) keytabSecret : name : gssapi-keytab key : gssapi-keytab-key # Pointing to a secret reference which contains the keytab (authType: KRB5_USER_AUTH) passwordSecret : name : gssapi-password key : gssapi-password-key kerberosConfigSecret : # Pointing to a secret reference which contains the kerberos config name : my-kerberos-config key : my-kerberos-config-key plain : # Optional, for PLAIN mechanism userSecret : # Pointing to a secret reference which contains the user name : plain-user key : plain-user-key passwordSecret : # Pointing to a secret reference which contains the password name : plain-password key : plain-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha256 : # Optional, for SCRAM-SHA-256 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-256-user key : scram-sha-256-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-256-password key : scram-sha-256-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha512 : # Optional, for SCRAM-SHA-512 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-512-user key : scram-sha-512-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-512-password key : scram-sha-512-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true FAQ \u00b6 How to start the Kafka Source from a specific offset based on datetime? \u00b6 In order to start the Kafka Source from a specific offset based on datetime, we need to reset the offset before we start the pipeline. For example, we have a topic quickstart-events with 3 partitions and a consumer group console-consumer-94457 . This example uses Kafka 3.6.1 and localhost. \u279c kafka_2.13-3.6.1 bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic quickstart-events Topic: quickstart-events TopicId: WqIN6j7hTQqGZUQWdF7AdA PartitionCount: 3 ReplicationFactor: 1 Configs: Topic: quickstart-events Partition: 0 Leader: 0 Replicas: 0 Isr: 0 Topic: quickstart-events Partition: 1 Leader: 0 Replicas: 0 Isr: 0 Topic: quickstart-events Partition: 2 Leader: 0 Replicas: 0 Isr: 0 \u279c kafka_2.13-3.6.1 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list --all-groups console-consumer-94457 We have already consumed all the available messages in the topic quickstart-events , but we want to go back to some datetime and re-consume the data from that datetime. \u279c kafka_2.13-3.6.1 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group console-consumer-94457 GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID console-consumer-94457 quickstart-events 0 56 56 0 - - - console-consumer-94457 quickstart-events 1 38 38 0 - - - console-consumer-94457 quickstart-events 2 4 4 0 - - - To achieve that, before the pipeline start, we need to first stop the consumers in the consumer group console-consumer-94457 because offsets can only be reset if the group console-consumer-94457 is inactive. Then, reset the offsets using the desired date and time. The example command below uses UTC time. \u279c kafka_2.13-3.6.1 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --execute --reset-offsets --group console-consumer-94457 --topic quickstart-events --to-datetime 2024 -01-19T19:26:00.000 GROUP TOPIC PARTITION NEW-OFFSET console-consumer-94457 quickstart-events 0 54 console-consumer-94457 quickstart-events 1 26 console-consumer-94457 quickstart-events 2 0 Now, we can start the pipeline, and the Kafka source will start consuming the topic quickstart-events with consumer group console-consumer-94457 from the NEW-OFFSET . You may need to create a property file which contains the connectivity details and use it to connect to the clusters. Below are two example config.properties files: SASL/PLAIN and TSL . ssl.endpoint.identification.algorithm=https sasl.mechanism=PLAIN request.timeout.ms=20000 bootstrap.servers= retry.backoff.ms=500 sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \\ username=\"\" \\ password=\"\"; security.protocol=SASL_SSL request.timeout.ms=20000 bootstrap.servers= security.protocol=SSL ssl.enabled.protocols=TLSv1.2 ssl.truststore.location= ssl.truststore.password= Run the command with the --command-config option. bin/kafka-consumer-groups.sh --bootstrap-server --command-config config.properties --execute --reset-offsets --group --topic --to-datetime Reference: - How to Use Kafka Tools With Confluent Cloud - Apache Kafka Security","title":"Kafka Source"},{"location":"user-guide/sources/kafka/#kafka-source","text":"A Kafka source is used to ingest the messages from a Kafka topic. Numaflow uses consumer-groups to manage offsets. spec : vertices : - name : input source : kafka : brokers : - my-broker1:19700 - my-broker2:19700 topic : my-topic consumerGroup : my-consumer-group config : | # Optional. consumer: offsets: initial: -2 # -2 for sarama.OffsetOldest, -1 for sarama.OffsetNewest. Default to sarama.OffsetNewest. tls : # Optional. insecureSkipVerify : # Optional, where to skip TLS verification. Default to false. caCertSecret : # Optional, a secret reference, which contains the CA Cert. name : my-ca-cert key : my-ca-cert-key certSecret : # Optional, pointing to a secret reference which contains the Cert. name : my-cert key : my-cert-key keySecret : # Optional, pointing to a secret reference which contains the Private Key. name : my-pk key : my-pk-key sasl : # Optional mechanism : GSSAPI # PLAIN, GSSAPI, SCRAM-SHA-256 or SCRAM-SHA-512, other mechanisms not supported gssapi : # Optional, for GSSAPI mechanism serviceName : my-service realm : my-realm # KRB5_USER_AUTH for auth using password # KRB5_KEYTAB_AUTH for auth using keytab authType : KRB5_KEYTAB_AUTH usernameSecret : # Pointing to a secret reference which contains the username name : gssapi-username key : gssapi-username-key # Pointing to a secret reference which contains the keytab (authType: KRB5_KEYTAB_AUTH) keytabSecret : name : gssapi-keytab key : gssapi-keytab-key # Pointing to a secret reference which contains the keytab (authType: KRB5_USER_AUTH) passwordSecret : name : gssapi-password key : gssapi-password-key kerberosConfigSecret : # Pointing to a secret reference which contains the kerberos config name : my-kerberos-config key : my-kerberos-config-key plain : # Optional, for PLAIN mechanism userSecret : # Pointing to a secret reference which contains the user name : plain-user key : plain-user-key passwordSecret : # Pointing to a secret reference which contains the password name : plain-password key : plain-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha256 : # Optional, for SCRAM-SHA-256 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-256-user key : scram-sha-256-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-256-password key : scram-sha-256-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true scramsha512 : # Optional, for SCRAM-SHA-512 mechanism userSecret : # Pointing to a secret reference which contains the user name : scram-sha-512-user key : scram-sha-512-user-key passwordSecret : # Pointing to a secret reference which contains the password name : scram-sha-512-password key : scram-sha-512-password-key # Send the Kafka SASL handshake first if enabled (defaults to true) # Set this to false if using a non-Kafka SASL proxy handshake : true","title":"Kafka Source"},{"location":"user-guide/sources/kafka/#faq","text":"","title":"FAQ"},{"location":"user-guide/sources/kafka/#how-to-start-the-kafka-source-from-a-specific-offset-based-on-datetime","text":"In order to start the Kafka Source from a specific offset based on datetime, we need to reset the offset before we start the pipeline. For example, we have a topic quickstart-events with 3 partitions and a consumer group console-consumer-94457 . This example uses Kafka 3.6.1 and localhost. \u279c kafka_2.13-3.6.1 bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic quickstart-events Topic: quickstart-events TopicId: WqIN6j7hTQqGZUQWdF7AdA PartitionCount: 3 ReplicationFactor: 1 Configs: Topic: quickstart-events Partition: 0 Leader: 0 Replicas: 0 Isr: 0 Topic: quickstart-events Partition: 1 Leader: 0 Replicas: 0 Isr: 0 Topic: quickstart-events Partition: 2 Leader: 0 Replicas: 0 Isr: 0 \u279c kafka_2.13-3.6.1 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list --all-groups console-consumer-94457 We have already consumed all the available messages in the topic quickstart-events , but we want to go back to some datetime and re-consume the data from that datetime. \u279c kafka_2.13-3.6.1 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group console-consumer-94457 GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID console-consumer-94457 quickstart-events 0 56 56 0 - - - console-consumer-94457 quickstart-events 1 38 38 0 - - - console-consumer-94457 quickstart-events 2 4 4 0 - - - To achieve that, before the pipeline start, we need to first stop the consumers in the consumer group console-consumer-94457 because offsets can only be reset if the group console-consumer-94457 is inactive. Then, reset the offsets using the desired date and time. The example command below uses UTC time. \u279c kafka_2.13-3.6.1 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --execute --reset-offsets --group console-consumer-94457 --topic quickstart-events --to-datetime 2024 -01-19T19:26:00.000 GROUP TOPIC PARTITION NEW-OFFSET console-consumer-94457 quickstart-events 0 54 console-consumer-94457 quickstart-events 1 26 console-consumer-94457 quickstart-events 2 0 Now, we can start the pipeline, and the Kafka source will start consuming the topic quickstart-events with consumer group console-consumer-94457 from the NEW-OFFSET . You may need to create a property file which contains the connectivity details and use it to connect to the clusters. Below are two example config.properties files: SASL/PLAIN and TSL . ssl.endpoint.identification.algorithm=https sasl.mechanism=PLAIN request.timeout.ms=20000 bootstrap.servers= retry.backoff.ms=500 sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \\ username=\"\" \\ password=\"\"; security.protocol=SASL_SSL request.timeout.ms=20000 bootstrap.servers= security.protocol=SSL ssl.enabled.protocols=TLSv1.2 ssl.truststore.location= ssl.truststore.password= Run the command with the --command-config option. bin/kafka-consumer-groups.sh --bootstrap-server --command-config config.properties --execute --reset-offsets --group --topic --to-datetime Reference: - How to Use Kafka Tools With Confluent Cloud - Apache Kafka Security","title":"How to start the Kafka Source from a specific offset based on datetime?"},{"location":"user-guide/sources/nats/","text":"Nats Source \u00b6 A Nats source is used to ingest the messages from a nats subject. spec : vertices : - name : input source : nats : url : nats://demo.nats.io # Multiple urls separated by comma. subject : my-subject queue : my-queue # Queue subscription, see https://docs.nats.io/using-nats/developer/receiving/queues tls : # Optional. insecureSkipVerify : # Optional, where to skip TLS verification. Default to false. caCertSecret : # Optional, a secret reference, which contains the CA Cert. name : my-ca-cert key : my-ca-cert-key certSecret : # Optional, pointing to a secret reference which contains the Cert. name : my-cert key : my-cert-key keySecret : # Optional, pointing to a secret reference which contains the Private Key. name : my-pk key : my-pk-key auth : # Optional. basic : # Optional, pointing to the secret references which contain user name and password. user : name : my-secret key : my-user password : name : my-secret key : my-password Auth \u00b6 The auth strategies supported in nats source include basic (user and password), token and nkey , check the API for the details.","title":"Nats Source"},{"location":"user-guide/sources/nats/#nats-source","text":"A Nats source is used to ingest the messages from a nats subject. spec : vertices : - name : input source : nats : url : nats://demo.nats.io # Multiple urls separated by comma. subject : my-subject queue : my-queue # Queue subscription, see https://docs.nats.io/using-nats/developer/receiving/queues tls : # Optional. insecureSkipVerify : # Optional, where to skip TLS verification. Default to false. caCertSecret : # Optional, a secret reference, which contains the CA Cert. name : my-ca-cert key : my-ca-cert-key certSecret : # Optional, pointing to a secret reference which contains the Cert. name : my-cert key : my-cert-key keySecret : # Optional, pointing to a secret reference which contains the Private Key. name : my-pk key : my-pk-key auth : # Optional. basic : # Optional, pointing to the secret references which contain user name and password. user : name : my-secret key : my-user password : name : my-secret key : my-password","title":"Nats Source"},{"location":"user-guide/sources/nats/#auth","text":"The auth strategies supported in nats source include basic (user and password), token and nkey , check the API for the details.","title":"Auth"},{"location":"user-guide/sources/overview/","text":"Sources \u00b6 Source vertex is responsible for reliable reading data from an unbounded source into Numaflow. Source vertex may require transformation or formatting of data prior to sending it to the output buffers. Source Vertex also does Watermark tracking and late data detection. In Numaflow, we currently support the following sources Kafka HTTP Ticker Nats User-defined Source A user-defined source is a custom source that a user can write using Numaflow SDK when the user needs to read data from a system that is not supported by the platform's built-in sources. User-defined source also supports custom acknowledge management for exactly-once reading.","title":"Overview"},{"location":"user-guide/sources/overview/#sources","text":"Source vertex is responsible for reliable reading data from an unbounded source into Numaflow. Source vertex may require transformation or formatting of data prior to sending it to the output buffers. Source Vertex also does Watermark tracking and late data detection. In Numaflow, we currently support the following sources Kafka HTTP Ticker Nats User-defined Source A user-defined source is a custom source that a user can write using Numaflow SDK when the user needs to read data from a system that is not supported by the platform's built-in sources. User-defined source also supports custom acknowledge management for exactly-once reading.","title":"Sources"},{"location":"user-guide/sources/user-defined-sources/","text":"User-defined Sources \u00b6 A Pipeline may have multiple Sources, those sources could either be a pre-defined source such as kafka , http , etc., or a user-defined source . With no source data transformer, A pre-defined source vertex runs single-container pods; a user-defined source runs two-container pods. Build Your Own User-defined Sources \u00b6 You can build your own user-defined sources in multiple languages. Check the links below to see the examples for different languages. Golang Java Python After building a docker image for the written user-defined source, specify the image as below in the vertex spec. spec : vertices : - name : input source : udsource : container : image : my-source:latest Available Environment Variables \u00b6 Some environment variables are available in the user-defined source container: NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex. User-defined sources contributed from the open source community \u00b6 If you're looking for examples and usages contributed by the open source community, head over to the numaproj-contrib repositories . These user-defined sources like AWS SQS, GCP Pub/Sub, provide valuable insights and guidance on how to use and write a user-defined source.","title":"User-defined Sources"},{"location":"user-guide/sources/user-defined-sources/#user-defined-sources","text":"A Pipeline may have multiple Sources, those sources could either be a pre-defined source such as kafka , http , etc., or a user-defined source . With no source data transformer, A pre-defined source vertex runs single-container pods; a user-defined source runs two-container pods.","title":"User-defined Sources"},{"location":"user-guide/sources/user-defined-sources/#build-your-own-user-defined-sources","text":"You can build your own user-defined sources in multiple languages. Check the links below to see the examples for different languages. Golang Java Python After building a docker image for the written user-defined source, specify the image as below in the vertex spec. spec : vertices : - name : input source : udsource : container : image : my-source:latest","title":"Build Your Own User-defined Sources"},{"location":"user-guide/sources/user-defined-sources/#available-environment-variables","text":"Some environment variables are available in the user-defined source container: NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex.","title":"Available Environment Variables"},{"location":"user-guide/sources/user-defined-sources/#user-defined-sources-contributed-from-the-open-source-community","text":"If you're looking for examples and usages contributed by the open source community, head over to the numaproj-contrib repositories . These user-defined sources like AWS SQS, GCP Pub/Sub, provide valuable insights and guidance on how to use and write a user-defined source.","title":"User-defined sources contributed from the open source community"},{"location":"user-guide/sources/transformer/overview/","text":"Source Data Transformer \u00b6 The Source Data Transformer is a feature that allows users to execute custom code to transform their data at source. This functionality offers two primary advantages to users: Event Time Assignment - It enables users to extract the event time from the message payload, providing a more precise and accurate event time than the default mechanisms like LOG_APPEND_TIME of Kafka for Kafka source, custom HTTP header for HTTP source, and others. Early data processing - It pre-processes the data, or filters out unwanted data at source vertex, saving the cost of creating another UDF vertex and an inter-step buffer. Source Data Transformer runs as a sidecar container in a Source Vertex Pod. Data processing in the transformer is supposed to be idempotent. The communication between the main container (platform code) and the sidecar container (user code) is through gRPC over Unix Domain Socket. Built-in Transformers \u00b6 There are some Built-in Transformers that can be used directly. Build Your Own Transformer \u00b6 You can build your own transformer in multiple languages. A user-defined transformer could be as simple as the example below in Golang. In the example, the transformer extracts event times from timestamp of the JSON payload and assigns them to messages as new event times. It also filters out unwanted messages based on filterOut of the payload. package main import ( \"context\" \"encoding/json\" \"log\" \"time\" \"github.com/numaproj/numaflow-go/pkg/sourcetransformer\" ) func transform ( _ context . Context , keys [] string , data sourcetransformer . Datum ) sourcetransformer . Messages { /* Input messages are in JSON format. Sample: {\"timestamp\": \"1673239888\", \"filterOut\": \"true\"}. Field \"timestamp\" shows the real event time of the message, in the format of epoch. Field \"filterOut\" indicates whether the message should be filtered out, in the format of boolean. */ var jsonObject map [ string ] interface {} json . Unmarshal ( data . Value (), & jsonObject ) // event time assignment eventTime := data . EventTime () // if timestamp field exists, extract event time from payload. if ts , ok := jsonObject [ \"timestamp\" ]; ok { eventTime = time . Unix ( int64 ( ts .( float64 )), 0 ) } // data filtering var filterOut bool if f , ok := jsonObject [ \"filterOut\" ]; ok { filterOut = f .( bool ) } if filterOut { return sourcetransformer . MessagesBuilder (). Append ( sourcetransformer . MessageToDrop ( eventTime )) } else { return sourcetransformer . MessagesBuilder (). Append ( sourcetransformer . NewMessage ( data . Value (), eventTime ). WithKeys ( keys )) } } func main () { err := sourcetransformer . NewServer ( sourcetransformer . SourceTransformFunc ( transform )). Start ( context . Background ()) if err != nil { log . Panic ( \"Failed to start source transform server: \" , err ) } } Check the links below to see another transformer example in various programming languages, where we apply conditional forwarding based on the input event time. Python Golang Java After building a docker image for the written transformer, specify the image as below in the source vertex spec. spec : vertices : - name : my-vertex source : http : {} transformer : container : image : my-python-transformer-example:latest Available Environment Variables \u00b6 Some environment variables are available in the source transformer container, they might be useful in you own source data transformer implementation. NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex. Configuration \u00b6 Configuration data can be provided to the transformer container at runtime multiple ways. environment variables args command volumes init containers","title":"Overview"},{"location":"user-guide/sources/transformer/overview/#source-data-transformer","text":"The Source Data Transformer is a feature that allows users to execute custom code to transform their data at source. This functionality offers two primary advantages to users: Event Time Assignment - It enables users to extract the event time from the message payload, providing a more precise and accurate event time than the default mechanisms like LOG_APPEND_TIME of Kafka for Kafka source, custom HTTP header for HTTP source, and others. Early data processing - It pre-processes the data, or filters out unwanted data at source vertex, saving the cost of creating another UDF vertex and an inter-step buffer. Source Data Transformer runs as a sidecar container in a Source Vertex Pod. Data processing in the transformer is supposed to be idempotent. The communication between the main container (platform code) and the sidecar container (user code) is through gRPC over Unix Domain Socket.","title":"Source Data Transformer"},{"location":"user-guide/sources/transformer/overview/#built-in-transformers","text":"There are some Built-in Transformers that can be used directly.","title":"Built-in Transformers"},{"location":"user-guide/sources/transformer/overview/#build-your-own-transformer","text":"You can build your own transformer in multiple languages. A user-defined transformer could be as simple as the example below in Golang. In the example, the transformer extracts event times from timestamp of the JSON payload and assigns them to messages as new event times. It also filters out unwanted messages based on filterOut of the payload. package main import ( \"context\" \"encoding/json\" \"log\" \"time\" \"github.com/numaproj/numaflow-go/pkg/sourcetransformer\" ) func transform ( _ context . Context , keys [] string , data sourcetransformer . Datum ) sourcetransformer . Messages { /* Input messages are in JSON format. Sample: {\"timestamp\": \"1673239888\", \"filterOut\": \"true\"}. Field \"timestamp\" shows the real event time of the message, in the format of epoch. Field \"filterOut\" indicates whether the message should be filtered out, in the format of boolean. */ var jsonObject map [ string ] interface {} json . Unmarshal ( data . Value (), & jsonObject ) // event time assignment eventTime := data . EventTime () // if timestamp field exists, extract event time from payload. if ts , ok := jsonObject [ \"timestamp\" ]; ok { eventTime = time . Unix ( int64 ( ts .( float64 )), 0 ) } // data filtering var filterOut bool if f , ok := jsonObject [ \"filterOut\" ]; ok { filterOut = f .( bool ) } if filterOut { return sourcetransformer . MessagesBuilder (). Append ( sourcetransformer . MessageToDrop ( eventTime )) } else { return sourcetransformer . MessagesBuilder (). Append ( sourcetransformer . NewMessage ( data . Value (), eventTime ). WithKeys ( keys )) } } func main () { err := sourcetransformer . NewServer ( sourcetransformer . SourceTransformFunc ( transform )). Start ( context . Background ()) if err != nil { log . Panic ( \"Failed to start source transform server: \" , err ) } } Check the links below to see another transformer example in various programming languages, where we apply conditional forwarding based on the input event time. Python Golang Java After building a docker image for the written transformer, specify the image as below in the source vertex spec. spec : vertices : - name : my-vertex source : http : {} transformer : container : image : my-python-transformer-example:latest","title":"Build Your Own Transformer"},{"location":"user-guide/sources/transformer/overview/#available-environment-variables","text":"Some environment variables are available in the source transformer container, they might be useful in you own source data transformer implementation. NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex.","title":"Available Environment Variables"},{"location":"user-guide/sources/transformer/overview/#configuration","text":"Configuration data can be provided to the transformer container at runtime multiple ways. environment variables args command volumes init containers","title":"Configuration"},{"location":"user-guide/sources/transformer/builtin-transformers/","text":"Built-in Functions \u00b6 Numaflow provides some built-in source data transformers that can be used directly. Filter A filter built-in transformer filters the message based on expression. payload keyword represents message object. see documentation for filter expression here spec : vertices : - name : in source : http : {} transformer : builtin : name : filter kwargs : expression : int(json(payload).id) < 100 Event Time Extractor A eventTimeExtractor built-in transformer extracts event time from the payload of the message, based on expression and user-specified format. payload keyword represents message object. see documentation for event time extractor expression here . If you want to handle event times in epoch format, you can find helpful resource here . spec : vertices : - name : in source : http : {} transformer : builtin : name : eventTimeExtractor kwargs : expression : json(payload).item[0].time format : 2006-01-02T15:04:05Z07:00 Time Extraction Filter A timeExtractionFilter implements both the eventTimeExtractor and filter built-in functions. It evaluates a message on a pipeline and if valid, extracts event time from the payload of the messsage. spec : vertices : - name : in source : http : {} transformer : builtin : name : timeExtractionFilter kwargs : filterExpr : int(json(payload).id) < 100 eventTimeExpr : json(payload).item[1].time eventTimeFormat : 2006-01-02T15:04:05Z07:00","title":"Overview"},{"location":"user-guide/sources/transformer/builtin-transformers/#built-in-functions","text":"Numaflow provides some built-in source data transformers that can be used directly. Filter A filter built-in transformer filters the message based on expression. payload keyword represents message object. see documentation for filter expression here spec : vertices : - name : in source : http : {} transformer : builtin : name : filter kwargs : expression : int(json(payload).id) < 100 Event Time Extractor A eventTimeExtractor built-in transformer extracts event time from the payload of the message, based on expression and user-specified format. payload keyword represents message object. see documentation for event time extractor expression here . If you want to handle event times in epoch format, you can find helpful resource here . spec : vertices : - name : in source : http : {} transformer : builtin : name : eventTimeExtractor kwargs : expression : json(payload).item[0].time format : 2006-01-02T15:04:05Z07:00 Time Extraction Filter A timeExtractionFilter implements both the eventTimeExtractor and filter built-in functions. It evaluates a message on a pipeline and if valid, extracts event time from the payload of the messsage. spec : vertices : - name : in source : http : {} transformer : builtin : name : timeExtractionFilter kwargs : filterExpr : int(json(payload).id) < 100 eventTimeExpr : json(payload).item[1].time eventTimeFormat : 2006-01-02T15:04:05Z07:00","title":"Built-in Functions"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/","text":"Event Time Extractor \u00b6 A eventTimeExtractor built-in transformer extracts event time from the payload of the message, based on a user-provided expression and an optional format specification. expression is used to compile the payload to a string representation of the event time. format is used to convert the event time in string format to a time.Time object. Expression (required) \u00b6 Event Time Extractor expression is implemented with expr and sprig libraries. Data conversion functions \u00b6 These function can be accessed directly in expression. payload keyword represents the message object. It will be the root element to represent the message object in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount) Sprig functions \u00b6 Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.trim(string(json(payload).timestamp)) # Remove spaces from either side of the value of field timestamp Format (optional) \u00b6 Depending on whether a format is specified, Event Time Extractor uses different approaches to convert the event time string to a time.Time object. When specified \u00b6 When format is specified, the native time.Parse(layout, value string) library is used to make the conversion. In this process, the format parameter is passed as the layout input to the time.Parse() function, while the event time string is passed as the value parameter. When not specified \u00b6 When format is not specified, the extractor uses dateparse to parse the event time string without knowing the format in advance. How to specify format \u00b6 Please refer to golang format library . Error Scenarios \u00b6 When encountering parsing errors, event time extractor skips the extraction and passes on the message without modifying the original input message event time. Errors can occur for a variety of reasons, including: format is specified but the event time string can't parse to the specified format. format is not specified but dataparse can't convert the event time string to a time.Time object. Ambiguous event time strings \u00b6 Event time strings can be ambiguous when it comes to date format, such as MM/DD/YYYY versus DD/MM/YYYY. When using such format, you're required to explicitly specify format , to avoid confusion. If no format is provided, event time extractor treats ambiguous event time strings as an error scenario. Epoch format \u00b6 If the event time string in your message payload is in epoch format, you can skip specifying a format . You can rely on dateparse to recognize a wide range of epoch timestamp formats, including Unix seconds, milliseconds, microseconds, and nanoseconds. Event Time Extractor Spec \u00b6 spec : vertices : - name : in source : http : {} transformer : builtin : name : eventTimeExtractor kwargs : expression : sprig.trim(string(json(payload).timestamp)) format : 2006-01-02T15:04:05Z07:00","title":"Event Time Extractor"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#event-time-extractor","text":"A eventTimeExtractor built-in transformer extracts event time from the payload of the message, based on a user-provided expression and an optional format specification. expression is used to compile the payload to a string representation of the event time. format is used to convert the event time in string format to a time.Time object.","title":"Event Time Extractor"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#expression-required","text":"Event Time Extractor expression is implemented with expr and sprig libraries.","title":"Expression (required)"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#data-conversion-functions","text":"These function can be accessed directly in expression. payload keyword represents the message object. It will be the root element to represent the message object in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount)","title":"Data conversion functions"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#sprig-functions","text":"Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.trim(string(json(payload).timestamp)) # Remove spaces from either side of the value of field timestamp","title":"Sprig functions"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#format-optional","text":"Depending on whether a format is specified, Event Time Extractor uses different approaches to convert the event time string to a time.Time object.","title":"Format (optional)"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#when-specified","text":"When format is specified, the native time.Parse(layout, value string) library is used to make the conversion. In this process, the format parameter is passed as the layout input to the time.Parse() function, while the event time string is passed as the value parameter.","title":"When specified"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#when-not-specified","text":"When format is not specified, the extractor uses dateparse to parse the event time string without knowing the format in advance.","title":"When not specified"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#how-to-specify-format","text":"Please refer to golang format library .","title":"How to specify format"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#error-scenarios","text":"When encountering parsing errors, event time extractor skips the extraction and passes on the message without modifying the original input message event time. Errors can occur for a variety of reasons, including: format is specified but the event time string can't parse to the specified format. format is not specified but dataparse can't convert the event time string to a time.Time object.","title":"Error Scenarios"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#ambiguous-event-time-strings","text":"Event time strings can be ambiguous when it comes to date format, such as MM/DD/YYYY versus DD/MM/YYYY. When using such format, you're required to explicitly specify format , to avoid confusion. If no format is provided, event time extractor treats ambiguous event time strings as an error scenario.","title":"Ambiguous event time strings"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#epoch-format","text":"If the event time string in your message payload is in epoch format, you can skip specifying a format . You can rely on dateparse to recognize a wide range of epoch timestamp formats, including Unix seconds, milliseconds, microseconds, and nanoseconds.","title":"Epoch format"},{"location":"user-guide/sources/transformer/builtin-transformers/event-time-extractor/#event-time-extractor-spec","text":"spec : vertices : - name : in source : http : {} transformer : builtin : name : eventTimeExtractor kwargs : expression : sprig.trim(string(json(payload).timestamp)) format : 2006-01-02T15:04:05Z07:00","title":"Event Time Extractor Spec"},{"location":"user-guide/sources/transformer/builtin-transformers/filter/","text":"Filter \u00b6 A filter is a special-purpose built-in function. It is used to evaluate on each message in a pipeline and is often used to filter the number of messages that are passed to next vertices. Filter function supports comprehensive expression language which extends flexibility write complex expressions. payload will be root element to represent the message object in expression. Expression \u00b6 Filter expression implemented with expr and sprig libraries. Data conversion functions \u00b6 These function can be accessed directly in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount) Sprig functions \u00b6 Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.contains('James', json(payload).name) # James is contained in the value of name . int(json(sprig.b64dec(payload)).id) < 100 Filter Spec \u00b6 spec : vertices : - name : in source : http : {} transformer : builtin : name : filter kwargs : expression : int(json(payload).id) < 100","title":"Filter"},{"location":"user-guide/sources/transformer/builtin-transformers/filter/#filter","text":"A filter is a special-purpose built-in function. It is used to evaluate on each message in a pipeline and is often used to filter the number of messages that are passed to next vertices. Filter function supports comprehensive expression language which extends flexibility write complex expressions. payload will be root element to represent the message object in expression.","title":"Filter"},{"location":"user-guide/sources/transformer/builtin-transformers/filter/#expression","text":"Filter expression implemented with expr and sprig libraries.","title":"Expression"},{"location":"user-guide/sources/transformer/builtin-transformers/filter/#data-conversion-functions","text":"These function can be accessed directly in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount)","title":"Data conversion functions"},{"location":"user-guide/sources/transformer/builtin-transformers/filter/#sprig-functions","text":"Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.contains('James', json(payload).name) # James is contained in the value of name . int(json(sprig.b64dec(payload)).id) < 100","title":"Sprig functions"},{"location":"user-guide/sources/transformer/builtin-transformers/filter/#filter-spec","text":"spec : vertices : - name : in source : http : {} transformer : builtin : name : filter kwargs : expression : int(json(payload).id) < 100","title":"Filter Spec"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/","text":"Time Extraction Filter \u00b6 A timeExtractionFilter built-in transformer implements both the eventTimeExtractor and filter built-in functions. It evaluates a message on a pipeline and if valid, extracts event time from the payload of the messsage. filterExpr is used to evaluate and drop invalid messages. eventTimeExpr is used to compile the payload to a string representation of the event time. format is used to convert the event time in string format to a time.Time object. Expression (required) \u00b6 The expressions for the filter and event time extractor are implemented with expr and sprig libraries. Data conversion functions \u00b6 These function can be accessed directly in expression. payload keyword represents the message object. It will be the root element to represent the message object in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount) Sprig functions \u00b6 Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.trim(string(json(payload).timestamp)) # Remove spaces from either side of the value of field timestamp Format (optional) \u00b6 Depending on whether a format is specified, the Event Time Extractor uses different approaches to convert the event time string to a time.Time object. Time Extraction Filter Spec \u00b6 spec : vertices : - name : in source : http : {} transformer : builtin : name : timeExtractionFilter kwargs : filterExpr : int(json(payload).id) < 100 eventTimeExpr : json(payload).item[1].time eventTimeFormat : 2006-01-02T15:04:05Z07:00","title":"Event Time Extraction Filter"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/#time-extraction-filter","text":"A timeExtractionFilter built-in transformer implements both the eventTimeExtractor and filter built-in functions. It evaluates a message on a pipeline and if valid, extracts event time from the payload of the messsage. filterExpr is used to evaluate and drop invalid messages. eventTimeExpr is used to compile the payload to a string representation of the event time. format is used to convert the event time in string format to a time.Time object.","title":"Time Extraction Filter"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/#expression-required","text":"The expressions for the filter and event time extractor are implemented with expr and sprig libraries.","title":"Expression (required)"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/#data-conversion-functions","text":"These function can be accessed directly in expression. payload keyword represents the message object. It will be the root element to represent the message object in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount)","title":"Data conversion functions"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/#sprig-functions","text":"Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.trim(string(json(payload).timestamp)) # Remove spaces from either side of the value of field timestamp","title":"Sprig functions"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/#format-optional","text":"Depending on whether a format is specified, the Event Time Extractor uses different approaches to convert the event time string to a time.Time object.","title":"Format (optional)"},{"location":"user-guide/sources/transformer/builtin-transformers/time-extraction-filter/#time-extraction-filter-spec","text":"spec : vertices : - name : in source : http : {} transformer : builtin : name : timeExtractionFilter kwargs : filterExpr : int(json(payload).id) < 100 eventTimeExpr : json(payload).item[1].time eventTimeFormat : 2006-01-02T15:04:05Z07:00","title":"Time Extraction Filter Spec"},{"location":"user-guide/use-cases/monitoring-and-observability/","text":"Monitoring and Observability \u00b6 Docs \u00b6 How Intuit platform engineers use Numaflow to compute golden signals . Videos \u00b6 Numaflow as the stream-processing solution in Intuit\u2019s Customer Centric Observability Journey Using AIOps Using Numaflow for fast incident detection: Argo CD Observability with AIOps - Detect Incident Fast Implementing anomaly detection with Numaflow: Cluster Golden Signals to Avoid Alert Fatigue at Scale Appendix: What is Monitoring and Observability? \u00b6 Monitoring and observability are two critical concepts in software engineering that help developers ensure the health and performance of their applications. Monitoring refers to the process of collecting and analyzing data about an application's performance. This data can include metrics such as CPU usage, memory usage, network traffic, and response times. Monitoring tools allow developers to track these metrics over time and set alerts when certain thresholds are exceeded. This enables them to quickly identify and respond to issues before they become critical. Observability, on the other hand, is a more holistic approach to monitoring that focuses on understanding the internal workings of an application. Observability tools provide developers with deep insights into the behavior of their applications, allowing them to understand how different components interact with each other and how changes in one area can affect the overall system. This includes collecting data on things like logs, traces, and events, which can be used to reconstruct the state of the system at any given point in time. Together, monitoring and observability provide developers with a comprehensive view of their applications' performance, enabling them to quickly identify and respond to issues as they arise. By leveraging these tools, software engineers can ensure that their applications are running smoothly and efficiently, delivering the best possible experience to their users.","title":"Monitoring and Observability"},{"location":"user-guide/use-cases/monitoring-and-observability/#monitoring-and-observability","text":"","title":"Monitoring and Observability"},{"location":"user-guide/use-cases/monitoring-and-observability/#docs","text":"How Intuit platform engineers use Numaflow to compute golden signals .","title":"Docs"},{"location":"user-guide/use-cases/monitoring-and-observability/#videos","text":"Numaflow as the stream-processing solution in Intuit\u2019s Customer Centric Observability Journey Using AIOps Using Numaflow for fast incident detection: Argo CD Observability with AIOps - Detect Incident Fast Implementing anomaly detection with Numaflow: Cluster Golden Signals to Avoid Alert Fatigue at Scale","title":"Videos"},{"location":"user-guide/use-cases/monitoring-and-observability/#appendix-what-is-monitoring-and-observability","text":"Monitoring and observability are two critical concepts in software engineering that help developers ensure the health and performance of their applications. Monitoring refers to the process of collecting and analyzing data about an application's performance. This data can include metrics such as CPU usage, memory usage, network traffic, and response times. Monitoring tools allow developers to track these metrics over time and set alerts when certain thresholds are exceeded. This enables them to quickly identify and respond to issues before they become critical. Observability, on the other hand, is a more holistic approach to monitoring that focuses on understanding the internal workings of an application. Observability tools provide developers with deep insights into the behavior of their applications, allowing them to understand how different components interact with each other and how changes in one area can affect the overall system. This includes collecting data on things like logs, traces, and events, which can be used to reconstruct the state of the system at any given point in time. Together, monitoring and observability provide developers with a comprehensive view of their applications' performance, enabling them to quickly identify and respond to issues as they arise. By leveraging these tools, software engineers can ensure that their applications are running smoothly and efficiently, delivering the best possible experience to their users.","title":"Appendix: What is Monitoring and Observability?"},{"location":"user-guide/use-cases/overview/","text":"Overview \u00b6 Numaflow allows developers without any special knowledge of data/stream processing to easily create massively parallel data/stream processing jobs using a programming language of their choice, with just basic knowledge of Kubernetes. In this section, you'll find sample use cases for Numaflow and learn how to leverage its features for your stream processing tasks. Real-time data analytics applications. Event-driven applications: anomaly detection and monitoring . Streaming applications: data instrumentation and movement. Any workflows running in a streaming manner. Numaflow is still a relatively new tool, and there are likely many other use cases that we haven't yet explored. We're committed to keeping this page up-to-date with the latest use cases and best practices for using Numaflow. We welcome contributions from the community and encourage you to share your own use cases and experiences with us. As we continue to develop and improve Numaflow, we look forward to seeing the cool things you build with it!","title":"Overview"},{"location":"user-guide/use-cases/overview/#overview","text":"Numaflow allows developers without any special knowledge of data/stream processing to easily create massively parallel data/stream processing jobs using a programming language of their choice, with just basic knowledge of Kubernetes. In this section, you'll find sample use cases for Numaflow and learn how to leverage its features for your stream processing tasks. Real-time data analytics applications. Event-driven applications: anomaly detection and monitoring . Streaming applications: data instrumentation and movement. Any workflows running in a streaming manner. Numaflow is still a relatively new tool, and there are likely many other use cases that we haven't yet explored. We're committed to keeping this page up-to-date with the latest use cases and best practices for using Numaflow. We welcome contributions from the community and encourage you to share your own use cases and experiences with us. As we continue to develop and improve Numaflow, we look forward to seeing the cool things you build with it!","title":"Overview"},{"location":"user-guide/user-defined-functions/user-defined-functions/","text":"A Pipeline consists of multiple vertices, Source , Sink and UDF(user-defined functions) . User-defined functions (UDF) is the vertex where users can run custom code to transform the data. Data processing in the UDF is supposed to be idempotent. UDF runs as a sidecar container in a Vertex Pod, processes the received data. The communication between the main container (platform code) and the sidecar container (user code) is through gRPC over Unix Domain Socket. There are two kinds of processing users can run Map Reduce","title":"Overview"},{"location":"user-guide/user-defined-functions/map/examples/","text":"Map Examples \u00b6 Please read map to get the best out of these examples. Prerequisites \u00b6 Inter-Step Buffer Service (ISB Service) \u00b6 What is ISB Service? \u00b6 An Inter-Step Buffer Service is described by a Custom Resource , which is used to pass data between vertices of a numaflow pipeline. Please refer to the doc Inter-Step Buffer Service for more information on ISB. How to install the ISB Service \u00b6 kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/0-isbsvc-jetstream.yaml The expected output of the above command is shown below: $ kubectl get isbsvc NAME TYPE PHASE MESSAGE AGE default jetstream Running 3d19h # Wait for pods to be ready $ kubectl get pods NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3 /3 Running 0 19s isbsvc-default-js-1 3 /3 Running 0 19s isbsvc-default-js-2 3 /3 Running 0 19s NOTE The Source used in the examples is an HTTP source producing messages with values 5 and 10 with event time starting from 60000. Please refer to the doc http source on how to use an HTTP source. An example will be as follows, curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"5\" ${ http -source-url } curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"10\" ${ http -source-url } Creating a Simple Map Pipeline \u00b6 Now we will walk you through creating a map pipeline. In our example, this is called the even-odd pipeline, illustrated by the following diagram: There are five vertices in this example of a map pipeline. An HTTP source vertex which serves an HTTP endpoint to receive numbers as source data, a UDF vertex to tag the ingested numbers with the key even or odd , three Log sinks, one to print the even numbers, one to print the odd numbers, and the other one to print both the even and odd numbers. Run the following command to create the even-odd pipeline. kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml You may opt to view the list of pipelines you've created so far by running kubectl get pipeline . Otherwise, proceed to inspect the status of the pipeline, using kubectl get pods . # Wait for pods to be ready kubectl get pods NAME READY STATUS RESTARTS AGE even-odd-daemon-64d65c945d-vjs9f 1 /1 Running 0 5m3s even-odd-even-or-odd-0-pr4ze 2 /2 Running 0 30s even-odd-even-sink-0-unffo 1 /1 Running 0 22s even-odd-in-0-a7iyd 1 /1 Running 0 5m3s even-odd-number-sink-0-zmg2p 1 /1 Running 0 7s even-odd-odd-sink-0-2736r 1 /1 Running 0 15s isbsvc-default-js-0 3 /3 Running 0 10m isbsvc-default-js-1 3 /3 Running 0 10m isbsvc-default-js-2 3 /3 Running 0 10m Next, port-forward the HTTP endpoint, and make a POST request using curl . Remember to replace xxxxx with the appropriate pod names both here and in the next step. kubectl port-forward even-odd-in-0-xxxx 8444 :8443 # Post data to the HTTP endpoint curl -kq -X POST -d \"101\" https://localhost:8444/vertices/in curl -kq -X POST -d \"102\" https://localhost:8444/vertices/in curl -kq -X POST -d \"103\" https://localhost:8444/vertices/in curl -kq -X POST -d \"104\" https://localhost:8444/vertices/in Now you can watch the log for the even and odd vertices by running the commands below. # Watch the log for the even vertex kubectl logs -f even-odd-even-sink-0-xxxxx 2022 /09/07 22 :29:40 ( even-sink ) 102 2022 /09/07 22 :29:40 ( even-sink ) 104 # Watch the log for the odd vertex kubectl logs -f even-odd-odd-sink-0-xxxxx 2022 /09/07 22 :30:19 ( odd-sink ) 101 2022 /09/07 22 :30:19 ( odd-sink ) 103 View the UI for a pipeline at https://localhost:8443/. The source code of the even-odd user-defined function can be found here . You also can replace the Log Sink with some other sinks like Kafka to forward the data to Kafka topics. The pipeline can be deleted by kubectl delete -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml","title":"Examples"},{"location":"user-guide/user-defined-functions/map/examples/#map-examples","text":"Please read map to get the best out of these examples.","title":"Map Examples"},{"location":"user-guide/user-defined-functions/map/examples/#prerequisites","text":"","title":"Prerequisites"},{"location":"user-guide/user-defined-functions/map/examples/#inter-step-buffer-service-isb-service","text":"","title":"Inter-Step Buffer Service (ISB Service)"},{"location":"user-guide/user-defined-functions/map/examples/#what-is-isb-service","text":"An Inter-Step Buffer Service is described by a Custom Resource , which is used to pass data between vertices of a numaflow pipeline. Please refer to the doc Inter-Step Buffer Service for more information on ISB.","title":"What is ISB Service?"},{"location":"user-guide/user-defined-functions/map/examples/#how-to-install-the-isb-service","text":"kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/0-isbsvc-jetstream.yaml The expected output of the above command is shown below: $ kubectl get isbsvc NAME TYPE PHASE MESSAGE AGE default jetstream Running 3d19h # Wait for pods to be ready $ kubectl get pods NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3 /3 Running 0 19s isbsvc-default-js-1 3 /3 Running 0 19s isbsvc-default-js-2 3 /3 Running 0 19s NOTE The Source used in the examples is an HTTP source producing messages with values 5 and 10 with event time starting from 60000. Please refer to the doc http source on how to use an HTTP source. An example will be as follows, curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"5\" ${ http -source-url } curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"10\" ${ http -source-url }","title":"How to install the ISB Service"},{"location":"user-guide/user-defined-functions/map/examples/#creating-a-simple-map-pipeline","text":"Now we will walk you through creating a map pipeline. In our example, this is called the even-odd pipeline, illustrated by the following diagram: There are five vertices in this example of a map pipeline. An HTTP source vertex which serves an HTTP endpoint to receive numbers as source data, a UDF vertex to tag the ingested numbers with the key even or odd , three Log sinks, one to print the even numbers, one to print the odd numbers, and the other one to print both the even and odd numbers. Run the following command to create the even-odd pipeline. kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml You may opt to view the list of pipelines you've created so far by running kubectl get pipeline . Otherwise, proceed to inspect the status of the pipeline, using kubectl get pods . # Wait for pods to be ready kubectl get pods NAME READY STATUS RESTARTS AGE even-odd-daemon-64d65c945d-vjs9f 1 /1 Running 0 5m3s even-odd-even-or-odd-0-pr4ze 2 /2 Running 0 30s even-odd-even-sink-0-unffo 1 /1 Running 0 22s even-odd-in-0-a7iyd 1 /1 Running 0 5m3s even-odd-number-sink-0-zmg2p 1 /1 Running 0 7s even-odd-odd-sink-0-2736r 1 /1 Running 0 15s isbsvc-default-js-0 3 /3 Running 0 10m isbsvc-default-js-1 3 /3 Running 0 10m isbsvc-default-js-2 3 /3 Running 0 10m Next, port-forward the HTTP endpoint, and make a POST request using curl . Remember to replace xxxxx with the appropriate pod names both here and in the next step. kubectl port-forward even-odd-in-0-xxxx 8444 :8443 # Post data to the HTTP endpoint curl -kq -X POST -d \"101\" https://localhost:8444/vertices/in curl -kq -X POST -d \"102\" https://localhost:8444/vertices/in curl -kq -X POST -d \"103\" https://localhost:8444/vertices/in curl -kq -X POST -d \"104\" https://localhost:8444/vertices/in Now you can watch the log for the even and odd vertices by running the commands below. # Watch the log for the even vertex kubectl logs -f even-odd-even-sink-0-xxxxx 2022 /09/07 22 :29:40 ( even-sink ) 102 2022 /09/07 22 :29:40 ( even-sink ) 104 # Watch the log for the odd vertex kubectl logs -f even-odd-odd-sink-0-xxxxx 2022 /09/07 22 :30:19 ( odd-sink ) 101 2022 /09/07 22 :30:19 ( odd-sink ) 103 View the UI for a pipeline at https://localhost:8443/. The source code of the even-odd user-defined function can be found here . You also can replace the Log Sink with some other sinks like Kafka to forward the data to Kafka topics. The pipeline can be deleted by kubectl delete -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/2-even-odd-pipeline.yaml","title":"Creating a Simple Map Pipeline"},{"location":"user-guide/user-defined-functions/map/map/","text":"Map UDF \u00b6 Map in a Map vertex takes an input and returns 0, 1, or more outputs (also known as flat-map operation). Map is an element wise operator. Builtin UDF \u00b6 There are some Built-in Functions that can be used directly. Build Your Own UDF \u00b6 You can build your own UDF in multiple languages. Check the links below to see the UDF examples for different languages. Python Golang Java After building a docker image for the written UDF, specify the image as below in the vertex spec. spec : vertices : - name : my-vertex udf : container : image : my-python-udf-example:latest Streaming Mode \u00b6 In cases the map function generates more than one output (e.g., flat map), the UDF can be configured to run in a streaming mode instead of batching, which is the default mode. In streaming mode, the messages will be pushed to the downstream vertices once generated instead of in a batch at the end. Note that to maintain data orderliness, we restrict the read batch size to be 1 . spec : vertices : - name : my-vertex limits : # mapstreaming won't work if readBatchSize is != 1 readBatchSize : 1 Check the links below to see the UDF examples in streaming mode for different languages. Python Golang Java Batch Map Mode \u00b6 BatchMap is an interface that allows developers to process multiple data items in a UDF single call, rather than each item in separate calls. The BatchMap interface can be helpful in scenarios where performing operations on a group of data can be more efficient. Important Considerations \u00b6 When using BatchMap, there are a few important considerations to keep in mind: Ensure that the BatchResponses object is tagged with the correct request ID. Each Datum has a unique ID tag, which will be used by Numaflow to ensure correctness. Ensure that the length of the BatchResponses list is equal to the number of requests received. This means that for every input data item, there should be a corresponding response in the BatchResponses list. Check the links below to see the UDF examples in batch mode for different languages. Python Golang Java Rust Available Environment Variables \u00b6 Some environment variables are available in the user-defined function container, they might be useful in your own UDF implementation. NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex. Configuration \u00b6 Configuration data can be provided to the UDF container at runtime multiple ways. environment variables args command volumes init containers","title":"Overview"},{"location":"user-guide/user-defined-functions/map/map/#map-udf","text":"Map in a Map vertex takes an input and returns 0, 1, or more outputs (also known as flat-map operation). Map is an element wise operator.","title":"Map UDF"},{"location":"user-guide/user-defined-functions/map/map/#builtin-udf","text":"There are some Built-in Functions that can be used directly.","title":"Builtin UDF"},{"location":"user-guide/user-defined-functions/map/map/#build-your-own-udf","text":"You can build your own UDF in multiple languages. Check the links below to see the UDF examples for different languages. Python Golang Java After building a docker image for the written UDF, specify the image as below in the vertex spec. spec : vertices : - name : my-vertex udf : container : image : my-python-udf-example:latest","title":"Build Your Own UDF"},{"location":"user-guide/user-defined-functions/map/map/#streaming-mode","text":"In cases the map function generates more than one output (e.g., flat map), the UDF can be configured to run in a streaming mode instead of batching, which is the default mode. In streaming mode, the messages will be pushed to the downstream vertices once generated instead of in a batch at the end. Note that to maintain data orderliness, we restrict the read batch size to be 1 . spec : vertices : - name : my-vertex limits : # mapstreaming won't work if readBatchSize is != 1 readBatchSize : 1 Check the links below to see the UDF examples in streaming mode for different languages. Python Golang Java","title":"Streaming Mode"},{"location":"user-guide/user-defined-functions/map/map/#batch-map-mode","text":"BatchMap is an interface that allows developers to process multiple data items in a UDF single call, rather than each item in separate calls. The BatchMap interface can be helpful in scenarios where performing operations on a group of data can be more efficient.","title":"Batch Map Mode"},{"location":"user-guide/user-defined-functions/map/map/#important-considerations","text":"When using BatchMap, there are a few important considerations to keep in mind: Ensure that the BatchResponses object is tagged with the correct request ID. Each Datum has a unique ID tag, which will be used by Numaflow to ensure correctness. Ensure that the length of the BatchResponses list is equal to the number of requests received. This means that for every input data item, there should be a corresponding response in the BatchResponses list. Check the links below to see the UDF examples in batch mode for different languages. Python Golang Java Rust","title":"Important Considerations"},{"location":"user-guide/user-defined-functions/map/map/#available-environment-variables","text":"Some environment variables are available in the user-defined function container, they might be useful in your own UDF implementation. NUMAFLOW_NAMESPACE - Namespace. NUMAFLOW_POD - Pod name. NUMAFLOW_REPLICA - Replica index. NUMAFLOW_PIPELINE_NAME - Name of the pipeline. NUMAFLOW_VERTEX_NAME - Name of the vertex.","title":"Available Environment Variables"},{"location":"user-guide/user-defined-functions/map/map/#configuration","text":"Configuration data can be provided to the UDF container at runtime multiple ways. environment variables args command volumes init containers","title":"Configuration"},{"location":"user-guide/user-defined-functions/map/builtin-functions/","text":"Built-in Functions \u00b6 Numaflow provides some built-in functions that can be used directly. Cat A cat builtin UDF does nothing but return the same messages it receives, it is very useful for debugging and testing. spec : vertices : - name : cat-vertex udf : builtin : name : cat Filter A filter built-in UDF does filter the message based on expression. payload keyword represents message object. see documentation for expression here spec : vertices : - name : filter-vertex udf : builtin : name : filter kwargs : expression : int(object(payload).id) > 100","title":"Overview"},{"location":"user-guide/user-defined-functions/map/builtin-functions/#built-in-functions","text":"Numaflow provides some built-in functions that can be used directly. Cat A cat builtin UDF does nothing but return the same messages it receives, it is very useful for debugging and testing. spec : vertices : - name : cat-vertex udf : builtin : name : cat Filter A filter built-in UDF does filter the message based on expression. payload keyword represents message object. see documentation for expression here spec : vertices : - name : filter-vertex udf : builtin : name : filter kwargs : expression : int(object(payload).id) > 100","title":"Built-in Functions"},{"location":"user-guide/user-defined-functions/map/builtin-functions/cat/","text":"Cat \u00b6 A cat builtin function does nothing but return the same messages it receives, it is very useful for debugging and testing. spec : vertices : - name : cat-vertex udf : builtin : name : cat","title":"Cat"},{"location":"user-guide/user-defined-functions/map/builtin-functions/cat/#cat","text":"A cat builtin function does nothing but return the same messages it receives, it is very useful for debugging and testing. spec : vertices : - name : cat-vertex udf : builtin : name : cat","title":"Cat"},{"location":"user-guide/user-defined-functions/map/builtin-functions/filter/","text":"Filter \u00b6 A filter is a special-purpose built-in function. It is used to evaluate on each message in a pipeline and is often used to filter the number of messages that are passed to next vertices. Filter function supports comprehensive expression language which extend flexibility write complex expressions. payload will be root element to represent the message object in expression. Expression \u00b6 Filter expression implemented with expr and sprig libraries. Data conversion functions \u00b6 These function can be accessed directly in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount) Sprig functions \u00b6 Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.contains('James', json(payload).name) # James is contained in the value of name . int(json(sprig.b64dec(payload)).id) < 100 Filter Spec \u00b6 - name : filter-vertex udf : builtin : name : filter kwargs : expression : int(json(payload).id) < 100","title":"Filter"},{"location":"user-guide/user-defined-functions/map/builtin-functions/filter/#filter","text":"A filter is a special-purpose built-in function. It is used to evaluate on each message in a pipeline and is often used to filter the number of messages that are passed to next vertices. Filter function supports comprehensive expression language which extend flexibility write complex expressions. payload will be root element to represent the message object in expression.","title":"Filter"},{"location":"user-guide/user-defined-functions/map/builtin-functions/filter/#expression","text":"Filter expression implemented with expr and sprig libraries.","title":"Expression"},{"location":"user-guide/user-defined-functions/map/builtin-functions/filter/#data-conversion-functions","text":"These function can be accessed directly in expression. json - Convert payload in JSON object. e.g: json(payload) int - Convert element/payload into int value. e.g: int(json(payload).id) string - Convert element/payload into string value. e.g: string(json(payload).amount)","title":"Data conversion functions"},{"location":"user-guide/user-defined-functions/map/builtin-functions/filter/#sprig-functions","text":"Sprig library has 70+ functions. sprig prefix need to be added to access the sprig functions. sprig functions E.g: sprig.contains('James', json(payload).name) # James is contained in the value of name . int(json(sprig.b64dec(payload)).id) < 100","title":"Sprig functions"},{"location":"user-guide/user-defined-functions/map/builtin-functions/filter/#filter-spec","text":"- name : filter-vertex udf : builtin : name : filter kwargs : expression : int(json(payload).id) < 100","title":"Filter Spec"},{"location":"user-guide/user-defined-functions/reduce/examples/","text":"Reduce Examples \u00b6 Please read reduce to get the best out of these examples. Prerequisites \u00b6 Inter-Step Buffer Service (ISB Service) \u00b6 What is ISB Service? \u00b6 An Inter-Step Buffer Service is described by a Custom Resource , which is used to pass data between vertices of a numaflow pipeline. Please refer to the doc Inter-Step Buffer Service for more information on ISB. How to install the ISB Service \u00b6 kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/0-isbsvc-jetstream.yaml The expected output of the above command is shown below: $ kubectl get isbsvc NAME TYPE PHASE MESSAGE AGE default jetstream Running 3d19h # Wait for pods to be ready $ kubectl get pods NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3 /3 Running 0 19s isbsvc-default-js-1 3 /3 Running 0 19s isbsvc-default-js-2 3 /3 Running 0 19s NOTE The Source used in the examples is an HTTP source producing messages with values 5 and 10 with event time starting from 60000. Please refer to the doc http source on how to use an HTTP source. An example will be as follows, curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"5\" ${ http -source-url } curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"10\" ${ http -source-url } Sum Pipeline Using Fixed Window \u00b6 This is a simple reduce pipeline that just does summation (sum of numbers) but uses fixed window. The snippet for the reduce vertex is as follows. - name : compute-sum udf : container : # compute the sum image : quay.io/numaio/numaflow-go/reduce-sum:v0.5.0 groupBy : window : fixed : length : 60s keyed : true 6-reduce-fixed-window.yaml has the complete pipeline definition. In this example we use a partitions of 2 . We are setting a partitions > 1 because it is a keyed window. - name : compute-sum partitions : 2 kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/6-reduce-fixed-window.yaml Output : 2023/01/05 11:54:41 (sink) Payload - 300 Key - odd Start - 60000 End - 120000 2023/01/05 11:54:41 (sink) Payload - 600 Key - even Start - 60000 End - 120000 2023/01/05 11:54:41 (sink) Payload - 300 Key - odd Start - 120000 End - 180000 2023/01/05 11:54:41 (sink) Payload - 600 Key - even Start - 120000 End - 180000 2023/01/05 11:54:42 (sink) Payload - 600 Key - even Start - 180000 End - 240000 2023/01/05 11:54:42 (sink) Payload - 300 Key - odd Start - 180000 End - 240000 In our example, input is an HTTP source producing 2 messages each second with values 5 and 10, and the event time starts from 60000. Since we have considered a fixed window of length 60s, and also we are producing two messages with different keys \"even\" and \"odd\", Numaflow will create two different windows with a start time of 60000 and an end time of 120000. So the output will be 300(5 * 60) and 600(10 * 60). If we had used a non keyed window ( keyed: false ), we would have seen one single output with value of 900(300 of odd + 600 of even) for each window. Sum Pipeline Using Sliding Window \u00b6 This is a simple reduce pipeline that just does summation (sum of numbers) but uses sliding window. The snippet for the reduce vertex is as follows. - name : reduce-sliding udf : container : # compute the sum image : quay.io/numaio/numaflow-go/reduce-sum:v0.5.0 groupBy : window : sliding : length : 60s slide : 10s keyed : true 7-reduce-sliding-window.yaml has the complete pipeline definition kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/7-reduce-sliding-window.yaml Output: 2023/01/05 15:13:16 (sink) Payload - 300 Key - odd Start - 60000 End - 120000 2023/01/05 15:13:16 (sink) Payload - 600 Key - even Start - 60000 End - 120000 2023/01/05 15:13:16 (sink) Payload - 300 Key - odd Start - 70000 End - 130000 2023/01/05 15:13:16 (sink) Payload - 600 Key - even Start - 700000 End - 1300000 2023/01/05 15:13:16 (sink) Payload - 300 Key - odd Start - 80000 End - 140000 2023/01/05 15:13:16 (sink) Payload - 600 Key - even Start - 80000 End - 140000 In our example, input is an HTTP source producing 2 messages each second with values 5 and 10, and the event time starts from 60000. Since we have considered a sliding window of length 60s and slide 10s, and also we are producing two messages with different keys \"even\" and \"odd\". Numaflow will create two different windows with a start time of 60000 and an end time of 120000, and because the slide duration is 10s, a next set of windows will be created with start time of 70000 and an end time of 130000. Since it's a sum operation the output will be 300(5 * 60) and 600(10 * 60). Payload - 50 Key - odd Start - 10000 End - 70000 , we see 50 here for odd because the first window has only 10 elements Complex Reduce Pipeline \u00b6 In the complex reduce example, we will chain of reduce functions use both fixed and sliding windows use keyed and non-keyed windowing 8-reduce-complex-pipeline.yaml has the complete pipeline definition kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/8-reduce-complex-pipeline.yaml Output: 2023/01/05 15:33:55 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 80000 End - 140000 2023/01/05 15:33:55 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 90000 End - 150000 2023/01/05 15:33:55 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 100000 End - 160000 2023/01/05 15:33:56 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 110000 End - 170000 2023/01/05 15:33:56 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 120000 End - 180000 2023/01/05 15:33:56 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 130000 End - 190000 In our example, first we have the reduce vertex with a fixed window of duration 5s. Since the input is 5 and 10, the output from the first reduce vertex will be 25 (5 * 5) and 50 (5 * 10). This will be passed to the next non-keyed reduce vertex with the fixed window duration of 10s. This being a non-keyed, it will combine the inputs and produce the output of 150(25 * 2 + 50 * 2), which will be passed to the reduce vertex with a sliding window of duration 60s and with the slide duration of 10s. Hence the final output will be 900(150 * 6).","title":"Examples"},{"location":"user-guide/user-defined-functions/reduce/examples/#reduce-examples","text":"Please read reduce to get the best out of these examples.","title":"Reduce Examples"},{"location":"user-guide/user-defined-functions/reduce/examples/#prerequisites","text":"","title":"Prerequisites"},{"location":"user-guide/user-defined-functions/reduce/examples/#inter-step-buffer-service-isb-service","text":"","title":"Inter-Step Buffer Service (ISB Service)"},{"location":"user-guide/user-defined-functions/reduce/examples/#what-is-isb-service","text":"An Inter-Step Buffer Service is described by a Custom Resource , which is used to pass data between vertices of a numaflow pipeline. Please refer to the doc Inter-Step Buffer Service for more information on ISB.","title":"What is ISB Service?"},{"location":"user-guide/user-defined-functions/reduce/examples/#how-to-install-the-isb-service","text":"kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/stable/examples/0-isbsvc-jetstream.yaml The expected output of the above command is shown below: $ kubectl get isbsvc NAME TYPE PHASE MESSAGE AGE default jetstream Running 3d19h # Wait for pods to be ready $ kubectl get pods NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3 /3 Running 0 19s isbsvc-default-js-1 3 /3 Running 0 19s isbsvc-default-js-2 3 /3 Running 0 19s NOTE The Source used in the examples is an HTTP source producing messages with values 5 and 10 with event time starting from 60000. Please refer to the doc http source on how to use an HTTP source. An example will be as follows, curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"5\" ${ http -source-url } curl -kq -X POST -H \"x-numaflow-event-time: 60000\" -d \"10\" ${ http -source-url }","title":"How to install the ISB Service"},{"location":"user-guide/user-defined-functions/reduce/examples/#sum-pipeline-using-fixed-window","text":"This is a simple reduce pipeline that just does summation (sum of numbers) but uses fixed window. The snippet for the reduce vertex is as follows. - name : compute-sum udf : container : # compute the sum image : quay.io/numaio/numaflow-go/reduce-sum:v0.5.0 groupBy : window : fixed : length : 60s keyed : true 6-reduce-fixed-window.yaml has the complete pipeline definition. In this example we use a partitions of 2 . We are setting a partitions > 1 because it is a keyed window. - name : compute-sum partitions : 2 kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/6-reduce-fixed-window.yaml Output : 2023/01/05 11:54:41 (sink) Payload - 300 Key - odd Start - 60000 End - 120000 2023/01/05 11:54:41 (sink) Payload - 600 Key - even Start - 60000 End - 120000 2023/01/05 11:54:41 (sink) Payload - 300 Key - odd Start - 120000 End - 180000 2023/01/05 11:54:41 (sink) Payload - 600 Key - even Start - 120000 End - 180000 2023/01/05 11:54:42 (sink) Payload - 600 Key - even Start - 180000 End - 240000 2023/01/05 11:54:42 (sink) Payload - 300 Key - odd Start - 180000 End - 240000 In our example, input is an HTTP source producing 2 messages each second with values 5 and 10, and the event time starts from 60000. Since we have considered a fixed window of length 60s, and also we are producing two messages with different keys \"even\" and \"odd\", Numaflow will create two different windows with a start time of 60000 and an end time of 120000. So the output will be 300(5 * 60) and 600(10 * 60). If we had used a non keyed window ( keyed: false ), we would have seen one single output with value of 900(300 of odd + 600 of even) for each window.","title":"Sum Pipeline Using Fixed Window"},{"location":"user-guide/user-defined-functions/reduce/examples/#sum-pipeline-using-sliding-window","text":"This is a simple reduce pipeline that just does summation (sum of numbers) but uses sliding window. The snippet for the reduce vertex is as follows. - name : reduce-sliding udf : container : # compute the sum image : quay.io/numaio/numaflow-go/reduce-sum:v0.5.0 groupBy : window : sliding : length : 60s slide : 10s keyed : true 7-reduce-sliding-window.yaml has the complete pipeline definition kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/7-reduce-sliding-window.yaml Output: 2023/01/05 15:13:16 (sink) Payload - 300 Key - odd Start - 60000 End - 120000 2023/01/05 15:13:16 (sink) Payload - 600 Key - even Start - 60000 End - 120000 2023/01/05 15:13:16 (sink) Payload - 300 Key - odd Start - 70000 End - 130000 2023/01/05 15:13:16 (sink) Payload - 600 Key - even Start - 700000 End - 1300000 2023/01/05 15:13:16 (sink) Payload - 300 Key - odd Start - 80000 End - 140000 2023/01/05 15:13:16 (sink) Payload - 600 Key - even Start - 80000 End - 140000 In our example, input is an HTTP source producing 2 messages each second with values 5 and 10, and the event time starts from 60000. Since we have considered a sliding window of length 60s and slide 10s, and also we are producing two messages with different keys \"even\" and \"odd\". Numaflow will create two different windows with a start time of 60000 and an end time of 120000, and because the slide duration is 10s, a next set of windows will be created with start time of 70000 and an end time of 130000. Since it's a sum operation the output will be 300(5 * 60) and 600(10 * 60). Payload - 50 Key - odd Start - 10000 End - 70000 , we see 50 here for odd because the first window has only 10 elements","title":"Sum Pipeline Using Sliding Window"},{"location":"user-guide/user-defined-functions/reduce/examples/#complex-reduce-pipeline","text":"In the complex reduce example, we will chain of reduce functions use both fixed and sliding windows use keyed and non-keyed windowing 8-reduce-complex-pipeline.yaml has the complete pipeline definition kubectl apply -f https://raw.githubusercontent.com/numaproj/numaflow/main/examples/8-reduce-complex-pipeline.yaml Output: 2023/01/05 15:33:55 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 80000 End - 140000 2023/01/05 15:33:55 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 90000 End - 150000 2023/01/05 15:33:55 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 100000 End - 160000 2023/01/05 15:33:56 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 110000 End - 170000 2023/01/05 15:33:56 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 120000 End - 180000 2023/01/05 15:33:56 (sink) Payload - 900 Key - NON_KEYED_STREAM Start - 130000 End - 190000 In our example, first we have the reduce vertex with a fixed window of duration 5s. Since the input is 5 and 10, the output from the first reduce vertex will be 25 (5 * 5) and 50 (5 * 10). This will be passed to the next non-keyed reduce vertex with the fixed window duration of 10s. This being a non-keyed, it will combine the inputs and produce the output of 150(25 * 2 + 50 * 2), which will be passed to the reduce vertex with a sliding window of duration 60s and with the slide duration of 10s. Hence the final output will be 900(150 * 6).","title":"Complex Reduce Pipeline"},{"location":"user-guide/user-defined-functions/reduce/reduce/","text":"Reduce UDF \u00b6 Overview \u00b6 Reduce is one of the most commonly used abstractions in a stream processing pipeline to define aggregation functions on a stream of data. It is the reduce feature that helps us solve problems like \"performs a summary operation(such as counting the number of occurrences of a key, yielding user login frequencies), etc. \"Since the input is an unbounded stream (with infinite entries), we need an additional parameter to convert the unbounded problem to a bounded problem and provide results on that. That bounding condition is \"time\", eg, \"number of users logged in per minute\". So while processing an unbounded stream of data, we need a way to group elements into finite chunks using time. To build these chunks, the reduce function is applied to the set of records produced using the concept of windowing . Reduce Pseudo code \u00b6 Unlike in map vertex where only an element is given to user-defined function, in reduce since there is a group of elements, an iterator is passed to the reduce function. The following is a generic outlook of a reduce function. I have written the pseudo-code using the accumulator to show that very powerful functions can be applied using this reduce semantics. # reduceFn func is a generic reduce function that processes a set of elements def reduceFn ( keys : List [ str ], datums : Iterator [ Datum ], md : Metadata ) -> Messages : # initialize_accumalor could be any function that starts of with an empty # state. eg, accumulator = 0 accumulator = initialize_accumalor () # we are iterating on the input set of elements for d in datums : # accumulator.add_input() can be any function. # e.g., it could be as simple as accumulator += 1 accumulator . add_input ( d ) # once we are done with iterating on the elements, we return the result # acumulator.result() can be str.encode(accumulator) return Messages ( Message ( acumulator . result (), keys )) Specification \u00b6 The structure for defining a reduce vertex is as follows. - name : my-reduce-udf udf : container : image : my-reduce-udf:latest groupBy : window : ... keyed : ... storage : ... The reduce spec adds a new section called groupBy and this how we differentiate a map vertex from reduce vertex. There are two important fields, the window and keyed . These two fields play an important role in grouping the data together and pass it to the user-defined reduce code. The reduce supports parallelism processing by defining a partitions in the vertex. This is because auto-scaling is not supported in reduce vertex. If partitions is not defined default of one will be used. - name : my-reduce-udf partitions : integer It is wrong to give a partitions > 1 if it is a non-keyed vertex ( keyed: false ). There are a couple of examples that demonstrate Fixed windows, Sliding windows, chaining of windows, keyed streams, etc. Time Characteristics \u00b6 All windowing operations generate new records as an output of reduce operations. Event-time and Watermark are two main primitives that determine how the time propagates in a streaming application. so for all new records generated in a reduce operation, event time is set to the end time of the window. For example, for a reduce operation over a keyed/non-keyed window with a start and end defined by [2031-09-29T18:47:00Z, 2031-09-29T18:48:00Z) , event time for all the records generated will be set to 2031-09-29T18:47:59.999Z since millisecond is the smallest granularity (as of now) event time is set to the last timestamp that belongs to a window. Watermark is treated similarly, the watermark is set to the last timestamp for a given window. So for the example above, the value of the watermark will be set to the last timestamp, i.e., 2031-09-29T18:47:59.999Z . This applies to all the window types regardless of whether they are keyed or non-keyed windows. Allowed Lateness \u00b6 allowedLateness flag on the Reduce vertex will allow late data to be processed by slowing the down the close-of-book operation of the Reduce vertex. Late data will be included for the Reduce operation as long as the late data is not later than (CurrentWatermark - AllowedLateness) . Without allowedLateness , late data will be rejected and dropped. Each Reduce vertex can have its own allowedLateness . vertices : - name : my-udf udf : groupBy : allowedLateness : 5s # Optional, allowedLateness is disabled by default Storage \u00b6 Reduce unlike map requires persistence. To support persistence user has to define the storage configuration. We replay the data stored in this storage on pod startup if there has been a restart of the reduce pod caused due to pod migrations, etc. vertices : - name : my-udf udf : groupBy : storage : .... Persistent Volume Claim (PVC) \u00b6 persistentVolumeClaim supports the following fields, volumeSize , storageClassName , and accessMode . As name suggests, volumeSize specifies the size of the volume. accessMode can be of many types, but for reduce use case we need only ReadWriteOnce . storageClassName can also be provided, more info on storage class can be found here . The default value of storageClassName is default which is default StorageClass may be deployed to a Kubernetes cluster by addon manager during installation. Example \u00b6 vertices : - name : my-udf udf : groupBy : storage : persistentVolumeClaim : volumeSize : 10Gi accessMode : ReadWriteOnce EmptyDir \u00b6 We also support emptyDir for quick experimentation. We do not recommend this in production setup. If we use emptyDir , we will end up in data loss if there are pod migrations. emptyDir also takes an optional sizeLimit . medium field controls where emptyDir volumes are stored. By default emptyDir volumes are stored on whatever medium that backs the node such as disk, SSD, or network storage, depending on your environment. If you set the medium field to \"Memory\" , Kubernetes mounts a tmpfs (RAM-backed filesystem) for you instead. Example \u00b6 vertices : - name : my-udf udf : groupBy : storage : emptyDir : {}","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/reduce/#reduce-udf","text":"","title":"Reduce UDF"},{"location":"user-guide/user-defined-functions/reduce/reduce/#overview","text":"Reduce is one of the most commonly used abstractions in a stream processing pipeline to define aggregation functions on a stream of data. It is the reduce feature that helps us solve problems like \"performs a summary operation(such as counting the number of occurrences of a key, yielding user login frequencies), etc. \"Since the input is an unbounded stream (with infinite entries), we need an additional parameter to convert the unbounded problem to a bounded problem and provide results on that. That bounding condition is \"time\", eg, \"number of users logged in per minute\". So while processing an unbounded stream of data, we need a way to group elements into finite chunks using time. To build these chunks, the reduce function is applied to the set of records produced using the concept of windowing .","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/reduce/#reduce-pseudo-code","text":"Unlike in map vertex where only an element is given to user-defined function, in reduce since there is a group of elements, an iterator is passed to the reduce function. The following is a generic outlook of a reduce function. I have written the pseudo-code using the accumulator to show that very powerful functions can be applied using this reduce semantics. # reduceFn func is a generic reduce function that processes a set of elements def reduceFn ( keys : List [ str ], datums : Iterator [ Datum ], md : Metadata ) -> Messages : # initialize_accumalor could be any function that starts of with an empty # state. eg, accumulator = 0 accumulator = initialize_accumalor () # we are iterating on the input set of elements for d in datums : # accumulator.add_input() can be any function. # e.g., it could be as simple as accumulator += 1 accumulator . add_input ( d ) # once we are done with iterating on the elements, we return the result # acumulator.result() can be str.encode(accumulator) return Messages ( Message ( acumulator . result (), keys ))","title":"Reduce Pseudo code"},{"location":"user-guide/user-defined-functions/reduce/reduce/#specification","text":"The structure for defining a reduce vertex is as follows. - name : my-reduce-udf udf : container : image : my-reduce-udf:latest groupBy : window : ... keyed : ... storage : ... The reduce spec adds a new section called groupBy and this how we differentiate a map vertex from reduce vertex. There are two important fields, the window and keyed . These two fields play an important role in grouping the data together and pass it to the user-defined reduce code. The reduce supports parallelism processing by defining a partitions in the vertex. This is because auto-scaling is not supported in reduce vertex. If partitions is not defined default of one will be used. - name : my-reduce-udf partitions : integer It is wrong to give a partitions > 1 if it is a non-keyed vertex ( keyed: false ). There are a couple of examples that demonstrate Fixed windows, Sliding windows, chaining of windows, keyed streams, etc.","title":"Specification"},{"location":"user-guide/user-defined-functions/reduce/reduce/#time-characteristics","text":"All windowing operations generate new records as an output of reduce operations. Event-time and Watermark are two main primitives that determine how the time propagates in a streaming application. so for all new records generated in a reduce operation, event time is set to the end time of the window. For example, for a reduce operation over a keyed/non-keyed window with a start and end defined by [2031-09-29T18:47:00Z, 2031-09-29T18:48:00Z) , event time for all the records generated will be set to 2031-09-29T18:47:59.999Z since millisecond is the smallest granularity (as of now) event time is set to the last timestamp that belongs to a window. Watermark is treated similarly, the watermark is set to the last timestamp for a given window. So for the example above, the value of the watermark will be set to the last timestamp, i.e., 2031-09-29T18:47:59.999Z . This applies to all the window types regardless of whether they are keyed or non-keyed windows.","title":"Time Characteristics"},{"location":"user-guide/user-defined-functions/reduce/reduce/#allowed-lateness","text":"allowedLateness flag on the Reduce vertex will allow late data to be processed by slowing the down the close-of-book operation of the Reduce vertex. Late data will be included for the Reduce operation as long as the late data is not later than (CurrentWatermark - AllowedLateness) . Without allowedLateness , late data will be rejected and dropped. Each Reduce vertex can have its own allowedLateness . vertices : - name : my-udf udf : groupBy : allowedLateness : 5s # Optional, allowedLateness is disabled by default","title":"Allowed Lateness"},{"location":"user-guide/user-defined-functions/reduce/reduce/#storage","text":"Reduce unlike map requires persistence. To support persistence user has to define the storage configuration. We replay the data stored in this storage on pod startup if there has been a restart of the reduce pod caused due to pod migrations, etc. vertices : - name : my-udf udf : groupBy : storage : ....","title":"Storage"},{"location":"user-guide/user-defined-functions/reduce/reduce/#persistent-volume-claim-pvc","text":"persistentVolumeClaim supports the following fields, volumeSize , storageClassName , and accessMode . As name suggests, volumeSize specifies the size of the volume. accessMode can be of many types, but for reduce use case we need only ReadWriteOnce . storageClassName can also be provided, more info on storage class can be found here . The default value of storageClassName is default which is default StorageClass may be deployed to a Kubernetes cluster by addon manager during installation.","title":"Persistent Volume Claim (PVC)"},{"location":"user-guide/user-defined-functions/reduce/reduce/#example","text":"vertices : - name : my-udf udf : groupBy : storage : persistentVolumeClaim : volumeSize : 10Gi accessMode : ReadWriteOnce","title":"Example"},{"location":"user-guide/user-defined-functions/reduce/reduce/#emptydir","text":"We also support emptyDir for quick experimentation. We do not recommend this in production setup. If we use emptyDir , we will end up in data loss if there are pod migrations. emptyDir also takes an optional sizeLimit . medium field controls where emptyDir volumes are stored. By default emptyDir volumes are stored on whatever medium that backs the node such as disk, SSD, or network storage, depending on your environment. If you set the medium field to \"Memory\" , Kubernetes mounts a tmpfs (RAM-backed filesystem) for you instead.","title":"EmptyDir"},{"location":"user-guide/user-defined-functions/reduce/reduce/#example_1","text":"vertices : - name : my-udf udf : groupBy : storage : emptyDir : {}","title":"Example"},{"location":"user-guide/user-defined-functions/reduce/windowing/fixed/","text":"Fixed \u00b6 Overview \u00b6 Fixed windows (sometimes called tumbling windows) are defined by a static window size, e.g. 30 second windows, one minute windows, etc. They are generally aligned, i.e. every window applies across all the data for the corresponding period of time. It has a fixed size measured in time and does not overlap. The element which belongs to one window will not belong to any other tumbling window. For example, a window size of 20 seconds will include all entities of the stream which came in a certain 20-second interval. To enable Fixed window, we use fixed under window section. vertices : - name : my-udf udf : groupBy : window : fixed : length : duration NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"\u00b5s\"), \"ms\", \"s\", \"m\", \"h\". Length \u00b6 The length is the window size of the fixed window. Example \u00b6 A 60-second window size can be defined as following. vertices : - name : my-udf udf : groupBy : window : fixed : length : 60s The yaml snippet above contains an example spec of a reduce vertex that uses fixed window aggregation. As we can see, the length of the window is 60s. This means only one window will be active at any point in time. It is also possible to have multiple inactive and non-empty windows (based on out-of-order arrival of elements). The window boundaries for the first window (post bootstrap) are determined by rounding down from time.now() to the nearest multiple of length of the window. So considering the above example, if the time.now() corresponds to 2031-09-29T18:46:30Z , then the start-time of the window will be adjusted to 2031-09-29T18:46:00Z and the end-time is set accordingly to 2031-09-29T18:47:00Z . Windows are left inclusive and right exclusive which means an element with event time (considering event time characteristic) of 2031-09-29T18:47:00Z will belong to the window with boundaries [2031-09-29T18:47:00Z, 2031-09-29T18:48:00Z) It is important to note that because of this property, for a constant throughput, the first window may contain fewer elements than other windows. Check the links below to see the UDF examples for different languages. Python Golang Java Streaming Mode \u00b6 Reduce can be enabled on streaming mode to stream messages or forward partial responses to the next vertex. This is useful for custom triggering, where we want to forward responses to the next vertex quickly, even before the fixed window closes. The close-of-book and a final triggering will still happen even if partial results have been emitted. To enable reduce streaming, set the streaming flag to true in the fixed window configuration. vertices : - name : my-udf udf : groupBy : window : fixed : length : duration streaming : true # set streaming to true to enable reduce streamer Note: UDFs should use the ReduceStreamer functionality in the SDKs to use this feature. Check the links below to see the UDF examples in streaming mode for different languages. Python Golang Java","title":"Fixed"},{"location":"user-guide/user-defined-functions/reduce/windowing/fixed/#fixed","text":"","title":"Fixed"},{"location":"user-guide/user-defined-functions/reduce/windowing/fixed/#overview","text":"Fixed windows (sometimes called tumbling windows) are defined by a static window size, e.g. 30 second windows, one minute windows, etc. They are generally aligned, i.e. every window applies across all the data for the corresponding period of time. It has a fixed size measured in time and does not overlap. The element which belongs to one window will not belong to any other tumbling window. For example, a window size of 20 seconds will include all entities of the stream which came in a certain 20-second interval. To enable Fixed window, we use fixed under window section. vertices : - name : my-udf udf : groupBy : window : fixed : length : duration NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"\u00b5s\"), \"ms\", \"s\", \"m\", \"h\".","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/windowing/fixed/#length","text":"The length is the window size of the fixed window.","title":"Length"},{"location":"user-guide/user-defined-functions/reduce/windowing/fixed/#example","text":"A 60-second window size can be defined as following. vertices : - name : my-udf udf : groupBy : window : fixed : length : 60s The yaml snippet above contains an example spec of a reduce vertex that uses fixed window aggregation. As we can see, the length of the window is 60s. This means only one window will be active at any point in time. It is also possible to have multiple inactive and non-empty windows (based on out-of-order arrival of elements). The window boundaries for the first window (post bootstrap) are determined by rounding down from time.now() to the nearest multiple of length of the window. So considering the above example, if the time.now() corresponds to 2031-09-29T18:46:30Z , then the start-time of the window will be adjusted to 2031-09-29T18:46:00Z and the end-time is set accordingly to 2031-09-29T18:47:00Z . Windows are left inclusive and right exclusive which means an element with event time (considering event time characteristic) of 2031-09-29T18:47:00Z will belong to the window with boundaries [2031-09-29T18:47:00Z, 2031-09-29T18:48:00Z) It is important to note that because of this property, for a constant throughput, the first window may contain fewer elements than other windows. Check the links below to see the UDF examples for different languages. Python Golang Java","title":"Example"},{"location":"user-guide/user-defined-functions/reduce/windowing/fixed/#streaming-mode","text":"Reduce can be enabled on streaming mode to stream messages or forward partial responses to the next vertex. This is useful for custom triggering, where we want to forward responses to the next vertex quickly, even before the fixed window closes. The close-of-book and a final triggering will still happen even if partial results have been emitted. To enable reduce streaming, set the streaming flag to true in the fixed window configuration. vertices : - name : my-udf udf : groupBy : window : fixed : length : duration streaming : true # set streaming to true to enable reduce streamer Note: UDFs should use the ReduceStreamer functionality in the SDKs to use this feature. Check the links below to see the UDF examples in streaming mode for different languages. Python Golang Java","title":"Streaming Mode"},{"location":"user-guide/user-defined-functions/reduce/windowing/session/","text":"Session \u00b6 Overview \u00b6 Session window is a type of Unaligned window where the window\u2019s end time keeps moving until there is no data for a given time duration. Unlike fixed and sliding windows, session windows do not overlap, nor do they have a set start and end time. They can be used to group data based on activity. vertices : - name : my-udf udf : groupBy : window : session : timeout : duration NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"\u00b5s\"), \"ms\", \"s\", \"m\", \"h\". timeout \u00b6 The timeout is the duration of inactivity (no data flowing in for the particular key) after which the session is considered to be closed. Example \u00b6 To create a session window of timeout 1 minute, we can use the following snippet. vertices : - name : my-udf udf : groupBy : window : session : timeout : 60s The yaml snippet above contains an example spec of a reduce vertex that uses session window aggregation. As we can see, the timeout of the window is 60s. This means we no data arrives for a particular key for 60 seconds, we will mark it as closed. Let's say, time.now() in the pipeline is 2031-09-29T18:46:30Z as the current time, and we have a session gap of 30s. If we receive events in this pattern: Event-1 at 2031-09-29T18:45:40Z Event-2 at 2031-09-29T18:45:55Z # Notice the 15 sec interval from Event-1, still within session gap Event-3 at 2031-09-29T18:46:20Z # Notice the 25 sec interval from Event-2, still within session gap Event-4 at 2031-09-29T18:46:55Z # Notice the 35 sec interval from Event-3, beyond the session gap Event-5 at 2031-09-29T18:47:10Z # Notice the 15 sec interval from Event-4, within the new session gap This would lead to two session windows as follows: [2031-09-29T18:45:40Z, 2031-09-29T18:46:20Z) # includes Event-1, Event-2 and Event-3 [2031-09-29T18:46:55Z, 2031-09-29T18:47:10Z) # includes Event-4 and Event-5 In this example, the start time is inclusive and the end time is exclusive. Event-1 , Event-2 , and Event-3 fall within the first window, and this window closes 30 seconds after Event-3 at 2031-09-29T18:46:50Z . Event-4 arrives 5 seconds later, meaning it's beyond the session gap of the previous window, initiating a new window. The second window includes Event-4 and Event-5 , and it closes 30 seconds after Event-5 at 2031-09-29T18:47:40Z , if no further events arrive for the key until the timeout. Note: Streaming mode is by default enabled for session windows. Check the links below to see the UDF examples for different languages. Currently, we have the SDK support for Golang and Java. Golang Java","title":"Session"},{"location":"user-guide/user-defined-functions/reduce/windowing/session/#session","text":"","title":"Session"},{"location":"user-guide/user-defined-functions/reduce/windowing/session/#overview","text":"Session window is a type of Unaligned window where the window\u2019s end time keeps moving until there is no data for a given time duration. Unlike fixed and sliding windows, session windows do not overlap, nor do they have a set start and end time. They can be used to group data based on activity. vertices : - name : my-udf udf : groupBy : window : session : timeout : duration NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"\u00b5s\"), \"ms\", \"s\", \"m\", \"h\".","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/windowing/session/#timeout","text":"The timeout is the duration of inactivity (no data flowing in for the particular key) after which the session is considered to be closed.","title":"timeout"},{"location":"user-guide/user-defined-functions/reduce/windowing/session/#example","text":"To create a session window of timeout 1 minute, we can use the following snippet. vertices : - name : my-udf udf : groupBy : window : session : timeout : 60s The yaml snippet above contains an example spec of a reduce vertex that uses session window aggregation. As we can see, the timeout of the window is 60s. This means we no data arrives for a particular key for 60 seconds, we will mark it as closed. Let's say, time.now() in the pipeline is 2031-09-29T18:46:30Z as the current time, and we have a session gap of 30s. If we receive events in this pattern: Event-1 at 2031-09-29T18:45:40Z Event-2 at 2031-09-29T18:45:55Z # Notice the 15 sec interval from Event-1, still within session gap Event-3 at 2031-09-29T18:46:20Z # Notice the 25 sec interval from Event-2, still within session gap Event-4 at 2031-09-29T18:46:55Z # Notice the 35 sec interval from Event-3, beyond the session gap Event-5 at 2031-09-29T18:47:10Z # Notice the 15 sec interval from Event-4, within the new session gap This would lead to two session windows as follows: [2031-09-29T18:45:40Z, 2031-09-29T18:46:20Z) # includes Event-1, Event-2 and Event-3 [2031-09-29T18:46:55Z, 2031-09-29T18:47:10Z) # includes Event-4 and Event-5 In this example, the start time is inclusive and the end time is exclusive. Event-1 , Event-2 , and Event-3 fall within the first window, and this window closes 30 seconds after Event-3 at 2031-09-29T18:46:50Z . Event-4 arrives 5 seconds later, meaning it's beyond the session gap of the previous window, initiating a new window. The second window includes Event-4 and Event-5 , and it closes 30 seconds after Event-5 at 2031-09-29T18:47:40Z , if no further events arrive for the key until the timeout. Note: Streaming mode is by default enabled for session windows. Check the links below to see the UDF examples for different languages. Currently, we have the SDK support for Golang and Java. Golang Java","title":"Example"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/","text":"Sliding \u00b6 Overview \u00b6 Sliding windows are similar to Fixed windows, the size of the windows is measured in time and is fixed. The important difference from the Fixed window is the fact that it allows an element to be present in more than one window. The additional window slide parameter controls how frequently a sliding window is started. Hence, sliding windows will be overlapping and the slide should be smaller than the window length. vertices : - name : my-udf udf : groupBy : window : sliding : length : duration slide : duration NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"\u00b5s\"), \"ms\", \"s\", \"m\", \"h\". Length \u00b6 The length is the window size of the fixed window. Slide \u00b6 slide is the slide parameter that controls the frequency at which the sliding window is created. Example \u00b6 To create a sliding window of length 1 minute which slides every 10 seconds, we can use the following snippet. vertices : - name : my-udf udf : groupBy : window : sliding : length : 60s slide : 10s The yaml snippet above contains an example spec of a reduce vertex that uses sliding window aggregation. As we can see, the length of the window is 60s and sliding frequency is once every 10s. This means there will be multiple windows active at any point in time. Let's say, time.now() in the pipeline is 2031-09-29T18:46:30Z the active window boundaries will be as follows (there are total of 6 windows 60s/10s ) [2031-09-29T18:45:40Z, 2031-09-29T18:46:40Z) [2031-09-29T18:45:50Z, 2031-09-29T18:46:50Z) # notice the 10 sec shift from the above window [2031-09-29T18:46:00Z, 2031-09-29T18:47:00Z) [2031-09-29T18:46:10Z, 2031-09-29T18:47:10Z) [2031-09-29T18:46:20Z, 2031-09-29T18:47:20Z) [2031-09-29T18:46:30Z, 2031-09-29T18:47:30Z) The window start time is always be left inclusive and right exclusive. That is why [2031-09-29T18:45:30Z, 2031-09-29T18:46:30Z) window is not considered active (it fell on the previous window, right exclusive) but [2031-09-29T18:46:30Z, 2031-09-29T18:47:30Z) is an active (left inclusive). The first window always ends after the sliding seconds from the time.Now() , the start time of the window will be the nearest integer multiple of the slide which is less than the message's event time. So the first window starts in the past and ends _sliding_duration (based on time progression in the pipeline and not the wall time) from present. It is important to note that regardless of the window boundary (starting in the past or ending in the future) the target element set totally depends on the matching time (in case of event time, all the elements with the time that falls with in the boundaries of the window, and in case of system time, all the elements that arrive from the present until the end of window present + sliding ) From the point above, it follows then that immediately upon startup, for the first window, fewer elements may get aggregated depending on the current lateness of the data stream. Check the links below to see the UDF examples for different languages. Python Golang Java Streaming Mode \u00b6 Reduce can be enabled on streaming mode to stream messages or forward partial responses to the next vertex. This is useful for custom triggering, where we want to forward responses to the next vertex quickly, even before the fixed window closes. The close-of-book and a final triggering will still happen even if partial results have been emitted. To enable reduce streaming, set the streaming flag to true in the sliding window configuration. vertices : - name : my-udf udf : groupBy : window : sliding : length : duration slide : duration streaming : true # set streaming to true to enable reduce streamer Note: UDFs should use the ReduceStreamer functionality in the SDKs to use this feature. Check the links below to see the UDF examples in streaming mode for different languages. Python Golang Java","title":"Sliding"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/#sliding","text":"","title":"Sliding"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/#overview","text":"Sliding windows are similar to Fixed windows, the size of the windows is measured in time and is fixed. The important difference from the Fixed window is the fact that it allows an element to be present in more than one window. The additional window slide parameter controls how frequently a sliding window is started. Hence, sliding windows will be overlapping and the slide should be smaller than the window length. vertices : - name : my-udf udf : groupBy : window : sliding : length : duration slide : duration NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"\u00b5s\"), \"ms\", \"s\", \"m\", \"h\".","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/#length","text":"The length is the window size of the fixed window.","title":"Length"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/#slide","text":"slide is the slide parameter that controls the frequency at which the sliding window is created.","title":"Slide"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/#example","text":"To create a sliding window of length 1 minute which slides every 10 seconds, we can use the following snippet. vertices : - name : my-udf udf : groupBy : window : sliding : length : 60s slide : 10s The yaml snippet above contains an example spec of a reduce vertex that uses sliding window aggregation. As we can see, the length of the window is 60s and sliding frequency is once every 10s. This means there will be multiple windows active at any point in time. Let's say, time.now() in the pipeline is 2031-09-29T18:46:30Z the active window boundaries will be as follows (there are total of 6 windows 60s/10s ) [2031-09-29T18:45:40Z, 2031-09-29T18:46:40Z) [2031-09-29T18:45:50Z, 2031-09-29T18:46:50Z) # notice the 10 sec shift from the above window [2031-09-29T18:46:00Z, 2031-09-29T18:47:00Z) [2031-09-29T18:46:10Z, 2031-09-29T18:47:10Z) [2031-09-29T18:46:20Z, 2031-09-29T18:47:20Z) [2031-09-29T18:46:30Z, 2031-09-29T18:47:30Z) The window start time is always be left inclusive and right exclusive. That is why [2031-09-29T18:45:30Z, 2031-09-29T18:46:30Z) window is not considered active (it fell on the previous window, right exclusive) but [2031-09-29T18:46:30Z, 2031-09-29T18:47:30Z) is an active (left inclusive). The first window always ends after the sliding seconds from the time.Now() , the start time of the window will be the nearest integer multiple of the slide which is less than the message's event time. So the first window starts in the past and ends _sliding_duration (based on time progression in the pipeline and not the wall time) from present. It is important to note that regardless of the window boundary (starting in the past or ending in the future) the target element set totally depends on the matching time (in case of event time, all the elements with the time that falls with in the boundaries of the window, and in case of system time, all the elements that arrive from the present until the end of window present + sliding ) From the point above, it follows then that immediately upon startup, for the first window, fewer elements may get aggregated depending on the current lateness of the data stream. Check the links below to see the UDF examples for different languages. Python Golang Java","title":"Example"},{"location":"user-guide/user-defined-functions/reduce/windowing/sliding/#streaming-mode","text":"Reduce can be enabled on streaming mode to stream messages or forward partial responses to the next vertex. This is useful for custom triggering, where we want to forward responses to the next vertex quickly, even before the fixed window closes. The close-of-book and a final triggering will still happen even if partial results have been emitted. To enable reduce streaming, set the streaming flag to true in the sliding window configuration. vertices : - name : my-udf udf : groupBy : window : sliding : length : duration slide : duration streaming : true # set streaming to true to enable reduce streamer Note: UDFs should use the ReduceStreamer functionality in the SDKs to use this feature. Check the links below to see the UDF examples in streaming mode for different languages. Python Golang Java","title":"Streaming Mode"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/","text":"Windowing \u00b6 Overview \u00b6 In the world of data processing on an unbounded stream, Windowing is a concept of grouping data using temporal boundaries. We use event-time to discover temporal boundaries on an unbounded, infinite stream and Watermark to ensure the datasets within the boundaries are complete. The reduce is applied on these grouped datasets. For example, when we say, we want to find number of users online per minute, we use windowing to group the users into one minute buckets. The entirety of windowing is under the groupBy section. vertices : - name : my-udf udf : groupBy : window : ... keyed : ... Since a window can be Non-Keyed v/s Keyed , we have an explicit field called keyed to differentiate between both (see below). Under the window section we will define different types of windows. Window Types \u00b6 Numaflow supports the following types of windows Fixed Sliding Session Non-Keyed v/s Keyed Windows \u00b6 Non-Keyed \u00b6 A non-keyed partition is a partition where the window is the boundary condition. Data processing on a non-keyed partition cannot be scaled horizontally because only one partition exists. A non-keyed partition is usually used after aggregation and is hardly seen at the head section of any data processing pipeline. (There is a concept called Global Window where there is no windowing, but let us table that for later). Keyed \u00b6 A keyed partition is a partition where the partition boundary is a composite key of both the window and the key from the payload (e.g., GROUP BY country, where country names are the keys). Each smaller partition now has a complete set of datasets for that key and boundary. The subdivision of dividing a huge window-based partition into smaller partitions by adding keys along with the window will help us horizontally scale the distribution. Keyed partitions are heavily used to aggregate data and are frequently seen throughout the processing pipeline. We could also convert a non-keyed problem to a set of keyed problems and apply a non-keyed function at the end. This will help solve the original problem in a scalable manner without affecting the result's completeness and/or accuracy. When a keyed window is used, an optional partitions can be specified in the vertex for parallel processing. Usage \u00b6 Numaflow supports both Keyed and Non-Keyed windows. We set keyed to either true (keyed) or false (non-keyed). Please note that the non-keyed windows are not horizontally scalable as mentioned above. vertices : - name : my-reduce partitions : 5 # Optional, defaults to 1 udf : groupBy : window : ... keyed : true # Optional, defaults to false","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#windowing","text":"","title":"Windowing"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#overview","text":"In the world of data processing on an unbounded stream, Windowing is a concept of grouping data using temporal boundaries. We use event-time to discover temporal boundaries on an unbounded, infinite stream and Watermark to ensure the datasets within the boundaries are complete. The reduce is applied on these grouped datasets. For example, when we say, we want to find number of users online per minute, we use windowing to group the users into one minute buckets. The entirety of windowing is under the groupBy section. vertices : - name : my-udf udf : groupBy : window : ... keyed : ... Since a window can be Non-Keyed v/s Keyed , we have an explicit field called keyed to differentiate between both (see below). Under the window section we will define different types of windows.","title":"Overview"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#window-types","text":"Numaflow supports the following types of windows Fixed Sliding Session","title":"Window Types"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#non-keyed-vs-keyed-windows","text":"","title":"Non-Keyed v/s Keyed Windows"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#non-keyed","text":"A non-keyed partition is a partition where the window is the boundary condition. Data processing on a non-keyed partition cannot be scaled horizontally because only one partition exists. A non-keyed partition is usually used after aggregation and is hardly seen at the head section of any data processing pipeline. (There is a concept called Global Window where there is no windowing, but let us table that for later).","title":"Non-Keyed"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#keyed","text":"A keyed partition is a partition where the partition boundary is a composite key of both the window and the key from the payload (e.g., GROUP BY country, where country names are the keys). Each smaller partition now has a complete set of datasets for that key and boundary. The subdivision of dividing a huge window-based partition into smaller partitions by adding keys along with the window will help us horizontally scale the distribution. Keyed partitions are heavily used to aggregate data and are frequently seen throughout the processing pipeline. We could also convert a non-keyed problem to a set of keyed problems and apply a non-keyed function at the end. This will help solve the original problem in a scalable manner without affecting the result's completeness and/or accuracy. When a keyed window is used, an optional partitions can be specified in the vertex for parallel processing.","title":"Keyed"},{"location":"user-guide/user-defined-functions/reduce/windowing/windowing/#usage","text":"Numaflow supports both Keyed and Non-Keyed windows. We set keyed to either true (keyed) or false (non-keyed). Please note that the non-keyed windows are not horizontally scalable as mentioned above. vertices : - name : my-reduce partitions : 5 # Optional, defaults to 1 udf : groupBy : window : ... keyed : true # Optional, defaults to false","title":"Usage"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 93f52df084..3b1f74ac01 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ diff --git a/user-guide/user-defined-functions/map/map/index.html b/user-guide/user-defined-functions/map/map/index.html index 52085d897e..af971c99f5 100644 --- a/user-guide/user-defined-functions/map/map/index.html +++ b/user-guide/user-defined-functions/map/map/index.html @@ -1034,6 +1034,26 @@ Streaming Mode + + +
  • + + Batch Map Mode + + + +
  • @@ -2285,6 +2305,26 @@ Streaming Mode +
  • + +
  • + + Batch Map Mode + + + +
  • @@ -2363,6 +2403,25 @@

    Streaming ModeGolang

  • Java
  • +

    Batch Map Mode

    +

    BatchMap is an interface that allows developers to process multiple data items in a UDF single call, +rather than each item in separate calls.

    +

    The BatchMap interface can be helpful in scenarios where performing operations on a group of data can be more efficient.

    +

    Important Considerations

    +

    When using BatchMap, there are a few important considerations to keep in mind:

    +
      +
    • Ensure that the BatchResponses object is tagged with the correct request ID. +Each Datum has a unique ID tag, which will be used by Numaflow to ensure correctness.
    • +
    • Ensure that the length of the BatchResponses list is equal to the number of requests received. This means that for +every input data item, there should be a corresponding response in the BatchResponses list.
    • +
    +

    Check the links below to see the UDF examples in batch mode for different languages.

    +

    Available Environment Variables

    Some environment variables are available in the user-defined function container, they might be useful in your own UDF implementation.