From 57fca943ee2dc551d03e24f7d4dcff3d18da39c1 Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Wed, 19 Feb 2025 16:58:45 -0800
Subject: [PATCH 01/14] [0029] Set cooperative vectors proposal to under review

---
 proposals/0029-cooperative-vector.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index 76a4400c..24c11f14 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -3,7 +3,9 @@
 * Proposal: [0029](0029-cooperative-vector.md)
 * Author(s): [Anupama Chandrasekhar][anupamachandra]
 * Sponsor: [Damyan Pepper][damyanp], [Greg Roth][pow2clk]
-* Status: **Under Consideration**
+* Status: **Under Review**
+* Planned Version: Shader Model 6.9
+
 
 [anupamachandra]: https://github.com/anupamachandra
 [damyanp]: https://github.com/damyanp

From 32bbb2446cdc7e5b1a365084d739f1f54a68c2ef Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Wed, 19 Feb 2025 21:28:34 -0800
Subject: [PATCH 02/14] Refocus proposal on DXIL operations without specifying
 HLSL API

---
 proposals/0029-cooperative-vector.md | 608 +++++++++++++--------------
 1 file changed, 283 insertions(+), 325 deletions(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index 24c11f14..b81abcb6 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -1,7 +1,7 @@
 <!-- {% raw %} -->
 
 * Proposal: [0029](0029-cooperative-vector.md)
-* Author(s): [Anupama Chandrasekhar][anupamachandra]
+* Author(s): [Anupama Chandrasekhar][anupamachandra], [Damyan Pepper][damyanp]
 * Sponsor: [Damyan Pepper][damyanp], [Greg Roth][pow2clk]
 * Status: **Under Review**
 * Planned Version: Shader Model 6.9
@@ -11,14 +11,23 @@
 [damyanp]: https://github.com/damyanp
 [pow2clk]: https://github.com/pow2clk
 
-# HLSL Cooperative Vectors
+# Cooperative Vectors
 
 ## Introduction
 In research and in industry, machine learning based approaches have made their way to mainstream, replacing/augmenting
 traditional techniques. In graphics, neural network (NN) based rendering methods are gaining popularity over
 traditional methods of image reconstruction, texture compression, material shading etc. Simultaneously, the increasing
 use of GPUs for general purpose ML/DL means that GPU vendors continue to add more specialized hardware in GPUs to
-accelerate neural network computations, like accelerating matrix operations. This specification introduces HLSL and DXIL intrinsics for vector-matrix operations that can accelerated by the underlying hardware.
+accelerate neural network computations, like accelerating matrix operations.
+
+This proposal introduces DXIL operations for vector-matrix operations that can
+accelerated by the underlying hardware, building on support for long verctors
+described in proposals [0026] and [0030]. This proposal describes HLSL builtins
+that can be used for example and testing purposes as well as the building blocks
+for a high-level HLSL API. The high-level API is described in proposal \[TBD\].
+
+[0026]: 0026-hlsl-long-vector-type.md
+[0030]: 0030-dxil-vectors.md
 
 ## Motivation
 
@@ -28,7 +37,7 @@ Note that the NN simply replaces the computations in the original shader with no
 
 **Original Shader**
 
-``` 
+```c++ 
 void ps_main(args) // args: texture, normal, position
 {   
     PreProcessing(args);
@@ -48,7 +57,7 @@ void ps_main(args) // args: texture, normal, position
 
 Below shader is in HLSL-like psuedocode, to highlight the idea of what replacing physical computations with a neural network based evaluation looks like. The exact syntax for the new intrinsics is intentionally skipped to keep it simple, later sections contain examples with the correct syntax and sample descriptors.
 
-``` 
+```c++
 ByteAddressBuffer inputMatrix0; 
 ByteAddressBuffer inputMatrix1; 
 ByteAddressBuffer biasVector0; 
@@ -92,7 +101,7 @@ void ps_main(args) // args: texture, normal, position
 
 ## Proposed solution
 
-Introduce new HLSL intrinsics to accelarate matrix-vector operations. In this specification we add four operations:
+Introduce new DXIL operations to accelarate matrix-vector operations. In this specification we add four operations:
 
 * **Matrix-Vector Multiply:** Multiply a matrix in memory and a vector parameter.
 * **Matrix-Vector Multiply-Add:** Multiply a matrix in memory and a vector parameter and add a vector from memory.
@@ -102,383 +111,335 @@ Introduce new HLSL intrinsics to accelarate matrix-vector operations. In this sp
 
 ## Detailed design
 
-### Intrinsics for Vector-Matrix Operations
+### Matrix-Vector Multiply and Multiply-Add Operations
 
-**Matrix-Vector Multiply and Add Intrinsic**
-
-Intrinsics for specifying a multiplication operation between a matrix(Dim: M * K) loaded from memory and aa vector(Dim: K), a
-variant of this with an add, where a bias vector(Dim: K), loaded from memory, is added to the result vector(Dim: M) of the matrix-vector
-multiply operation.
-
-Note that the dimensions of the matrix are `M X K` versus `M x N`  usually found in linear algebra texbooks. This is to
-futureproof for potential Matrix-Matrix operations in the future where the inputs could be `M X K` and `K x N` to
-produce an `M X N` result matrix.
+#### Syntax
+ 
+> NOTES FOR REVIEW, REMOVE BEFORE MERGING: 
+>  
+> Added @dx.op.matvecmuladd version, and also renamed  from
+> @dx.op.vecmatmul[add] to @dx.op.matvecmul[add] since it is a left matrix
+> multiply.
 
-The `InputVector` is an HLSL vector and the `Matrix` and `BiasVector` are loaded from memory at specified offsets.
+``` llvm 
+declare <[NUMo] x [TYo] @dx.op.matvecmul.v[NUMo][TYo].v[NUMi][TYi](
+    i32               ; opcode
+    <[NUMi] x [TYi]>, ; input vector
+    i32,              ; input interpretation
+    %dx.types.Handle, ; matrix resource
+    i32,              ; matrix offset
+    i32,              ; matrix interpretation
+    i32,              ; matrix M dimension    
+    i32,              ; matrix K dimension    
+    i32,              ; matrix layout
+    i32,              ; matrix transpose      <<< should this be i1?
+    i32,              ; matrix stride
+    i1)               ; isResultSigned        <<< what is this for?
 
+declare <[NUMo] x [TYo]> @dx.op.matvecmuladd.v[NUMo][TYo].v[NUMi][TYi](
+    i32               ; opcode
+    <[NUMi] x [TYi]>, ; input vector
+    i32,              ; input interpretation
+    %dx.types.Handle, ; matrix resource
+    i32,              ; matrix offset
+    i32,              ; matrix interpretation
+    i32,              ; matrix M dimension    
+    i32,              ; matrix K dimension    
+    i32,              ; matrix layout
+    i32,              ; matrix transpose      <<< should this be i1?
+    i32,              ; matrix stride
+    %dx.types.Handle, ; bias vector resource
+    i32,              ; bias vector offset
+    i32,              ; bias vector interpretation
+    i1)               ; isResultSigned        <<< what is this for?
 ```
-// Result = Matrix * InputVector + Bias
-template<typename DESC, typename InputTy, typename ResultTy, uint InputComponents>
-vector<ResultTy, DESC::M> VectorMatrixMulAdd(vector<InputTy, InputComponents>  InputVector,
-                                                      (RW)ByteAddressBuffer             Matrix,
-                                                      uint                              MatrixOffset,
-                                                      uint                              MatrixStride,
-                                                      (RW)ByteAddressBuffer             BiasVector,
-                                                      uint                              BiasOffset);
-
-// Result = Matrix * InputVector
-template<typename DESC, typename InputTy, typename ResultTy, uint InputComponents>
-vector<ResultTy, DESC::M> VectorMatrixMul(vector<InputTy, InputComponents> InputVector,
-                                                   (RW)ByteAddressBuffer            Matrix,
-                                                   uint                             MatrixOffset,
-                                                   uint                             MatrixStride);
 
-```
+#### Overview
 
-Note that the `InputVector` has a physical storage type `InputTy` and an interpretation type that specifies how it is
-interpreted. Similarly,`Matrix` and `BiasVector` are loaded from a memory buffer and have interpretation parameters
-that specify how the buffer elements are interpreted. See the section on Type Interpretation for more details.
-
-```
-enum class DXILTypeInterpretation :uint {
-  Float16               = 0,
-  Float32               = 1,
-  UnsignedInt8          = 2,
-  UnsignedInt16         = 3,
-  UnsignedInt32         = 4,
-  SignedInt8            = 5,
-  SignedInt16           = 6,
-  SignedInt32           = 7,
-  SignedInt8x4Packed    = 8,
-  UnsignedInt8x4Packed  = 9,
-  FloatE4M3             = 10,
-  FloatE5M2             = 11,
-  Unsupported           = 32
-};
-
-enum class DXILMatrixLayout : uint {
-  RowMajor              = 0,
-  ColumnMajor           = 1,
-  InferencingOptimal    = 2,
-  TrainingOptimal       = 3,
-};
+The `@dx.op.matvecmul` operation multiplies a **MxK** dimension matrix and a
+**K** sized input vector. The matrix is loaded from memory while the vector is
+stored in a variable.
 
-template<uint m, uint k, uint input_interp, uint matrix_interp, uint bias_interp, 
-         uint layout, bool transpose>
-struct VecMatOpDescriptor {
-  static const uint M               = m;
-  static const uint K               = k;
-  static const uint Ii              = input_interp;
-  static const uint Mi              = matrix_interp;
-  static const uint Bi              = bias_interp;
-  static const uint Layout          = layout;
-  static const bool Transposed      = transpose;
-};
+The `@dx.op.matvecmuladd` operation behaves as `@dx.op.matvecmul`, but also adds
+an **M**-sized bias vector (loaded from memory) to the result.
 
-// Result = Matrix * InputVector + Bias
-template<typename DESC, typename InputTy, typename ResultTy, uint InputComponents>
-vector<ResultTy, DESC::M> VectorMatrixMulAdd(vector<InputTy, InputComponents> InputVector,
-                               (RW)ByteAddressBuffer Matrix,
-                               uint MatrixOffset,
-                               uint MatrixStride,
-                               (RW)ByteAddressBuffer BiasVector,
-                               uint BiasOffset);
-
-// Result = Matrix * InputVector
-template<typename DESC, typename InputTy, typename ResultTy, uint InputComponents>
-vector<ResultTy, DESC::M> VectorMatrixMul(vector<InputTy, InputComponents> InputVector,
-                             (RW)ByteAddressBuffer Matrix,
-                             uint MatrixOffset,
-                             uint MatrixStride);
+> Note that the dimensions of the matrix are **M**x**K** versus **M**x**N**
+> usually found in linear algebra texbooks. This is to futureproof for potential
+> matrix-matrix operations in the future where the inputs could be **M**x**K**
+> and **K**x**N** to produce an **M**x**N** result matrix.
 
-```
+#### Arguments
 
-*InputVector* is the vector operand of the matrix-vector mul/mul-add operation. *InputTy* is the physical storage type
- of the elements of the vector, which might vary from the actual type that the elements of the vector are interpreted
- as, *InputInptretation* from *DESC*. *InputComponents* is the number of components in the input vector, which equals
- the matrix dimension *K* for a non-packed type and for a packed type, equals the least number that can hold *K* values
- of the packed type. Where, packed type, refers to types like `SignedInt8x4Packed` where each 32-bit element of the
- vector corresponds to four 8-bit signed integers; Unpacked types are the standard types like float16, uint etc. The
- elements of the *InputVector* are converted to type specified by *DESC: Ii* present in, if it is legal. More details
- in the [Type Interpretations](#type-interpretations) section.
+##### Input Vector
 
-*Matrix* is loaded starting from a byte offset *MatrixOffset* from the start of Buffer, and raw data is loaded according
- to the type interpretation parameter *DESC: Mi*. *DESC: MxK* is the dimension of the matrix. No conversion is
- performed. The *MatrixOffset* and the base address of the Matrix buffer must be 64B aligned. The *DESC: Layout* of the
- matrix is one of the enum values *DXILMatrixLayout* listed above.
+The **input vector** is of size `NUMi` and contains elements of physical type
+`TYi`. The **input interpretation** describes how to interpret the contents of
+the vector. `NUMi` has a relationship with **K** as follows:
 
-*MatrixStride*, for RowMajor or ColumnMajor layouts, is the number of bytes to go from one row/column to the next. For
- optimal layouts, *MatrixStride* is ignored.
+* for non-packed interpretations: `NUMi` equals **K**,
+* for packed interpretations: `NUMi` equals the least number that can hold **K**
+  values of the packed type.
 
-*BiasVector*, the bias vector, is loaded starting from a byte offset of *BiasOffset* from the start of the array, and
- raw data is loaded according to the type interpretation parameter *DESC: Bi*. *M* consecutive elements are loaded. No
- conversion is performed. The *BiasOffset* and the base address of the BiasVector buffer must be 64B aligned.
+Non-packed interpretations are standard types such as float16, uint etc.  Packed
+types are types such as **SignedInt8x4Packed** where each 32-bit element of the
+vector corresponds to four 8-bit signed integers. See [Type Interpretations] for
+details.
 
- **VecMatOpDescriptor Parameters** 
 
- The *VecMatOpDescriptor* describes the interpretation of the Input, Matrix and Bias elements. Bias interpretation
- applies only for the *VectorMatrixMulAdd* operation and is ignored for *VectorMatrixMul* operation. 
+##### Matrix
 
-*Ii* Input Interpretation, *Mi* MatrixInterpretation and *Bi* BiasInterpretation define what type the respective objects
- will be interpreted as. These values are constrained by the combinations allowed by the device, *Matrix* and *Bias*
- are typeless buffers and their respective interpretations determine the types. See [Type Interpretations]
- (#type-interpretations) section for more details.
+The matrix is loaded from the raw-buffer, **matrix resource**,  starting at
+**matrix offset**. The **matrix interpretation** argument specifies the element
+type of the matrix (see [Type Interpretations]). The **matrix M dimension** and
+**matrix K dimension** arguments specify the dimensions of the matrix. The
+**matrix layout** argument specifies the layout of the matrix (see [Matrix
+Layouts]). If the **matrix transpose** is non-zero then the matrix is transposed
+before performing the multiply (see [Matrix Transpose]). For row-major and
+column-major layouts, **matrix stride** specifies the number of bytes to go from
+one row/column to the next.  For optimal layouts, **matrix stride** is ignored.
 
+Only non-packed interpretations are valid for matrices.
 
-*Mi* and *Bi* determines the type of the Weight Matrix and Bias Vector elements.
+The base address of **matrix resource** and **matrix offset** must be 64 byte
+aligned.
 
-For the unpacked case, *M x K* is the dimension of the Matrix, *M* is the size of the result vector, *K* is the size of
-the input vector. For the packed case, the number of components in the input vector must large enough to hold the *K*
-packed components.
 
-*Layout* is an enum value, `DXILMatrixLayout`. Optimal layouts are opaque implementation specific layouts, the D3D call
- `CooperativeVectorConvertMatrix` can be used to convert the *Matrix* to an optimal layout. Row-Major and Column-Major
- layouts are also supported.
 
-The *Transposed* parameter indicates if the *Matrix* is transposed before performing the multiply. In linear algebra,
-the[transpose](https://en.wikipedia.org/wiki/Transpose) of a matrix is an operator which flips a matrix over its
-diagonal; that is, it switches the row and column indices of the matrix. Transposing is not supported for the
-RowMajor/ColumnMajor layouts. Not all component types support transposing. It is left to implementations to define
-which types support matrix transposing. "TransposeSupported" flag from the [CheckFeatureSupport]
-(#check-feature-support) struct is used to determine if a matrix transpose is supported. Note that even for the
-type/interpretation combinations with guaranteed [support](#minimum-support-set), transpose support isn't guaranteed
-and needs to be checked explicitly.
+> TODO: consider using a different set of interpretation values for in-memory
+> interpretations versus vector interpretations.
 
-**Type Interpretations**
 
-The types of *InputVector*, *Matrix* and *BiasVector* are all determined by their respective interpretation parameters.
-For the Matrix and BiasVector which are stored in (RW)ByteAddressBuffers, this is straightforward: the *M*
-and *K* *VecMatOpDescriptor* parameters describe the dimensions of the *Matrix*/*BiasVector*, these are loaded from the
-offsets *MatrixOffset* and *Biasoffset* respectively and the *Mi* and *Bi* parameters which
-are *DXILTypeInterpretation* enums specify the element type.
 
-*InputVector* is an HLSL vectors of a given type *InputTy* . However, the type that the elements of this vector are
- interpreted as in the matrix-vector operation is specified by the *InputInterpretation* parameter. The reason is that
- the interpretation parameter allows the elements to be interpreted as types not natively supported by HLSL, e.g.
- uint8/sint8. 
+##### Bias Matrix
 
-The legal conversions from the declared *InputType* to *InputInterpretation: Ii* and the
-corresponding *MatrixInterpretation: Mi* and *BiasInterpretation: Bi* are implementation dependent and can be queried.
-See[CheckFeatureSupport](#check-feature-support) section for details. An exception to this rule is the set of
-combinations guaranteed to be supported on all devices supporting this feature. See [Minimum Support Set]
-(#minimum-support-set).  Note that *Transposed* is always queried.
+> * TODO: alignment requirements for **bias vector offset**
+> * TODO: are packed types allowed for bias vectors?
 
-Non-"Packed" type interpretations are used to request arithmetic conversions. Input type must be a 32-bit or 16-bit
-scalar integer or a 32-bit or 16-bit float. Integer to integer conversion saturates, float to float conversion is
-implementation dependent and preserves the value as accurately as possible. Float to integer conversion is RTNE and
-saturating. Integer to float conversion is RTNE.
+The bias matrix is loaded from the raw-buffer, **bias vector resource**,
+starting at **bias vector offset**. The **bias vector interpretation** argument
+specifies the element type of the bias vector (see [Type Interpretations]).
 
-/// XXX TODO: These rules make sense for NN applications but diverge from HLSL conversion rules [here]
-    (https://microsoft.github.io/hlsl-specs/specs/hlsl.html#Conv).
+The base address of **bias vector resource** and **bias vector offset** must be
+64 byte aligned.
 
-"Packed" type conversions are bitcasts to a smaller type. The declared input type must be 32-bit unsigned integer. 
 
-/// XXX TODO: Error handling for illegal conversions. 
+### Vector Outer Product
 
-Examples:
+#### Syntax
 
-Packed Case:
+``` llvm
+declare void @dx.op.vecouterproductacc.v[M][TY].v[N][TY](
+    i32,              ; opcode 
+    <[M] x [TY]>,     ; input vector 1
+    <[N] x [TY]>,     ; input vector 2
+    %dx.types.Handle, ; matrix resource
+    i32,              ; matrix offset 
+    i32,              ; matrix stride 
+    i32,              ; matrix interpretation 
+    i32)              ; matrix layout 
 ```
-// Declare an input vector
-vector<uint, 8> ipVector;
-
-// Set interpretation value to DXILCoopVectorTypeInterpretation::SignedInt8x4Packed
-// Each uint element (32-bit) in the input vector, ipVector, will be interpreted as 4 int8 values in the VectorMatrixMul intrinsic. 
-// Note that InputTy = uint and InputComponents = 8 (8 x 4 = 32 sint8 values )
-VecMatOpDescriptor<32 /*M*/, 
-                   32 /*K*/, 
-                   DXILTypeInterpretation::SignedInt8x4Packed /*InputInterpretation*/, 
-                   DXILTypeInterpretation::SignedInt8 /*MatrixInterpretation*/,
-                   DXILTypeInterpretation::Unsupported /*BiasInterpretation*/, 
-                   DXILMatrixLayout::InferencingOptimal /*Layout*/,
-                   false /*Transpose*/> desc;
-
-vector<int, 32> resultVector; //Note that the ResultComponents equals M(32)
-// Matrix is a ByteAddressBuffer
-resultVector = VectorMatrixMul(ipVector, Matrix, 0/*MatrixOffset*/, 0/*MatrixStride*/);
 
-```
+#### Overview
 
-Non-Packed Case:
-```
-// Declare an input vector
-vector<float, 32> ipVector;
+Computes the outer product between column vectors and an **M**x**N** matrix* is accumulated atomically (with device scope) in memory. 
 
-// Set interpretation value to DXILCoopVectorTypeInterpretation::SignedInt8x4Packed
-// Each float element of the input vector, ipVector, will be arithmetically converted to a sint8 value in the VectorMatrixMul intrinsic. 
-VecMatOpDescriptor<64 /*M*/, 
-                   32 /*K*/, 
-                   DXILTypeInterpretation::SignedInt8 /*InputInterpretation*/, 
-                   DXILTypeInterpretation::SignedInt8 /*MatrixInterpretation*/,
-                   DXILTypeInterpretation::SignedInt8 /*BiasInterpretation*/, 
-                   DXILMatrixLayout::InferencingOptimal /*Layout*/,
-                   false /*Transpose*/> desc;
+``` 
+ResultMatrix = InputVector1 * Transpose(InputVector2); 
+```
 
-vector<int, 64> resultVector; // Note that the ResultComponents equals M(64)
 
-// Matrix and Bias are ByteAddressBuffers
-resultVector = VectorMatrixMul(ipVector, Matrix, 0/*MatrixOffset*/, 0/*MatrixStride*/, Bias, 0/*BiasStride*/);
+#### Arguments
 
-```
+The two input vectors are specified via **input vector 1** and **input vector
+2**.
 
+The matrix is accumulated to the writeable raw-buffer specified by **matrix
+resource**, with **matrix offset**, **matrix stride**, **matrix interpretation**
+and **matrix layout** behaving as described
+[above](#matrix-vector-multiply-and-multiply-add-operations).
 
-**Vector Outer Product**
+The base address of **matrix resource** and **matrix offset** must be 64 byte
+aligned.
 
-Computes the outer product between column vectors and an *MxN Matrix* is accumulated atomically (with device scope) in memory. The device should be queried in `CheckFeatureSupport` to determine type of InputVector supported and the corresponding Accumulation type.
-An exception to this rule is the set of combinations guaranteed to be supported on all devices supporting the cooperative vector feature. See [here](#minimum-support-set).
+Not all combinations of vector element type and matrix interpretations are
+supported by all implementations. [CheckFeatureSupport] can be used to determine
+which combinations. A list of combinations that are guaranteed to be supported
+on all implementations can be found in [Minimum Support Set].
 
-``` 
-ResultMatrix = InputVector1 * Transpose(InputVector2); 
-```
 
+### Reduce Sum Accumulate
 
-```
-template<uint matrix_interp, uint layout>
-struct OuterProductAccDescriptor{
-  static const uint Mi     = matrix_interp;
-  static const uint Layout = layout;
-};
+#### Syntax
 
-template<typename DESC, typename T, uint M, uint N>
-void OuterProductAccumulate(vector<T, M> InputVector1,
-                            vector<T, N> InputVector2,
-                            RWByteAddressBuffer ResultMatrix,
-                            uint ResultMatrixOffset,
-                            uint ResultMatrixStride);
+``` llvm
+declare void @dx.op.vecreducesumacc.v[NUM][TY](
+    i32,              ; opcode
+    <[NUM] x [TY]>,   ; input vector
+    %dx.types.Handle, ; output array resource 
+    i32)              ; output array offset
 ```
 
-*InputVector1* is an M component vector of type T.
+#### Overview
 
-*InputVector2* is an N component vector of type T.
+Accumulates the components of a vector atomically (with device scope) to the
+corresponding elements of an array in memory.
 
-*ResultMatrix* is the resulting *MxN* matrix accumulated atomically (with device scope) in memory
- (RWByteAddressBuffer) at offset *ResultMatrixOffset*. The base address and *ResultMatrixOffset* of the Matrix buffer
- must be 64B aligned.
+#### Arguments
 
-*ResultMatrixStride* for RowMajor or ColumnMajor layouts, is the number of bytes to go from one row/column to the next.
- For optimal lyouts, stride is ignored.
+The input vector is specified by **input vector**, and has `NUM` elements of type `TY`.
 
- **OuterProductAccDescriptor Parameters**
+The output array is accumulated to the writeable raw-buffer resource specified
+by **output array resource** and **output array offset**.  The base address and
+**output array offset** must be 64 byte aligned.
 
- *Mi* determines the type of the Result Matrix. See [Type Interpretations](#type-interpretations) section for more
-  details.
+[CheckFeatureSupport] can be used to determine which vector element types can be accumulated. A list of types that are guaranteed to be supported on all devices can be found in [Minimum Support Set].
 
- *Layout* is an enum value, `DXILMatrixLayout`. Optimal layouts are opaque implementation specific layouts, the D3D call
-  `CooperativeVectorConvertMatrix` can be used to convert the *Matrix* to an optimal layout. Row-Major and Column-Major
-  layouts are also supported.
 
-The device should be queried in [Check Feature Support](#check-feature-support) to determine datatypes of InputVector
-supported along with the AccumulationType. An exception to this rule is the set of combinations guaranteed to be
-supported on all devices supporting this feature. See [Minimum Support Set](#minimum-support-set).
+[Type Interpretations]: #type-interpretations
+[Matrix Layouts]: #matrix-layouts
+[Matrix Transpose]: #matrix-transpose
+[Minimum Support Set]: #minimum-support-set
+[CheckFeatureSupport]: #check-feature-support
 
 
-**Reduce Sum Accumulate**
+### Type Interpretations
 
-Accumulates the components of a vector atomically (with device scope) to the corresponding elements of an array in
-memory.
-
-```
-template<typename T, uint M>
-void ReduceSumAccumulate(vector<T, M> InputVector,
-                         RWByteAddressBuffer Buf,
-                         uint BufOffset);
+The various "interpretation" arguments specify a value from the following enum:
 
+```c++
+enum class DXILTypeInterpretation :uint {
+  Float16               = 0,
+  Float32               = 1,
+  UnsignedInt8          = 2,
+  UnsignedInt16         = 3,
+  UnsignedInt32         = 4,
+  SignedInt8            = 5,
+  SignedInt16           = 6,
+  SignedInt32           = 7,
+  SignedInt8x4Packed    = 8,
+  UnsignedInt8x4Packed  = 9,
+  FloatE4M3             = 10,
+  FloatE5M2             = 11,
+  Unsupported           = 32
+};
 ```
 
-*InputVector* is an M component vector of type T.
+For matrices and vectors that are specified by resource handles and stored in
+raw-buffers, the interpretation value directly specifies the element type.  It
+is invalid to specify a packed interpretation in these cases.
 
-*Buf* is the array into which the *InputVector* is accummulated. The base address and *BufOffset* of the buffer
- must be 64B aligned.
+For input vectors that come from variables there is a distinction between the
+physical type and the logical type. The **input interpretation** argument for
+these vectors describes how to convert from the physical to logical type. This
+allows elements to be interpreted as types not natively supported by HLSL, e.g.
+uint8/sint8. For packed interpretations, a single physical element can expand
+into multiple logical elements.
 
-*BufOffset* is the offset to the first element of the array to which the *InputVector* is accummulated. It is 64B aligned.
+[CheckFeatureSupport] can be used to determine what combinations of **TYi**,
+**input interpretation**, **matrix interpretation**, **matrix transpose**,
+**bias vector interpretation** and **TYo** are supported on a particular
+implementation. A list of combinations that are guaranteed to be supported on
+all implementations can be found in [Minimum Support Set]. Note that there is no
+guaranteed support for **matrix tranpose**, and so it must always queried. s
+queried.
 
-The device should be queried in [Check Feature Support](#check-feature-support) to determine datatypes of InputVector supported along
-with the AccumulationType. An exception to this rule is the set of combinations guaranteed to be supported on all
-devices supporting this feature. See [Minimum Support Set](#minimum-support-set).
-
-### Example HLSL Shader
-
-// XXX TODO
+Non-"Packed" type interpretations are used to request arithmetic conversions. Input type must be a 32-bit or 16-bit
+scalar integer or a 32-bit or 16-bit float. Integer to integer conversion saturates, float to float conversion is
+implementation dependent and preserves the value as accurately as possible. Float to integer conversion is RTNE and
+saturating. Integer to float conversion is RTNE.
 
-### Interchange Format Additions
+> TODO: These rules make sense for NN applications but diverge from HLSL
+> conversion rules
+> [here](https://microsoft.github.io/hlsl-specs/specs/hlsl.html#Conv).
 
-**Vector Matrix Multiply(Add)**
+"Packed" type conversions are bitcasts to a smaller type. The declared input type must be 32-bit unsigned integer. 
 
-*HLSL*
+> /// XXX TODO: Error handling for illegal conversions. 
 
-``` 
-template<typename DESC, typename InputTy, typename ResultTy, uint InputComponents>
-vector<ResultTy, DESC::M> VectorMatrixMulAdd(vector<InputTy, InputComponents> InputVector,
-                                                      (RW)ByteAddressBuffer Matrix,
-                                                      uint MatrixOffset,
-                                                      uint MatrixStride,
-                                                      (RW)ByteAddressBuffer BiasVector,
-                                                      uint BiasOffset);
+Examples:
 
+Packed Case:
+``` llvm
+; Using SignedInt8x4Packed input interpretation, each uint element (32-bit) in the 
+; input vector will be interpreted as 4 int8 values.
+;
+; Note that TYi = i32 and NUMi = 8 (8 x 4 = 32 sint8 values ), and the result is a 
+; 32-element vector.
+
+%inputVector = <8 x i32> ...
+
+%result = <32 x i32> call @dx.op.matvecmul.v[32][i32].v[8][i32](
+     OPCODE,
+     %inputVector,
+     8,               ; input interpretation - SignedInt8x4Packed
+     %matrixResource,
+     0,               ; matrix offset
+     5,               ; matrix interpretation - SignedInt8
+     32,              ; matrix M dimension
+     32,              ; matrix K dimension
+     2,               ; matrix layout - InferencingOptimal
+     0,               ; matrix transpose - false
+     0,               ; matrix stride
+     1);              ; isResultSigned - true
 ```
 
-*DXIL*
-
-``` 
-<n1 x ty1> @dx.op.vecmatmul.v<n1><ty1>.v<n2<ty2>(i32 opcode, 
-                                                 <n2 x ty2> %ipVec, 
-                                                 i32 inputInterpretation, 
-                                                 %dx.types.Handle %matrix, 
-                                                 i32 %matrixoffset, 
-                                                 i32 matrixInterpretation, 
-                                                 i32 matrixMdim,
-                                                 i32 matrixKdim, 
-                                                 i32 matrixLayout, 
-                                                 i32 matrixTranspose, 
-                                                 i32 matrixStride
-                                                 i1 isResultSigned); 
-```
+Non-Packed Case:
+``` llvm
+; Using SignedInt8 input interpretation, each float element will be arithmetically
+; converted to a sint8 value.
 
-**Outer Product Accumulate**
+%inputVector = <32 x float> ...
 
-*HLSL*
+%result = <64 x i32> call @dx.op.matvecmul.v[64][i32].v[32][float](
+    OPCODE,
+    %inputVector,
+    5,               ; input interpretation - SignedInt8
+    %matrixResource,
+    0,               ; matrix offset
+    5,               ; matrix interpretation - SignedInt8
+    64,              ; matrix M dimension
+    32,              ; matrix K dimension
+    2,               ; matrix layout - InferencingOptimal
+    0,               ; matrix transpose - false
+    0,               ; matrix stride
+    1)               ; isResultSigned - true
+```
 
-``` 
-template<typename DESC, typename T, uint M, uint N>
-void OuterProductAccumulate(vector<T, M> InputVector1,
-                            vector<T, N> InputVector2,
-                            RWByteAddressBuffer ResultMatrix,
-                            uint ResultMatrixOffset,
-                            uint ResultMatrixStride);
 
-```
+### Matrix Layouts
 
-*DXIL*
+The **matrix layout** argument specifies a value from the following enum:
 
-``` 
-void @dx.op.vecouterproductacc.v<n1><ty>.v<n2<ty>(i32 opcode, <n1 x ty> %ipVec1, 
-                                                  <n2 x ty> %ipVec2, 
-                                                  %dx.types.Handle %matrix, 
-                                                  i32 %matrixoffset, 
-                                                  i32 %matrixstride,
-                                                  i32 matrixInterpretation, 
-                                                  i32 matrixLayout); 
+```c++
+enum class DXILMatrixLayout : uint {
+  RowMajor              = 0,
+  ColumnMajor           = 1,
+  InferencingOptimal    = 2,
+  TrainingOptimal       = 3,
+};
 ```
 
+Optimal layouts are opaque implementation specific layouts, the D3D call
+`CooperativeVectorConvertMatrix` can be used to convert the *Matrix* to an
+optimal layout. Row-Major and Column-Major layouts are also supported.
 
-**Reduce Sum Accumulate**
+ 
+### Matrix Transpose
 
-*HLSL*
+The **matrix transpose** parameter indicates if the matrix is transposed before
+performing the multiply. In linear algebra,
+the[transpose](https://en.wikipedia.org/wiki/Transpose) of a matrix is an
+operator which flips a matrix over its diagonal; that is, it switches the row
+and column indices of the matrix. 
 
-```
-void ReduceSumAccumulate(vector<T, M> InputVector,
-                         RWByteAddressBuffer Buf,
-                         uint BufOffset);
+Transposing is not supported for the RowMajor/ColumnMajor layouts. 
 
-```
-
-*DXIL*
-```
-void @dx.op.vecreducesumacc.v<n><ty>(i32 opcode, 
-                                     <n x ty> %ipVec, 
-                                     %dx.types.Handle %buf, 
-                                     i32 %bufoffset); 
-```
+Not all component types support transposing. It is left to implementations to
+define which types support matrix transposing. "TransposeSupported" flag from
+the [CheckFeatureSupport] (#check-feature-support) struct is used to determine
+if a matrix transpose is supported. Note that even for the type/interpretation
+combinations described in [Minimum Support Set], transpose support isn't
+guaranteed and needs to be checked explicitly.
 
 ### Non-Uniform control flow
 
@@ -488,25 +449,25 @@ implementations can enable fast paths by allowing vectors to cooperate behind th
 fully occupied waves and uniform values for Matrix, Matrix Offset, Matrix Interpretation, Matrix Layout, Matrix Stride,
 Matrix Transpose and Bias, Bias Offset, Bias Interpretation, but this is not a requirement for functionality.
 
-### Shade Stages
+### Shader Stages
 
 The vector-matrix intrinsics are expected to be supported in all shader stages.
 
-// XXX TODO: Add query to determine which shader stages support these intrinsics.
+> TODO: Add query to determine which shader stages support these intrinsics.
 
 ### Diagnostic Changes
 
 * Diagnostics for incorrect use of the new intrinsics.
 
 
-#### Validation Changes
+### Validation Changes
 
 
-### D3D12 API Additions
+#### D3D12 API Additions
 
 Note: The enums and structs need to be updated from the coop_vec name, once a new name for the feature is decided.
 
-#### Check Feature Support
+### Check Feature Support
 
 This feature requires calling CheckFeatureSupport(). Additional D3D12_FEATURE enum and corresponding D3D12_FEATURE_DATA* structs (listed below) are added to enable discovering the Cooperative Vector tier along with the datatype and interpretation combinations supported by new vector-matrix intrinsics.
 
@@ -585,15 +546,15 @@ If pProperties is non-NULL for any intrinsic but its PropCount is less than the
 
 // XXX TODO: Add query for emulated types. For example E4M3 and E5M2 might not be supported on certain h/w, but since these are in the minimum support set, they need to be emulated, possibly using FP16. Add capability for the application to query which types are natively supported and which ones are emulated.
 
-### Minimum Support Set
+#### Minimum Support Set
 
 Minimum set of properties that implementations are required to support for each intrinsic are listed below.
 
-#### For VectorMatrixMulAdd
+##### For Matrix-Vector Multiply and Multiply-Add
 
 Note that value of `TransposeSupported` is never guaranteed and needs to be explicitly checked for the combinations below.
 
-```
+
 | InputType    | InputInterpretation | MatrixInterpretation | BiasInterpretation | OutputType |
 |--------------|---------------------|----------------------|--------------------|------------|
 | FP16         | FP16                | FP16                 | FP16               | FP16       |
@@ -601,28 +562,25 @@ Note that value of `TransposeSupported` is never guaranteed and needs to be expl
 | FP16         | E5M2                | E5M2                 | FP16               | FP16       |
 | SINT8_PACKED | SINT8               | SINT8                | SINT32             | SINT32     |
 | FP32         | SINT8               | SINT8                | SINT32             | SINT32     |
-```
 
-#### For OuterProductAccumulate
 
-```
+##### For OuterProductAccumulate
+
 | InputType | AccumulationType |
 |-----------|------------------|
 | FP16      | FP16             |
 | FP16      | FP32             |
-```
 
-#### For ReduceSumAccumulate
+##### For ReduceSumAccumulate
 
-```
 | InputType | AccumulationType |
 |-----------|------------------|
 | FP16      | FP16             |
-```
 
-**Usage Example:**
 
-```
+#### Usage Example
+
+```c++
 // Check for CooperativeVector support and query properties for VectorMatrixMulAdd
 D3D12_FEATURE_DATA_D3D12_OPTIONSNN CoopVecSupport = {};
 
@@ -660,8 +618,8 @@ specific alignment constraints and performance characteristics. We introduce a d
 dataype of the weight matrix from and to any of the layouts in `D3D12_COOPERATIVE_VECTOR_MATRIX_LAYOUT` and datatypes in
 `D3D12_COOPERATIVE_VECTOR_DATATYPE`.
 
-```
-typedef enum D3D12_COOPERATIVE_VECTOR_MATRIX_LAYOUT {
+```c++
+enum D3D12_COOPERATIVE_VECTOR_MATRIX_LAYOUT {
     D3D12_COOPERATIVE_VECTOR_MATRIX_LAYOUT_ROW_MAJOR,
     D3D12_COOPERATIVE_VECTOR_MATRIX_LAYOUT_COLUMN_MAJOR,
     D3D12_COOPERATIVE_VECTOR_MATRIX_LAYOUT_INFERENCING_OPTIMAL,
@@ -673,7 +631,7 @@ typedef enum D3D12_COOPERATIVE_VECTOR_MATRIX_LAYOUT {
 
 The destination buffer (to hold the matrix) size can be implementation dependent. The API `GetCooperativeVectorMatrixConversionDestinationInfo` is added to query the size of the destination buffer in the desired layout and datatype. It takes a pointer to `D3D12_COOPERATIVE_VECTOR_MATRIX_CONVERSION_DEST_INFO` descriptor that provides the inputs required to calculate the necessary size. The same descriptor, updated with the calculated output size, is then passed to the conversion API. 
 
-```
+```c++
 
 // Descriptor to query the destination buffer size
 typedef struct D3D12_COOPERATIVE_VECTOR_MATRIX_CONVERSION_DEST_INFO { 
@@ -702,7 +660,7 @@ void ID3D12Device::GetCooperativeVectorMatrixConversionDestinationInfo(
 
 After the size of the destination buffer is known, user can pass the `D3D12_COOPERATIVE_VECTOR_MATRIX_CONVERSION_DEST_INFO` descriptor along with information of source layout and datatype in `D3D12_COOPERATIVE_VECTOR_MATRIX_CONVERSION_SOURCE_INFO` and addresses of the source and destination buffers to the layout and datatype conversion API.
 
-```
+```c++
 
 // GPU VAs of source and destination buffers
 
@@ -743,10 +701,10 @@ typedef struct D3D12_COOPERATIVE_VECTOR_MATRIX_CONVERSION_INFO {
 
 New API is added to the ID3D12CommandList interface. Multiple conversions can be done in a single call of the API. The number of descriptors pointed to by pDesc is specified using descCount. If DestSize passed to this API is less than the number of bytes returned in call to `GetCooperativeVectorMatrixConversionDestinationInfo`, behavior is undefined.
 
-```
+```c++
 // Converts source matrix to desired layout and datatype
 void ID3D12CommandList::CooperativeVectorConvertMatrix(D3D12_COOPERATIVE_VECTOR_MATRIX_CONVERSION_INFO* pDesc,
-                                                          UINT DescCount);
+                                                       UINT DescCount);
 
 ```
 
@@ -766,7 +724,7 @@ void ID3D12CommandList::CooperativeVectorConvertMatrix(D3D12_COOPERATIVE_VECTOR_
 
 *Usage Example:*
 
-```
+```c++
 
 D3D12_COOPERATIVE_VECTOR_MATRIX_CONVERSION_INFO infoDesc = 
 { 

From 260b917b9862e39866cdadcc302ce68dfff04513 Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Thu, 20 Feb 2025 11:57:50 -0800
Subject: [PATCH 03/14] Add immargs, fix typos, remove unnecessary TODO

---
 proposals/0029-cooperative-vector.md | 51 ++++++++++++++--------------
 1 file changed, 26 insertions(+), 25 deletions(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index b81abcb6..23024c46 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -123,35 +123,35 @@ Introduce new DXIL operations to accelarate matrix-vector operations. In this sp
 
 ``` llvm 
 declare <[NUMo] x [TYo] @dx.op.matvecmul.v[NUMo][TYo].v[NUMi][TYi](
-    i32               ; opcode
+    immarg i32        ; opcode
     <[NUMi] x [TYi]>, ; input vector
-    i32,              ; input interpretation
+    immarg i32,       ; input interpretation
     %dx.types.Handle, ; matrix resource
     i32,              ; matrix offset
-    i32,              ; matrix interpretation
-    i32,              ; matrix M dimension    
-    i32,              ; matrix K dimension    
-    i32,              ; matrix layout
-    i32,              ; matrix transpose      <<< should this be i1?
+    immarg i32,       ; matrix interpretation
+    immarg i32,       ; matrix M dimension    
+    immarg i32,       ; matrix K dimension    
+    immarg i32,       ; matrix layout
+    immarg i32,       ; matrix transpose      <<< should this be i1?
     i32,              ; matrix stride
-    i1)               ; isResultSigned        <<< what is this for?
+    immarg i1)        ; isResultSigned        <<< See #399
 
 declare <[NUMo] x [TYo]> @dx.op.matvecmuladd.v[NUMo][TYo].v[NUMi][TYi](
-    i32               ; opcode
+    immarg i32        ; opcode
     <[NUMi] x [TYi]>, ; input vector
-    i32,              ; input interpretation
+    immarg i32,       ; input interpretation
     %dx.types.Handle, ; matrix resource
     i32,              ; matrix offset
-    i32,              ; matrix interpretation
-    i32,              ; matrix M dimension    
-    i32,              ; matrix K dimension    
-    i32,              ; matrix layout
-    i32,              ; matrix transpose      <<< should this be i1?
+    immarg i32,       ; matrix interpretation
+    immarg i32,       ; matrix M dimension    
+    immarg i32,       ; matrix K dimension    
+    immarg i32,       ; matrix layout
+    immarg i32,       ; matrix transpose      <<< should this be i1?
     i32,              ; matrix stride
     %dx.types.Handle, ; bias vector resource
     i32,              ; bias vector offset
-    i32,              ; bias vector interpretation
-    i1)               ; isResultSigned        <<< what is this for?
+    immarg i32,       ; bias vector interpretation
+    immarg i1)        ; isResultSigned        <<< See #399
 ```
 
 #### Overview
@@ -164,12 +164,14 @@ The `@dx.op.matvecmuladd` operation behaves as `@dx.op.matvecmul`, but also adds
 an **M**-sized bias vector (loaded from memory) to the result.
 
 > Note that the dimensions of the matrix are **M**x**K** versus **M**x**N**
-> usually found in linear algebra texbooks. This is to futureproof for potential
-> matrix-matrix operations in the future where the inputs could be **M**x**K**
-> and **K**x**N** to produce an **M**x**N** result matrix.
+> usually found in linear algebra textbooks. This is to futureproof for
+> potential matrix-matrix operations in the future where the inputs could be
+> **M**x**K** and **K**x**N** to produce an **M**x**N** result matrix.
 
 #### Arguments
 
+| Argument | Type | 
+
 ##### Input Vector
 
 The **input vector** is of size `NUMi` and contains elements of physical type
@@ -212,7 +214,6 @@ aligned.
 
 ##### Bias Matrix
 
-> * TODO: alignment requirements for **bias vector offset**
 > * TODO: are packed types allowed for bias vectors?
 
 The bias matrix is loaded from the raw-buffer, **bias vector resource**,
@@ -229,14 +230,14 @@ The base address of **bias vector resource** and **bias vector offset** must be
 
 ``` llvm
 declare void @dx.op.vecouterproductacc.v[M][TY].v[N][TY](
-    i32,              ; opcode 
+    immarg i32,       ; opcode 
     <[M] x [TY]>,     ; input vector 1
     <[N] x [TY]>,     ; input vector 2
     %dx.types.Handle, ; matrix resource
     i32,              ; matrix offset 
     i32,              ; matrix stride 
-    i32,              ; matrix interpretation 
-    i32)              ; matrix layout 
+    immarg i32,       ; matrix interpretation 
+    immarg i32)       ; matrix layout 
 ```
 
 #### Overview
@@ -273,7 +274,7 @@ on all implementations can be found in [Minimum Support Set].
 
 ``` llvm
 declare void @dx.op.vecreducesumacc.v[NUM][TY](
-    i32,              ; opcode
+    immarg i32,       ; opcode
     <[NUM] x [TY]>,   ; input vector
     %dx.types.Handle, ; output array resource 
     i32)              ; output array offset

From 6fec8ee76e7c7395b1a4bd18197487961917d314 Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Thu, 20 Feb 2025 12:48:31 -0800
Subject: [PATCH 04/14] Remove the NOTES FOR REVIEW section, can cover in PR

---
 proposals/0029-cooperative-vector.md | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index 23024c46..856d6d8c 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -115,12 +115,6 @@ Introduce new DXIL operations to accelarate matrix-vector operations. In this sp
 
 #### Syntax
  
-> NOTES FOR REVIEW, REMOVE BEFORE MERGING: 
->  
-> Added @dx.op.matvecmuladd version, and also renamed  from
-> @dx.op.vecmatmul[add] to @dx.op.matvecmul[add] since it is a left matrix
-> multiply.
-
 ``` llvm 
 declare <[NUMo] x [TYo] @dx.op.matvecmul.v[NUMo][TYo].v[NUMi][TYi](
     immarg i32        ; opcode

From be36c0c7da6f0cb34c4747da6e162fd03714729e Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Thu, 20 Feb 2025 17:02:24 -0800
Subject: [PATCH 05/14] matrix transpose is now i1

---
 proposals/0029-cooperative-vector.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index 856d6d8c..a6780d26 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -126,7 +126,7 @@ declare <[NUMo] x [TYo] @dx.op.matvecmul.v[NUMo][TYo].v[NUMi][TYi](
     immarg i32,       ; matrix M dimension    
     immarg i32,       ; matrix K dimension    
     immarg i32,       ; matrix layout
-    immarg i32,       ; matrix transpose      <<< should this be i1?
+    immarg i1,        ; matrix transpose
     i32,              ; matrix stride
     immarg i1)        ; isResultSigned        <<< See #399
 
@@ -140,7 +140,7 @@ declare <[NUMo] x [TYo]> @dx.op.matvecmuladd.v[NUMo][TYo].v[NUMi][TYi](
     immarg i32,       ; matrix M dimension    
     immarg i32,       ; matrix K dimension    
     immarg i32,       ; matrix layout
-    immarg i32,       ; matrix transpose      <<< should this be i1?
+    immarg i1,        ; matrix transpose
     i32,              ; matrix stride
     %dx.types.Handle, ; bias vector resource
     i32,              ; bias vector offset

From 4d4d41614e4c3b85a6d946213148236e58fc3585 Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Thu, 20 Feb 2025 17:06:44 -0800
Subject: [PATCH 06/14] Remove TODO for thing now tracked by #402

---
 proposals/0029-cooperative-vector.md | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index a6780d26..223d6b16 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -200,12 +200,6 @@ The base address of **matrix resource** and **matrix offset** must be 64 byte
 aligned.
 
 
-
-> TODO: consider using a different set of interpretation values for in-memory
-> interpretations versus vector interpretations.
-
-
-
 ##### Bias Matrix
 
 > * TODO: are packed types allowed for bias vectors?

From 015801171eee2db7595803eafde827e94fbbad76 Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Thu, 20 Feb 2025 17:07:42 -0800
Subject: [PATCH 07/14] Fix formatting

---
 proposals/0029-cooperative-vector.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index 223d6b16..7b8a7c62 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -416,10 +416,10 @@ optimal layout. Row-Major and Column-Major layouts are also supported.
 ### Matrix Transpose
 
 The **matrix transpose** parameter indicates if the matrix is transposed before
-performing the multiply. In linear algebra,
-the[transpose](https://en.wikipedia.org/wiki/Transpose) of a matrix is an
-operator which flips a matrix over its diagonal; that is, it switches the row
-and column indices of the matrix. 
+performing the multiply. In linear algebra, the
+[transpose](https://en.wikipedia.org/wiki/Transpose) of a matrix is an operator
+which flips a matrix over its diagonal; that is, it switches the row and column
+indices of the matrix. 
 
 Transposing is not supported for the RowMajor/ColumnMajor layouts. 
 

From 77a6800f707195ab5897d39170e6a185c608858d Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Thu, 20 Feb 2025 17:08:16 -0800
Subject: [PATCH 08/14] Remove out-of-date TODO

---
 proposals/0029-cooperative-vector.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index 7b8a7c62..ce6ad997 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -442,8 +442,6 @@ Matrix Transpose and Bias, Bias Offset, Bias Interpretation, but this is not a r
 
 The vector-matrix intrinsics are expected to be supported in all shader stages.
 
-> TODO: Add query to determine which shader stages support these intrinsics.
-
 ### Diagnostic Changes
 
 * Diagnostics for incorrect use of the new intrinsics.

From 842ac448696923429c4ef98427aed2c26c56eec3 Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Thu, 20 Feb 2025 17:08:48 -0800
Subject: [PATCH 09/14] Fix typos / grammar

---
 proposals/0029-cooperative-vector.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index ce6ad997..e3360db8 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -21,7 +21,7 @@ use of GPUs for general purpose ML/DL means that GPU vendors continue to add mor
 accelerate neural network computations, like accelerating matrix operations.
 
 This proposal introduces DXIL operations for vector-matrix operations that can
-accelerated by the underlying hardware, building on support for long verctors
+be accelerated by the underlying hardware, building on support for long vectors
 described in proposals [0026] and [0030]. This proposal describes HLSL builtins
 that can be used for example and testing purposes as well as the building blocks
 for a high-level HLSL API. The high-level API is described in proposal \[TBD\].

From 783acc057892257d42deff10d10ff750b711bf8a Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Thu, 20 Feb 2025 17:09:47 -0800
Subject: [PATCH 10/14] update / fix bias vector section

---
 proposals/0029-cooperative-vector.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index e3360db8..e07955e1 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -200,14 +200,14 @@ The base address of **matrix resource** and **matrix offset** must be 64 byte
 aligned.
 
 
-##### Bias Matrix
+##### Bias Vector
 
-> * TODO: are packed types allowed for bias vectors?
-
-The bias matrix is loaded from the raw-buffer, **bias vector resource**,
+The bias vector is loaded from the raw-buffer, **bias vector resource**,
 starting at **bias vector offset**. The **bias vector interpretation** argument
 specifies the element type of the bias vector (see [Type Interpretations]).
 
+Only non-packed interpretations are valid for bias vectors.
+
 The base address of **bias vector resource** and **bias vector offset** must be
 64 byte aligned.
 

From e931151076bc26dcdac0e72c0a51b8e20ffb2482 Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Thu, 20 Feb 2025 17:12:05 -0800
Subject: [PATCH 11/14] Clarify atomic operations, add Conversion Rules section
 header

---
 proposals/0029-cooperative-vector.md | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index e07955e1..d5e24ac9 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -230,7 +230,8 @@ declare void @dx.op.vecouterproductacc.v[M][TY].v[N][TY](
 
 #### Overview
 
-Computes the outer product between column vectors and an **M**x**N** matrix* is accumulated atomically (with device scope) in memory. 
+Computes the outer product between column vectors and an **M**x**N** matrix is
+accumulated component-wise atomically (with device scope) in memory. 
 
 ``` 
 ResultMatrix = InputVector1 * Transpose(InputVector2); 
@@ -270,8 +271,8 @@ declare void @dx.op.vecreducesumacc.v[NUM][TY](
 
 #### Overview
 
-Accumulates the components of a vector atomically (with device scope) to the
-corresponding elements of an array in memory.
+Accumulates the components of a vector component-wise atomically (with device
+scope) to the corresponding elements of an array in memory.
 
 #### Arguments
 
@@ -332,6 +333,8 @@ all implementations can be found in [Minimum Support Set]. Note that there is no
 guaranteed support for **matrix tranpose**, and so it must always queried. s
 queried.
 
+#### Conversation Rules
+
 Non-"Packed" type interpretations are used to request arithmetic conversions. Input type must be a 32-bit or 16-bit
 scalar integer or a 32-bit or 16-bit float. Integer to integer conversion saturates, float to float conversion is
 implementation dependent and preserves the value as accurately as possible. Float to integer conversion is RTNE and

From ddd41197c51d0d400be905783ac14d4da8a0fd6a Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Thu, 20 Feb 2025 17:26:03 -0800
Subject: [PATCH 12/14] Add shashankw to author list

---
 proposals/0029-cooperative-vector.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index d5e24ac9..ea248e6a 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -1,7 +1,8 @@
 <!-- {% raw %} -->
 
 * Proposal: [0029](0029-cooperative-vector.md)
-* Author(s): [Anupama Chandrasekhar][anupamachandra], [Damyan Pepper][damyanp]
+* Author(s): [Anupama Chandrasekhar][anupamachandra], [Damyan Pepper][damyanp],
+             [Shashank Wadhwa][shashankw]
 * Sponsor: [Damyan Pepper][damyanp], [Greg Roth][pow2clk]
 * Status: **Under Review**
 * Planned Version: Shader Model 6.9
@@ -10,6 +11,7 @@
 [anupamachandra]: https://github.com/anupamachandra
 [damyanp]: https://github.com/damyanp
 [pow2clk]: https://github.com/pow2clk
+[shashankw]: https://github.com/shashankw
 
 # Cooperative Vectors
 

From 0cbeddb6368f45f503be080faeea918ddada3c36 Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Thu, 20 Feb 2025 20:07:54 -0800
Subject: [PATCH 13/14] Remove incompleted table

---
 proposals/0029-cooperative-vector.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index ea248e6a..cb8cb99e 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -166,8 +166,6 @@ an **M**-sized bias vector (loaded from memory) to the result.
 
 #### Arguments
 
-| Argument | Type | 
-
 ##### Input Vector
 
 The **input vector** is of size `NUMi` and contains elements of physical type

From 594b6c45fd0f21cc19ca373249b1c91bf3ce9d68 Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp@microsoft.com>
Date: Fri, 21 Feb 2025 11:15:18 -0800
Subject: [PATCH 14/14] Address @spall's feedback.

---
 proposals/0029-cooperative-vector.md | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/proposals/0029-cooperative-vector.md b/proposals/0029-cooperative-vector.md
index cb8cb99e..bfd49b3f 100644
--- a/proposals/0029-cooperative-vector.md
+++ b/proposals/0029-cooperative-vector.md
@@ -24,9 +24,8 @@ accelerate neural network computations, like accelerating matrix operations.
 
 This proposal introduces DXIL operations for vector-matrix operations that can
 be accelerated by the underlying hardware, building on support for long vectors
-described in proposals [0026] and [0030]. This proposal describes HLSL builtins
-that can be used for example and testing purposes as well as the building blocks
-for a high-level HLSL API. The high-level API is described in proposal \[TBD\].
+described in proposals [0026] and [0030]. The high-level API is described in
+proposal \[TBD\].
 
 [0026]: 0026-hlsl-long-vector-type.md
 [0030]: 0030-dxil-vectors.md
@@ -159,7 +158,7 @@ stored in a variable.
 The `@dx.op.matvecmuladd` operation behaves as `@dx.op.matvecmul`, but also adds
 an **M**-sized bias vector (loaded from memory) to the result.
 
-> Note that the dimensions of the matrix are **M**x**K** versus **M**x**N**
+> Note that the dimensions of the matrix are **M**x**K** versus the **M**x**N**
 > usually found in linear algebra textbooks. This is to futureproof for
 > potential matrix-matrix operations in the future where the inputs could be
 > **M**x**K** and **K**x**N** to produce an **M**x**N** result matrix.
@@ -173,8 +172,8 @@ The **input vector** is of size `NUMi` and contains elements of physical type
 the vector. `NUMi` has a relationship with **K** as follows:
 
 * for non-packed interpretations: `NUMi` equals **K**,
-* for packed interpretations: `NUMi` equals the least number that can hold **K**
-  values of the packed type.
+* for packed interpretations: `NUMi` equals the smallest number that can hold
+  **K** values of the packed type.
 
 Non-packed interpretations are standard types such as float16, uint etc.  Packed
 types are types such as **SignedInt8x4Packed** where each 32-bit element of the
@@ -253,8 +252,8 @@ aligned.
 
 Not all combinations of vector element type and matrix interpretations are
 supported by all implementations. [CheckFeatureSupport] can be used to determine
-which combinations. A list of combinations that are guaranteed to be supported
-on all implementations can be found in [Minimum Support Set].
+which combinations are supported. A list of combinations that are guaranteed to
+be supported on all implementations can be found in [Minimum Support Set].
 
 
 ### Reduce Sum Accumulate
@@ -330,8 +329,7 @@ into multiple logical elements.
 **bias vector interpretation** and **TYo** are supported on a particular
 implementation. A list of combinations that are guaranteed to be supported on
 all implementations can be found in [Minimum Support Set]. Note that there is no
-guaranteed support for **matrix tranpose**, and so it must always queried. s
-queried.
+guaranteed support for **matrix tranpose**, and so it must always be queried.
 
 #### Conversation Rules