- The Gccjit backend operates using "on device" copies of tensors, where the "device memory" is the stack of the C function. This is intended to improve cache locality and reduce cache contention.
- Three / four synchronization heuristics:
- "parallel": a slice of the tensor is copied host-to-device at the beginning and device-to-host at the end, without interference because each task has a different slice.
- "update on host": the tensor is copied host-to-device at the beginning; each write is an update, it reads the old value from host to update it on the host. Thus each write is a synchronization point.
- "replicated": the tensor is copied host-to-device at the beginning; only task 0 copies device-to-host.
- "device-only": no copying to/from host.
- Three / four synchronization heuristics:
- On-device-only tensors that are not materialized on the OCaml side.
- A new category of axis dimensions is introduced:
Frozen
. It is analogous to theParallel
axis category in that a single task execution / "device call" only processes a 1D slice of the axis.- Currently, for tensors processed in parallel, we only support processing of a contiguous tensor slice (copied "to device" using
memcpy
).
- Currently, for tensors processed in parallel, we only support processing of a contiguous tensor slice (copied "to device" using
- A new syntax
%nn_rs
("postprocess results" variant of%nn_dt
) for computations that should happen at the end of task execution / refresh step. It's meant to prepare the data to be copied back to the host.
- Got rid of backend-agnostic synchronization. It was not worth the complexity / implementation effort at this point.
- Keeping the
Rebalance
constructor around, but it is not playing any role.
- Keeping the
- Got rid of
debug_virtual_nodes
, was tricky to maintain. - Dynamic indexing now skips over parallel axes: when there is a
Parallel
axis on the left, it is preserved in the resulting tensor (slice), and the next-right axis is indexed into instead.- Removed the "indexing axes from-right" functionality for now (fails as not implemented).
- Dynamic indexing now can produce virtual nodes.
- Dynamic indexing fixes.
- Thread-local parameter
task_id
for automated iteration over a dimensionParallel
.- This implements multicore SGD.
- Rebalancing of computations that don't use
Parallel
, and synchronization in theGccjit
backend, are left as future work. - Already provides significant speedups in the interpreter (6-7x for me), but that's a moot point.
- Giving up further work this approach for now, because the bottleneck is the memory access with
Gccjit
. - Keeping the new representation capability around, maybe it will be a stepping stone to other things.
- Monolithic step update with "macrobatch" (multiple steps within one backend call).
- Streamlined the source code, e.g. removed the
OCaml
backend. - Better syntax for
%nn_dt
and%nn_op
shape specification, allows identifiers. - Improved virtual node and scalar constant inlining.
- Better debugging, e.g. an option to "trace"
Gccjit
execution by printing the comments.
- An inline constants optimization that compile-time computes scalar constant subexpressions and inlines the values.
- Improved debuggability.
- A last-minute breaking bug (would be nice to have a pre-release or a pre-publish hook to run tests!).
- The virtual nodes optimization is more robust, correct even with aggressive inlining settings (e.g. escaping variables check).
- The first changes-tracking release. Earlier development history is still somewhat documented via closed issues.
- Supports single and double precision floats, more precisions in the future.
- Generates a monolithic step update routine executed by
refresh_session ()
, but can generate arbitrary additional routines at arbitrary times to be executed at arbitrary other times within a session. - An
Interpreter
backend that can for example log all individual tensor modifications. - A
Gccjit
backend that can sometimes be 400x faster than theInterpreter
backend (without any debug work/output). - A virtual nodes (tensors) optimization that inlines computation of a cell in lieu of tensor accesses, can sometimes reduce memory consumption by 1/3.