[0.2.0] -- 2023-06-03

Added

The Gccjit backend operates using "on device" copies of tensors, where the "device memory" is the stack of the C function. This is intended to improve cache locality and reduce cache contention.
- Three / four synchronization heuristics:
  - "parallel": a slice of the tensor is copied host-to-device at the beginning and device-to-host at the end, without interference because each task has a different slice.
  - "update on host": the tensor is copied host-to-device at the beginning; each write is an update, it reads the old value from host to update it on the host. Thus each write is a synchronization point.
  - "replicated": the tensor is copied host-to-device at the beginning; only task 0 copies device-to-host.
  - "device-only": no copying to/from host.
On-device-only tensors that are not materialized on the OCaml side.
A new category of axis dimensions is introduced: Frozen. It is analogous to the Parallel axis category in that a single task execution / "device call" only processes a 1D slice of the axis.
- Currently, for tensors processed in parallel, we only support processing of a contiguous tensor slice (copied "to device" using memcpy).
A new syntax %nn_rs ("postprocess results" variant of %nn_dt) for computations that should happen at the end of task execution / refresh step. It's meant to prepare the data to be copied back to the host.

Changed

Got rid of backend-agnostic synchronization. It was not worth the complexity / implementation effort at this point.
- Keeping the Rebalance constructor around, but it is not playing any role.
Got rid of debug_virtual_nodes, was tricky to maintain.
Dynamic indexing now skips over parallel axes: when there is a Parallel axis on the left, it is preserved in the resulting tensor (slice), and the next-right axis is indexed into instead.
- Removed the "indexing axes from-right" functionality for now (fails as not implemented).
Dynamic indexing now can produce virtual nodes.

Fixed

Dynamic indexing fixes.

[0.1.2] -- 2023-05-12

Added

Thread-local parameter task_id for automated iteration over a dimension Parallel.
- This implements multicore SGD.
- Rebalancing of computations that don't use Parallel, and synchronization in the Gccjit backend, are left as future work.
- Already provides significant speedups in the interpreter (6-7x for me), but that's a moot point.
- Giving up further work this approach for now, because the bottleneck is the memory access with Gccjit.
- Keeping the new representation capability around, maybe it will be a stepping stone to other things.
Monolithic step update with "macrobatch" (multiple steps within one backend call).

Changed

Streamlined the source code, e.g. removed the OCaml backend.
Better syntax for %nn_dt and %nn_op shape specification, allows identifiers.
Improved virtual node and scalar constant inlining.
Better debugging, e.g. an option to "trace" Gccjit execution by printing the comments.

[0.1.1] -- 2023-05-06

Added

An inline constants optimization that compile-time computes scalar constant subexpressions and inlines the values.

Changed

Improved debuggability.

Fixed

A last-minute breaking bug (would be nice to have a pre-release or a pre-publish hook to run tests!).
The virtual nodes optimization is more robust, correct even with aggressive inlining settings (e.g. escaping variables check).

[0.1.0] -- 2023-05-04

Added

The first changes-tracking release. Earlier development history is still somewhat documented via closed issues.
Supports single and double precision floats, more precisions in the future.
Generates a monolithic step update routine executed by refresh_session (), but can generate arbitrary additional routines at arbitrary times to be executed at arbitrary other times within a session.
An Interpreter backend that can for example log all individual tensor modifications.
A Gccjit backend that can sometimes be 400x faster than the Interpreter backend (without any debug work/output).
A virtual nodes (tensors) optimization that inlines computation of a cell in lieu of tensor accesses, can sometimes reduce memory consumption by 1/3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGES.md

CHANGES.md

[0.2.0] -- 2023-06-03

Added

Changed

Fixed

[0.1.2] -- 2023-05-12

Added

Changed

[0.1.1] -- 2023-05-06

Added

Changed

Fixed

[0.1.0] -- 2023-05-04

Added

Files

CHANGES.md

Latest commit

History

CHANGES.md

File metadata and controls

[0.2.0] -- 2023-06-03

Added

Changed

Fixed

[0.1.2] -- 2023-05-12

Added

Changed

[0.1.1] -- 2023-05-06

Added

Changed

Fixed

[0.1.0] -- 2023-05-04

Added