Skip to content

Latest commit

 

History

History
73 lines (51 loc) · 4.21 KB

CHANGES.md

File metadata and controls

73 lines (51 loc) · 4.21 KB

[0.2.0] -- 2023-06-03

Added

  • The Gccjit backend operates using "on device" copies of tensors, where the "device memory" is the stack of the C function. This is intended to improve cache locality and reduce cache contention.
    • Three / four synchronization heuristics:
      • "parallel": a slice of the tensor is copied host-to-device at the beginning and device-to-host at the end, without interference because each task has a different slice.
      • "update on host": the tensor is copied host-to-device at the beginning; each write is an update, it reads the old value from host to update it on the host. Thus each write is a synchronization point.
      • "replicated": the tensor is copied host-to-device at the beginning; only task 0 copies device-to-host.
      • "device-only": no copying to/from host.
  • On-device-only tensors that are not materialized on the OCaml side.
  • A new category of axis dimensions is introduced: Frozen. It is analogous to the Parallel axis category in that a single task execution / "device call" only processes a 1D slice of the axis.
    • Currently, for tensors processed in parallel, we only support processing of a contiguous tensor slice (copied "to device" using memcpy).
  • A new syntax %nn_rs ("postprocess results" variant of %nn_dt) for computations that should happen at the end of task execution / refresh step. It's meant to prepare the data to be copied back to the host.

Changed

  • Got rid of backend-agnostic synchronization. It was not worth the complexity / implementation effort at this point.
    • Keeping the Rebalance constructor around, but it is not playing any role.
  • Got rid of debug_virtual_nodes, was tricky to maintain.
  • Dynamic indexing now skips over parallel axes: when there is a Parallel axis on the left, it is preserved in the resulting tensor (slice), and the next-right axis is indexed into instead.
    • Removed the "indexing axes from-right" functionality for now (fails as not implemented).
  • Dynamic indexing now can produce virtual nodes.

Fixed

  • Dynamic indexing fixes.

[0.1.2] -- 2023-05-12

Added

  • Thread-local parameter task_id for automated iteration over a dimension Parallel.
    • This implements multicore SGD.
    • Rebalancing of computations that don't use Parallel, and synchronization in the Gccjit backend, are left as future work.
    • Already provides significant speedups in the interpreter (6-7x for me), but that's a moot point.
    • Giving up further work this approach for now, because the bottleneck is the memory access with Gccjit.
    • Keeping the new representation capability around, maybe it will be a stepping stone to other things.
  • Monolithic step update with "macrobatch" (multiple steps within one backend call).

Changed

  • Streamlined the source code, e.g. removed the OCaml backend.
  • Better syntax for %nn_dt and %nn_op shape specification, allows identifiers.
  • Improved virtual node and scalar constant inlining.
  • Better debugging, e.g. an option to "trace" Gccjit execution by printing the comments.

[0.1.1] -- 2023-05-06

Added

  • An inline constants optimization that compile-time computes scalar constant subexpressions and inlines the values.

Changed

  • Improved debuggability.

Fixed

  • A last-minute breaking bug (would be nice to have a pre-release or a pre-publish hook to run tests!).
  • The virtual nodes optimization is more robust, correct even with aggressive inlining settings (e.g. escaping variables check).

[0.1.0] -- 2023-05-04

Added

  • The first changes-tracking release. Earlier development history is still somewhat documented via closed issues.
  • Supports single and double precision floats, more precisions in the future.
  • Generates a monolithic step update routine executed by refresh_session (), but can generate arbitrary additional routines at arbitrary times to be executed at arbitrary other times within a session.
  • An Interpreter backend that can for example log all individual tensor modifications.
  • A Gccjit backend that can sometimes be 400x faster than the Interpreter backend (without any debug work/output).
  • A virtual nodes (tensors) optimization that inlines computation of a cell in lieu of tensor accesses, can sometimes reduce memory consumption by 1/3.