Skip to content

Latest commit

 

History

History

examples

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

IBM Analog Hardware Acceleration Kit: Examples

We have many different examples to explore the many features of the IBM Analog Hardware Acceleration Kit:

In this example a single fully connected analog layer is used to predict the output tensor y, based on the input tensor x. The rpu_config parameter of the AnalogLinear layer can be used to define different settings for the network. In this case it specifies the device ConstantStepDevice, but different Resistive Processing Units exist:

# Define a single-layer network, using a constant step device type.
rpu_config = SingleRPUConfig(device=ConstantStepDevice())
model = AnalogLinear(4, 2, bias=True,
                     rpu_config=rpu_config)

In the first few examples both training and inference are done on an analog chip, while few of the later examples will use the Hardware Aware (HWA) training. HWA training uses digital chip for the training part and analog one for the inference one. The network is trained for 100 epochs using the analog Stochastic Gradient Descent and the loss is printed for every epoch:

Loss error: 0.5686419010162354
...
Loss error: 0.0137629760429263

Process finished with exit code 0

The second example uses a larger fully connected network to predict the output tensor y based on the input tensor x. In this network the multiple fully connected analog layer are wrapped by a sequential container. Similarly to the first example, the network uses a ConstantStepDevice and it is trained for 100 epochs with the loss printed at every epoch.

This MNIST training example is based on the paper: Gokmen T and Vlasov Y (2016) Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices: Design Considerations. Front. Neurosci.. It builds a deep neural network of fully connected layers with 784, 256, 128 and 10 neurons, sigmoids and softmax activation functions.

Sequential(
  (0): AnalogLinear(in_features=784, out_features=256, bias=True, is_cuda=False)
  (1): Sigmoid()
  (2): AnalogLinear(in_features=256, out_features=128, bias=True, is_cuda=False)
  (3): Sigmoid()
  (4): AnalogLinear(in_features=128, out_features=10, bias=True, is_cuda=False)
  (5): LogSoftmax()
)

The device used in the analog tile is a ConstantStepDevice. The network is trained with a standard MNIST dataset with a Stochastic Gradient Descent optimizer. The network is trained for 30 epochs with batch size of 64 and a fixed learning rate of 0.05 and the loss is printed at every epoch.

This CNN MNIST training example is based on the paper: Gokmen T, Onen M and Haensch W (2017) Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices. Front. Neurosci.. It is a deep neural network of convolutional layers with an architecture similar to LeNet5. It has 2 convolutional layers with 5x5 kernels, tanh and softmax activation functions and uses max-pooling as subsampling layer. The full architecture is printed in terminal during execution:

LeNet5(
  (feature_extractor): Sequential(
    (0): AnalogConv2d(1, 16, kernel_size=(5, 5), stride=(1, 1))
    (1): Tanh()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): AnalogConv2d(16, 32, kernel_size=(5, 5), stride=(1, 1))
    (4): Tanh()
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Tanh()
  )
  (classifier): Sequential(
    (0): AnalogLinear(in_features=512, out_features=128, bias=True, is_cuda=True)
    (1): Tanh()
    (2): AnalogLinear(in_features=128, out_features=10, bias=True, is_cuda=True)
  )
)

This example can be used to evaluate the accuracy difference between training with an analog network (specified by ConstantStepDevice) or a digital network (specified by FloatingPointDevice), by selecting the proper SingleRPUConfig definition:

# Select the device model to use in the training.
# * If `SingleRPUConfig(device=ConstantStepDevice())` then analog tiles with
#   constant step devices will be used,
# * If `FloatingPointRPUConfig(device=FloatingPointDevice())` then standard
#   floating point devices will be used
RPU_CONFIG = SingleRPUConfig(device=ConstantStepDevice())
# RPU_CONFIG = FloatingPointRPUConfig(device=FloatingPointDevice())

The CNN is trained for 30 epochs with a batch size of 8, a Stochastic Gradient Descent optimizer and a learning rate of 0.01. When training on GPU (V100) it takes ~20s/epoch to train the network; when working with CPU, the time required for training with a ConstantSetpDevice is substantially longer (on a standard laptop ~5min/epoch) than FloatingPointDevice is used. The timestamp, losses, test error and accuracy are printed for every epoch:

10:20:13 --- Started LeNet5 Example
10:20:32 --- Epoch: 0	Train loss: 1.9803	Valid loss: 1.6184	Test error: 10.66%	Accuracy: 89.34%
...
10:29:27 --- Epoch: 29	Train loss: 1.4724	Valid loss: 1.4768	Test error: 1.32%	Accuracy: 98.68%
10:29:27 --- Completed LeNet5 Example

At the end of the training/evaluation phases of the network, the losses, accuracy and test_error are also saved to file in a results/LENET5/ folder (from the path where the example was executed) together with a plot that shows the training evolution over the multiple epochs.

Test Losses

Test Error

Templating from the example 1, this example use the same input/output tensors and same architecture, but uses the hardware aware training functionality. In hardware aware training the network training is performed on a digital network, with many noise sources typical of Phase Change Memory (PCM) inserted in this phase to make the network more resilient to these. The inference part is instead performed on the analog network to take full advantage of the speed of the analog computation. Many features and parameters can be defined in the rpu_config to specify the noise sources characteristics:

# Define a single-layer network, using inference/hardware-aware training tile
rpu_config = InferenceRPUConfig()
rpu_config.forward.out_res = -1.  # Turn off (output) ADC discretization.
rpu_config.forward.w_noise_type = WeightNoiseType.ADDITIVE_CONSTANT
rpu_config.forward.w_noise = 0.02  # Short-term w-noise.

rpu_config.clip.type = WeightClipType.FIXED_VALUE
rpu_config.clip.fixed_value = 1.0
rpu_config.modifier.pdrop = 0.03  # Drop connect.
rpu_config.modifier.type = WeightModifierType.ADD_NORMAL  # Fwd/bwd weight noise.
rpu_config.modifier.std_dev = 0.1
rpu_config.modifier.rel_to_actual_wmax = True

# Inference noise model.
rpu_config.noise_model = PCMLikeNoiseModel(g_max=25.0)

Additionally this example explore the use of drift_compensation methods to improve the accuracy of the analog network during the inference phase in presence of weight drift. The drift_compensation is also specified in the rpu_config option:

# drift compensation
rpu_config.drift_compensation = GlobalDriftCompensation()

The weight drift is exercised through the drift_analog_weights function that is applied to the analog model, and it can be looped over to explore inference accuracy at different times since training:

for t_inference in [0., 1., 20., 1000., 1e5]:
    model.drift_analog_weights(t_inference)
    pred_drift = model(x)
    print('Prediction after drift (t={}, correction={:1.3f}):\t {}'.format(
        t_inference, model.analog_tile.alpha.cpu().numpy(),
        pred_drift.detach().cpu().numpy().flatten()))

More details are discussed in the Inference and PCM statistical model documentation.

This example templates from the CNN of example 4 and add noise and non-idealities typical of the PCM.

# Define the properties of the neural network in terms of noise simulated during
# the inference/training pass
RPU_CONFIG = InferenceRPUConfig()
RPU_CONFIG.forward.out_res = -1.  # Turn off (output) ADC discretization.
RPU_CONFIG.forward.w_noise_type = WeightNoiseType.ADDITIVE_CONSTANT
RPU_CONFIG.forward.w_noise = 0.02
RPU_CONFIG.noise_model = PCMLikeNoiseModel(g_max=25.0)

As the previous example it then explores the effect of the PCM weight drift on the inference accuracy:

    # Simulation of inference pass at different times after training.
    for t_inference in [0., 1., 20., 1000., 1e5]:
        model.drift_analog_weights(t_inference)

This example templates from example 1. However, rather than having a single device at every cross-point of the RPU array it defines 3 different devices, each one with its specific parameters, and the total weights is represented by the sum of the weights of these 3 devices:

# 3 arbitrary single unit cell devices (of the same type) per cross-point.
rpu_config.device = VectorUnitCell(
    unit_cell_devices=[
        ConstantStepDevice(w_max=0.3),
        ConstantStepDevice(w_max_dtod=0.4),
        ConstantStepDevice(up_down_dtod=0.1),
    ])

The example also define the specific update policy that is used to learn the weights of the network during the training phase. In this example the update is ony done on a single device which is randomly selected among the 3:

# Only one of the devices should receive a single update.
# That is selected randomly, the effective weights is the sum of all
# weights.
rpu_config.device.update_policy = VectorUnitCellUpdatePolicy.SINGLE_RANDOM

More information can be find in the Unit Cell Device documentation. Similarly to example 1 the network is trained over 100 epochs with an analog Stochastic Gradient Descent optimizer and the loss is printed for every epoch.

This example aims at exploring a training algorithm specifically developed for RPU and resistive device and discussed in Gokmen T and Haensch W (2020) Algorithm for Training Neural Networks on Resistive Device Arrays. Front. Neurosci.. The Tiki-Taka algorithm relaxes the symmetricity requirements on the switching characteristic of resistive devices used in RPU cross-point array. In this example 2 Compound Device are defined at each crosspoint and the Tiki-Taka algorithm parameters are specified as part of the UnitCellRPUConfig:

# The Tiki-taka learning rule can be implemented using the transfer device.
rpu_config = UnitCellRPUConfig(
    device=TransferCompound(

        # Devices that compose the Tiki-taka compound.
        unit_cell_devices=[
            SoftBoundsDevice(w_min=-0.3, w_max=0.3),
            SoftBoundsDevice(w_min=-0.6, w_max=0.6)
        ],

        # Make some adjustments of the way Tiki-Taka is performed.
        units_in_mbatch=True,    # batch_size=1 anyway
        transfer_every=2,        # every 2 batches do a transfer-read
        n_cols_per_transfer=1,   # one forward read for each transfer
        gamma=0.0,               # all SGD weight in second device
        scale_transfer_lr=True,  # in relative terms to SGD LR
        transfer_lr=1.0,         # same transfer LR as for SGD
    )
)

Similarly to example 1 the network is trained over 100 epochs with an analog Stochastic Gradient Descent optimizer and the loss is printed for every epoch.

This example shows how to select different pulsing schemes for the analog update. By default, we use stochastic pulses for activation and errors to update the devices with a step in up or down direction only if pulses coincide at a crosspoint. Note that we indeed generate stochastic pulse trains for each input line and thus do no short-cut in the simulation.

Here, in this example, we change the update to be performed using deterministic pulse trains that yield always the same number of coincident pulses at a cross point, if the same inputs are given. Implicitly, one can design the stored pulse train structure to quantize the input values in a number of bins, which can be changed by the *_res_imilicit parameters.

rpu_config = SingleRPUConfig(device=ConstantStepDevice())
rpu_config.update.pulse_type = PulseType.DETERMINISTIC_IMPLICIT
rpu_config.update.desired_bl = 10 # max number in this case
rpu_config.update.update_bl_management = True # will vary up to 10 on demand
rpu_config.update.d_res_implicit = 0.1 # effective resolution of x bit lines
rpu_config.update.x_res_implicit = 0.1 # effective resolution of d bit lines

Similarly to example 1 the network is trained over 100 epochs with an analog Stochastic Gradient Descent optimizer and the loss is printed for every epoch.

Example 10: 10_plot_presets.py

This example prints 5 of the different preset devices and preset configs available. Using these presets makes easy to test the different memory elements that can be used to train a neural network. The presets (except for the "Idealized analog device") are calibrated on the measured characteristics of real hardware devices that have been fabricated at IBM. Device non-ideal characteristics, noise, and variability are accurately simulated in all presets.

This example train a VGG8 network on the Street View House Number (SVHN) dataset. The device used in the rpu_config is the GokmenVlasovPreset config. In addition, it uses weight scaling in the analog layers in order to improve the accuracy.

There are many built-in preset devices and corresponding preset configs which makes convenient to explore the different memory element that can be used to train a neural network. The presets (except for the "Idealized analog device") are calibrated on the measured characteristics of real hardware devices that have been fabricated at IBM. Device non-ideal characteristics, noise, and variability are accurately simulated in all presets.

This example uses a single fully connected analog layer to predict the output tensor y based on the input tensor x (using the same values and network structure as the example 01.). However, this example uses a custom RPU config for using a digital rank update and transfer to analog device (as in mixed precision), set to a mixed precision compound which in turn uses a ConstantStep analog device.

In this example the usage of the Experiments module is shown. A BasicTraining experiment using three fully connected analog layers (based on the example 03) is created, and executed locally using the LocalRunner.

The Experiments module provides a convenient interface for running different types of experiments by encapsulating the experiments in single classes.

This example illustrates how to customize the Experiments for adding new features. By subclassing the base BasicTraining experiment used in example 13, the training algorithm is modified in order to support the use of a scheduler.

In particular, the new class specializes the training() function in order to create the scheduler when the training is started, and the training_step() function in order to step the scheduler every time an epoch is completed.

Example 15: 15_simple_lstm.py

This example performs hardware-aware training of an analog Long Short-Term Memory (LSTM) network on a simple sequence. In hardware aware training, a digital network is trained using noise sources typical of Phase Change Memory (PCM) in order to make the network more resilient to noise sources. Inference is then performed using the analog LSTM network to take full advantage of the speed and efficiency gains of the analog computation. Hardware features can be defined using rpu_config:

# Define a single-layer network, using inference/hardware-aware training tile
rpu_config = InferenceRPUConfig()
rpu_config.forward.out_res = -1.  # Turn off (output) ADC discretization.
rpu_config.forward.w_noise_type = WeightNoiseType.ADDITIVE_CONSTANT
rpu_config.forward.w_noise = 0.02  # Short-term w-noise.

rpu_config.clip.type = WeightClipType.FIXED_VALUE
rpu_config.clip.fixed_value = 1.0
rpu_config.modifier.pdrop = 0.03  # Drop connect.
rpu_config.modifier.type = WeightModifierType.ADD_NORMAL  # Fwd/bwd weight noise.
rpu_config.modifier.std_dev = 0.1
rpu_config.modifier.rel_to_actual_wmax = True

# Inference noise model.
rpu_config.noise_model = PCMLikeNoiseModel(g_max=25.0)

This example also demonstrates the efficacy of drift_compensation to improve the inference accuracy of the analog LSTM network over time in presence of weight drift. The drift_compensation method is also specified as an rpu_config option:

# drift compensation
rpu_config.drift_compensation = GlobalDriftCompensation()

Weight drift is applied to the weights within the analog LSTM model using the drift_analog_weights function. This function can be looped over to quantify the impact of weight drift on inference accuracy over time:

for t_inference in [0., 1., 20., 1000., 1e5]:
    model.drift_analog_weights(t_inference)
    states = reset_states()
    pred_drift, states = model(y_in, states)

More details on Phase Change Memory (PCM) noise sources and weight drift can be found in the Inference and PCM statistical model documentation.

Example 16: 16_mnist_gan.py

This example depicts how to train a Generative Adversarial Network (GAN) using the aihwkit package and its analog capabilities.

To do so, we implemented a very basic GAN architecture in which an analog generator learns to draw characters, trying to mimic those appearing in the MNIST dataset. In contrast, a non-analog discriminator tries to classify the generated images as fake or original (to give feedback to the generator). The full architecture of the generator and the discriminator are printed in terminal during execution:

Generator(
  (gen): AnalogSequential(
    (0): AnalogSequential(
      (0): AnalogLinear(in_features=64, out_features=128, bias=True)
      (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (1): AnalogSequential(
      (0): AnalogLinear(in_features=128, out_features=256, bias=True)
      (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (2): AnalogSequential(
      (0): AnalogLinear(in_features=256, out_features=512, bias=True)
      (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (3): AnalogSequential(
      (0): AnalogLinear(in_features=512, out_features=1024, bias=True)
      (1): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (4): AnalogLinear(in_features=1024, out_features=784, bias=True)
    (5): Sigmoid()
  )
)
Discriminator(
  (disc): Sequential(
    (0): Sequential(
      (0): Linear(in_features=784, out_features=512, bias=True)
      (1): LeakyReLU(negative_slope=0.2, inplace=True)
    )
    (1): Sequential(
      (0): Linear(in_features=512, out_features=256, bias=True)
      (1): LeakyReLU(negative_slope=0.2, inplace=True)
    )
    (2): Sequential(
      (0): Linear(in_features=256, out_features=128, bias=True)
      (1): LeakyReLU(negative_slope=0.2, inplace=True)
    )
    (3): Linear(in_features=128, out_features=1, bias=True)
  )
)

The results of this example are the fake images generated during the training and a GIF animation that gives an overview of the generator's capabilities evolution. An example of the resulting GIF animation:

replay_fake_images_gan

This example downloads a pre-defined network architecture (ResNet34) from the pytorch library and converts it to its analog counterpart using the function:

   # Convert the model to its analog version
    model = convert_to_analog(model, RPU_CONFIG, weight_scaling_omega=0.6)

This function passes all the additional parameters that define the analog tile to the model so that these can be used during the training/inference phase. The model from the PyTorch library is suitable for training/inference on the ImageNet dataset.

In this example we define a ResNet34 inspired model that is trained using the CIFAR10 dataset. The model is defined using the various digital layer and then the convert_to_analog() is called to convert the model to its analog version. The CIFAR10 dataset is then used to train the model and the weights are then saved so that these can be used for additional training or inference.

In this example, we define a Lenet model and apply the analog_summary function. The analog_summary(model, input_size) function provides a per-layer analog information and statistics. Given the model and the input size, it prints a table containing the following per-layer information:

  • Layer Name
  • Layer type (Analog or Digital)
  • Input Shape
  • Output Shape
  • Kernel Shape
  • Number of analog tiles used by the layer
  • The re-use factor (the number of computed operations using the same tile)
  • Per-tile information :
    • Logical tile shape (out, in)
    • Physical tile shape (out, in)
    • Utilization (%)

The function also prints some general information, such as the total number of tiles or analog layers.

Example 20: 20_mnist_ddp.py

In this example we define a linear model that is trained using the MNIST dataset. The model is an analog sequential model that is defined using both the analog linear and analog linear mapped layers. The model also uses PyTorch's Distributed Data Parallel (DDP) for training, which efficiently uses multiple GPUs for parallel training in order to significantly decrease the training time.

In this example shows how one could quickly fit real device data for analog training using the PiecewiseStepDevice.

This example explores the study done in Gokmen T and Haensch W (2020) Algorithm for Training Neural Networks on Resistive Device Arrays. Front. Neurosci. on a 2-layer LSTM network trained with the War and Peace dataset (which needs to be downloaded separately).

In the example, there are 4 different RPU that can be used: a floating point tile to run the workload; an analog tile with ideal devices that use SGD for training; an analog tile with ideal and symmetric device that use SGD for training; an analog tile which uses ideal device with tiki-taka v2 algorithm that achieves better performance than SGD.

Here it is illustrated how the analog tiles can be directly used to implement an analog mat-vec (without a pytorch layer).

This example is adapted from https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb

The example loads a pre-trained BERT model trained on the SQuAD dataset. It then applies convert_to_analog() to examine the effects of drift_analog_weights() on inference performance at different weight noise levels. Tensorboard is used to display the SQuAD metrics evaluated using the model at various times after training completed.

Commandline arguments can be used to control certain options. For example: python /path/to/aihwkit/examples/24_bert_on_squad.py -n 0.1 -r "run 1" -l 0.0005 -t to set the weight noise to 0.1, name the run in Tensorboard "run 1", set the learning rate to 0.0005, and do hardware-aware training

Hardware-aware training example using the TorchInferenceRPUConfig which has currently has more limited functionality than using the InferenceRPUConfig, however, it lets torch autograd define the backward pass instead of explicitely defining it. Note that for hardware-aware training, the backward and update step is not needed to be custom-defined as for analog training. Using autograd has the advantage of ease of implementation of new hardware-aware training methods. This is show-cased by using a sample-by-sample noise injection method that is only available for the autograd-based tile.

Show-casing and plotting different specialized analog training algorithms in an simple example of constant gradients.

The example plots similar figures as discussed in Rasch MJ, Carta F, Fagbohungbe O and Gokmen T (2023) Fast offset-corrected in-memeory training. ArXiv preprint.

Make sure to first run the simple HW-aware training of LeNet using python examples/25_torch_tile_lenet5_hardware_aware.py. This saves the pretrained model that is used in this example. This script calibrates the input ranges of the model post-training using three techniques:

  • CACHE_QUANTILE simply caches the tile inputs and calculates a hard-coded quantile as the input range.
  • MOVING_QUANTILE approximates the quantile on-the-fly using a moving average. This saves memory since we don't need to cache everything.
  • MOVING_STD also uses a moving average but uses the X-th standard deviaiton as the input range.

For each method, the performance (under quantization) is printed.

Example of how to use the time-dependent IR drop for inference. The example shows what RPU Configuration has to be used and plots a number of comparisons of the advanced IR drop in comparison to simplified IR drop implementation.

Example of how to use the linaer algebra module. It uses an flexible GMRES algorithm with analog preconditioner. Note that the example needs additional python packages which can be installed with pip install -r requirements-examples.txt.

Example of how to import and perform inference using a model which has been trained in a hardware-aware fashion using an external library (i.e., not the AIHWKIT).

Example of how to modify analog device time-dependent drift characteristics.

Example of how to implement different unit cells for weight programming. These include the default SinglePairConductanceConverter along with a DualPairConductanceConverter, which uses two conductance pairs per unit cell. An NPairConductanceConverter is also available for studying the effects of additional conductance pairs along with a CustomPairConductanceConverter, which enables highly customizable weight programming strategies where allowing the user to define how to split weights across multiple conductances as a function of the weight being implemented.

Example of how to improve inference accuracy--including as a function of time–-via more optimized weight programming strategies based on specific analog device characterstics such as programming errors, conductance drift, and read noise. This example is based on the paper: C. Mackin, et al., "Optimised weight programming for analogue memory-based deep neural networks," Nature Communications, 2022.