Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

various simplifications #3

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

various simplifications #3

wants to merge 2 commits into from

Conversation

iacore
Copy link

@iacore iacore commented Dec 6, 2023

If I understand this correctly, this is like a fuzzy key-value database.

Is this a form of unsupervised behavioral learning?

- Remove DANGER_START (doesn't seem to do anything)
- replace loops with numpy operators
Copy link
Owner

@Blimpyway Blimpyway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
Thanks for the feedback. I would like to keep DANGER_START exactly because despite varying it from 0 to absurdly large values (e.g. 1e6 which means all danger steps are virtually equal) the model still converges.

This challenges the Q-learning idea of a gradual discount factor propagating indefinitely back in time.

I speculate this is very "animalistic" - if anything slightly smells as danger, the default behavior is to stay a few feet away from that cliff's edge, just to be sure. A simpler on/off mapping of dangers.

Another interesting difference here is the danger is cumulative across episodes instead of being averaged (or approximated) as in Q-tables. So if a (state,action) leads to failure several times, its negative reward score accumulates.


Getting rid of the non-njit version of encoder is great, thanks.

@iacore
Copy link
Author

iacore commented Dec 19, 2023

...DANGER_START...
I speculate this is very "animalistic" - if anything slightly smells as danger, the default behavior is to stay a few feet away from that cliff's edge, just to be sure. A simpler on/off mapping of dangers.

The model learns just as fast to me. In this example, anything danger from zero is the same.

here, danger is simulating animalistic instinct. instinct will not trigger "stay far away" behavior.

however, I do not like projecting (animal) individual behavior onto such simple models. nature has much energy efficient ways of dealing with danger.

@Blimpyway
Copy link
Owner

Is this a form of unsupervised behavioral learning?

Nope, it is a reinforcement learning example. Which means the environment's response is the "supervisor".
Ogma's algorithm is good at time series learning but it performs comparatively poorly in CartPole, I can only guess why. One reason might be the CartPole is so simple that its last time step state is sufficient to make an action choice, so considering previous timesteps back in time only messes things up.
Other reason might be the Ogma's CSDRs encoding doesn't overlap close (similar) values. They use a one bit out of N (N ~10 to 30 ) encoding for each state parameter, while "normal" SDR encodings have an overlap between proximal/similar values and that matters a lot.
They-re very fond of their own C(olumnar)SDR concept, I think this pulls them back, SPHs would work better with normal SDRs

@iacore
Copy link
Author

iacore commented Dec 19, 2023

In CSDR, every column has one active state (one of 0..N), while in SDR the "column" is 1 or 0 but need to be sparsely 1.

In SPH, every layer tries to predict multiple outputs of the layer below.

I don't quite understand HTM.

What's the benefit of SDR in SPH?

@Blimpyway
Copy link
Owner

Ok, let's compare the two. First, in SDR - as data representation - is a simple one dimensional vector of 0 or 1 bits. No "columns" in there. You can compare them with an embedding vector in mainstream ML where there are scalars, floats most common. There is no implicit structure of how the SDR is organized or what each bit means.
The CSDR is 3D structure of bits, where columns are arranged in a 2D "horizontal" plane, and all columns share a given height. What CSDR enforces is that each column must have a single bit turned on.
This structure seems more expressive but it is also more restrictive.
Usually a lowest level encoder for e.g. CartPole, will dedicate one column per state value, so you have 4 columns with a sufficient high height to achieve a meaningful resolution, e.g. if the height is 20 and the column for pole's tilt varies between -20 to +20 degrees then each "level" approximates a 2 degrees wide level. First bit (e.g.) encodes -20 to -18, and so on, the last one means 18 to 20 degrees.

In HTM theory SDRs encode similarity through overlapping. That's why each encoded value gets assigned more than one bit - 3,4 or more so that the learning part can "sense" the -10 degrees is "somewhat close" to neighboring -14 degrees and "even closer" to -12 degrees.

The same way one can have a similarity metric as a cosine or euclidean distance between two scalar vectors (aka embeddings) in mainstream NNs, similarity between SDRs is represented as degree of overlap - the closer two SDRs are the more common (overlapping) bits they share.

The constraint of having single ON bit per column in CSDRs pretty much spoils this important property.

@iacore
Copy link
Author

iacore commented Dec 19, 2023

So the fundamental difference is that each SDR "bit" can only be 0 or 1 while CSDR "column" can have column_size possible states.

If CSDR column_size is set to 2 I don't see a difference in mathematical representation.

@Blimpyway
Copy link
Owner

Blimpyway commented Dec 19, 2023

Plus in SDRs there is no restriction on how many encoding bits are allocated for each value, there are also encodings that can span indefinitely (arbitrarily large intervals) despite having a fixed size SDR and there is more to it.
You can compensate the "lack of width" of a single column in CSDRs by having several columns encoding with different resolutions the same value. One maps 2 degrees steps, another in 3, another in 5 degrees steps, but given all columns have the same size, you end up with unused bits. a 5 degree steps needs only 8 steps yet the column is 20 bit high to match the finest resolution.

So things get more complicated while in SDR one can use (and test) all kind of encodings.

e.g. the cp_vmap_ovenc.py uses all bits to overlap all values! And doing so it is so much more sample efficient, there are sessions in which the model with only two failed balancing sessions (20 or less time steps needed to solve the environment) , I don't what gymnastics would I have to do to get a similar performance in CSDR representation.

And that's another thing what I speculate is "animalistic" about it - ability to figure out from very few samples what is the "right" move to keep the pole up.

@Blimpyway
Copy link
Owner

Blimpyway commented Dec 19, 2023

If CSDR column_size is set to 2 I don't see a difference in mathematical representation

Mathematically sure but why would I waste computing resources? "0" for a SDR-based machines usually means "just ignore it". Being sparse the algorithm cares only about the 1% or 5% or 25% 1 (ON) bits, while the CSDR either can't or if it does then it means you can simply ignore all columns with first bit 1. What's the benefit?

@Blimpyway
Copy link
Owner

Regarding the DANGER_START, apparently the model converges as fast with a flat value of 1 or -1 for all danger steps:

my.vmap.add(sdr, left_or_right)
instead of
# my.vmap.add(sdr, left_or_right * (DANGER_START + danger))

However, I recall running a few hundred sessions with varying values for hyperparameters in order to have a more clear image on whether or how these parameters influence the convergence time.

Also I don't know if this indifference extrapolates to other RL problems. I did not source the time to test other problems.
So I would rather remove both but that would suggest that grading Q-value isn't necessary in general. Which is quite radical.
I'm inclined to keep them as a reminder that's an opened question to investigate.

@222464
Copy link

222464 commented Dec 20, 2023

Hi, just saw this randomly - @Blimpyway the benefit of a CSDR over a SDR is 2-fold: Local receptive fields are a trivial look up (an SDR would need a KD-tree to do it efficiently, or use locks when multi-threading), and it combines better with reinforcement learning (discrete actions are one-hot). It works better on computers, at a small capacity reduction (which can be offset via an extra column or two).

The RF lookup issue is similar to that of spatial partitioning in video games - our approach is similar to a grid lookup while unstructured SDRs (variable numbers of bits enter a RF) need some sort of spatial partitioning, or need to propagate "bottom up" - which has negative implications for parallelism (needs atomics). Since we know there is exactly 1 item per "spatial cell", the grid lookup is most efficient.

Other than that, it has the same properties as an SDR. For local similarity, one or a few columns change while the rest remain static. CSDRs can also approximate arbitrary inputs to high precision, by just adding more columns (the columns jointly represent the input).

The only downside of a CSDR is slightly reduced representational capacity for an equal number of cells, at a benefit to computational complexity. Since the capacity is exponential w.r.t. columns, add an extra column or two makes up for the representational capacity loss.

@Blimpyway
Copy link
Owner

Hi Eric, thanks for chiming in.

Maybe I'm just more comfortable with "normal" SDRs which due to their less constrained structure I found fun in stretching them in "unnatural" ways :D

I haven't used K-D trees to be knowledgeable about their suitability in handling local receptive fields. If I ever get into handling visual data I would be tempted towards a more dynamic, "foveic" motion, like in a RL game in which reward is granted when (the model behind) fovea can predict what it is expected to "see" when it is queried with an arbitrary move across the scene.

What I do find valuable in your work is the idea of stacking blocks with increasing span of temporal awareness, IMO that architecture shouldn't be tied to a specific data representation.

An example of it applied to not even SDRs but linear scalar vectors (or embeddings) could be more compelling to ML community.

After all a CSDR column is just a scalar, an integer with a limited size/resolution. However scalar vectors make good distance metrics in which 2 is "closer" to 5 than to 20 within the same column. CSDR in pure "bit-overlap" metric would mark them as equally far apart.

Unfortunately SDRs as a concept are fringe to almost everyone despite recent interest in sparsifying large NNs.

Speaking from my limited experience with them.

@222464
Copy link

222464 commented Dec 21, 2023

I have also looked into foveating, but more as a method to implement a sort of data augmentation method for visual systems. The local RFs in SPH are more about making a spatial hierarchy for complexity reasons, kind of like a convolutional neural network but the weights are not shared.

You are correct about the exponential memory idea (which are just local RFs through time), it isn't specific to SPH, it could be used with Deep Learning methods. Or, more precisely, DL-based encoders and decoders, since backpropagating end-to-end through a EM hierarchy would require exponential time and memory (SPH doesn't do end-to-end backprop, since it doesn't backprop at all) and would be pointless.

After all a CSDR column is just a scalar, an integer with a limited size/resolution. However scalar vectors make good distance metrics in which 2 is "closer" to 5 than to 20 within the same column. CSDR in pure "bit-overlap" metric would mark them as equally far apart.

That is true for regular SDRs as well - but actually, we have some branches we have been experimenting with where (unlike both classic SDRs and CSDRs) topology is preserved using self-organizing map type approaches. In those branches, each column's active cell literally "moves" when the input changes just a little, instead of "teleporting" around. While interesting, we haven't pursued this too much since these types of topology-preserving systems are hard to train online.

Unfortunately SDRs as a concept are fringe to almost everyone despite recent interest in sparsifying large NNs.

Yes, it's unfortunate. I think when most people in DL are talking about sparsification, they mean in weights/parameters, not the representations. So, it doesn't actually ever exhibit hard branching/conditional computation. The reason they don't care about it is because it breaks backpropagation (it's non-differentiable) and passthrough optimizers don't do a good job.

@iacore
Copy link
Author

iacore commented Dec 21, 2023

Hello @222464,

While re-implementing SPH, I noticed that your implementation (AOgmaNeo) can be changed in the following ways:

  • Each column is coded as i32 (C int). This can be u8 (column size <= 256)
  • Because each column is encoded as an integer, with added complexity, the column size of a layer can grow and shrink on demand without losing much data. When shrinking, it should be like shrinking an image with nearest neighbour.
  • with much added complexity, each layer can have columns of unequal height (as mentioned by @Blimpyway as an advantage of SDR)
  • Some operations (like clearing weights) can be sped up by SIMD trivially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants