-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
various simplifications #3
base: main
Are you sure you want to change the base?
Conversation
- Remove DANGER_START (doesn't seem to do anything) - replace loops with numpy operators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Thanks for the feedback. I would like to keep DANGER_START exactly because despite varying it from 0 to absurdly large values (e.g. 1e6 which means all danger steps are virtually equal) the model still converges.
This challenges the Q-learning idea of a gradual discount factor propagating indefinitely back in time.
I speculate this is very "animalistic" - if anything slightly smells as danger, the default behavior is to stay a few feet away from that cliff's edge, just to be sure. A simpler on/off mapping of dangers.
Another interesting difference here is the danger is cumulative across episodes instead of being averaged (or approximated) as in Q-tables. So if a (state,action) leads to failure several times, its negative reward score accumulates.
Getting rid of the non-njit version of encoder is great, thanks.
The model learns just as fast to me. In this example, anything danger from zero is the same. here, however, I do not like projecting (animal) individual behavior onto such simple models. nature has much energy efficient ways of dealing with danger. |
Is this a form of unsupervised behavioral learning? Nope, it is a reinforcement learning example. Which means the environment's response is the "supervisor". |
In CSDR, every column has one active state (one of 0..N), while in SDR the "column" is 1 or 0 but need to be sparsely 1. In SPH, every layer tries to predict multiple outputs of the layer below. I don't quite understand HTM. What's the benefit of SDR in SPH? |
Ok, let's compare the two. First, in SDR - as data representation - is a simple one dimensional vector of 0 or 1 bits. No "columns" in there. You can compare them with an embedding vector in mainstream ML where there are scalars, floats most common. There is no implicit structure of how the SDR is organized or what each bit means. In HTM theory SDRs encode similarity through overlapping. That's why each encoded value gets assigned more than one bit - 3,4 or more so that the learning part can "sense" the -10 degrees is "somewhat close" to neighboring -14 degrees and "even closer" to -12 degrees. The same way one can have a similarity metric as a cosine or euclidean distance between two scalar vectors (aka embeddings) in mainstream NNs, similarity between SDRs is represented as degree of overlap - the closer two SDRs are the more common (overlapping) bits they share. The constraint of having single ON bit per column in CSDRs pretty much spoils this important property. |
So the fundamental difference is that each SDR "bit" can only be 0 or 1 while CSDR "column" can have column_size possible states. If CSDR |
Plus in SDRs there is no restriction on how many encoding bits are allocated for each value, there are also encodings that can span indefinitely (arbitrarily large intervals) despite having a fixed size SDR and there is more to it. So things get more complicated while in SDR one can use (and test) all kind of encodings. e.g. the cp_vmap_ovenc.py uses all bits to overlap all values! And doing so it is so much more sample efficient, there are sessions in which the model with only two failed balancing sessions (20 or less time steps needed to solve the environment) , I don't what gymnastics would I have to do to get a similar performance in CSDR representation. And that's another thing what I speculate is "animalistic" about it - ability to figure out from very few samples what is the "right" move to keep the pole up. |
If CSDR column_size is set to 2 I don't see a difference in mathematical representation Mathematically sure but why would I waste computing resources? "0" for a SDR-based machines usually means "just ignore it". Being sparse the algorithm cares only about the 1% or 5% or 25% 1 (ON) bits, while the CSDR either can't or if it does then it means you can simply ignore all columns with first bit 1. What's the benefit? |
Regarding the DANGER_START, apparently the model converges as fast with a flat value of 1 or -1 for all danger steps:
However, I recall running a few hundred sessions with varying values for hyperparameters in order to have a more clear image on whether or how these parameters influence the convergence time. Also I don't know if this indifference extrapolates to other RL problems. I did not source the time to test other problems. |
Hi, just saw this randomly - @Blimpyway the benefit of a CSDR over a SDR is 2-fold: Local receptive fields are a trivial look up (an SDR would need a KD-tree to do it efficiently, or use locks when multi-threading), and it combines better with reinforcement learning (discrete actions are one-hot). It works better on computers, at a small capacity reduction (which can be offset via an extra column or two). The RF lookup issue is similar to that of spatial partitioning in video games - our approach is similar to a grid lookup while unstructured SDRs (variable numbers of bits enter a RF) need some sort of spatial partitioning, or need to propagate "bottom up" - which has negative implications for parallelism (needs atomics). Since we know there is exactly 1 item per "spatial cell", the grid lookup is most efficient. Other than that, it has the same properties as an SDR. For local similarity, one or a few columns change while the rest remain static. CSDRs can also approximate arbitrary inputs to high precision, by just adding more columns (the columns jointly represent the input). The only downside of a CSDR is slightly reduced representational capacity for an equal number of cells, at a benefit to computational complexity. Since the capacity is exponential w.r.t. columns, add an extra column or two makes up for the representational capacity loss. |
Hi Eric, thanks for chiming in. Maybe I'm just more comfortable with "normal" SDRs which due to their less constrained structure I found fun in stretching them in "unnatural" ways :D I haven't used K-D trees to be knowledgeable about their suitability in handling local receptive fields. If I ever get into handling visual data I would be tempted towards a more dynamic, "foveic" motion, like in a RL game in which reward is granted when (the model behind) fovea can predict what it is expected to "see" when it is queried with an arbitrary move across the scene. What I do find valuable in your work is the idea of stacking blocks with increasing span of temporal awareness, IMO that architecture shouldn't be tied to a specific data representation. An example of it applied to not even SDRs but linear scalar vectors (or embeddings) could be more compelling to ML community. After all a CSDR column is just a scalar, an integer with a limited size/resolution. However scalar vectors make good distance metrics in which 2 is "closer" to 5 than to 20 within the same column. CSDR in pure "bit-overlap" metric would mark them as equally far apart. Unfortunately SDRs as a concept are fringe to almost everyone despite recent interest in sparsifying large NNs. Speaking from my limited experience with them. |
I have also looked into foveating, but more as a method to implement a sort of data augmentation method for visual systems. The local RFs in SPH are more about making a spatial hierarchy for complexity reasons, kind of like a convolutional neural network but the weights are not shared. You are correct about the exponential memory idea (which are just local RFs through time), it isn't specific to SPH, it could be used with Deep Learning methods. Or, more precisely, DL-based encoders and decoders, since backpropagating end-to-end through a EM hierarchy would require exponential time and memory (SPH doesn't do end-to-end backprop, since it doesn't backprop at all) and would be pointless.
That is true for regular SDRs as well - but actually, we have some branches we have been experimenting with where (unlike both classic SDRs and CSDRs) topology is preserved using self-organizing map type approaches. In those branches, each column's active cell literally "moves" when the input changes just a little, instead of "teleporting" around. While interesting, we haven't pursued this too much since these types of topology-preserving systems are hard to train online.
Yes, it's unfortunate. I think when most people in DL are talking about sparsification, they mean in weights/parameters, not the representations. So, it doesn't actually ever exhibit hard branching/conditional computation. The reason they don't care about it is because it breaks backpropagation (it's non-differentiable) and passthrough optimizers don't do a good job. |
Hello @222464, While re-implementing SPH, I noticed that your implementation (AOgmaNeo) can be changed in the following ways:
|
If I understand this correctly, this is like a fuzzy key-value database.
Is this a form of unsupervised behavioral learning?