-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fit the training data #4
Comments
Hi Chen, That's a great toy experiment! It's also quite different from the ones we ran, but here are my thoughts:
|
Hi Fabian, Thanks for all the great suggestions! I would like to report some empirical observations. I reduce the number of KNN graphs from point clouds to 1. In that case, I am able to fit the data exactly. However, even I slightly increase to the number of graphs to 5, then I am not the ability to fit the data. It also seems that more graphs I have, it's harder for SE3-transformer to fit the data. (The cosine similarity decreases when I increase the number of graphs) I spend two hours trying different K, the size of the graph, varying architecture (switch from GConvSE3 to GSE3Res), and learning rate decay. however, I didn't see significant improvement. According to this recent paper https://arxiv.org/abs/2010.02449 by Haggai Maron, SE3-transformer are universal. So I would like to try my best to solve this toy problem before moving to real data. Let me know if you want to try it yourself. I can share the script. |
Hi Chen, yes, it would be very interesting to have a look at your script. Feel free to send it over! |
|
I am using the following version You can modify the n_pts and n_nbrs for your needs. Let me know if you need any clarification. Thank you! |
Awesome, thank you! Hopefully, I will find the time to play around with it next week, but no promises. |
Another suggestion for analysing what's happening: you could try to have a few fully connected layers on each of the per-point outputs. This will obviously break the equivariance, but it could help analyse where the overfitting breaks. You can also directly compare that to using the same amount of fully connected layers applied directly (and again in a per-point fashion) to the inputs. |
No worries. I guess right now everyone is busy with ICLR rebuttal:-) |
Hi Fabian, I have another question regarding the time and memory of the se3-transformer (in general equivariant NN). It seems that the equivariance comes with a price of slower runtime and more memory. In the paper, you mention that it takes 2.5 days to train the nets for QM9. If it's only for one regression (or it's the total time for 6 tasks?), then it's roughly 72min for one epoch. In contrast, I run a simple example on qm9 here https://github.com/rusty1s/pytorch_geometric/blob/master/examples/qm9_nn_conv.py and it took 3min per epoch. I was wondering is it 20x more time roughly the right scale here? Also, seems that equivariance nets in general is memory expansive. Would you like to point out the source of slow speed and high memory usage? I am interested in improving it if it is not too hard. |
Hi @Chen-Cai-OSU and @FabianFuchsML, Multiple things to check here:
Regarding memory and runtime of equivariant networks, (at least for e3nn) these bottlenecks are primarily due to the combinatorial nature of the geometric tensor product (contracting two representation indices with Clebsch-Gordan coefficients) and the fact that there are no readily available CUDA kernels for doing these operations. There are many ways around this. For example, one can create specialized geometric tensor product modules that do not do all possible products, but rather a subset. Hope that helps a bit! |
Hi @blondegeek! Thanks a lot for all the suggestions and explanations. Would you like to elaborate on ``combinatorial nature of the geometric tensor product''? You mean to calculate the tensor product of type-a and type-b (a,b=0,1,2,...) irreducible representations is very expansive? How do you choose the subset of products? Thank you! |
Hi @Chen-Cai-OSU,
Yes, the tensor product is expansive. How you choose subsets will depend on
the application and determined by experiment :) For example, you can take
an approach similar to depthwise convolutions where you only interact
certain subsets of features with each other. In general, max L=1 hidden
features seems effective for several tasks. There are more geometric tasks
where you need higher L.
Have you been able to successfully overfit to one example? That should help
debug whether the task is set up correctly.
Another thing to be aware of, even if your loss function allows for both -v
and v to be "correct", the network will ALWAYS output the linear
combination of the two degenerate possibilities, which in this case is
zero. (more detail for why this is here: https://arxiv.org/abs/2007.02005)
Best,
Tess
…On Sun, Nov 15, 2020 at 10:54 AM Chen-Cai-OSU ***@***.***> wrote:
Hi @blondegeek <https://github.com/blondegeek>!
Thanks a lot for all the suggestions and explanations.
I think for 1) I have checked that the output rotates properly when I
rotate the input 2) I am using (negative) absolute cosine similarity as
loss and metric so the 'up-to-sign' problem should be already solved.
However, I am not able to fit even three point clouds, (200 points each
point cloud) which is a bit puzzing. I will wait for Fabian to take a look
at the dataset.
Would you like to elaborate on ``combinatorial nature of the geometric
tensor product''? You mean to calculate the tensor product of type-a and
type-b (a,b=0,1,2,...) irreducible representations is very expansive? How
do you choose the subset of products? Thank you!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7EGVX5JJ64NJD6I3T7OQTSQAPWRANCNFSM4TN7UCMQ>
.
|
Hi both! Some additional remarks about speed (I will put this somewhere in the readme):
Here are some ideas about speeding up the SE3-Transformer:
Best |
This is super interesting! It sounds sort of similar to what I found out when I spent some time digging into what the bottlenecks are - but then also not quite. I wish I could remember more precisely what my findings were. In the beginning, the bottleneck was definitely purely the spherical harmonics. But after speeding them up by shifting the computations to the GPU (all the credit here goes to Daniel), the bottleneck was equally split between multiple parts - one of them being constructing the basis vectors from the spherical harmonics and the Clebsch-Gordon coefficients. It sounds like there is some potential if one wanted to get into CUDA programming. |
Thanks @blondegeek for the reference and explanations.
Yes. For a single pointclouds I can overfit. For two point clouds, I can also overfit. But starting from 3 point clouds, I cannot overfit anymore :-(
I didn't understand why this is the case. In your paper, I understand it's easy to convert a rectangle into a square as the latter has more symmetry but not the other way around. But how is this related to the prediction of eigenvectors? If my loss is set to encourage the output to be either v or -v, why will the network want to output the 0? I guess this is a subtle point that I haven't yet grasped. |
Hi Chen, I had a little time to look through your code today. It's a great toy example and I would love to try to debug / crack it, but I am not too optimistic that I will have the time to get to it. I looked at how you sample points on an ellipsoide in |
Hi Fabian, I remember I tried both 1) points -= np.mean(points, axis=0) and 2) not recentering it. np.mean(points, axis=0) is very close to the center. It's just the sample mean. I tried both versions and didn't see significant differences in term of fitting training data. |
Ok, so here's how I would do it with Here's a simple network fitting on a single example. Note, I'm not fitting to eigenvectors but rather the rotated scale matrix, which is basically the matrix you are getting the eigenvectors for. Here's the same network fitting on multiple training examples. It needs more training, but it's starting to get the idea. |
Thanks @blondegeek for the nice notebooks! I am starting to trying out TFN now. My quick question is that is about predicting the eigenvectors. Is it a bad idea to try to use equivariance NN to predict the eigenvectors in this case? Is it because of both v and -v are the right answer so the NN will tend to output 0? I still don't understand why this is the case. Even if I set the loss (to maximize the absolute cosine similarity between predicted eigenvector and true eigenvector) to account for this "up-to-sign" problem, is this still doomed to fail? |
The key issue is that eigenvector solvers are not symmetry preserving, they "pick" an eigenvector typically based on a random initialization or similarly arbitrary procedure. This becomes especially problematic for symmetric structures. Let's consider two higher symmetry cases. Let's say the scaling matrix is the identity torch.eye(3). What are the principle eigenvectors? They are degenerate -- a sphere is radially symmetric, any three orthogonal directions are equally valid and can be in any order. How about if the scaling matrix is something like torch.eye(3) * [1, 1, 2] so that the ellispoid is radially symmetric along one axis. You have a similar problem, there is no unique way to choose the eigenvectors in a rotation equivariant manner. So the issue is more than "just a sign" -- the issue is that the question is symmetrically ill posed. Principle axes are not vectors, they are double headed rays and you need L=2 features to describe them. A 3x3 matrix can handle the spherically symmetric case, it will just predict a matrix with a scalar trace (an identity matrix) and any less symmetric case. Hope that helps! |
Thanks @blondegeek for the further clarification!
What is double headed rays exactly? I saw this slide (at around 20 min) in your talk https://sites.google.com/view/equiv-data-aug/home I can find the Pseudovector in Wikipedia but I didn't find any good references on the double headed rays and spiral. I am familiar with covariant/contravariant tensors but never head of double headed rays and spiral. Do you mind pointing out some references? |
… On Sat, Nov 21, 2020 at 10:00 Chen-Cai-OSU ***@***.***> wrote:
Thanks @blondegeek <https://github.com/blondegeek> for the further
clarification!
I understand the case where eigenvectors with the multiplicity, any vector
in the eigenspace can be taken as eigenvectors.
Principle axes are not vectors, they are double headed rays and you need
L=2 features to describe them
What is double headed rays exactly? I saw this slide (at around 20 min) in
your talk https://sites.google.com/view/equiv-data-aug/home
[image: Screen Shot 2020-11-21 at 12 54 47 PM]
<https://user-images.githubusercontent.com/47577816/99883983-be541f80-2bf8-11eb-9fb7-4bcec237fa94.png>
I can find the Pseudovector in Wikipedia but I didn't find any good
references on the double headed rays and spiral. I am familiar with
covariant/contravariant tensors but never head of double headed rays and
spiral. Do you mind pointing out some references?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7EGVXZFXHGSDYSGNZEAJLSQ755HANCNFSM4TN7UCMQ>
.
|
@blondegeek Hi Tess, I am using e3nn (it's nice that change of basis matrix can be handled by to_irrep_transformation.) but I had some issues verifying the equivariance for the 3*3 matrix (Rs out=[(1, 0, 1), (1, 2, 1)]) Would you mind taking a look at e3nn/e3nn#149? Many thanks! |
Hi Fabian,
When I use se3-transformer for my dataset. I find it seems quite difficult for the model to fit the training data. To understand why, I create a simple task in the following way.
I generate a few hundred point clouds sampled on the surface of an ellipsoid in 3d (centered at (0,0,0)). I construct a KNN (k=5) graph for each point cloud. The goal is to predict the first eigenvector of the covariance matrix for each point cloud. This is a type-1 feature. I am using the following model
however, when I trained with Adam optimizer with learning rate 1e-3 to minimize the mse loss between predicted eigenvector and the true eigenvector, I end up the training loss roughly around 0.05, and the cosine similarly between predicted eigenvector and true eigenvector is roughly 0.65. (which means the the angle is more than 45 degree)
The number of the parameters of the model is 643296, which is probably not large but my data set is also tiny (200 neighborhood graphs constructed from the point clouds). So I am a bit surprised why the model cannot even fit the data exactly. (I am trying to use more layers but Cuda memory quickly runs out)
Is there some places I should pay special attention to when using a se3-transformer? Maybe because the when the equivariance constraint is placed on the kernel, the model will be less flexible therefore harder to fit the training data? Should I try to increase model size, or try different optimizers and learning rates? I can share the data if needed.
Thank you!
The text was updated successfully, but these errors were encountered: