Skip to content
This repository has been archived by the owner on Dec 6, 2023. It is now read-only.

[WIP] Adagrad solver for arbitrary orders. #8

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

vene
Copy link
Collaborator

@vene vene commented Sep 12, 2016

Implements the SG algorithm from Higher-order Factorization Machines.
Mathieu Blondel, Akinori Fujino, Naonori Ueda, Masakazu Ishihata.
In Proceedings of Neural Information Processing Systems (NIPS), December 2016.

Todos

  • Explicit fitting of the lower matrices, as in CD
  • Benchmarks
  • Consider avoiding the numpy import in cython
  • Unify callback API with CD before merging
  • Test coverage.

Returns
-------
Gram matrix : array of shape (n_samples_1, n_samples_2)
"""
if degree == 2:

if degree > 3 or method == 'dp':
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about degree == 3?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cases degree in (2, 3) are dealt with using the old closed-form approach. My benchmarks show this is faster in batch settings like this.


cdef Py_ssize_t t, j, jj

for jj in range(nnz + 1):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython's numpy should support low level vectorized = operations, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean A[:, 0] = 1? I think that works only for numpy arrays, not sure if it's implemented for generic memoryviews. I'll give it a try.

@@ -0,0 +1,67 @@
cdef inline double _fast_anova_kernel(double[::1, :] A,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file needs the cython directives for efficiency, too. I added them but didn't get to commit yet, because I'm in the middle of some profiling: the adagrad solver is at the moment slower than I expected.

@EthanRosenthal
Copy link

Is there anything I can do to help complete this PR?

@vene
Copy link
Collaborator Author

vene commented Nov 11, 2016

Hi @EthanRosenthal

I'm a bit busy these days, but I plan to focus on this project and get a release out by the end of the year.

The current problem with this PR is that the adagrad solver seems to take quite more time per epoch than CD. If you manage to pinpoint the issue, it would be great.

--(Let me try to push the work-in-progress benchmark code that I have around.)-- edit: it was already up

@EthanRosenthal
Copy link

@vene Sounds good - I'm happy to take a look and see if I can improve the performance.

@vene
Copy link
Collaborator Author

vene commented Dec 21, 2016

Making P and P-like arrays (especially the gradient output) fortran-ordered makes this ~1.5x faster, but coordinate descent solvers are still faster per iteration, it seems. Also weird that degree 3 is faster than degree 2. I set a very low tolerance to prevent it from converging early, but I should check why.

I just realized it might be because the alpha and beta regularization parameters have different meanings for the two solvers unless accounted for.

(with P in C order all the time)
Classifier            train       test         f1   accuracy
------------------------------------------------------------
fm-3-ada          129.4632s    0.3683s     0.0000     0.9576
fm-2-ada          199.0683s    0.1267s     0.0000     0.9576
fm-2               15.9484s    0.1455s     0.1437     0.7594
polynet-3          15.5964s    0.0836s     0.1624     0.8425
fm-3               21.2889s    0.5154s     0.3695     0.9429
polynet-2          12.9678s    0.0915s     0.4800     0.9620

(with P (and everything else) in F order all the time)
Classifier            train       test         f1   accuracy
------------------------------------------------------------
fm-2-ada          138.9727s    0.1163s     0.0000     0.9576
fm-3-ada           72.2692s    0.3488s     0.0000     0.9576
fm-2               16.2693s    0.1453s     0.1437     0.7594
polynet-3          15.6552s    0.0843s     0.1624     0.8425
fm-3               21.4048s    0.5239s     0.3695     0.9429
polynet-2          13.0174s    0.0924s     0.4800     0.9620

@mblondel
Copy link
Member

Indeed the regularization are not the same because of the 1/n factor in front of the loss when using stochastic gradient algorithms.

Regularization scaling is now ON by default. I think this is
sensible, because it keeps the choice independent of data split.

Adagrad seems very sensitive to the initial norm of P, so I changed
the init to have unit variance rather than 0.01.
Makes benchmark more reasonable but norms are still weird.
Finnicky tests (fm warm starts) had to be updated, but most
things behave well.
@vene
Copy link
Collaborator Author

vene commented Dec 21, 2016

Here's the performance after a bunch of tweaking and making the problem easier.

I'm printing out the norm of P to emphasize the big inconsistency of the solutions, even after correctly setting regularization terms. This is weird. When initializing P with standard deviation 0.01, adagrad sets it to zero very quickly. (especially with lower learning rates).

20 newsgroups
=============
X_train.shape = (1079, 130107)
X_train.dtype = float64
X_train density = 0.0013896454072061155
y_train (1079,)
Training class ratio: 0.4448563484708063
X_test (717, 130107)
X_test.dtype = float64
y_test (717,)

Classifier Training
===================
Training fm-2 ... done
||P|| = 20542.6551563
Training fm-2-ada ... done
||P|| = 49.2109170758
Training fm-3 ... done
||P|| = 71852.8549548
Training fm-3-ada ... done
||P|| = 37.0119191191
Training polynet-2 ... done
Training polynet-3 ... done
Classification performance:
===========================

Classifier            train       test         f1   accuracy
------------------------------------------------------------
fm-3                6.5862s    0.0839s     0.4524     0.5105
fm-2                4.5888s    0.0178s     0.4645     0.5690
fm-3-ada           20.1919s    0.0684s     0.4993     0.4965
polynet-3           5.4179s    0.0099s     0.5114     0.5523
polynet-2           4.1679s    0.0099s     0.5449     0.5621
fm-2-ada           18.6872s    0.0126s     0.5698     0.5788

@vene
Copy link
Collaborator Author

vene commented Dec 21, 2016

Appveyor crash is not a test failure, for some reason I get Command exited with code -1073740940. No idea why right now...

@vene
Copy link
Collaborator Author

vene commented Dec 21, 2016

Now supports explicit fitting of lower orders. No performance degradation but the code is a bit unrolled and could be written clearer.

I'm a bit confused by the way adagrad reacts to the learning rate, especially in the example, and why it seems to shrink to 0 faster with lower learning rates. But the tests, at least, suggest things are sensible.

@vene
Copy link
Collaborator Author

vene commented Dec 22, 2016

On second thought, the windows crash is not a fluke.

The exit code -1073740940, in hex, is 0xC0000374, which apparently means heap corruption

It seems that my last commit fixed it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants