You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Great work! I have a question: Is the role of higher-order moments significant? In Table 2, can degree=1 be understood as an operation without using higher-order moments?
The text was updated successfully, but these errors were encountered:
Yep, exactly. degree=1 is just the average the tokens (after a linear+activation), degree=2 corresponds projections of the covariance matrix, degree=3 the third order statistics, etc.
If you think of the sequence as samples from an unknown distribution, then the idea of the high order moments is to capture information about that distribution. If you have enough information to characterize this distribution, then you no longer need the tokens. How many order (degree) you need depends on the distribution, but in practice 2 is already pretty good.
Great work! I have a question: Is the role of higher-order moments significant? In Table 2, can degree=1 be understood as an operation without using higher-order moments?
The text was updated successfully, but these errors were encountered: