Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlpe vs generalized_wishart #4

Open
sjordi opened this issue Sep 13, 2022 · 1 comment
Open

mlpe vs generalized_wishart #4

sjordi opened this issue Sep 13, 2022 · 1 comment

Comments

@sjordi
Copy link

sjordi commented Sep 13, 2022

Dear Nathaniel,
I was wondering if and how can I compare the mlpe and generalized_wishart (gwis) output of my analyses.
I am trying both for each of the tested models.
mlpe always show much lower fit (loglik and aic) than gwis
Is it statistically sound to use the loglik and aic to compare these ?
Thanks a lot in advance.

Jordi

@nspope
Copy link
Owner

nspope commented Sep 14, 2022

Hi Jordi, that's a good question that I don't have a clear answer to. The generalized Wishart uses a hyperparameter nu representing the number of loci, that effectively concentrates or flattens the loglikelihood to reflect the quantity of data used in the calculation of pairwise genetic distances. This has to be provided by the user; it's not identifiable from the data alone. There's no such analogue in the MLPE model; any error in the estimates of genetic distance is ignored (error that might be minimal if distances are calculated from genome-wide SNPs, but definitely will be non-trivial for things like microsat datasets). So, I don't really think they're comparable via loglikelihood.

To do model selection, I'd use something like cross-validation. So for example, split your samples or populations into 70/30 training/test sets; fit both models to the training data; generate predicted genetic distances for the test data; then compare the predicted vs actual genetic distances to see which model performs better on out-of-sample data. In general I think cross-validation is the way to go for model selection here. This is because MLPE or generalized Wishart or whatever will always be a crude, heuristic approximation to the actual stochastic process generating the data, which would include noise from random sample genealogies, mutations, and sequencing errors. So, IMO the best we can do is ask how well these different loss functions work for out-of-sample prediction.

As a side note, the generalized Wishart implementation in radish needs more testing before it can be fully trusted. When I wrote this package (a while back), the other implementations of the generalized Wishart that I could find were a bit different from the formula in the McCullagh paper introducing the distribution. I went with the form in the paper, which seemed to give reasonable answers on simulated data, but need to revisit it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants