Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider alternatives to Qrank #1

Open
hugolpz opened this issue Mar 16, 2021 · 4 comments
Open

Consider alternatives to Qrank #1

hugolpz opened this issue Mar 16, 2021 · 4 comments

Comments

@hugolpz
Copy link

hugolpz commented Mar 16, 2021

Simple suggestion. Your numbers seems to show views rather than rank. Qviews or Qvisits would be more accurate.

@brawer
Copy link
Owner

brawer commented Mar 17, 2021

Currently that’s true, but see Future work.

@athalhammer
Copy link

Hi @brawer,

Nice work!
I'm doing some experiments on ~321 Wikimedia languages combined via Q-ID link graph PageRank computation. If you are interested in some discussion/collaboration on this topic please let me know.

@brawer
Copy link
Owner

brawer commented Aug 11, 2021

@athalhammer, that’s super interesting! Basically, PageRank is a mathematical model to predict hypothetical popularity from the shape of the link graph, whereas QRank is trivially counting beans to measure actual popularity from logs analysis. Super curious, have you by chance compared the two approaches to ranking? It would be really interesting to know how well the mathematical model is able to predict reality in practice, and where exactly the differences are.

Also curious, have you tried seeding your PageRank computation with measured numbers from QRank, and then run the classic iterative algorithm until it converges? In the classic version of PageRank, all nodes start equal; but obviously you could also seed your initial weights by taking empiric measurements (like those from QRank) into account. In case you’ve tried that, what were your findings?

Personally, I could well imagine that the QRank pipeline gets eventually extended with some graph algorithm(s). Not really to replace the current logs analysis; why use a prediction model if we can measure reality? However, the majority of entities in Wikidata will never get a corresponding page in Wikipedia (or in Wikisource, Wikitravel, Wikispecies, Wikibooks, etc.); and currently such page-less entities are not getting ranked. Finding some way for fixing this problem would be quite interesting. For example, once the world’s citation graph gets stored in WikiData, it might be interesting to enrich QRank with a simulated random walk along the citation graph. Likewise, it might be interesting to propagate some of an artist’s fame to their works, in particular for works that have no Wikipedia page.

Anyhow, if you’re still interested, let’s talk! Feel free to contact me at [email protected].

@athalhammer
Copy link

Thanks @brawer, for your detailed answer and also for your questions!

Super curious, have you by chance compared the two approaches to ranking?

During my PhD I was performing some experiments that compared page-view-based rankings with link-based (PageRank) ones. Not surprisingly you will find more the pages of type 'Rihanna' and 'Eminem' in the top 100 of the page-view-based ranking. As a matter of fact, if page-view-based statistics are used for auto-suggest or recommendation they often have a self-amplifying effect. The current auto-suggest interface of English Wikipedia actually does that (it is page-view-based to the best of my knowledge). I'm not sure how they deal with this self-amplifying strategy as if they see it as an issue and/or address it at all.

Also curious, have you tried seeding your PageRank computation with measured numbers from QRank, and then run the classic iterative algorithm until it converges?

That's a good question. It would be one way of combining the two approaches. My initial thought on this is that the output of this will depend a lot on the number of PageRank iterations: With each iteration it will converge a bit more towards the equally seeded PageRank. There are other ways of combining this without this dependence on the iterations (e.g., Personalized PageRank).

However, the majority of entities in Wikidata will never get a corresponding page in Wikipedia (or in Wikisource, Wikitravel, Wikispecies, Wikibooks, etc.); and currently such page-less entities are not getting ranked.

That's an issue danker is facing as well... I thought about ways of combining a pure Wikidata-based PageRank with the Wikipedia-based ones that danker provides. At this point the same question arises again: What would be a good way of combining these?

Overall, I believe that exploring ways of combining different rankings (e.g., seeding as you suggest) and measuring the differences as well as trying to find something like an "optimal" ranking could probably be a whole PhD thesis in itself (or multiple ones)... In any case, I will get in touch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants