Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identify potential schemas for a property graph for Latex content #23

Open
bhpayne opened this issue Dec 17, 2023 · 2 comments
Open

identify potential schemas for a property graph for Latex content #23

bhpayne opened this issue Dec 17, 2023 · 2 comments

Comments

@bhpayne
Copy link
Member

bhpayne commented Dec 17, 2023

what nodes and edges would be useful for a graph of Latex content? What properties would the nodes and edges have?

@msgoff
Copy link
Contributor

msgoff commented Dec 17, 2023

Gephi uses
Source,Target,Type,Id,Label,Weight

Source -> FileID
Target -> TokenID
Type -> (Latex Environments, Verbs, Math mode)
Label -> Token:String
Weight -> TF-IDF

Edges might consist of transformations applied to a node.

@msgoff
Copy link
Contributor

msgoff commented Dec 23, 2023

I think the structure implemented in the existing code base handles the 3 variant cases listed in the Wikipedia link.

Inverted index
Impact-ordered postings, with the tf-idf scoring which can easily be replaced by similar scores such as BM-25
Positional postings list

https://en.wikipedia.org/wiki/Postings_list

Unique
FileID:hash -> TokenIDS:list

Unique
TokenID:hash -> FileIDS:list

unique primary key, not null
(FileID,TokenID)

(FileID,TokenID) -> offsets:list

sample queries
(*,*) -> returns list of lists of offsets for all fileids and all tokenids
(FileID,*) -> returns list of lists for all tokenids:offset pairs.
(*,TokenID) -> a list of all fileIDs that match token

or any combination of subsets of fileIDs or TokenIDs
([fileid1,fileid2,...],[tokenid1,tokenid5,...]) -> list of lists of offsets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants