-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalable blob building #148
Comments
Basic suggested approach: Establish a limit on the number of sub-tables (say 10k). Add a phase whereby we determine which sub-tables we want to build without actually building them. We do this by first building all the 2dim subtables, measuring their dependence, and then using these measures to determine the quality of sub-tables. We go through a process where we explore the most promising 3dim sub-tables (i.e. start with the highest dependence 2dim sub-tables), then the most promising 4dim etc, In the process, we record what |
Currently we have code like this:
Context of above code is in a loop that goes through the permutation columns in order, and measures the sub-table quality as it does until quality drops below Note
So, related to the entropy of the column, which basically is determined by the number of distinct vals IIRC. ok, for now, just try to measure |
I suspect the way to do this is to use |
Look into "mutual information" as a measure of dependency between columns. From chatgpt: Absolutely, you're right! Traditional correlation measures like Pearson's only capture linear relationships. Two variables can indeed have a strong non-linear relationship and show little to no correlation. This is where other measures come into play: Mutual Information: Measures any kind of dependency between variables, not just linear. It captures both linear and non-linear relationships. Distance Correlation: Captures both linear and non-linear associations, providing a more comprehensive view of dependency between variables. Copula-based Measures: Capture dependency structures between variables, accounting for non-linearity and tail dependencies. From other information, looks like mutual information is likely the best choice. |
Currently the blob system doesn't scale beyond 10 or so columns. This is because we currently build every possible sub-table.
What is needed instead are smarter sub-tables, where we avoid building sub-tables that have low quality because one or more columns are independent of the others.
The text was updated successfully, but these errors were encountered: