Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalable blob building #148

Open
yoid2000 opened this issue Nov 8, 2024 · 4 comments
Open

Scalable blob building #148

yoid2000 opened this issue Nov 8, 2024 · 4 comments

Comments

@yoid2000
Copy link
Contributor

yoid2000 commented Nov 8, 2024

Currently the blob system doesn't scale beyond 10 or so columns. This is because we currently build every possible sub-table.

What is needed instead are smarter sub-tables, where we avoid building sub-tables that have low quality because one or more columns are independent of the others.

@yoid2000
Copy link
Contributor Author

yoid2000 commented Nov 8, 2024

Basic suggested approach:

Establish a limit on the number of sub-tables (say 10k).

Add a phase whereby we determine which sub-tables we want to build without actually building them. We do this by first building all the 2dim subtables, measuring their dependence, and then using these measures to determine the quality of sub-tables.

We go through a process where we explore the most promising 3dim sub-tables (i.e. start with the highest dependence 2dim sub-tables), then the most promising 4dim etc, In the process, we record what max_weight and merge_thresh would have led to these sub-tables under the current blob-building procedure.

@yoid2000
Copy link
Contributor Author

yoid2000 commented Nov 8, 2024

Currently we have code like this:

        average_quality = sum(dependency_matrix[col, c] for c in cluster.columns) / len(cluster.columns)

            # Skip if below threshold or above weight limit.
            if average_quality < merge_thresh or (
                len(cluster.columns) > DERIVED_COLS_MIN and cluster.total_entropy + col_weights[col] > capacity
            ):

Context of above code is in a loop that goes through the permutation columns in order, and measures the sub-table quality as it does until quality drops below merge_thresh.

Note merge_thresh is default 0.1. As the number of columns in a sub-table increases, the effect of a single low-dependence pair is reduced. But this might not matter in practice (i.e., we don't need some additional parameter like min_dependence or something)

max_weight is related to the sum of column weights (default 15). This puts some kind of limit on sub-table size. Column weights are computed like this:

col_weights = list(_col_weight(col) for col in context.entropy_1dim)

So, related to the entropy of the column, which basically is determined by the number of distinct vals IIRC.

ok, for now, just try to measure average_quality and col_weights and see how they behave.

@yoid2000
Copy link
Contributor Author

I suspect the way to do this is to use average_quality to find the best sub-tables in the builder, and then use a solver to find the best clustering in the reader.

@yoid2000
Copy link
Contributor Author

yoid2000 commented Jan 19, 2025

Look into "mutual information" as a measure of dependency between columns.

From chatgpt:

Absolutely, you're right! Traditional correlation measures like Pearson's only capture linear relationships. Two variables can indeed have a strong non-linear relationship and show little to no correlation. This is where other measures come into play:

Mutual Information: Measures any kind of dependency between variables, not just linear. It captures both linear and non-linear relationships.

Distance Correlation: Captures both linear and non-linear associations, providing a more comprehensive view of dependency between variables.

Copula-based Measures: Capture dependency structures between variables, accounting for non-linearity and tail dependencies.


From other information, looks like mutual information is likely the best choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant