Scalable blob building #148

yoid2000 · 2024-11-08T22:42:42Z

Currently the blob system doesn't scale beyond 10 or so columns. This is because we currently build every possible sub-table.

What is needed instead are smarter sub-tables, where we avoid building sub-tables that have low quality because one or more columns are independent of the others.

yoid2000 · 2024-11-08T22:54:31Z

Basic suggested approach:

Establish a limit on the number of sub-tables (say 10k).

Add a phase whereby we determine which sub-tables we want to build without actually building them. We do this by first building all the 2dim subtables, measuring their dependence, and then using these measures to determine the quality of sub-tables.

We go through a process where we explore the most promising 3dim sub-tables (i.e. start with the highest dependence 2dim sub-tables), then the most promising 4dim etc, In the process, we record what max_weight and merge_thresh would have led to these sub-tables under the current blob-building procedure.

yoid2000 · 2024-11-08T23:56:56Z

Currently we have code like this:

        average_quality = sum(dependency_matrix[col, c] for c in cluster.columns) / len(cluster.columns)

            # Skip if below threshold or above weight limit.
            if average_quality < merge_thresh or (
                len(cluster.columns) > DERIVED_COLS_MIN and cluster.total_entropy + col_weights[col] > capacity
            ):

Context of above code is in a loop that goes through the permutation columns in order, and measures the sub-table quality as it does until quality drops below merge_thresh.

Note merge_thresh is default 0.1. As the number of columns in a sub-table increases, the effect of a single low-dependence pair is reduced. But this might not matter in practice (i.e., we don't need some additional parameter like min_dependence or something)

max_weight is related to the sum of column weights (default 15). This puts some kind of limit on sub-table size. Column weights are computed like this:

col_weights = list(_col_weight(col) for col in context.entropy_1dim)

So, related to the entropy of the column, which basically is determined by the number of distinct vals IIRC.

ok, for now, just try to measure average_quality and col_weights and see how they behave.

yoid2000 · 2024-11-10T22:33:40Z

I suspect the way to do this is to use average_quality to find the best sub-tables in the builder, and then use a solver to find the best clustering in the reader.

yoid2000 · 2025-01-19T13:56:03Z

Look into "mutual information" as a measure of dependency between columns.

From chatgpt:

Absolutely, you're right! Traditional correlation measures like Pearson's only capture linear relationships. Two variables can indeed have a strong non-linear relationship and show little to no correlation. This is where other measures come into play:

Mutual Information: Measures any kind of dependency between variables, not just linear. It captures both linear and non-linear relationships.

Distance Correlation: Captures both linear and non-linear associations, providing a more comprehensive view of dependency between variables.

Copula-based Measures: Capture dependency structures between variables, accounting for non-linearity and tail dependencies.

From other information, looks like mutual information is likely the best choice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalable blob building #148

Scalable blob building #148

yoid2000 commented Nov 8, 2024

yoid2000 commented Nov 8, 2024

yoid2000 commented Nov 8, 2024 •

edited

Loading

yoid2000 commented Nov 10, 2024

yoid2000 commented Jan 19, 2025 •

edited

Loading

Scalable blob building #148

Scalable blob building #148

Comments

yoid2000 commented Nov 8, 2024

yoid2000 commented Nov 8, 2024

yoid2000 commented Nov 8, 2024 • edited Loading

yoid2000 commented Nov 10, 2024

yoid2000 commented Jan 19, 2025 • edited Loading

yoid2000 commented Nov 8, 2024 •

edited

Loading

yoid2000 commented Jan 19, 2025 •

edited

Loading