Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DES python bindings #508

Merged
merged 4 commits into from
Jan 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ The currently supported data patterns are:
- Exact unique column combination (discovery and validation)
- Approximate unique column combination, with $g_1$ metric (discovery and validation)
* Association rules (discovery)
* Numerical association rules (discovery)
* Matching dependencies (discovery)
* Variable heterogeneous denial constraints (validation)

Expand Down Expand Up @@ -219,6 +220,9 @@ Here is a list of papers about patterns, organized in the recommended reading or
- [Sebastian Kruse and Felix Naumann. 2018. Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11, 7 (March 2018), 759–772.](https://www.vldb.org/pvldb/vol11/p759-kruse.pdf)
* Association rules
- [Charu C. Aggarwal, Jiawei Han. 2014. Frequent Pattern Mining. Springer Cham. pp 471.](https://link.springer.com/book/10.1007/978-3-319-07821-2)
* Numerical association rules
- [Minakshi Kaushik, Rahul Sharma, Iztok Fister Jr., and Dirk Draheim. 2023. Numerical Association Rule Mining: A Systematic Literature Review. 1, 1 (July 2023), 50 pages.](https://arxiv.org/abs/2307.00662)
- [Fister, Iztok & Fister jr, Iztok. 2020. uARMSolver: A framework for Association Rule Mining. 10.48550/arXiv.2010.10884.](https://doi.org/10.48550/arXiv.2010.10884)
* Matching dependencies
- [Philipp Schirmer, Thorsten Papenbrock, Ioannis Koumarelas, and Felix Naumann. 2020. Efficient Discovery of Matching Dependencies. ACM Trans. Database Syst. 45, 3, Article 13 (September 2020), 33 pages. https://doi.org/10.1145/3392778](https://dl.acm.org/doi/10.1145/3392778)
* Denial constraints
Expand Down
4 changes: 4 additions & 0 deletions README_PYPI.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ The currently supported data patterns are:
- Exact unique column combination (discovery and validation)
- Approximate unique column combination, with $g_1$ metric (discovery and validation)
* Association rules (discovery)
* Numerical association rules (discovery)
* Matching dependencies (discovery)
* Variable heterogeneous denial constraints (validation)

Expand Down Expand Up @@ -220,6 +221,9 @@ Here is a list of papers about patterns, organized in the recommended reading or
- [Sebastian Kruse and Felix Naumann. 2018. Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11, 7 (March 2018), 759–772.](https://www.vldb.org/pvldb/vol11/p759-kruse.pdf)
* Association rules
- [Charu C. Aggarwal, Jiawei Han. 2014. Frequent Pattern Mining. Springer Cham. pp 471.](https://link.springer.com/book/10.1007/978-3-319-07821-2)
* Numerical association rules
- [Minakshi Kaushik, Rahul Sharma, Iztok Fister Jr., and Dirk Draheim. 2023. Numerical Association Rule Mining: A Systematic Literature Review. 1, 1 (July 2023), 50 pages.](https://arxiv.org/abs/2307.00662)
- [Fister, Iztok & Fister jr, Iztok. 2020. uARMSolver: A framework for Association Rule Mining. 10.48550/arXiv.2010.10884.](https://doi.org/10.48550/arXiv.2010.10884)
* Matching dependencies
- [Philipp Schirmer, Thorsten Papenbrock, Ioannis Koumarelas, and Felix Naumann. 2020. Efficient Discovery of Matching Dependencies. ACM Trans. Database Syst. 45, 3, Article 13 (September 2020), 33 pages. https://doi.org/10.1145/3392778](https://dl.acm.org/doi/10.1145/3392778)
* Denial constraints
Expand Down
1 change: 1 addition & 0 deletions examples/basic/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ These scenarios showcase a single pattern by discussing its definition and provi
+ [mining_ac.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/mining_ac.py) — a scenario showing the discovery of algebraic constraints.
+ [mining_afd.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/mining_afd.py) — a scenario showing how to discover approximate functional dependencies.
+ [mining_ar.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/mining_ar.py) — a scenario showing how to discover association rules.
+ [mining_nar.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/mining_nar.py) — a scenario showing how to discover numerical association rules.
+ [mining_aucc.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/mining_aucc.py) — a scenario showing how to discover approximate unique column combinations.
+ [mining_cfd.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/mining_cfd.py) — a scenario showing how to discover conditional functional dependencies.
+ [mining_dd.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/mining_dd.py) — a scenario showing how to discover differential dependencies.
Expand Down
110 changes: 110 additions & 0 deletions examples/basic/mining_nar.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
import desbordante
import pandas
from colorama import Fore, Style, Back

TABLE = 'examples/datasets/dog_breeds.csv'

DOWN_ARROW = " |\n |\n V"

def print_rule_part(rule_part, columns):
for column_index, value in rule_part.items():
print(f'{columns[column_index]}{value}')


def print_10_nars(nars, df_columns):
for i, nar in enumerate(nars[:10], start=1):
print(f"NAR {i}:{Style.BRIGHT}")
print_rule_part(nar.ante, df_columns)
print(DOWN_ARROW)
print_rule_part(nar.cons, df_columns)
print(f" support = {nar.support}")
print(f" confidence = {nar.confidence}")
print(f"{Style.RESET_ALL}")


if __name__ == '__main__':
algo = desbordante.nar.algorithms.DES()
algo.load_data(table=(TABLE, ',', True))
print("Numerical Association Rules (NAR) are an extension of traditional "
"Association Rules (AR), which help to discover patterns in data. Unlike ARs, "
"which work with binary attributes (e.g., whether an item was purchased "
"or not), NARs can handle numerical data (e.g., how many units of an item "
"were purchased). This makes NARs more flexible for discovering relationships "
"in datasets with numerical data.")
print("Suppose we have a table containing students' exam grades and how many "
"hours they studied for the exam. Such a table might hold the following "
"numerical association rule:\n")
print(f"{Style.BRIGHT}Study_Hours[15.5 - 30.2] {Fore.BLUE}⎤-Antecedent")
print(f"{Fore.RESET}Subject[Topology] {Fore.BLUE}⎦")
print(f"{Fore.RESET}{DOWN_ARROW}")
print(f"Grade[3 - 5] {Fore.BLUE}]-Consequent")
print(f"{Fore.RESET} support = 0.21")
print(" confidence = 0.93")
print()
print(f"{Style.RESET_ALL}This rule states that students who study Topology for "
"between 15.5 and 30.2 hours will receive a grade between 3 and 5. This "
"rule has support of 0.21, which means that 21% of rows in the dataset "
"satisfy both the antecedent's and consequent's requirements. This rule also "
"has confidence of 0.93, meaning that 93% of rows that satisfy the "
"antecedent also satisfy the consequent. Note that attributes can be integers, "
"floating point numbers, or strings.\n")
print('Desbordante implements an algorithm called "Differential Evolution Solver" '
'(DES), described by Iztok Fister et al. in "uARMSolver: A framework for '
'Association Rule Mining". It is a nature-inspired stochastic optimization '
"algorithm.\n")
print("As a demonstration of working with some of DES' parameters, let's inspect "
"a dataset containing information about 159 dog breeds.\n")
df = pandas.read_csv(TABLE)

print("Fragment of the dog_breeds.csv table:")
print(df)
print("\nA fragment of the table is presented above. In total, each dog breed has"
" 13 attributes.")
print("Now, let's mine some NARs. We will use a minimum support of 0.1 and a minimum "
"confidence of 0.7. We will also use a population size of 500 and "
"max_fitness_evaluations of 700. Larger values for max_fitness_evaluations "
"tend to return larger rules encompassing more attributes. The population size "
"parameter affects the number of NARs being generated and mutated. Larger values "
"are slower but output more NARs.\n")
algo.execute(minconf=0.7, minsup=0.1, population_size=500,
max_fitness_evaluations=700)
if len(algo.get_nars()) != 1:
raise ValueError("example requires that a single NAR was mined")
discovered_nar = algo.get_nars()[0]
print_10_nars([discovered_nar], df.columns)

print("\nThe above NAR is the only one discovered with these settings. The NAR "
"states that about 92% of all dog breeds of type "
"'Hound' have an intelligence rating between 6 and 8 out of 10 and are between "
"sizes 0 and 4 out of 5 (0 being 'Toy' and 5 being 'Giant'). This suggests "
"that, in general, hounds are intelligent dogs and no more than "
"8% of all hounds are of 'Giant' size. Let's see if that is true.\n")

hound_rows = df[df['Type'] == 'Hound']

violating_row_indices = []
min_intelligence = discovered_nar.cons[9].lower_bound
max_intelligence = discovered_nar.cons[9].upper_bound
for i, (_, row) in enumerate(hound_rows.iterrows()):
intelligence = row['Intelligence']
if intelligence < min_intelligence or intelligence > max_intelligence:
violating_row_indices.append(i)

header, *hound_row_strings = hound_rows[['Name', 'Type', 'Intelligence', 'Size']].to_string().splitlines()
for i, hound_row_string in enumerate(hound_row_strings):
if i in violating_row_indices:
print(f"{Back.RED}{hound_row_string}{Back.RESET}")
else:
print(hound_row_string)

print("\nAs observed, only 2 rows with 'Type' equal to 'Hound' fall outside "
"the intelligence rating range of 6 to 8. These two records account for "
VanyaVolgushev marked this conversation as resolved.
Show resolved Hide resolved
"the (27-2)/27 ~= 92% confidence level of this rule.\n")
print("Let's try again, but this time with different settings. This time, minimum support "
"will have a more lenient value of 0.05 and the population size will be 700. "
"This will help discover more NARs. The value of max_fitness_evaluations "
"will also need to be increased to 1500 in accordance with the population "
"size to produce a non-empty result.\n")
algo.execute(minconf=0.7, minsup=0.05, population_size=700,
max_fitness_evaluations=1500)
print_10_nars(algo.get_nars(), df.columns)
Loading
Loading