bug fixes to the evaluate algorithm + speedups for redce/equivalence.…

… Added diagram to README and a short blurb on how to reproduce our results"
BatsResearch · Jun 26, 2024 · 90b8d60 · 90b8d60
1 parent a482c8b
commit 90b8d60
Show file tree

Hide file tree

Showing 12 changed files with 532 additions and 428 deletions.
diff --git a/README.md b/README.md
@@ -32,9 +32,9 @@ rm -rf tmp
 ## Basic Usage
 To evaluate a PDDL problem description, we can use the `planetarium.evaluate` module:
 ```python
-from planetarium import evaluate
+import planetarium
 ...
-evaluate.evaluate(gt_pddl_str, pred_pddl_str)
+planetarium.evaluate(gt_pddl_str, pred_pddl_str)
 ```
 The supported domains are `blocksworld` and `gripper` domains.
 
@@ -47,6 +47,7 @@ from datasets import load_dataset
 
 dataset = load_dataset("BatsResearch/planetarium")
 ```
+Here, `dataset["test"]` is the main test set used in the paper. You may evaluate on this set to reproduce our results.
 
 You can reporduce the dataset, the splits, and a report by running the following command:
 ```bash
@@ -74,4 +75,18 @@ Total number of problems: $132,037$.
 | $20$ - $40$ | $10,765$ | $2,112$ |
 | $40$ - $60$ | $50,793$ | $9,412$ |
 | $60$ - $80$ | $26,316$ | $25,346$ |
-| $80$ - inf | $3,464$ | $2,438$ |
+| $80$ - inf | $3,464$ | $2,438$ |
+
+## How it Works
+Planetarium🪐 compares two PDDL problem descriptions by first transcribing them into a graph representation.
+Graphs help us to better detect and manipulate relationships between certain objects and propositions.
+Next, we build "fully specified" graph representations by adding "trivial" propositions (propositions that do not exist in the problem description but must exist in any state that satisfies such description).
+Finally, we use graph isomorphism to compare the fully specified graph representations of the two PDDL problem descriptions, either comparing the entire problem graph or the individual initial and goal scene graphs.
+This lets check correctness of the translation of the natural language description into PDDL, without ever needing to run a planner.
+
+Below is a flowchart providing an overview of the equivalence algorithm:
+
+![Equivalence Algorithm Overview](assets/equivalence.png)
+<p style="text-align: center;">(Left) Two planning problems, in PDDL problem description, real-world scenario, and graph representations. (Center) Fully specified graph representation. (Right) Graph isomorphism.</p>
+
+The key to this algorithm working is building a specially crafted "fully specify" function, which we build for each domain that we want to support. We provide implementations for the `blocksworld` and `gripper` domains in the `planetarium.oracle` module.
diff --git a/assets/equivalence.png b/assets/equivalence.png
diff --git a/evaluate.py b/evaluate.py
@@ -200,8 +200,8 @@ def result():
         parseable = True
 
         # reduce and further validate the LLM output
-        oracle.reduce(llm_problem_graph.decompose()[0], validate=True)
-        oracle.reduce(llm_problem_graph.decompose()[1], validate=True)
+        oracle.reduce(llm_problem_graph.init())
+        oracle.reduce(llm_problem_graph.goal())
         valid = True
 
         problem_graph = builder.build(problem_pddl)

diff --git a/planetarium/__init__.py b/planetarium/__init__.py
@@ -1,7 +1,9 @@
-__all__ = ["builder", "downward", "graph", "metric", "oracle"]
+__all__ = ["builder", "downward", "graph", "metric", "oracle", "evaluate"]
 
 from . import builder
 from . import downward
 from . import graph
 from . import metric
 from . import oracle
+
+from .evaluate import evaluate
diff --git a/planetarium/evaluate.py b/planetarium/evaluate.py
@@ -3,7 +3,7 @@
 from pddl.parser.problem import LenientProblemParser
 from pddl.formatter import problem_to_string
 
-from planetarium import *
+from planetarium import builder, oracle, metric, downward
 
 
 VALIDATE = os.getenv("VALIDATE", "Validate")
@@ -129,7 +129,7 @@ def evaluate(
 
     if source_graph == target_graph:
         equivalent = True
-    elif source_graph.decompose()[0] != target_graph.decompose()[0]:
+    elif not metric.equals(source_graph.init(), target_graph.init()):
         equivalent = False
     else:
         equivalent = metric.equals(