Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(Pipelines) : Smart Data Frame Pipeline #735

Merged
merged 15 commits into from
Nov 12, 2023

Conversation

milind-sinaptik
Copy link
Contributor

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.

Copy link
Contributor

coderabbitai bot commented Nov 7, 2023

Important

Auto Review Skipped

Auto reviews are disabled on base/target branches other than the default branch. Please add the base/target branch pattern to the list of additional branches to be reviewed in the settings.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository.

To trigger a single review, invoke the @coderabbitai review command.


Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • If you reply to a review comment from CodeRabbit, the bot will automatically respond.
  • To engage with CodeRabbit bot directly around the specific lines of code in the PR, mention @coderabbitai in your review comment
  • Note: Review comments are made on code diffs or files, not on the PR overview.
  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Note: For conversation with the bot, please use the review comments on code diffs or files.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

Co-authored-by: Sourcery AI <>
@gventuri gventuri requested a review from ArslanSaleem November 7, 2023 22:59
"output_type_helper", output_type_helper
)
pipeline_context.add_intermediate_value("viz_lib_helper", viz_lib_helper)
pipeline_context.add_intermediate_value("last_reasoning", self._last_reasoning)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to pass this info to the pipeline, as it will only be set and never read within the pipeline

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for last_answer, last_code_generated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used in pandasai/smart_datalake/code_generator.py

Copy link
Contributor Author

@milind-sinaptik milind-sinaptik Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only last_answer and last_reasoning are never used, getting rid of it

"last_code_generated", self._last_code_generated
)
pipeline_context.add_intermediate_value("get_prompt", self._get_prompt)
pipeline_context.add_intermediate_value("llm", self.llm)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The llm can be found within the config, no need to pass it as an additional value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing this

pipeline_context.add_intermediate_value("code_manager", self._code_manager)
pipeline_context.add_intermediate_value(
"response_parser", self._response_parser
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we move the full logic of code_manager and response_parser in the logic unit that is more appropriate. If it is used elsewhere, we still replicate the whole logic within the related logic unit.

Copy link
Contributor Author

@milind-sinaptik milind-sinaptik Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is not a lot of logic here, only passing the parser object here, which in turn does parsing with method parse. Have a look at result_parsing.py, whole logic is moved there.

Copy link
Collaborator

@ArslanSaleem ArslanSaleem Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milind-sinaptik pandasai/pipelines/logic_units/output_logic_unit.py check this how we moved ResponseParser.

context=context,
logger=logger,
steps=[
CodeGenerator(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milind-sinaptik can we add one more logic unit before code generator for generate prompt and use one that is already there for run execute llm? You can check the example of already implemented pipeline of sythetic dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pipeline_context.add_intermediate_value(
"output_type_helper", output_type_helper
)
pipeline_context.add_intermediate_value("viz_lib_helper", viz_lib_helper)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milind-sinaptik we don't these intermediate values to be stored only some might needed but viz_lib is coming from the config which is already present. These helpers can be called in inside logic unit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


self.last_result = result
self.logger.log(f"Answer: {result}")
result = GenerateSmartDatalakePipeline(pipeline_context, self.logger).run()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milind-sinaptik i would say why not pass query and output_type as dict to the pipeline.run(input). This way we can reduce the use of add_intermediate_value value usage.

pipeline_context.add_intermediate_value(
"last_code_generated", self._last_code_generated
)
pipeline_context.add_intermediate_value("get_prompt", self._get_prompt)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milind-sinaptik This functionality of get_prompt can be moved to logic unit that take cares of prompt generation. This way we can completely decouple it from the SmartDatalake.

Milind Lalwani and others added 6 commits November 8, 2023 16:34
…andas-ai ; /usr/bin/env /Users/milindlalwani/anaconda3/envs/pandas-ai/bin/python /Users/milindlalwani/.vscode/extensions/ms-python.python-2023.20.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 59121 -- /Users/milindlalwani/pandas-ai/examples/from_csv.py
Co-authored-by: Sourcery AI <>
@gventuri gventuri requested a review from ArslanSaleem November 9, 2023 23:40
@milind-sinaptik milind-sinaptik changed the base branch from release/v1.5 to main November 10, 2023 09:35
@milind-sinaptik milind-sinaptik changed the base branch from main to release/v1.5 November 10, 2023 09:57
@@ -79,6 +78,10 @@ def run(self, data: Any = None) -> Any:
try:
for index, logic in enumerate(self._steps):
self._logger.log(f"Executing Step {index}: {logic.__class__.__name__}")

if logic.skip_if is not None and logic.skip_if(self._context):
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milind-sinaptik add log if skip for the debugging purpose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -628,11 +439,25 @@ def _validate_output(self, result: dict, output_type: Optional[str] = None):
)
raise ValueError("Output validation failed")

def _get_output_type_hint(self, output_type: Optional[str]) -> str:
def _get_viz_library_type(self) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milind-sinaptik I think we don't this function, it's not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was not added by me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

query_exec_tracker=self._query_exec_tracker,
)
pipeline_context.add_intermediate_value("is_present_in_cache", False)
pipeline_context.add_intermediate_value(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milind-sinaptik I think this output_type_helper is not necessary to be passed. We simply store Input = { query: str, output_type: str} and whichever logic unit it's needed we get from factory there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query is already in the memory. The reason we are not providing as input is that we will be providing output type to all seven stages but only one needs it. It would be a bit un-necessary and output of one stage is input of other. Then we will have to manipulate the output of each stage as well.

Copy link
Collaborator

@ArslanSaleem ArslanSaleem Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then instead of passing output_type_helper store raw output_type from the chat method and get it output_type_factory where is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)
pipeline_context.add_intermediate_value("get_prompt", self._get_prompt)
pipeline_context.add_intermediate_value("last_prompt_id", self.last_prompt_id)
pipeline_context.add_intermediate_value("skills", self._skills)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PipelineContext have skills in the constructor can be get from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"last_code_generated", self._last_code_generated
)
pipeline_context.add_intermediate_value("get_prompt", self._get_prompt)
pipeline_context.add_intermediate_value("last_prompt_id", self.last_prompt_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last prompt id i think it can be added in last prompt id. We need to store thing in right place and i would let's completely remove last_prompt_id from SmartDatalake and initialize in QueryTracker -> start_new_track method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pipeline_context.add_intermediate_value("get_prompt", self._get_prompt)
pipeline_context.add_intermediate_value("last_prompt_id", self.last_prompt_id)
pipeline_context.add_intermediate_value("skills", self._skills)
pipeline_context.add_intermediate_value("code_manager", self._code_manager)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move code manager completely to logic unit of code execution becaus it is need only there and remove it completely from SmartDataLake

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def _add_result_to_memory(self, result: dict):
pipeline_context = PipelineContext(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milind-sinaptik I would say construct PipelineContext and Pipeline in the constructor of SmartDatalake. And in the chat method use it like self._pipeline.run({query:str, output})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"code": code,
"error_returned": e,
}
error_correcting_instruction = context.get_intermediate_value("get_prompt")(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milind-sinaptik why not move get prompt function here in that function and set_vars that are necessary for CorrectErrorPrompt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried moving this but there are some test which start failing as move this function. Putting aside this for now.

self._add_result_to_memory(result=result, context=pipeline_context)

result = pipeline_context.query_exec_tracker.execute_func(
pipeline_context.get_intermediate_value("response_parser").parse, result
Copy link
Collaborator

@ArslanSaleem ArslanSaleem Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milind-sinaptik ResponseParser Object can be constructed here

@gventuri gventuri merged commit 5dd26ba into release/v1.5 Nov 12, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants