-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(Pipelines) : Smart Data Frame Pipeline #735
Conversation
milind-sinaptik
commented
Nov 7, 2023
- closes #xxxx (Replace xxxx with the GitHub issue number)
- Tests added and passed if fixing a bug or adding a new feature
- All code checks passed.
Important Auto Review SkippedAuto reviews are disabled on base/target branches other than the default branch. Please add the base/target branch pattern to the list of additional branches to be reviewed in the settings. Please check the settings in the CodeRabbit UI or the To trigger a single review, invoke the TipsChat with CodeRabbit Bot (
|
Co-authored-by: Sourcery AI <>
pandasai/smart_datalake/__init__.py
Outdated
"output_type_helper", output_type_helper | ||
) | ||
pipeline_context.add_intermediate_value("viz_lib_helper", viz_lib_helper) | ||
pipeline_context.add_intermediate_value("last_reasoning", self._last_reasoning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to pass this info to the pipeline, as it will only be set and never read within the pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for last_answer, last_code_generated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used in pandasai/smart_datalake/code_generator.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only last_answer and last_reasoning are never used, getting rid of it
pandasai/smart_datalake/__init__.py
Outdated
"last_code_generated", self._last_code_generated | ||
) | ||
pipeline_context.add_intermediate_value("get_prompt", self._get_prompt) | ||
pipeline_context.add_intermediate_value("llm", self.llm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The llm can be found within the config, no need to pass it as an additional value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing this
pipeline_context.add_intermediate_value("code_manager", self._code_manager) | ||
pipeline_context.add_intermediate_value( | ||
"response_parser", self._response_parser | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we move the full logic of code_manager and response_parser in the logic unit that is more appropriate. If it is used elsewhere, we still replicate the whole logic within the related logic unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is not a lot of logic here, only passing the parser object here, which in turn does parsing with method parse. Have a look at result_parsing.py, whole logic is moved there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milind-sinaptik pandasai/pipelines/logic_units/output_logic_unit.py check this how we moved ResponseParser.
context=context, | ||
logger=logger, | ||
steps=[ | ||
CodeGenerator(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milind-sinaptik can we add one more logic unit before code generator for generate prompt and use one that is already there for run execute llm? You can check the example of already implemented pipeline of sythetic dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pipeline_context.add_intermediate_value( | ||
"output_type_helper", output_type_helper | ||
) | ||
pipeline_context.add_intermediate_value("viz_lib_helper", viz_lib_helper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milind-sinaptik we don't these intermediate values to be stored only some might needed but viz_lib is coming from the config which is already present. These helpers can be called in inside logic unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
self.last_result = result | ||
self.logger.log(f"Answer: {result}") | ||
result = GenerateSmartDatalakePipeline(pipeline_context, self.logger).run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milind-sinaptik i would say why not pass query and output_type as dict to the pipeline.run(input). This way we can reduce the use of add_intermediate_value value usage.
pipeline_context.add_intermediate_value( | ||
"last_code_generated", self._last_code_generated | ||
) | ||
pipeline_context.add_intermediate_value("get_prompt", self._get_prompt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milind-sinaptik This functionality of get_prompt can be moved to logic unit that take cares of prompt generation. This way we can completely decouple it from the SmartDatalake.
…andas-ai ; /usr/bin/env /Users/milindlalwani/anaconda3/envs/pandas-ai/bin/python /Users/milindlalwani/.vscode/extensions/ms-python.python-2023.20.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 59121 -- /Users/milindlalwani/pandas-ai/examples/from_csv.py
…a Smart Lake pipeline
Co-authored-by: Sourcery AI <>
@@ -79,6 +78,10 @@ def run(self, data: Any = None) -> Any: | |||
try: | |||
for index, logic in enumerate(self._steps): | |||
self._logger.log(f"Executing Step {index}: {logic.__class__.__name__}") | |||
|
|||
if logic.skip_if is not None and logic.skip_if(self._context): | |||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milind-sinaptik add log if skip for the debugging purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -628,11 +439,25 @@ def _validate_output(self, result: dict, output_type: Optional[str] = None): | |||
) | |||
raise ValueError("Output validation failed") | |||
|
|||
def _get_output_type_hint(self, output_type: Optional[str]) -> str: | |||
def _get_viz_library_type(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milind-sinaptik I think we don't this function, it's not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was not added by me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
query_exec_tracker=self._query_exec_tracker, | ||
) | ||
pipeline_context.add_intermediate_value("is_present_in_cache", False) | ||
pipeline_context.add_intermediate_value( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milind-sinaptik I think this output_type_helper is not necessary to be passed. We simply store Input = { query: str, output_type: str} and whichever logic unit it's needed we get from factory there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query is already in the memory. The reason we are not providing as input is that we will be providing output type to all seven stages but only one needs it. It would be a bit un-necessary and output of one stage is input of other. Then we will have to manipulate the output of each stage as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok then instead of passing output_type_helper store raw output_type from the chat method and get it output_type_factory where is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
) | ||
pipeline_context.add_intermediate_value("get_prompt", self._get_prompt) | ||
pipeline_context.add_intermediate_value("last_prompt_id", self.last_prompt_id) | ||
pipeline_context.add_intermediate_value("skills", self._skills) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PipelineContext have skills in the constructor can be get from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"last_code_generated", self._last_code_generated | ||
) | ||
pipeline_context.add_intermediate_value("get_prompt", self._get_prompt) | ||
pipeline_context.add_intermediate_value("last_prompt_id", self.last_prompt_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last prompt id i think it can be added in last prompt id. We need to store thing in right place and i would let's completely remove last_prompt_id from SmartDatalake and initialize in QueryTracker -> start_new_track method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pipeline_context.add_intermediate_value("get_prompt", self._get_prompt) | ||
pipeline_context.add_intermediate_value("last_prompt_id", self.last_prompt_id) | ||
pipeline_context.add_intermediate_value("skills", self._skills) | ||
pipeline_context.add_intermediate_value("code_manager", self._code_manager) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move code manager completely to logic unit of code execution becaus it is need only there and remove it completely from SmartDataLake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
def _add_result_to_memory(self, result: dict): | ||
pipeline_context = PipelineContext( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milind-sinaptik I would say construct PipelineContext and Pipeline in the constructor of SmartDatalake. And in the chat method use it like self._pipeline.run({query:str, output})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"code": code, | ||
"error_returned": e, | ||
} | ||
error_correcting_instruction = context.get_intermediate_value("get_prompt")( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milind-sinaptik why not move get prompt function here in that function and set_vars that are necessary for CorrectErrorPrompt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried moving this but there are some test which start failing as move this function. Putting aside this for now.
self._add_result_to_memory(result=result, context=pipeline_context) | ||
|
||
result = pipeline_context.query_exec_tracker.execute_func( | ||
pipeline_context.get_intermediate_value("response_parser").parse, result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milind-sinaptik ResponseParser Object can be constructed here