[Docs Agent] Docs Agent release version 0.4.0 (#533)

* [Docs Agent] Docs Agent release version 0.4.0 - **Multi-modal support:** The Docs Agent CLI supports image, audio, and video files as part of a prompt to the Gemini model. - **Formatted output:** Select the format of Docs Agent CLI's responses with the `--response_type json` and `--plaintext` options. - **Autocomplete script:** The `autocomplete.sh` script is added to include Docs Agent CLI commands and options, making it easier and faster to use the Docs Agent CLI on a terminal. * [Docs Agent] Docs Agent release version 0.4.0 (Files missed in the previous commit) - **Multi-modal support:** The Docs Agent CLI supports image, audio, and video files as part of a prompt to the Gemini model. - **Formatted output:** Select the format of Docs Agent CLI's responses with the `--response_type json` and `--plaintext` options. - **Autocomplete script:** The `autocomplete.sh` script is added to include Docs Agent CLI commands and options, making it easier and faster to use the Docs Agent CLI on a terminal.
google · Oct 29, 2024 · 4203530 · 4203530
1 parent 213c6ce
commit 4203530
Show file tree

Hide file tree

Showing 20 changed files with 2,819 additions and 2,049 deletions.
diff --git a/examples/gemini/python/docs-agent/README.md b/examples/gemini/python/docs-agent/README.md
@@ -26,10 +26,10 @@ check out the [Set up Docs Agent][set-up-docs-agent] section below.
 
 Docs Agent's `agent runtask` command allows you to run pre-defined chains of prompts,
 which are referred to as **tasks**. These tasks simplify complex interactions by defining
-a series of steps that the Docs Agent will execute. The tasks are defined in `.yaml` files
-stored in the [`tasks`][tasks-dir] directory of your Docs Agent project. The tasks are
+a series of steps that the Docs Agent CLI will execute. The tasks are defined in `.yaml`
+files stored in the [`tasks`][tasks-dir] directory of your Docs Agent project. The tasks are
 designed to be reusable and can be used to automate common workflows, such as generating
-release notes, updating documentation, or analyzing complex information.
+release notes, drafting overview pages, or analyzing complex information.
 
 A task file example:
 
@@ -101,6 +101,16 @@ The list below summarizes the tasks and features supported by Docs Agent:
   agent runtask --task DraftReleaseNotes
   ```
 
+- **Multi-modal support**: Docs Agent's `agent helpme` command can include image,
+  audio, and video files as part of a prompt to the Gemini 1.5 model, for example:
+
+  ```sh
+  agent helpme Provide a concise, descriptive alt text for this PNG image --file ./my_image_example.png
+  ```
+
+  You can use this feature for creating tasks as well. For example, see the
+  [DescribeImages][describe-images] task.
+
 For more information on Docs Agent's architecture and features,
 see the [Docs Agent concepts][docs-agent-concepts] page.
 
@@ -241,6 +251,13 @@ Clone the Docs Agent project and install dependencies:
    **Important**: From this point, all `agent` command lines below need to
    run in this `poetry shell` environment.
 
+5. (**Optional**) To enable autocomplete commands and flags related to
+   Docs Agent in your shell environment, run the following command:
+
+   ```
+   source scripts/autocomplete.sh
+   ```
+
 ### 5. Edit the Docs Agent configuration file
 
 This guide uses the [open source Flutter documents][flutter-docs-src] as an example dataset,
@@ -458,3 +475,4 @@ Meggin Kearney (`@Meggin`), and Kyo Lee (`@kyolee415`).
 [chunking-process]: docs/chunking-process.md
 [new-15-mode]: docs/config-reference.md#app_mode
 [tasks-dir]: tasks/
+[describe-images]: tasks/describe-images-for-alt-text-task.yaml
diff --git a/examples/gemini/python/docs-agent/apps_script/drive_to_markdown.gs b/examples/gemini/python/docs-agent/apps_script/drive_to_markdown.gs
@@ -235,6 +235,6 @@ function convertDriveFolder(folderName, outputFolderName="", indexFile="") {
       insertRichText(sheet, md_chip, "E", row_number);
       insertRichText(sheet, folder_chip, "I", row_number);
     }
+    return gdoc_count, pdf_count, new_file_count, updated_file_count, unchanged_file_count
   }
-  return gdoc_count, pdf_count, new_file_count, updated_file_count, unchanged_file_count
 }
diff --git a/examples/gemini/python/docs-agent/docs/cli-reference.md b/examples/gemini/python/docs-agent/docs/cli-reference.md
@@ -258,6 +258,20 @@ For example:
 agent helpme write a concept doc covering all features in this project? --allfiles ~/my-project --new
 ```
 
+### Ask the model to print the output in JSON
+
+The command below prints the output from the model in JSON format:
+
+```sh
+agent helpme <REQUEST> --response_type json
+```
+
+For example:
+
+```sh
+agent helpme how do I cook pasta? --response_type json
+```
+
 ### Ask the model to run a pre-defined chain of prompts
 
 The command below runs a task (a sequence of prompts) defined in
@@ -297,6 +311,22 @@ For example:
 agent runtask --task IndexPageGenerator --custom_input ~/my_example/docs/development/
 ```
 
+### Ask the model to print the output in plain text
+
+By default, the `agent runtask` command uses Python's Rich console
+to format its output. You can disable it by using the `--plaintext`
+flag:
+
+```sh
+agent runtask --task <TASK> --plaintext
+```
+
+For example:
+
+```sh
+agent runtask --task DraftReleaseNotes --plaintext
+```
+
 ## Managing online corpora
 
 ### List all existing online corpora

diff --git a/examples/gemini/python/docs-agent/docs_agent/agents/docs_agent.py b/examples/gemini/python/docs-agent/docs_agent/agents/docs_agent.py
@@ -17,6 +17,7 @@
 """Docs Agent"""
 
 import typing
+import os, pathlib
 
 from absl import logging
 import google.api_core
@@ -573,6 +574,73 @@ def ask_content_model_to_fact_check_prompt(self, context: str, prev_response: st
     def generate_embedding(self, text, task_type: str = "SEMANTIC_SIMILARITY"):
         return self.gemini.embed(text, task_type)[0]
 
+    # Generate a response to an image
+    def ask_model_about_image(self, prompt: str, image):
+        if not prompt:
+            prompt = f"Describe this image:"
+        if self.context_model.startswith("models/gemini-1.5"):
+            try:
+                # Adding prompt in the beginning allows long contextual
+                # information to be added.
+                response = self.gemini.generate_content([prompt, image])
+            except google.api_core.exceptions.InvalidArgument:
+                return self.config.conditions.model_error_message
+        else:
+            logging.error(f"The {self.context_model} can't read an image.")
+            response = None
+            exit(1)
+        return response
+
+    # Generate a response to audio
+    def ask_model_about_audio(self, prompt: str, audio):
+        if not prompt:
+            prompt = f"Describe this audio clip:"
+        audio_size = os.path.getsize(audio)
+        # Limit is 20MB
+        if audio_size > 20000000:
+            logging.error(f"The audio clip {audio} is too large: {audio_size} bytes.")
+            exit(1)
+        # Get the mime type of the audio file and trim the . from the extension.
+        mime_type = "audio/" + pathlib.Path(audio).suffix[:1]
+        audio_clip = {
+            "mime_type": mime_type,
+            "data": pathlib.Path(audio).read_bytes()
+        }
+        if self.context_model.startswith("models/gemini-1.5"):
+            try:
+                response = self.gemini.generate_content([prompt, audio_clip])
+            except google.api_core.exceptions.InvalidArgument:
+                return self.config.conditions.model_error_message
+        else:
+            logging.error(f"The {self.context_model} can't read an audio clip.")
+            exit(1)
+        return response
+
+    # Generate a response to video
+    def ask_model_about_video(self, prompt: str, video):
+        if not prompt:
+            prompt = f"Describe this video clip:"
+        video_size = os.path.getsize(video)
+        # Limit is 2GB
+        if video_size > 2147483648:
+            logging.error(f"The video clip {video} is too large: {video_size} bytes.")
+            exit(1)
+        request_options = {
+            "timeout": 600
+        }
+        mime_type = "video/" + pathlib.Path(video).suffix[:1]
+        video_clip_uploaded =self.gemini.upload_file(video)
+        video_clip = self.gemini.get_file(video_clip_uploaded)
+        if self.context_model.startswith("models/gemini-1.5"):
+            try:
+                response = self.gemini.generate_content([prompt, video_clip],
+                                                        request_options=request_options)
+            except google.api_core.exceptions.InvalidArgument:
+                return self.config.conditions.model_error_message
+        else:
+            logging.error(f"The {self.context_model} can't see video clips.")
+            exit(1)
+        return response
 
 # Function to give an embedding function for gemini using an API key
 def embedding_function_gemini_retrieval(api_key, embedding_model: str):

diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/README.md b/examples/gemini/python/docs-agent/docs_agent/interfaces/README.md
@@ -101,10 +101,16 @@ from your `$HOME` directory.
    poetry shell
    ```
 
-   Entering the `poetry shell` environment is **required** for
-   running the `agent` command.
+   **Important**: You must always enter the `poetry shell` environment
+   to run the `agent` command.
 
-2. Run the `agent helpme` command, for example:
+2. Enable autocomplete for Docs Agent CLI options in your environment:
+
+   ```
+   source scripts/autocomplete.sh
+   ```
+
+3. Run the `agent helpme` command, for example:
 
    ```
    agent helpme how do I cook pasta?
@@ -113,7 +119,7 @@ from your `$HOME` directory.
    This command returns the Gemini model's response of your input prompt
    `how do I cook pasta?`.
 
-3. View the list of Docs Agent tasks available in your setup:
+4. View the list of Docs Agent tasks available in your setup:
 
    ```
    agent runtask
@@ -122,7 +128,7 @@ from your `$HOME` directory.
    This command prints a list of Docs Agent tasks that you can run.
    (See the `tasks` directory in your local Docs Agent setup.)
 
-4. Run the `agent runtask` command, for example:
+5. Run the `agent runtask` command, for example:
 
    ```
    agent runtask --task IndexPageGenerator