Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion): Add custom doctstore indexing for user-provided repos #47

Closed
wants to merge 32 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
678b53c
fix: change Slack app loading using configs.
Nov 20, 2023
6ff371e
chore: update requirements and env
parambharat Nov 20, 2023
66e7d57
chore: update requirements and env
parambharat Nov 20, 2023
8811e5e
feat: update ingestion pipeline to use new llama index changes
Nov 21, 2023
f2da675
fix: update chat prompt to include context in system template
Nov 21, 2023
ad6f582
fix: fix typo in chat response schema
Nov 21, 2023
16cd037
feat: include query language in chat request and update wandbot app v…
Nov 21, 2023
c02010d
feat: reroute queries to retriever by language in chat interface
Nov 21, 2023
5f8029b
fix: revert back to old gpt-4 model
Nov 21, 2023
12cdd25
refactor: rename zendesk app and make it a module
Nov 21, 2023
5ca1ef2
feat: include language in question answer db schema
Nov 21, 2023
72dd44f
fix: type annotate slack response correctly in send message
Nov 21, 2023
8cba8ab
refactor: rename zendesk app config
Nov 21, 2023
0b9ac9c
feat: include custom language filter node postprocessor
Nov 21, 2023
88aae62
feat: include language in database schema and api
Nov 21, 2023
4a4fd18
refactor: convert slack to an async app
Nov 21, 2023
7e02943
chore: remove unused comment from slack app
Nov 21, 2023
31d9a63
fix: add language in api client and slack app
Nov 21, 2023
56f218f
refactor: add docstrings and fix linting issues in module.
Nov 22, 2023
33409c6
refactor: add type annotations to function definitions
Nov 22, 2023
e7b194e
refactor: switch to async api client
Nov 22, 2023
62bedb4
feat: include en language in discord client
Nov 22, 2023
a322bcf
fix: timestamp issue in database backup
Nov 22, 2023
d62e016
fix: linting issues, run black and isort over the codebase
Nov 22, 2023
d8b6124
fix: rename config language to as it conflicts with os language
Nov 22, 2023
da7ab89
feat: add stream table logging for chat logs
Nov 22, 2023
cf730a5
fix: remove wandb api key from apps and add run label
Nov 22, 2023
e0a6dee
feat: include run commands for both en and ja slack bots
Nov 22, 2023
bc9fff6
chore: update README.md with new tokens and run commands
Nov 22, 2023
045966a
Add working hardcoded workflow
ash0ts Nov 21, 2023
35bae27
Allow for custom prompts and also fix flags for custom ingestion
ash0ts Nov 22, 2023
8e9161e
Working multidataset ingestion workflow
ash0ts Nov 22, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 61 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,56 @@ poetry run python -m src.wandbot.ingestion
You will notice that the data is ingested into the `data/cache` directory and stored in three different directories `raw_data`, `vectorstore` with individual files for each step of the ingestion process.
These datasets are also stored as wandb artifacts in the project defined in the environment variable `WANDB_PROJECT` and can be accessed from the [wandb dashboard](https://wandb.ai/wandb/wandbot-dev).

### Custom Dataset

To run the Data Ingestion with a custom dataset you can use the following command where the below path can be replaced with your <path_to_custom_dataset_config_yaml>:

```bash
poetry run python -m src.wandbot.ingestion --custom --custom_dataset_config_yaml="./src/wandbot/ingestion/custom_dataset.yaml"
```

where

- `--custom` -> Flag for ingesting a custom dataset. If this flag is not set, the default wandbot data flow is used.
- `--custom_dataset_config_yaml` -> Path to the custom dataset config yaml file. An example is provided in `src/wandbot/ingestion/custom_dataset_config.yaml`

The YAML is structured as follows:
```yaml
- CustomConfig:
name: "custom_store"
data_source:
remote_path: "https://docs.wandb.ai/"
repo_path: "https://github.com/wandb/docodile"
base_path: "docs"
file_pattern: "*.md"
is_git_repo: true
language: "en"
docstore_dir: "custom_store_en"
- CustomConfig2:
...
```

To load an index based on the custom dataset as defined above, you can set the following environment variable to an artifact path:

```bash
WANDB_INDEX_ARTIFACT="{ENTITY}/{PROJECT}/custom_index:latest"
```

#### Custom Prompt

To load an prompt based on a custom prompt in the format of the [chat_prompt.json](data/prompts/chat_prompt.json) file, you can set the following environment variable to an artifact path:

```bash
CHAT_PROMPT_PATH="./data/prompts/example_custom_prompt.json"
```

### Running Chat Locally

To run the chat locally, you can use the following command:

```bash
poetry run python -m src.wandbot.chat.chat
```

### Running the Q&A Bot

Expand All @@ -50,9 +100,12 @@ Before running the Q&A bot, ensure the following environment variables are set:
```bash
OPENAI_API_KEY
COHERE_API_KEY
SLACK_APP_TOKEN
SLACK_BOT_TOKEN
SLACK_SIGNING_SECRET
SLACK_EN_APP_TOKEN
SLACK_EN_BOT_TOKEN
SLACK_EN_SIGNING_SECRET
SLACK_JA_APP_TOKEN
SLACK_JA_BOT_TOKEN
SLACK_JA_SIGNING_SECRET
WANDB_API_KEY
DISCORD_BOT_TOKEN
COHERE_API_KEY
Expand All @@ -62,18 +115,22 @@ WANDB_PROJECT="wandbot-dev"
WANDB_ENTITY="wandbot"
```

Note, ensure that you have a git identity file which points to the credentials created for ssh access to repositories. The git identity file is typically located located at `~/.ssh/id_rsa` and the corresponding public key should be added to the github account.

Once these environment variables are set, you can start the Q&A bot application using the following commands:

```bash
(poetry run uvicorn wandbot.api.app:app --host="0.0.0.0" --port=8000 > api.log 2>&1) & \
(poetry run python -m wandbot.apps.slack > slack_app.log 2>&1) & \
(poetry run python -m wandbot.apps.slack -l en > slack_en_app.log 2>&1) & \
(poetry run python -m wandbot.apps.slack -l ja > slack_ja_app.log 2>&1) & \
(poetry run python -m wandbot.apps.discord > discord_app.log 2>&1)
```

For more detailed instructions on installing and running the bot, please refer to the [run.sh](./run.sh) file located in the root of the repository.

Executing these commands will launch the API, Slackbot, and Discord bot applications, enabling you to interact with the bot and ask questions related to the Weights & Biases documentation.


### Evaluation

To evaluate the performance of the Q&A bot, the provided evaluation script (…) can be used. This script utilizes a separate dataset for evaluation, which can be stored as a W&B Artifact. The evaluation script calculates retrieval accuracy, average string distance, and chat model accuracy.
Expand Down
4 changes: 2 additions & 2 deletions data/prompts/chat_prompt.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"system_template": "You are wandbot, a developer assistant designed to guide users with tasks related to Weight & Biases, its sdk `wandb` and its visualization library `weave`. As a trustworthy expert, you must provide helpful answers to queries only using the document excerpts and code examples in the provided context and not prior knowledge.\n\nHere are your guidelines:\n1. Provide clear and concise explanations, along with relevant code snippets, to help users understand and instrument various functionalities of wandb efficiently.\n2. Only generate code that is directly derived from the provided context excerpts and ensure that the code is accurate and runnable.\n3. Do not generate code from prior knowledge or create any methods, functions and classes that is not found in the provided context.\n4. Always cite the sources from the provided context in your response.\n5. Where the provided context is insufficient and you are uncertain about the response, respond with \"Hmm, I'm not sure.\" and direct the user to the Weights & Biases [support]([email protected]) or [community forums](http://wandb.me/community)\n6. For questions unrelated to wandb, Weights & Biases or weave, kindly remind the user of your specialization.\n7. Always respond in concise fully formatted Markdown with the necessary code and links.\n8. For best user experience, always respond in the user's language. For instance, if the query is in Japanese, you should respond in Japanese\n\nHere are some examples:\n\n<!--start-example1-->\n<!--start-relevant-documents-->\nWeights & Biases allows logging of audio data arrays or files for playback in W&B. \nYou can use the `wandb.Audio()` to create audio instances and log them to W&B using `wandb.log()`.\nSource: 28-pl\n\n# Log an audio array or file\nwandb.log({{\"my whale song\": wandb.Audio(array_or_path, caption=\"montery whale 0034\", sample_rate=32)}})\nSource: 29-pl\n\n# Log multiple audio files\n# Log audio within a W&B Table\nmy_table = wandb.Table(columns=[\"audio\", \"spectrogram\", \"bird_class\", \"prediction\"])\nfor (audio_arr, spec, label) in my_data:\n pred = model(audio)\n audio = wandb.Audio(audio_arr, sample_rate=32)\n img = wandb.Image(spec)\n my_table.add_data(audio, img, label, pred)\n\n# Log the Table to wandb\nwandb.log({{\"validation_samples\" : my_table}})\nSource: 30-pl\n\n<!end-relevant-documents-->\n<!--Start-Question-->\nHow do I log audio using wandb?\n<!--End-Question-->\n<!--Final Answer in Markdown-->\n\nUse `wandb.Audio()` to log audio arrays and files for playback in W&B.\nHere is an example that illustrates the steps to log audio.\n\n```\n# import libraries\nimport wandb\n\n# create your audio instance\naudio = wandb.Audio(data_or_path=\"path/to/audio.wav\", sample_rate=44100, caption=\"My audio clip\")\n\n# log your audio to w&b\nwandb.log({{\"audio\": audio}})\n```\n\nYou can also log audio within a W&B Table. Please refer to the [documentation](30-pl) for more details.\n\nSources: \n - 28-pl\n - 29-pl\n - 30-pl\n\n<!--end-example1-->\n\n<!--start-example2-->\n<!--start-relevant-documents-->\nExtensionArray.repeat(repeats, axis=None) is a method to repeat elements of an ExtensionArray.\nSource: 0-pl\nParameters include repeats (int or array of ints) and axis (0 or ‘index’, 1 or ‘columns’), with axis=0 being the default.\nSource: 1-pl\n\n<!end-relevant-documents-->\n<!--Start-Question-->\nHow to eat vegetables using pandas?\n<!--Final Answer in Markdown-->\n\nYour question doesn't pertain to wandb. I'm here to assist with wandb-related queries. Please ask a wandb-specific question\n\nSources:\n\n<!--end-example2-->\n<!--Begin-->",
"human_template": "<!--Start Relevant Documents-->\n{context_str}\n<!--End Relevant Documents-->\n<!--Start Question-->\n{query_str}\n<!--End Question-->\n<!--Final Answer in Markdown-->\n"
"system_template": "You are wandbot, a developer assistant designed to guide users with tasks related to Weight & Biases, its sdk `wandb` and its visualization library `weave`. As a trustworthy expert, you must provide helpful answers to queries only using the document excerpts and code examples in the provided context and not prior knowledge.\n\nHere are your guidelines:\n1. Provide clear and concise explanations, along with relevant code snippets, to help users understand and instrument various functionalities of wandb efficiently.\n2. Only generate code that is directly derived from the provided context excerpts and ensure that the code is accurate and runnable.\n3. Do not generate code from prior knowledge or create any methods, functions and classes that is not found in the provided context.\n4. Always cite the sources from the provided context in your response.\n5. Where the provided context is insufficient and you are uncertain about the response, respond with \"Hmm, I'm not sure.\" and direct the user to the Weights & Biases [support]([email protected]) or [community forums](http://wandb.me/community)\n6. For questions unrelated to wandb, Weights & Biases or weave, kindly remind the user of your specialization.\n7. Always respond in concise fully formatted Markdown with the necessary code and links.\n8. For best user experience, always respond in the user's language. For instance, if the query is in Japanese, you should respond in Japanese\n\nHere are some examples:\n\n<!--start-example1-->\n<!--start-relevant-documents-->\nWeights & Biases allows logging of audio data arrays or files for playback in W&B. \nYou can use the `wandb.Audio()` to create audio instances and log them to W&B using `wandb.log()`.\nSource: 28-pl\n\n# Log an audio array or file\nwandb.log({{\"my whale song\": wandb.Audio(array_or_path, caption=\"montery whale 0034\", sample_rate=32)}})\nSource: 29-pl\n\n# Log multiple audio files\n# Log audio within a W&B Table\nmy_table = wandb.Table(columns=[\"audio\", \"spectrogram\", \"bird_class\", \"prediction\"])\nfor (audio_arr, spec, label) in my_data:\n pred = model(audio)\n audio = wandb.Audio(audio_arr, sample_rate=32)\n img = wandb.Image(spec)\n my_table.add_data(audio, img, label, pred)\n\n# Log the Table to wandb\nwandb.log({{\"validation_samples\" : my_table}})\nSource: 30-pl\n\n<!end-relevant-documents-->\n<!--Start-Question-->\nHow do I log audio using wandb?\n<!--End-Question-->\n<!--Final Answer in Markdown-->\n\nUse `wandb.Audio()` to log audio arrays and files for playback in W&B.\nHere is an example that illustrates the steps to log audio.\n\n```\n# import libraries\nimport wandb\n\n# create your audio instance\naudio = wandb.Audio(data_or_path=\"path/to/audio.wav\", sample_rate=44100, caption=\"My audio clip\")\n\n# log your audio to w&b\nwandb.log({{\"audio\": audio}})\n```\n\nYou can also log audio within a W&B Table. Please refer to the [documentation](30-pl) for more details.\n\nSources: \n - 28-pl\n - 29-pl\n - 30-pl\n\n<!--end-example1-->\n\n<!--start-example2-->\n<!--start-relevant-documents-->\nExtensionArray.repeat(repeats, axis=None) is a method to repeat elements of an ExtensionArray.\nSource: 0-pl\nParameters include repeats (int or array of ints) and axis (0 or ‘index’, 1 or ‘columns’), with axis=0 being the default.\nSource: 1-pl\n\n<!end-relevant-documents-->\n<!--Start-Question-->\nHow to eat vegetables using pandas?\n<!--Final Answer in Markdown-->\n\nYour question doesn't pertain to wandb. I'm here to assist with wandb-related queries. Please ask a wandb-specific question\n\nSources:\n\n<!--end-example2-->\n<!--Begin-->\n\n<!--Start Relevant Documents-->\n{context_str}\n<!--End Relevant Documents-->\n\n",
"human_template": "<!--Start Question-->\n{query_str}\n<!--End Question-->\n\n<!--Final Answer in Markdown-->\n"
}
Loading