diff --git a/docs/howtos/customisations/gcp-vertexai.ipynb b/docs/howtos/customisations/gcp-vertexai.ipynb new file mode 100644 index 000000000..7ddf68f15 --- /dev/null +++ b/docs/howtos/customisations/gcp-vertexai.ipynb @@ -0,0 +1,416 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a6199b61-8084-4a2f-8488-2ddc4c260776", + "metadata": {}, + "source": [ + "# Using Vertext AI\n", + "\n", + "Vertex AI offers everything you need to build and use generative AI—from AI solutions, to Search and Conversation, to 100+ foundation models, to a unified AI platform. You get access to models like PaLM 2 which can be used to score your RAG responses and pipelines with Ragas instead of the default OpenAI.\n", + "\n", + "This tutorial will show you can you can use PaLM 2 with Ragas for evaluation." + ] + }, + { + "cell_type": "markdown", + "id": "79986e4d-17e4-4e88-ad2e-ca2b34a63de9", + "metadata": {}, + "source": [ + ":::{Note}\n", + "this guide is for folks who are using the Amazon Bedrock endpoints. Check the [evaluation guide](../../getstarted/evaluation.md) if your using OpenAI endpoints.\n", + ":::" + ] + }, + { + "cell_type": "markdown", + "id": "8bd757ae-51ee-4527-b2fb-0d0b88285939", + "metadata": {}, + "source": [ + "## Load Sample Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "0d3e6c99-c19c-44a1-8f05-4bde2de30866", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Found cached dataset fiqa (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/ragas_eval/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8)\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "b0c4c79a44334e1198e1b7283e4532b5", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/1 [00:00, ?it/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "DatasetDict({\n", + " baseline: Dataset({\n", + " features: ['question', 'ground_truths', 'answer', 'contexts'],\n", + " num_rows: 30\n", + " })\n", + "})" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# data\n", + "from datasets import load_dataset\n", + "\n", + "fiqa_eval = load_dataset(\"explodinggradients/fiqa\", \"ragas_eval\")\n", + "fiqa_eval" + ] + }, + { + "cell_type": "markdown", + "id": "4e67daaa-60e3-4584-8ec6-944c3c5a1a0c", + "metadata": {}, + "source": [ + "Now lets import the metrics we are going to use" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "42081210-3c0d-4e27-974a-ef152364a4ab", + "metadata": {}, + "outputs": [], + "source": [ + "from ragas.metrics import (\n", + " context_precision,\n", + " answer_relevancy, # AnswerRelevancy\n", + " faithfulness,\n", + " context_recall,\n", + ")\n", + "from ragas.metrics.critique import harmfulness\n", + "\n", + "# list of metrics we're going to use\n", + "metrics = [\n", + " faithfulness,\n", + " answer_relevancy,\n", + " context_recall,\n", + " context_precision,\n", + " harmfulness\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "90fa19c3-1356-412f-a39d-f9907c69a80e", + "metadata": {}, + "source": [ + "By default Ragas uses `ChatOpenAI` for evaluations, lets swap that out with `ChatVertextAI`. We also need to change the embeddings used for evaluations for `OpenAIEmbeddings` to `VertextAIEmbeddings` for metrices that need it, which in our case is `answer_relevancy`.\n", + "\n", + "Now in order to use the new `ChatVertextAI` llm instance with Ragas metrics, you have to create a new instance of `RagasLLM` using the `ragas.llms.LangchainLLM` wrapper. Its a simple wrapper around langchain that make Langchain LLM/Chat instances compatible with how Ragas metrics will use them." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "09ef783b-12dd-40e8-bdbf-4744e41038dd", + "metadata": {}, + "outputs": [], + "source": [ + "import google.auth\n", + "from langchain.chat_models import ChatVertexAI\n", + "from ragas.llms import LangchainLLM\n", + "from langchain.embeddings import VertexAIEmbeddings\n", + "\n", + "\n", + "\n", + "config = {\n", + " \"project_id\": \"tmp-project-404003\",\n", + "}\n", + "\n", + "# authenticate to GCP\n", + "creds, _ = google.auth.default(quota_project_id=\"tmp-project-404003\")\n", + "# create Langchain LLM and Embeddings\n", + "chat = ChatVertexAI(credentials=creds)\n", + "vertextai_embeddings = VertexAIEmbeddings(credentials=creds)\n", + "\n", + "# create a wrapper around it\n", + "ragas_vertexai_llm = LangchainLLM(chat)" + ] + }, + { + "cell_type": "markdown", + "id": "90e739a1-dbb4-42ce-adc2-b8cf88ae7a58", + "metadata": {}, + "source": [ + "Now lets swap out the defaults with the VertexAI LLM and Embeddings we created." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "280464c1-dbef-4200-85ec-dc2a4bffee14", + "metadata": {}, + "outputs": [], + "source": [ + "for m in metrics:\n", + " # change LLM for metric\n", + " m.__setattr__(\"llm\", ragas_vertexai_llm)\n", + " \n", + " # check if this metric needs embeddings\n", + " if hasattr(m, \"embeddings\"):\n", + " # if so change with VertexAI Embeddings\n", + " m.__setattr__(\"embeddings\", vertextai_embeddings)" + ] + }, + { + "cell_type": "markdown", + "id": "27a34b9d-796e-4025-ba8e-a799f6ab3c6b", + "metadata": {}, + "source": [ + "## Evaluation\n", + "\n", + "Running the evalutation is as simple as calling evaluate on the `Dataset` with the metrics of your choice." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "5e739223-4e34-4dc0-9892-4625fdea7489", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "evaluating with [faithfulness]\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.53s/it]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "evaluating with [answer_relevancy]\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.88s/it]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "evaluating with [context_recall]\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.71s/it]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "evaluating with [context_precision]\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.05it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "evaluating with [harmfulness]\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.02it/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "{'faithfulness': 1.0000, 'answer_relevancy': 0.9113, 'context_recall': 0.0000, 'context_precision': 0.0000, 'harmfulness': 0.0000}" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from ragas import evaluate\n", + "import nest_asyncio # CHECK NOTES\n", + "\n", + "# NOTES: Only used when running on a jupyter notebook, otherwise comment or remove this function. \n", + "nest_asyncio.apply() \n", + "\n", + "result = evaluate(\n", + " fiqa_eval[\"baseline\"].select(range(1)), # using 1 as example due to quota constrains\n", + " metrics=metrics,\n", + ")\n", + "\n", + "result" + ] + }, + { + "cell_type": "markdown", + "id": "960f88fc-c90b-4ac6-8e97-252edd2f1661", + "metadata": {}, + "source": [ + "and there you have the it, all the scores you need. `ragas_score` gives you a single metric that you can use while the other onces measure the different parts of your pipeline.\n", + "\n", + "now if we want to dig into the results and figure out examples where your pipeline performed worse or really good you can easily convert it into a pandas array and use your standard analytics tools too!" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "bc72c682-b0c0-4314-9da5-9b22aae722a4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + " | question | \n", + "contexts | \n", + "answer | \n", + "ground_truths | \n", + "faithfulness | \n", + "answer_relevancy | \n", + "context_recall | \n", + "context_precision | \n", + "harmfulness | \n", + "
---|---|---|---|---|---|---|---|---|---|
0 | \n", + "How to deposit a cheque issued to an associate... | \n", + "[Just have the associate sign the back and the... | \n", + "\\nThe best way to deposit a cheque issued to a... | \n", + "[Have the check reissued to the proper payee.J... | \n", + "1.0 | \n", + "0.911332 | \n", + "0.0 | \n", + "0.0 | \n", + "0 | \n", + "