Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: update endpoint path for agent retriever streaming response
Signed-off-by: Palaniappan R <[email protected]>
- Loading branch information
91bf80f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
===================================
==> Dataset: EDA Corpus
==> Running tests for agent-retriever
/home/luarss/actions-runner/_work/ORAssistant/ORAssistant/evaluation/.venv/lib/python3.12/site-packages/deepeval/init.py:49: UserWarning: You are using deepeval version 1.4.9, however version 2.0.3 is available. You should consider upgrading via the "pip install --upgrade deepeval" command.
warnings.warn(
Fetching 2 files: 0%| | 0/2 [00:00<?, ?it/s]
Fetching 2 files: 50%|█████ | 1/2 [00:00<00:00, 5.47it/s]
Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 10.31it/s]
Evaluating: 0%| | 0/100 [00:00<?, ?it/s]
Evaluating: 1%| | 1/100 [00:12<20:32, 12.45s/it]
Evaluating: 2%|▏ | 2/100 [00:23<18:40, 11.44s/it]
Evaluating: 3%|▎ | 3/100 [00:36<19:32, 12.08s/it]
Evaluating: 4%|▍ | 4/100 [00:45<17:58, 11.24s/it]
Evaluating: 5%|▌ | 5/100 [00:57<17:52, 11.29s/it]
Evaluating: 6%|▌ | 6/100 [01:09<17:56, 11.45s/it]
Evaluating: 7%|▋ | 7/100 [01:21<18:03, 11.65s/it]
Evaluating: 8%|▊ | 8/100 [01:32<17:46, 11.60s/it]
Evaluating: 9%|▉ | 9/100 [01:43<17:15, 11.38s/it]
Evaluating: 10%|█ | 10/100 [01:52<15:44, 10.49s/it]
Evaluating: 11%|█ | 11/100 [02:03<15:55, 10.74s/it]
Evaluating: 12%|█▏ | 12/100 [02:16<16:49, 11.48s/it]
Evaluating: 13%|█▎ | 13/100 [02:34<19:33, 13.49s/it]
Evaluating: 14%|█▍ | 14/100 [02:47<19:07, 13.35s/it]
Evaluating: 15%|█▌ | 15/100 [02:58<17:47, 12.56s/it]
Evaluating: 16%|█▌ | 16/100 [03:10<17:25, 12.45s/it]
Evaluating: 17%|█▋ | 17/100 [03:23<17:13, 12.46s/it]
Evaluating: 18%|█▊ | 18/100 [03:33<16:07, 11.80s/it]
Evaluating: 19%|█▉ | 19/100 [03:43<15:10, 11.24s/it]
Evaluating: 20%|██ | 20/100 [03:56<15:39, 11.74s/it]
Evaluating: 21%|██ | 21/100 [04:06<14:44, 11.20s/it]
Evaluating: 22%|██▏ | 22/100 [04:17<14:37, 11.25s/it]
Evaluating: 23%|██▎ | 23/100 [04:29<14:48, 11.54s/it]
Evaluating: 24%|██▍ | 24/100 [04:41<14:36, 11.53s/it]
Evaluating: 25%|██▌ | 25/100 [04:51<13:48, 11.04s/it]
Evaluating: 26%|██▌ | 26/100 [05:00<13:01, 10.57s/it]
Evaluating: 27%|██▋ | 27/100 [05:12<13:11, 10.84s/it]
Evaluating: 28%|██▊ | 28/100 [05:22<13:01, 10.85s/it]
Evaluating: 29%|██▉ | 29/100 [05:33<12:50, 10.85s/it]
Evaluating: 30%|███ | 30/100 [05:45<12:52, 11.04s/it]
Evaluating: 31%|███ | 31/100 [05:56<12:50, 11.17s/it]
Evaluating: 32%|███▏ | 32/100 [06:08<12:48, 11.30s/it]
Evaluating: 33%|███▎ | 33/100 [06:19<12:36, 11.29s/it]
Evaluating: 34%|███▍ | 34/100 [06:30<12:26, 11.31s/it]
Evaluating: 35%|███▌ | 35/100 [06:41<12:06, 11.17s/it]
Evaluating: 36%|███▌ | 36/100 [06:52<11:52, 11.14s/it]
Evaluating: 37%|███▋ | 37/100 [07:03<11:27, 10.92s/it]
Evaluating: 38%|███▊ | 38/100 [07:13<11:11, 10.82s/it]
Evaluating: 39%|███▉ | 39/100 [07:24<10:52, 10.69s/it]
Evaluating: 40%|████ | 40/100 [07:36<11:04, 11.08s/it]
Evaluating: 41%|████ | 41/100 [07:47<10:49, 11.00s/it]
Evaluating: 42%|████▏ | 42/100 [07:57<10:35, 10.96s/it]
Evaluating: 43%|████▎ | 43/100 [08:09<10:35, 11.15s/it]
Evaluating: 44%|████▍ | 44/100 [08:21<10:33, 11.32s/it]
Evaluating: 45%|████▌ | 45/100 [08:34<10:49, 11.80s/it]
Evaluating: 46%|████▌ | 46/100 [08:46<10:52, 12.08s/it]
Evaluating: 47%|████▋ | 47/100 [08:59<10:49, 12.25s/it]
Evaluating: 48%|████▊ | 48/100 [09:10<10:18, 11.90s/it]
Evaluating: 49%|████▉ | 49/100 [09:21<09:44, 11.45s/it]
Evaluating: 50%|█████ | 50/100 [09:33<09:49, 11.78s/it]
Evaluating: 51%|█████ | 51/100 [09:43<09:10, 11.22s/it]
Evaluating: 52%|█████▏ | 52/100 [09:55<09:09, 11.46s/it]
Evaluating: 53%|█████▎ | 53/100 [10:06<08:48, 11.24s/it]
Evaluating: 54%|█████▍ | 54/100 [10:18<08:46, 11.44s/it]
Evaluating: 55%|█████▌ | 55/100 [10:29<08:33, 11.41s/it]
Evaluating: 56%|█████▌ | 56/100 [10:39<08:00, 10.93s/it]
Evaluating: 57%|█████▋ | 57/100 [10:50<07:56, 11.07s/it]
Evaluating: 58%|█████▊ | 58/100 [11:01<07:40, 10.97s/it]
Evaluating: 59%|█████▉ | 59/100 [11:11<07:22, 10.79s/it]
Evaluating: 60%|██████ | 60/100 [11:23<07:27, 11.20s/it]
Evaluating: 61%|██████ | 61/100 [11:35<07:21, 11.32s/it]
Evaluating: 62%|██████▏ | 62/100 [11:46<07:10, 11.32s/it]
Evaluating: 63%|██████▎ | 63/100 [11:57<06:53, 11.17s/it]
Evaluating: 64%|██████▍ | 64/100 [12:06<06:11, 10.32s/it]
Evaluating: 65%|██████▌ | 65/100 [12:20<06:44, 11.56s/it]
Evaluating: 66%|██████▌ | 66/100 [12:31<06:28, 11.42s/it]
Evaluating: 67%|██████▋ | 67/100 [12:42<06:09, 11.19s/it]
Evaluating: 68%|██████▊ | 68/100 [12:54<06:07, 11.49s/it]
Evaluating: 69%|██████▉ | 69/100 [13:10<06:41, 12.94s/it]
Evaluating: 70%|███████ | 70/100 [13:27<07:01, 14.03s/it]
Evaluating: 71%|███████ | 71/100 [13:39<06:32, 13.54s/it]
Evaluating: 72%|███████▏ | 72/100 [13:50<05:55, 12.70s/it]
Evaluating: 73%|███████▎ | 73/100 [14:01<05:32, 12.30s/it]
Evaluating: 74%|███████▍ | 74/100 [14:13<05:12, 12.02s/it]
Evaluating: 75%|███████▌ | 75/100 [14:24<04:56, 11.85s/it]
Evaluating: 76%|███████▌ | 76/100 [14:35<04:39, 11.64s/it]
Evaluating: 77%|███████▋ | 77/100 [14:46<04:24, 11.50s/it]
Evaluating: 78%|███████▊ | 78/100 [14:59<04:18, 11.73s/it]
Evaluating: 79%|███████▉ | 79/100 [15:10<04:01, 11.49s/it]
Evaluating: 80%|████████ | 80/100 [15:21<03:47, 11.39s/it]
Evaluating: 81%|████████ | 81/100 [15:31<03:28, 10.98s/it]
Evaluating: 82%|████████▏ | 82/100 [15:42<03:16, 10.94s/it]
Evaluating: 83%|████████▎ | 83/100 [15:52<03:01, 10.67s/it]
Evaluating: 84%|████████▍ | 84/100 [16:02<02:50, 10.63s/it]
Evaluating: 85%|████████▌ | 85/100 [16:12<02:34, 10.30s/it]
Evaluating: 86%|████████▌ | 86/100 [16:24<02:33, 10.94s/it]
Evaluating: 87%|████████▋ | 87/100 [16:34<02:18, 10.65s/it]
Evaluating: 88%|████████▊ | 88/100 [16:46<02:12, 11.06s/it]
Evaluating: 89%|████████▉ | 89/100 [16:57<02:01, 11.02s/it]
Evaluating: 90%|█████████ | 90/100 [17:10<01:56, 11.69s/it]
Evaluating: 91%|█████████ | 91/100 [17:24<01:50, 12.30s/it]
Evaluating: 92%|█████████▏| 92/100 [17:36<01:38, 12.28s/it]
Evaluating: 93%|█████████▎| 93/100 [17:48<01:25, 12.22s/it]
Evaluating: 94%|█████████▍| 94/100 [17:58<01:09, 11.53s/it]
Evaluating: 95%|█████████▌| 95/100 [18:09<00:56, 11.22s/it]
Evaluating: 96%|█████████▌| 96/100 [18:19<00:43, 10.94s/it]
Evaluating: 97%|█████████▋| 97/100 [18:31<00:33, 11.21s/it]
Evaluating: 98%|█████████▊| 98/100 [18:43<00:22, 11.39s/it]
Evaluating: 99%|█████████▉| 99/100 [18:54<00:11, 11.27s/it]
Evaluating: 100%|██████████| 100/100 [19:05<00:00, 11.31s/it]
Evaluating: 100%|██████████| 100/100 [19:05<00:00, 11.46s/it]
✨ You're running DeepEval's latest Contextual Precision Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Contextual Recall Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Hallucination Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
Evaluating 100 test case(s) in parallel: | | 0% (0/100) [Time Taken: 00:00, ?test case/s]
‼️ Friendly reminder 😇: You can also run evaluations with ALL of deepeval's
Evaluating 100 test case(s) in parallel: | | 1% (1/100) [Time Taken: 00:11, 11.16s/test case]
Evaluating 100 test case(s) in parallel: |▏ | 2% (2/100) [Time Taken: 00:12, 5.27s/test case]
Evaluating 100 test case(s) in parallel: |▍ | 4% (4/100) [Time Taken: 00:12, 2.01s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 5% (5/100) [Time Taken: 00:12, 1.43s/test case]
Evaluating 100 test case(s) in parallel: |▋ | 7% (7/100) [Time Taken: 00:13, 1.08s/test case]
Evaluating 100 test case(s) in parallel: |▊ | 8% (8/100) [Time Taken: 00:14, 1.19test case/s]
Evaluating 100 test case(s) in parallel: |█ | 10% (10/100) [Time Taken: 00:14, 1.47test case/s]
Evaluating 100 test case(s) in parallel: |█▏ | 12% (12/100) [Time Taken: 00:15, 2.23test case/s]
Evaluating 100 test case(s) in parallel: |█▎ | 13% (13/100) [Time Taken: 00:15, 2.07test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 15% (15/100) [Time Taken: 00:16, 2.74test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 16% (16/100) [Time Taken: 00:16, 2.93test case/s]
Evaluating 100 test case(s) in parallel: |█▋ | 17% (17/100) [Time Taken: 00:16, 2.92test case/s]
Evaluating 100 test case(s) in parallel: |█▉ | 19% (19/100) [Time Taken: 00:16, 3.66test case/s]
Evaluating 100 test case(s) in parallel: |██ | 21% (21/100) [Time Taken: 00:17, 5.19test case/s]
Evaluating 100 test case(s) in parallel: |██▌ | 26% (26/100) [Time Taken: 00:17, 10.23test case/s]
Evaluating 100 test case(s) in parallel: |███ | 30% (30/100) [Time Taken: 00:17, 14.09test case/s]
Evaluating 100 test case(s) in parallel: |███▎ | 33% (33/100) [Time Taken: 00:17, 11.39test case/s]
Evaluating 100 test case(s) in parallel: |███▌ | 36% (36/100) [Time Taken: 00:18, 7.55test case/s]
Evaluating 100 test case(s) in parallel: |███▊ | 38% (38/100) [Time Taken: 00:19, 5.05test case/s]
Evaluating 100 test case(s) in parallel: |████ | 40% (40/100) [Time Taken: 00:21, 2.29test case/s]
Evaluating 100 test case(s) in parallel: |████ | 41% (41/100) [Time Taken: 00:22, 2.35test case/s]
Evaluating 100 test case(s) in parallel: |████▎ | 43% (43/100) [Time Taken: 00:22, 2.99test case/s]
Evaluating 100 test case(s) in parallel: |████▍ | 44% (44/100) [Time Taken: 00:23, 2.45test case/s]
Evaluating 100 test case(s) in parallel: |████▋ | 47% (47/100) [Time Taken: 00:23, 3.92test case/s]
Evaluating 100 test case(s) in parallel: |████▉ | 49% (49/100) [Time Taken: 00:23, 5.07test case/s]
Evaluating 100 test case(s) in parallel: |█████ | 51% (51/100) [Time Taken: 00:23, 6.37test case/s]
Evaluating 100 test case(s) in parallel: |█████▎ | 53% (53/100) [Time Taken: 00:23, 5.75test case/s]
Evaluating 100 test case(s) in parallel: |█████▌ | 55% (55/100) [Time Taken: 00:23, 7.18test case/s]
Evaluating 100 test case(s) in parallel: |█████▋ | 57% (57/100) [Time Taken: 00:24, 6.53test case/s]
Evaluating 100 test case(s) in parallel: |█████▉ | 59% (59/100) [Time Taken: 00:24, 5.04test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 61% (61/100) [Time Taken: 00:25, 5.50test case/s]
Evaluating 100 test case(s) in parallel: |██████▏ | 62% (62/100) [Time Taken: 00:25, 5.76test case/s]
Evaluating 100 test case(s) in parallel: |██████▍ | 64% (64/100) [Time Taken: 00:25, 7.26test case/s]
Evaluating 100 test case(s) in parallel: |██████▌ | 65% (65/100) [Time Taken: 00:25, 5.81test case/s]
Evaluating 100 test case(s) in parallel: |██████▋ | 67% (67/100) [Time Taken: 00:26, 6.09test case/s]
Evaluating 100 test case(s) in parallel: |██████▊ | 68% (68/100) [Time Taken: 00:26, 6.24test case/s]
Evaluating 100 test case(s) in parallel: |███████ | 70% (70/100) [Time Taken: 00:26, 6.48test case/s]
Evaluating 100 test case(s) in parallel: |███████ | 71% (71/100) [Time Taken: 00:26, 5.31test case/s]
Evaluating 100 test case(s) in parallel: |███████▎ | 73% (73/100) [Time Taken: 00:27, 4.54test case/s]
Evaluating 100 test case(s) in parallel: |███████▍ | 74% (74/100) [Time Taken: 00:27, 4.77test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 75% (75/100) [Time Taken: 00:28, 3.16test case/s]
Evaluating 100 test case(s) in parallel: |███████▋ | 77% (77/100) [Time Taken: 00:28, 4.48test case/s]
Evaluating 100 test case(s) in parallel: |███████▊ | 78% (78/100) [Time Taken: 00:28, 4.93test case/s]
Evaluating 100 test case(s) in parallel: |███████▉ | 79% (79/100) [Time Taken: 00:28, 4.57test case/s]
Evaluating 100 test case(s) in parallel: |████████ | 80% (80/100) [Time Taken: 00:28, 4.81test case/s]
Evaluating 100 test case(s) in parallel: |████████▏ | 82% (82/100) [Time Taken: 00:29, 6.41test case/s]
Evaluating 100 test case(s) in parallel: |████████▍ | 84% (84/100) [Time Taken: 00:29, 7.63test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 85% (85/100) [Time Taken: 00:29, 7.05test case/s]
Evaluating 100 test case(s) in parallel: |████████▋ | 87% (87/100) [Time Taken: 00:29, 8.06test case/s]
Evaluating 100 test case(s) in parallel: |████████▊ | 88% (88/100) [Time Taken: 00:30, 5.10test case/s]
Evaluating 100 test case(s) in parallel: |████████▉ | 89% (89/100) [Time Taken: 00:30, 5.25test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 90% (90/100) [Time Taken: 00:30, 4.94test case/s]
Evaluating 100 test case(s) in parallel: |█████████▏| 92% (92/100) [Time Taken: 00:31, 4.39test case/s]
Evaluating 100 test case(s) in parallel: |█████████▎| 93% (93/100) [Time Taken: 00:31, 3.42test case/s]
Evaluating 100 test case(s) in parallel: |█████████▍| 94% (94/100) [Time Taken: 00:31, 3.38test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 95% (95/100) [Time Taken: 00:32, 2.70test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 96% (96/100) [Time Taken: 00:32, 2.61test case/s]
Evaluating 100 test case(s) in parallel: |█████████▋| 97% (97/100) [Time Taken: 00:33, 3.01test case/s]
Evaluating 100 test case(s) in parallel: |█████████▊| 98% (98/100) [Time Taken: 00:35, 1.01s/test case]
Evaluating 100 test case(s) in parallel: |█████████▉| 99% (99/100) [Time Taken: 00:36, 1.21test case/s]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:40, 1.73s/test case]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:40, 2.49test case/s]
✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.
metrics directly on Confident AI instead.
Average Metric Scores:
Contextual Precision 0.7165138888888889
Contextual Recall 0.8484999999999999
Hallucination 0.5114780701754387
Metric Passrates:
Contextual Precision 0.68
Contextual Recall 0.8
Hallucination 0.64