confidance threshold to word embedding #225

TatendaMugadza · 2024-01-11T11:29:53Z

added confidence threshold for word embedding results
Improved word embedding preprocessing
Updated read me for embedding generation

Purpose

The aaq search was returning results even in the cases when the user has entered meaningless invalid text
https://praekelt.leankit.com/card/31512186572175

Approach

Added a confidence threshold so that we can filter out any returned results that have a similarity score of less than 25%

Open Questions and Pre-Merge TODOs

Learning

From analysing how different search inputs compare to the results, in terms of similarity score, 25% similarity gave a balance between giving back relevant answers and filtering out completely unrelated responses

Blog Posts

- Improved word embedding preprocessing - Updated read me for embedding generation

KaitCrawford

@TatendaMugadza two small comments but it looks good 👍

README.md

KaitCrawford · 2024-01-15T07:42:04Z

home/tests/test_api.py

+        assert content["count"] == 0
+        # it should not return search term matching pages if they are unpublished
+        page1.unpublish()
+        response = uclient.get("/api/v2/pages/?s=#(&whatsapp=true")


This is the same search term that previously didn't return any results. You probably want to use s=help here for the test to be valid

Just to add to this, I don't think we should be using the # character in the URL for our tests for this. It's the fragment identifier character, and actually needs to be escaped if you want to use it in URL parameters, otherwise it could result in everything after the # being interpreted as a fragment identifier and not as part of the query string: https://www.rfc-editor.org/rfc/rfc3986.html#section-3.5

updated readme

rudigiesler · 2024-12-18T15:12:38Z

I'm closing this PR since word embeddings are no longer a part of the CMS

confidance threshold to word embedding

46c19d8

- Improved word embedding preprocessing - Updated read me for embedding generation

TatendaMugadza requested a review from KaitCrawford January 11, 2024 11:29

KaitCrawford previously approved these changes Jan 15, 2024

View reviewed changes

updateded test for unpublished page

4781120

updated readme

TatendaMugadza dismissed KaitCrawford’s stale review via 4781120 January 15, 2024 08:17

removed # in search test

034d753

rudigiesler closed this Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

confidance threshold to word embedding #225

confidance threshold to word embedding #225

TatendaMugadza commented Jan 11, 2024

KaitCrawford left a comment

KaitCrawford Jan 15, 2024

rudigiesler Jan 15, 2024

rudigiesler commented Dec 18, 2024

confidance threshold to word embedding #225

confidance threshold to word embedding #225

Conversation

TatendaMugadza commented Jan 11, 2024

Purpose

Approach

Open Questions and Pre-Merge TODOs

Learning

Blog Posts

KaitCrawford left a comment

Choose a reason for hiding this comment

KaitCrawford Jan 15, 2024

Choose a reason for hiding this comment

rudigiesler Jan 15, 2024

Choose a reason for hiding this comment

rudigiesler commented Dec 18, 2024