Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

confidance threshold to word embedding #225

Closed

Conversation

TatendaMugadza
Copy link

  • added confidence threshold for word embedding results
  • Improved word embedding preprocessing
  • Updated read me for embedding generation

Purpose

The aaq search was returning results even in the cases when the user has entered meaningless invalid text
https://praekelt.leankit.com/card/31512186572175

Approach

Added a confidence threshold so that we can filter out any returned results that have a similarity score of less than 25%

Open Questions and Pre-Merge TODOs

Learning

From analysing how different search inputs compare to the results, in terms of similarity score, 25% similarity gave a balance between giving back relevant answers and filtering out completely unrelated responses

Blog Posts

- Improved word embedding preprocessing
- Updated read me for embedding generation
KaitCrawford
KaitCrawford previously approved these changes Jan 15, 2024
Copy link
Contributor

@KaitCrawford KaitCrawford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TatendaMugadza two small comments but it looks good 👍

assert content["count"] == 0
# it should not return search term matching pages if they are unpublished
page1.unpublish()
response = uclient.get("/api/v2/pages/?s=#(&whatsapp=true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same search term that previously didn't return any results. You probably want to use s=help here for the test to be valid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add to this, I don't think we should be using the # character in the URL for our tests for this. It's the fragment identifier character, and actually needs to be escaped if you want to use it in URL parameters, otherwise it could result in everything after the # being interpreted as a fragment identifier and not as part of the query string: https://www.rfc-editor.org/rfc/rfc3986.html#section-3.5

@rudigiesler
Copy link
Contributor

I'm closing this PR since word embeddings are no longer a part of the CMS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants