Rank for results #16

shameelabdulla · 2013-09-20T10:06:54Z

I ve been checking out fuzzily gem it greatly helps. It would be great if there is a rank for suggestions returned. I know that the best suggestion is the first result. If there is a way to give point for each suggestion (say 0 => Exact match, 0.2 => deviates to some extend, 0.9 => deviates to a great extend), it would be really great.

mezis · 2013-09-21T20:36:49Z

Hi @shameelabdulla, thanks for the suggestion.

I think the easiest way to do this is after the fact, on the (limited) list of results. You could use an implementation of Levenshtein to calculate the similarity of the result strings and your input string.

If you do, I'd welcome an add-on to Fuzzily that does this!

shameelabdulla · 2013-09-22T04:52:24Z

Oh great!! I ll add it

Sent from my iPhone

On 22-Sep-2013, at 2:06 AM, Julien Letessier [email protected] wrote:

Hi @shameelabdulla, thanks for the suggestion.

I think the easiest way to do this is after the fact, on the (limited) list of results. You could use an implementation of Levenshtein to calculate the similarity of the result strings and your input string.

If you do, I'd welcome an add-on to Fuzzily that does this!

—
Reply to this email directly or view it on GitHub.

airblade · 2013-09-30T09:40:57Z

This works for me: in Fuzzily::Model::Rails(2|3):

  def _matches_for_trigrams(trigrams)
    self.
      select('owner_id, owner_type, count(*) AS matches, MAX(score) AS score').
      group('owner_id, owner_type').
      order('matches DESC, score ASC').
      with_trigram(trigrams).
-     map(&:owner)
+     map do |t|
+       t.owner.tap |o|
+         o.instance_eval "def fuzzily_score; #{t.score}; end"
+       end
+     end
  end

Would you like a patch for this?

shameelabdulla · 2013-09-30T12:57:20Z

It would be great if you can add as patch

On Mon, Sep 30, 2013 at 3:10 PM, Andy Stewart [email protected]:

This works for me: in Fuzzily::Model::Rails(2|3):

def _matches_for_trigrams(trigrams)
self.
select('owner_id, owner_type, count(*) AS matches, MAX(score) AS score').
group('owner_id, owner_type').
order('matches DESC, score ASC').
with_trigram(trigrams).- map(&:owner)+ map do |t|+ t.owner.tap |o|+ o.instance_eval "def fuzzily_score; #{t.score}; end"+ end+ end
end

Would you like a patch for this?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-25348138
.

airblade · 2013-09-30T13:00:13Z

Here's some code: airblade@8b41888.

It's not as clean as my diff above due to having to work around the problem in #18.

shameelabdulla · 2013-10-01T10:18:50Z

Hi Andy,
Thanks. However this score does not seem to work. If you check the score
responses they do not correspond to level of matching.

On Mon, Sep 30, 2013 at 6:30 PM, Andy Stewart [email protected]:

Here's some code: airblade/fuzzily@8b41888airblade@8b41888
.

It's not as clean as my diff above due to having to work around the
problem in #18 #18.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-25357862
.

airblade · 2013-10-01T10:21:27Z

I had assumed that score corresponded to the quality of the match. Can you give a couple of examples? I'm not particularly familiar with how trigram matching works.

airblade · 2013-10-01T10:39:08Z

Looking at the code, a trigram's score is simply the length of the word from which it came. The fuzzy finder orders its results by matches DESC, score ASC, where matches is the number of trigrams which match. So the more trigrams which match, the better. And when you have a tie, prefer shorter words.

We need a way to normalise the quality of the matches (so they're comparable across models). How about modifying my code above like this:

o.instance_eval "def fuzzily_score; #{t.matches / t.score.to_f}; end"

– although that doesn't really normalise the results to between 0 and 1.

airblade · 2013-10-01T11:52:43Z

For normalising the score, how about:

(number of matches / number of trigrams for search text) / (1 + abs(score - score of search text))

Let's say the search text is vogue. It's score is 5 and it has 6 trigrams ('**v', '*vo', 'vog', 'ogu', 'gue', 'ue*').
For a search text with 0 matches, the normalised score = 0.
For an exact match, the normalised score = (6 / 6) / (1 + 0) = 1.
For two partial hits, each with three trigram matches, with scores 6 and 8:

first normalised score = (3 / 6) / (1 + (6 - 5)) = 0.25
second normalised score = (3 / 6) / (1 + (8 - 5)) = 0.125.

airblade · 2013-10-01T14:01:22Z

Here's another scoring method which I quite like. It keeps the same order in which the results are returned, i.e. the matches DESC, score ASC.

The more matches the better, and the lower the score the better.
The number of matches is an integer, so we make the score adjustment lie between 0 and 1. The bigger the difference between the result's score and the search text's score, the nearer 1 the score adjustment.
And we normalise the overall quality to be between 0 and 1.

delta score = abs(result score - search text score)
score adjustment = 1 / (1 + (search text score / delta score))
quality = (number of trigram matches - score adjustment) / (number of trigrams for seach string)

shameelabdulla · 2013-10-01T18:22:16Z

Trying one by one with a data set. Will let you know

Sent from my iPad

On 01-Oct-2013, at 7:31 pm, Andy Stewart [email protected] wrote:

Here's another scoring method which I quite like. It keeps the same order in which the results are returned, i.e. the matches DESC, score ASC.

The more matches the better, and the lower the score the better.
The number of matches is an integer, so we make the score adjustment lie between 0 and 1. The bigger the difference between the result's score and the search text's score, the nearer 1 the score adjustment.
And we normalise the overall quality to be between 0 and 1.

delta score = abs(result score - search text score)
score adjustment = 1 / (1 + (search text score / delta score))
quality = (number of trigram matches - score adjustment) / (number of trigrams for seach string)
—
Reply to this email directly or view it on GitHub.

shameelabdulla · 2013-10-02T07:09:51Z

What is the diff between result score and search text score?

On Tue, Oct 1, 2013 at 7:31 PM, Andy Stewart [email protected]:

Here's another scoring method which I quite like. It keeps the same order
in which the results are returned, i.e. the matches DESC, score ASC.

The more matches the better, and the lower the score the better.
The number of matches is an integer, so we make the score adjustment lie
between 0 and 1. The bigger the difference between the result's score and
the search text's score, the nearer 1 the score adjustment.
And we normalise the overall quality to be between 0 and 1.

delta score = abs(result score - search text score)
score adjustment = 1 / (1 + (search text score / delta score))
quality = (number of trigram matches - score adjustment) / (number of trigrams for seach string)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-25451707
.

airblade · 2013-10-02T07:17:37Z

The score is simply the length of the string. The result score is the length of the result string, and the search text score is the length of the text we're searching for.

shameelabdulla · 2013-10-02T07:26:07Z

Does not seem to work for the data I ve. I ll tell you the problem I am
trying to solve. I ve a big sane data base of different products. The sane
data base consists of names of products, classification (bakery, grocery,
household etc), and proper images.

As Input (lets call it insane data :) )I ve names of products - But the
inputs just have names(which may have spelling mistakes), and the products
are not classified.

What I am trying to do is

Pick each insane input (name of product)
Do a fuzzy search on my search database
If search comes back with a respectable score (say >0.9), select the
record.

However for step 3 I have not yet been able to figure out a proper scoring
algorithm. I tried scoring as you said, tried Levenshtein. However all
these just correspond to edit distance I feel.

The following is the data set I tried:

Insane name entry:
ATTA AASHIRVAAD MG 5KG

Response from fuzzy with scores:
{:t=>"AASHIRVAAD DAL BUKHARA 1",
:q=>"1",
:s=>0.4583333333333333,
:l=>0.6666666666666666},
{:t=>"AASHIRVAAD MULTIGRAIN ATTA 1 kg Pouch",
:q=>"1 kg Pouch",
:s=>0.4864864864864865,
:l=>0.7567567567567568},
{:t=>"ATTA AASHIRVAAD SELECT 5 kg",
:q=>"5 kg",
:s=>0.6666666666666666,
:l=>0.3333333333333333},
{:t=>"ATTA AASHIRVAAD SELECT 1 kg",
:q=>"1 kg",
:s=>0.6666666666666666,
:l=>0.37037037037037035},
{:t=>"ATTA AASHIRVAAD M G 1 kg",
:q=>"1 kg",
:s=>0.8333333333333334,
:l=>0.20833333333333334},
{:t=>"ATTA AASHIRVAAD M G 5 kg",
:q=>"5 kg",
:s=>0.8333333333333334,
:l=>0.16666666666666666},
{:t=>"ATTA AASHIRVAAD 10 kg",
:q=>"10 kg",
:s=>0.8571428571428571,
:l=>0.22727272727272727},
{:t=>"ATTA AASHIRVAAD 2 kg", :q=>"2 kg", :s=>0.9, :l=>0.22727272727272727},
{:t=>"ATTA AASHIRVAAD 5 kg", :q=>"5 kg", :s=>0.9, :l=>0.22727272727272727},
{:t=>"ATTA AASHIRVAAD 1 kg", :q=>"1 kg", :s=>0.9, :l=>0.22727272727272727}]

For my requirement 5th entry from last should ve the highest score. Any
thoughts are welcome :)

On Wed, Oct 2, 2013 at 12:47 PM, Andy Stewart [email protected]:

The score is simply the length of the stringhttps://github.com/mezis/fuzzily/blob/b91323ed8de5aa8872590dfce388d6234dda0e3d/lib/fuzzily/trigram.rb#L14.
The result score is the length of the result string, and the search text
score is the length of the text we're searching for.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-25517463
.

mezis · 2013-10-02T08:11:36Z

@airblade — while having a normalized "matchiness" metric is a hard problem, it looks like your formula works.

The reason the "score" is the length of the needle is that you want "York" to rank before "New York" when searching for "York"—between two strings that match as well in terms of number of matching trigrams, you want the shortest one, which will be the "best" match.

Implementation-wise, defining extra methods on the fly is a performance killer (it flushes the method cache, so affects an entire application), and it's probably not something you want to use the database to do either.

I haven't had much time last week but I'll try to cobble something together this weekend.

shameelabdulla · 2013-10-02T08:33:57Z

@ariblade Tried a combination score suggested by you [(x.matches /
x.score.to_f) => airblade score] and Levenshtein distance in the following
way:
result score = (airblade score + (1-Levenshtein distance))/2

Seems to work. Analysing with results.

On Wed, Oct 2, 2013 at 1:41 PM, Julien Letessier
[email protected]:

@airblade https://github.com/airblade — while having a normalized
"matchiness" metric is a hard problem, it looks like your formula works.

The reason the "score" is the length of the needle is that you want "York"
to rank before "New York" when searching for "York"—between two strings
that match as well in terms of number of matching trigrams, you want the
shortest one, which will be the "best" match.

Implementation-wise, defining extra methods on the fly is a performance
killer (it flushes the method cache, so affects an entire application), and
it's probably not something you want to use the database to do either.

I haven't had much time last week but I'll try to cobble something
together this weekend.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-25521115
.

mezis · 2013-12-01T14:30:19Z

Here's some code: airblade/fuzzily@8b41888.

As a (late) update to this, I can't use the code directly as it has no test and also has a performance issue—it adds methods on the fly, which kills the method cache in Ruby < 2.1. Working on an alternate solution based on @airblade's formula.

airblade mentioned this issue Sep 30, 2013

Results from find_by_fuzzy_name often include nil entries #18

Closed

flushentitypacket mentioned this issue Dec 31, 2014

Is there a way to return the scores with the results? #34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rank for results #16

Rank for results #16

shameelabdulla commented Sep 20, 2013

mezis commented Sep 21, 2013

shameelabdulla commented Sep 22, 2013

airblade commented Sep 30, 2013

shameelabdulla commented Sep 30, 2013

airblade commented Sep 30, 2013

shameelabdulla commented Oct 1, 2013

airblade commented Oct 1, 2013

airblade commented Oct 1, 2013

airblade commented Oct 1, 2013

airblade commented Oct 1, 2013

shameelabdulla commented Oct 1, 2013

shameelabdulla commented Oct 2, 2013

airblade commented Oct 2, 2013

shameelabdulla commented Oct 2, 2013

mezis commented Oct 2, 2013

shameelabdulla commented Oct 2, 2013

mezis commented Dec 1, 2013

Rank for results #16

Rank for results #16

Comments

shameelabdulla commented Sep 20, 2013

mezis commented Sep 21, 2013

shameelabdulla commented Sep 22, 2013

airblade commented Sep 30, 2013

shameelabdulla commented Sep 30, 2013

airblade commented Sep 30, 2013

shameelabdulla commented Oct 1, 2013

airblade commented Oct 1, 2013

airblade commented Oct 1, 2013

airblade commented Oct 1, 2013

airblade commented Oct 1, 2013

shameelabdulla commented Oct 1, 2013

shameelabdulla commented Oct 2, 2013

airblade commented Oct 2, 2013

shameelabdulla commented Oct 2, 2013

mezis commented Oct 2, 2013

shameelabdulla commented Oct 2, 2013

mezis commented Dec 1, 2013