Skip to content

Commit

Permalink
docs for traditional metrics
Browse files Browse the repository at this point in the history
  • Loading branch information
shahules786 committed Sep 23, 2024
1 parent 69ffead commit 6b53548
Showing 1 changed file with 113 additions and 0 deletions.
113 changes: 113 additions & 0 deletions docs/concepts/metrics/traditional.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Traditional NLP Metrics

## Non LLM String Similarity

he NonLLMStringSimilarity metric measures the similarity between the reference and the response using traditional string distance measures such as Levenshtein, Hamming, and Jaro. This metric is useful for evaluating the similarity of `response` to the `reference` text without relying on large language models (LLMs). The metric returns a score between 0 and 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.

### Example
```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._string import NonLLMStringSimilarity

sample = SingleTurnSample(
response="The Eiffel Tower is located in India.",
reference="The Eiffel Tower is located in Paris."
)

scorer = NonLLMStringSimilarity()
await scorer.single_turn_ascore(sample)
```

One can choose from available string distance measures from `DistanceMeasure`. Here is an example of using Hamming distance.

```python
from ragas.metrics._string import NonLLMStringSimilarity, DistanceMeasure

scorer = NonLLMStringSimilarity(distance_measure=DistanceMeasure.HAMMING)
```


## BLEU Score

The [BLEU (Bilingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU) score is a metric used to evaluate the quality of `response` by comparing it with `reference`. It measures the similarity between the response and the reference based on n-gram precision and brevity penalty. BLEU score was originally designed to evaluate machine translation systems, but it is also used in other natural language processing tasks. Since it was designed to evaluate machine translation systems, it expects the response and reference to contain same number of sentences. The comparison is done at sentence level. BLEU score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.

### Example
```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._bleu_score import BleuScore

sample = SingleTurnSample(
response="The Eiffel Tower is located in India.",
reference="The Eiffel Tower is located in Paris."
)

scorer = BleuScore()
await scorer.single_turn_ascore(sample)
```
Custom weights may be supplied to fine-tune the BLEU score further. A tuple of float weights for unigrams, bigrams, trigrams and so on can be given by

```python
scorer = BleuScore(weights=(0.25, 0.25, 0.25, 0.25))
```



## ROUGE Score

The [ROUGE (Recall-Oriented Understudy for Gisting Evaluation)](https://en.wikipedia.org/wiki/ROUGE_(metric)) score is a set of metrics used to evaluate the quality of natural language generations. It measures the overlap between the generated `response` and the `reference` text based on n-gram recall, precision, and F1 score. ROUGE score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._rogue_score import RougeScore

sample = SingleTurnSample(
response="The Eiffel Tower is located in India.",
reference="The Eiffel Tower is located in Paris."
)

scorer = RougeScore()
await scorer.single_turn_ascore(sample)
```

You can change the `rouge_type` to `rouge-1`, `rouge-2`, or `rouge-l` to calculate the ROUGE score based on unigrams, bigrams, or longest common subsequence respectively.

```python
scorer = RougeScore(rouge_type="rouge-1")
```

You can change the `measure_type` to `precision`, `recall`, or `f1` to calculate the ROUGE score based on precision, recall, or F1 score respectively.

```python
scorer = RougeScore(measure_type="recall")
```

## Exact Match
The ExactMatch metric checks if the response is exactly the same as the reference text. It is useful in scenarios where you need to ensure that the generated response matches the expected output word-for-word. For example, arguments in tool calls, etc. The metric returns 1 if the response is an exact match with the reference, and 0 otherwise.

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._string import ExactMatch

sample = SingleTurnSample(
response="India",
reference="Paris"
)

scorer = ExactMatch()
await scorer.single_turn_ascore(sample)
```

## String Presence
The StringPresence metric checks if the response contains the reference text. It is useful in scenarios where you need to ensure that the generated response contains certain keywords or phrases. The metric returns 1 if the response contains the reference, and 0 otherwise.

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._string import StringPresence

sample = SingleTurnSample(
response="The Eiffel Tower is located in India.",
reference="Eiffel Tower"
)
scorer = StringPresence()
await scorer.single_turn_ascore(sample)
```

0 comments on commit 6b53548

Please sign in to comment.