You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm a bit puzzled by something I encountered trying to encode sentences as embeddings. When I ran the sentences through the model one at a time, I got slightly different results from when I ran batches of sentences.
I've reduced an example down to:
from transformers import pipeline
import numpy as np
p = pipeline('feature-extraction', model='allenai/scibert_scivocab_uncased')
s = 'the scurvy dog walked home alone'.split()
for l in range(1,len(s)+1):
txt = ' '.join(s[:l])
res1 = p(txt)
res2 = p(txt)
res1_2 = p([txt, txt])
print(l, txt, len(res1[0]))
print(all( np.allclose(i, j) for i, j in zip(res1[0], res2[0])),
all( np.allclose(i, j) for i, j in zip(res2[0], res1_2[0])),
all( np.allclose(i, j) for i, j in zip(res1_2[0], res1_2[1])))
The output I get is:
1 the 3
True False True
2 the scurvy 6
True True True
3 the scurvy dog 7
True False False
4 the scurvy dog walked 9
True False True
5 the scurvy dog walked home 10
True True False
6 the scurvy dog walked home alone 11
True True True
So running a single sentence through the model seems to give the same output each time, but if I run a batch with the same sentence twice, it's sometimes different (between the two outputs, and compared to the single-sentence case)
Is this expected/explainable?
Further context, I'm running it on CPU (laptop), python 3.8.9, freshly installed venv.
The difference is usually just in a few indices of the embeddings, and can be up to 1e-3. The difference is negligible when comparing the embeddings with cosine distance. But I'd like to understand where it comes from before dismissing it.
The text was updated successfully, but these errors were encountered:
I'm a bit puzzled by something I encountered trying to encode sentences as embeddings. When I ran the sentences through the model one at a time, I got slightly different results from when I ran batches of sentences.
I've reduced an example down to:
The output I get is:
So running a single sentence through the model seems to give the same output each time, but if I run a batch with the same sentence twice, it's sometimes different (between the two outputs, and compared to the single-sentence case)
Is this expected/explainable?
Further context, I'm running it on CPU (laptop), python 3.8.9, freshly installed venv.
The difference is usually just in a few indices of the embeddings, and can be up to 1e-3. The difference is negligible when comparing the embeddings with cosine distance. But I'd like to understand where it comes from before dismissing it.
The text was updated successfully, but these errors were encountered: