We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
该问题最初来源于:使用不同库(sentence_transformers等)加载bge-rerank-large,进行重排,发现结果都不相同(指重排后的顺序),其中一个原因是由于sentence_transformers会对输入去除空白符 使用以下脚本模拟一下sentence_transformers去除空白符的操作:
import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer model_name = "/data/reranker/bge-reranker-large" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) def rerank(query, documents): pairs = [(query, doc) for doc in documents] inputs = tokenizer( pairs, padding=True, truncation=True, return_tensors="pt", max_length=512 ).to(device) with torch.no_grad(): outputs = model(**inputs) scores = outputs.logits.view(-1).float().tolist() return scores query, candidates = 'J2智能扫地机器人能吸得干净猫毛吗?',[' Q:边刷炸毛严重\nA:机器人在地毯等材质上面清洁,可能导致出现容易炸毛的情况。边刷炸毛对扫地功能、清洁效果影响不大,可定期更换边刷使用\n', ' Q:边刷炸毛严重\nA:机器人在地毯等材质上面清洁,可能导致出现容易炸毛的情况。边刷炸毛对扫地功能、清洁效果影响不大,可定期更换边刷使用。\n'] candidates_ = [i.strip() for i in candidates] scores = rerank(query, candidates_) res_scores = sorted([(i, j) for i, j in enumerate(scores)], key=lambda x: -x[1]) print("strip(),Ranked Candidates:") for idx, score in res_scores: print(f"index:{idx} socre:{score}") scores = rerank(query, candidates) res_scores = sorted([(i, j) for i, j in enumerate(scores)], key=lambda x: -x[1]) print() print("no strip() Ranked Candidates:") for idx, score in res_scores: print(f"index:{idx} socre:{score}")
得到的结果为:
strip(),Ranked Candidates: index:0 socre:1.6036423444747925 index:1 socre:0.7816800475120544 no strip() Ranked Candidates: index:0 socre:0.8946816325187683 index:1 socre:0.8705350756645203
可以看到candidates中两段文本前后的差异,只是去除了首尾的空白符,为什么前后差异会这么大?这两段如果处于100个段中,那这两段重排后的顺序会相差非常多(一开始也是从100段中拿出来的这两段) 是由于"\n"在预训练的时候,重要性非常高吗?在进行重排时一定不能去掉吗? 谢谢,期待您的指点
The text was updated successfully, but these errors were encountered:
No branches or pull requests
该问题最初来源于:使用不同库(sentence_transformers等)加载bge-rerank-large,进行重排,发现结果都不相同(指重排后的顺序),其中一个原因是由于sentence_transformers会对输入去除空白符
使用以下脚本模拟一下sentence_transformers去除空白符的操作:
得到的结果为:
可以看到candidates中两段文本前后的差异,只是去除了首尾的空白符,为什么前后差异会这么大?这两段如果处于100个段中,那这两段重排后的顺序会相差非常多(一开始也是从100段中拿出来的这两段)
是由于"\n"在预训练的时候,重要性非常高吗?在进行重排时一定不能去掉吗?
谢谢,期待您的指点
The text was updated successfully, but these errors were encountered: