Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix unicode compatibility #30

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

wradstok
Copy link

Hi,

When operating on my data set I was getting the following error:

> Traceback (most recent call last):
>   File "codes/run.py", line 361, in <module>
>     main(parse_args())
>   File "codes/run.py", line 211, in main
>     train_triples = read_triple(os.path.join(args.data_path, 'train.txt'), entity2id, relation2id)
>   File "codes/run.py", line 127, in read_triple
>     triples.append((entity2id[h], relation2id[r], entity2id[t]))
> KeyError: 'Găgăuzia\xa0'

To fix the issue, I added calls to unicodedata.normalize() before loading triples to normalize such weird characters (non-breaking spaces). I also fixed two minor grammatical mistakes in error messages.

codes/run.py Outdated
@@ -123,7 +124,7 @@ def read_triple(file_path, entity2id, relation2id):
triples = []
with open(file_path) as fin:
for line in fin:
h, r, t = line.strip().split('\t')
h, r, t = map(lambda x: x.strip(), unicodedata.normalize('NFKC', line).split('\t'))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str.strip could be a shorter replacement for lambda x: x.strip()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, didn't realize this was possible.

I mistakenly thought I was dealing with a unicode issue due to the error I received. Upon investigating closer I realized that when the entity/relations are loaded and the entire line is split, trailing spaces are removed because the names are at the end. However, when loading triples this only occurs on the tail entities. Fixed by mapping str.split() on all components when loading triples.
@wradstok
Copy link
Author

I mistakenly thought I was dealing with a unicode issue due to the error I received. Upon investigating closer I realized that when the entity/relations are loaded and the entire line is split, trailing spaces are removed because the names are at the end. However, when loading triples this only occurs on the tail entities. Fixed by mapping str.split() on all components when loading triples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants