Bugfix unicode compatibility #30

wradstok · 2020-07-20T18:56:56Z

Hi,

When operating on my data set I was getting the following error:

> Traceback (most recent call last):
>   File "codes/run.py", line 361, in <module>
>     main(parse_args())
>   File "codes/run.py", line 211, in main
>     train_triples = read_triple(os.path.join(args.data_path, 'train.txt'), entity2id, relation2id)
>   File "codes/run.py", line 127, in read_triple
>     triples.append((entity2id[h], relation2id[r], entity2id[t]))
> KeyError: 'Găgăuzia\xa0'

To fix the issue, I added calls to unicodedata.normalize() before loading triples to normalize such weird characters (non-breaking spaces). I also fixed two minor grammatical mistakes in error messages.

cthoyt · 2020-07-22T21:39:10Z

codes/run.py

@@ -123,7 +124,7 @@ def read_triple(file_path, entity2id, relation2id):
    triples = []
    with open(file_path) as fin:
        for line in fin:
-            h, r, t = line.strip().split('\t')
+            h, r, t = map(lambda x: x.strip(), unicodedata.normalize('NFKC', line).split('\t'))


str.strip could be a shorter replacement for lambda x: x.strip()

Thanks, didn't realize this was possible.

I mistakenly thought I was dealing with a unicode issue due to the error I received. Upon investigating closer I realized that when the entity/relations are loaded and the entire line is split, trailing spaces are removed because the names are at the end. However, when loading triples this only occurs on the tail entities. Fixed by mapping str.split() on all components when loading triples.

wradstok · 2020-07-23T08:27:02Z

I mistakenly thought I was dealing with a unicode issue due to the error I received. Upon investigating closer I realized that when the entity/relations are loaded and the entire line is split, trailing spaces are removed because the names are at the end. However, when loading triples this only occurs on the tail entities. Fixed by mapping str.split() on all components when loading triples.

wradstok added 2 commits July 20, 2020 20:50

Bugfix unicode compatibility

5a8aa66

Fix indentation mistake

20c37cd

cthoyt reviewed Jul 22, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix unicode compatibility #30

Bugfix unicode compatibility #30

wradstok commented Jul 20, 2020

cthoyt Jul 22, 2020

wradstok Jul 23, 2020

wradstok commented Jul 23, 2020

Bugfix unicode compatibility #30

Are you sure you want to change the base?

Bugfix unicode compatibility #30

Conversation

wradstok commented Jul 20, 2020

cthoyt Jul 22, 2020

Choose a reason for hiding this comment

wradstok Jul 23, 2020

Choose a reason for hiding this comment

wradstok commented Jul 23, 2020