Skip to content

Commit

Permalink
🎁 Improve word coordinates generator
Browse files Browse the repository at this point in the history
This commit will ensure that the word coordinates generator will have
unique values.  This has been observed in IIIF Print's version of the
same generator where the same word coordinates appear multiple times and
results in the UV having multiple annotations for the same word at the
same place.  This also will set up the text generator for a more
Tesseract like text file where extra spaces are omitted for an overall
cleaner output.  This also will set up the alto generator to not have
duplicate word coordinates as well.

- Resolves: #6
  • Loading branch information
kirkkwang committed May 25, 2023
1 parent 2e16764 commit ad5f265
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 7 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ def end_word
# add trailing space to plaintext buffer for between words:
@text += ' '
@words.push(@current) if word_complete?
@current = nil # clear the current word
end

def end_line
Expand Down Expand Up @@ -156,10 +157,13 @@ def characters(value)
# Callback for element end; at this time, flush word coordinate state
# for current word, and append line endings to plain text:
#
# @param _name [String] element name.
def end_element(_name)
end_line if @element_class_name == 'ocr_line'
end_word if @element_class_name == 'ocrx_word'
# @param name [String] element name.
def end_element(name)
if name == 'span'
end_word if @element_class_name == 'ocrx_word'
@text += "\n" if @element_class_name.nil?
end
@element_class_name = nil
end

# Callback for completion of parsing hOCR, used to normalize generated
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
json = JSON.parse(File.read(generated_file.file_path))
expect(json.keys).to match_array(["width", "height", "coords"])
expect(generated_file.exist?).to be_truthy
expect(generated_file.file_path).to end_with("/ocr_mono_text_hocr.coordinates.json")
end

expect(generated_file.exist?).to be_falsey
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,16 +20,19 @@
subject { hocr.text }

it 'outputs plain text' do
expect(subject.slice(0, 40)).to eq "_A FEARFUL ADVENTURE.\n‘The Missouri. "
expect(subject.size).to eq 831
expect(subject.slice(0, 40)).to eq "_A FEARFUL ADVENTURE.\n‘The Missouri. Rep"
expect(subject.size).to eq 723
end
end

describe '#to_json' do
subject { hocr.to_json }
it 'outputs JSON that includes coords key' do
it 'outputs JSON that includes coords key with unique coordinate values' do
parsed = JSON.parse(subject)
expect(parsed['coords'].length).to be > 1
parsed['coords'].values.each do |value|
expect(value.uniq.length).to eq(value.length)
end
end
end
end

0 comments on commit ad5f265

Please sign in to comment.