Skip to content

Commit

Permalink
Release 0.9.0
Browse files Browse the repository at this point in the history
  • Loading branch information
jbaiter committed Jun 12, 2024
1 parent 90b577a commit bf9a166
Show file tree
Hide file tree
Showing 9 changed files with 24 additions and 19 deletions.
4 changes: 2 additions & 2 deletions docs/alternatives.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ readings for a given sequence of characters, or they could come from an manual o
OCR correction system.

!!! note Expressing alternatives in OCR files
- For **hOCR**, use `<span class="alternatives"><ins class="alt">...</ins><del class="alt">...</del></span>` (see [hOCR specification](http://kba.cloud/hocr-spec/1.2/#segmentation))
- For **hOCR**, use `<span class="alternatives"><ins class="alt">...</ins><del class="alt">...</del></span>` (see [hOCR specification](http://kba.github.io/hocr-spec/1.2/#segmentation))
- For **ALTO**, use `<String …><ALTERNATIVE>...</ALTERNATIVE></String>` (see `AlternativeType` in the [ALTO schema](https://www.loc.gov/standards/alto/v4/alto-4-2.xsd))
- For **MiniOCR**, delimit alternative forms with `` (U+21FF) (see [MiniOCR documentation](../formats#miniocr))
- For **MiniOCR**, delimit alternative forms with `` (U+21FF) (see [MiniOCR documentation](./formats.md#miniocr))

In any case, these alternative readings can improve your user's search experience, by allowing us to
index *multiple forms for a given text position*. This enables users to find more matching passages
Expand Down
9 changes: 6 additions & 3 deletions docs/changes.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
## Unreleased
## 0.9.0 (2024-06-12)
[GitHub Release](https://github.com/dbmdz/solr-ocrhighlighting/releases/tag/0.9.0)

Major performance and stability improvements in this release, upgrading is highly recommended.

**Changed:**
- Add support for multithreaded highlighting. Uses all available logical CPU cores by default and
Expand Down Expand Up @@ -228,7 +231,7 @@ significantly.
shipping with Solr). For example, if you OCR file has the alternatives `christmas` and `christrias` for the token
`clistrias` in the span `presents on clistrias eve`, users would be able to search for `"presents christmas"` and
`"presents clistrias"` and would get the correct match in both cases, both with full highlighting.
Refer to the corresponding [section in the documentation](../alternatives) for instructions on setting it up.
Refer to the corresponding [section in the documentation](./alternatives.md) for instructions on setting it up.
- **On-the-fly repair of 'broken' markup.**
`OcrCharFilterFactory` has a new option `fixMarkup` that enables on-the-fly repair of invalid XML in OCR input documents,
namely problems that can arise when the markup contains unescaped instances of `<`, `>` and `&`.
Expand All @@ -255,7 +258,7 @@ significantly.
identify the page, improving highlighting performance due to the need for less backwards-seeking in the
input files.
- **Add new `expandAlternatives` attribute to `OcrCharFilterFactory`.** This enables the parsing of
alternative readings from input files (see above and the [corresponding section in the documentation](../alternatives))
alternative readings from input files (see above and the [corresponding section in the documentation](./alternatives.md))
- **Add new `hl.ocr.scorePassages` parameter to disable sorting of passages by their score.**
See the above section unter *New Features* for an explanation of this flag.

Expand Down
2 changes: 1 addition & 1 deletion docs/formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ encode a word with the default form `clistrias` and two alternatives `christmas`

| Block | MiniOCR tag | notes |
| --------- | ------------ | -------------------------------- |
| Word | `<w/>` | needs to have `box` attribute with `{x} {y} {width} {height}`. <br>Values can be integers or floats between 0 and 1, **with the leading `0.` omitted** |
| Word | `<w/>` | needs to have `box` attribute with `{x} {y} {width} {height}`. <br>Values can be integers or floats between 0 and 1, **with the leading `0` omitted** |
| Line | `<l/>` | |
| Block | `<b/>` | |
| Page | `<p/>` | needs to have an `xml:id` attribute with a page identifier. Optionally can have a `wh` attribute with the `{width} {height}` values for the page |
Expand Down
10 changes: 5 additions & 5 deletions docs/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ the (again, potentially very large) contents themselves in the index.
how fast the underlying storage is able to perform random I/O. This is why **we highly recommend
using flash storage for the documents**.

Another option to increase highlighting performance is
Another option to increase indexing performance is
to **switch from UTF8 to ASCII** (with XML-escaped Unicode codepoints) for the encoding of the OCR
files. This requires less CPU during decoding, since we don't have to take multi-byte sequences into
account. To signal to the plugin that a given source path is encoded in ASCII, include the `{ascii}`
Expand Down Expand Up @@ -143,9 +143,9 @@ The format of the regions is inspired by [Python's slicing syntax](https://docs.


!!! note "Example Implementation"
The [example setup on GitHub](https://github.com/dbmdz/solr-ocrhighlighting/tree/master/example)
uses a [Python script](https://github.com/dbmdz/solr-ocrhighlighting/blob/master/example/ingest.py)
The [example setup on GitHub](https://github.com/dbmdz/solr-ocrhighlighting/tree/main/example)
uses a [Python script](https://github.com/dbmdz/solr-ocrhighlighting/blob/main/example/ingest.py)
to index articles from multi-page newspaper scans into Solr. It works by [first extracting the OCR
block ids for each article from a METS file](https://github.com/dbmdz/solr-ocrhighlighting/blob/master/example/ingest.py#L141-L147)
and then [finds the byte regions these OCR blocks are located in](https://github.com/dbmdz/solr-ocrhighlighting/blob/master/example/ingest.py#L108-L123)
block ids for each article from a METS file](https://github.com/dbmdz/solr-ocrhighlighting/blob/main/example/ingest.py#L141-L148)
and then [finds the byte regions these OCR blocks are located in](https://github.com/dbmdz/solr-ocrhighlighting/blob/main/example/ingest.py#L103-L124)
to build the source pointer for each article.
2 changes: 1 addition & 1 deletion docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ your Solrcloud cluster. All paths are relative to the Solr installation director
`$ ./bin/solr package add-repo dbmdz.github.io https://dbmdz.github.io/solr`
- **Install package** in the latest version:<br>
`$ ./bin/solr package install ocrhighlighting` if you're on Solr 9, otherwise:
`$ ./bin/solr package install ocrhighlighting:0.8.1-solr78`
`$ ./bin/solr package install ocrhighlighting:0.9.0-solr78`

!!! caution "Be sure to use the `ocrhighlighting:` prefix when specifying classes in your configuration."
When using the Package Manager, classes from plugins have to be prefixed (separated by a colon) by
Expand Down
10 changes: 6 additions & 4 deletions docs/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,12 @@ in your `solrconfig.xml`. Tune these parameters to match your hardware and stora
- `numHighlightingThreads`: The number of threads that will be used to read and process the OCR files.
Defaults to the number of logical CPU cores. Set this higher if you're I/O-bottlenecked and can
support more parallel reads than you have logical CPU cores (very likely for modern NVMe drives).
- `maxQueuedPerThread`: By default, we queue only a limited number of documents per thread as to not
stall other requests. If this number is reached, all highlighting will be done single-threaded on
the request thread. You usually don't have to touch this setting, but if you have large result sets
with many concurrent requests, this can help to reduce the number of threads that are active at
- `maxQueuedPerThread`: The thread pool used to highlight documents is shared across all requests.
By default, we queue only a limited number of documents per thread as to not
stall other requests. If this number is reached, all remaining highlighting
will be done single-threaded on the request thread. You usually don't have to
touch this setting, but if you have large result sets with many concurrent
requests, this can help to reduce the number of threads that are active at
the same time, at least as a stopgap.

## Runtime configuration
Expand Down
2 changes: 1 addition & 1 deletion docs/query.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ The objects contained under the `snippets` key are structured like this:
- `pages` contains a list of pages the snippet appears on along with their pixel dimensions. This can be useful
for rendering highlights, e.g. if the highlighting target image is scaled down from the source image.
- `regions` contains a list of regions that the snippet is located on. Usually this will contain only one item,
but in cases where a phrase spans multiple pages, it will contain a region for every page involved in the match.
but in cases where a phrase spans multiple pages or columns, it will contain a region for every page and column involved in the match.
The object includes coordinates for all four corners it is defined by, as well as the identifier of the `page` the
region is located on.
- `highlights` contains a list of regions that contain the actual matches for the query as well as the `text` that
Expand Down
2 changes: 1 addition & 1 deletion example/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
version: '2'
services:
solr:
image: solr:9.5
image: solr:9.6
ports:
- "1044:1044" # Debugging port
- "8983:8983" # Solr admin interface
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

<groupId>de.digitalcollections</groupId>
<artifactId>solr-ocrhighlighting</artifactId>
<version>0.9.0-SNAPSHOT</version>
<version>0.9.0</version>

<name>Solr OCR Highlighting Plugin</name>
<description>
Expand Down

0 comments on commit bf9a166

Please sign in to comment.