Release 0.9.0

dbmdz · Jun 12, 2024 · bf9a166 · bf9a166
1 parent 90b577a
commit bf9a166
Show file tree

Hide file tree

Showing 9 changed files with 24 additions and 19 deletions.
diff --git a/docs/alternatives.md b/docs/alternatives.md
@@ -6,9 +6,9 @@ readings for a given sequence of characters, or they could come from an manual o
 OCR correction system.
 
 !!! note Expressing alternatives in OCR files
-    - For **hOCR**, use `<span class="alternatives"><ins class="alt">...</ins><del class="alt">...</del></span>` (see [hOCR specification](http://kba.cloud/hocr-spec/1.2/#segmentation))
+    - For **hOCR**, use `<span class="alternatives"><ins class="alt">...</ins><del class="alt">...</del></span>` (see [hOCR specification](http://kba.github.io/hocr-spec/1.2/#segmentation))
     - For **ALTO**, use `<String …><ALTERNATIVE>...</ALTERNATIVE></String>` (see `AlternativeType` in the [ALTO schema](https://www.loc.gov/standards/alto/v4/alto-4-2.xsd))
-    - For **MiniOCR**, delimit alternative forms with `⇿` (U+21FF) (see [MiniOCR documentation](../formats#miniocr))
+    - For **MiniOCR**, delimit alternative forms with `⇿` (U+21FF) (see [MiniOCR documentation](./formats.md#miniocr))
 
 In any case, these alternative readings can improve your user's search experience, by allowing us to
 index *multiple forms for a given text position*. This enables users to find more matching passages

diff --git a/docs/changes.md b/docs/changes.md
@@ -1,4 +1,7 @@
-## Unreleased
+## 0.9.0 (2024-06-12)
+[GitHub Release](https://github.com/dbmdz/solr-ocrhighlighting/releases/tag/0.9.0)
+
+Major performance and stability improvements in this release, upgrading is highly recommended.
 
 **Changed:**
 - Add support for multithreaded highlighting. Uses all available logical CPU cores by default and
@@ -228,7 +231,7 @@ significantly.
   shipping with Solr). For example, if you OCR file has the alternatives `christmas` and `christrias` for the token
   `clistrias` in the span `presents on clistrias eve`, users would be able to search for `"presents christmas"` and
   `"presents clistrias"` and would get the correct match in both cases, both with full highlighting.
-  Refer to the corresponding [section in the documentation](../alternatives) for instructions on setting it up.
+  Refer to the corresponding [section in the documentation](./alternatives.md) for instructions on setting it up.
 - **On-the-fly repair of 'broken' markup.**
   `OcrCharFilterFactory` has a new option `fixMarkup` that enables on-the-fly repair of invalid XML in OCR input documents,
   namely problems that can arise when the markup contains unescaped instances of `<`, `>` and `&`.
@@ -255,7 +258,7 @@ significantly.
   identify the page, improving highlighting performance due to the need for less backwards-seeking in the
   input files.
 - **Add new `expandAlternatives` attribute to `OcrCharFilterFactory`.** This enables the parsing of
-   alternative readings from input files (see above and the [corresponding section in the documentation](../alternatives))
+   alternative readings from input files (see above and the [corresponding section in the documentation](./alternatives.md))
   - **Add new `hl.ocr.scorePassages` parameter to disable sorting of passages by their score.**
     See the above section unter *New Features* for an explanation of this flag.
 

diff --git a/docs/formats.md b/docs/formats.md
@@ -81,7 +81,7 @@ encode a word with the default form `clistrias` and two alternatives `christmas`
 
 | Block     | MiniOCR tag  | notes                            |
 | --------- | ------------ | -------------------------------- |
-| Word      | `<w/>`       | needs to have `box` attribute with `{x} {y} {width} {height}`. <br>Values can be integers or floats between 0 and 1, **with the leading `0.` omitted** |
+| Word      | `<w/>`       | needs to have `box` attribute with `{x} {y} {width} {height}`. <br>Values can be integers or floats between 0 and 1, **with the leading `0` omitted** |
 | Line      | `<l/>`       |                                  |
 | Block     | `<b/>`       |                                  |
 | Page      | `<p/>`       | needs to have an `xml:id` attribute with a page identifier. Optionally can have a `wh` attribute with the `{width} {height}` values for the page |

diff --git a/docs/indexing.md b/docs/indexing.md
@@ -30,7 +30,7 @@ the (again, potentially very large) contents themselves in the index.
     how fast the underlying storage is able to perform random I/O. This is why **we highly recommend
     using flash storage for the documents**.
 
-    Another option to increase highlighting performance is
+    Another option to increase indexing performance is
     to **switch from UTF8 to ASCII** (with XML-escaped Unicode codepoints) for the encoding of the OCR
     files. This requires less CPU during decoding, since we don't have to take multi-byte sequences into
     account. To signal to the plugin that a given source path is encoded in ASCII, include the `{ascii}`
@@ -143,9 +143,9 @@ The format of the regions is inspired by [Python's slicing syntax](https://docs.
 
 
 !!! note "Example Implementation"
-    The [example setup on GitHub](https://github.com/dbmdz/solr-ocrhighlighting/tree/master/example)
-    uses a [Python script](https://github.com/dbmdz/solr-ocrhighlighting/blob/master/example/ingest.py)
+    The [example setup on GitHub](https://github.com/dbmdz/solr-ocrhighlighting/tree/main/example)
+    uses a [Python script](https://github.com/dbmdz/solr-ocrhighlighting/blob/main/example/ingest.py)
     to index articles from multi-page newspaper scans into Solr. It works by [first extracting the OCR
-    block ids for each article from a METS file](https://github.com/dbmdz/solr-ocrhighlighting/blob/master/example/ingest.py#L141-L147)
-    and then [finds the byte regions these OCR blocks are located in](https://github.com/dbmdz/solr-ocrhighlighting/blob/master/example/ingest.py#L108-L123)
+    block ids for each article from a METS file](https://github.com/dbmdz/solr-ocrhighlighting/blob/main/example/ingest.py#L141-L148)
+    and then [finds the byte regions these OCR blocks are located in](https://github.com/dbmdz/solr-ocrhighlighting/blob/main/example/ingest.py#L103-L124)
     to build the source pointer for each article.
diff --git a/docs/installation.md b/docs/installation.md
@@ -19,7 +19,7 @@ your Solrcloud cluster. All paths are relative to the Solr installation director
   `$ ./bin/solr package add-repo dbmdz.github.io https://dbmdz.github.io/solr`
 - **Install package** in the latest version:<br>
   `$ ./bin/solr package install ocrhighlighting` if you're on Solr 9, otherwise:
-  `$ ./bin/solr package install ocrhighlighting:0.8.1-solr78`
+  `$ ./bin/solr package install ocrhighlighting:0.9.0-solr78`
 
 !!! caution "Be sure to use the `ocrhighlighting:` prefix when specifying classes in your configuration."
     When using the Package Manager, classes from plugins have to be prefixed (separated by a colon) by

diff --git a/docs/performance.md b/docs/performance.md
@@ -58,10 +58,12 @@ in your `solrconfig.xml`. Tune these parameters to match your hardware and stora
 - `numHighlightingThreads`: The number of threads that will be used to read and process the OCR files.
    Defaults to the number of logical CPU cores. Set this higher if you're I/O-bottlenecked and can
    support more parallel reads than you have logical CPU cores (very likely for modern NVMe drives).
-- `maxQueuedPerThread`: By default, we queue only a limited number of documents per thread as to not
-  stall other requests. If this number is reached, all highlighting will be done single-threaded on
-  the request thread. You usually don't have to touch this setting, but if you have large result sets
-  with many concurrent requests, this can help to reduce the number of threads that are active at
+- `maxQueuedPerThread`: The thread pool used to highlight documents is shared across all requests.
+  By default, we queue only a limited number of documents per thread as to not
+  stall other requests. If this number is reached, all remaining highlighting
+  will be done single-threaded on the request thread. You usually don't have to
+  touch this setting, but if you have large result sets with many concurrent
+  requests, this can help to reduce the number of threads that are active at
   the same time, at least as a stopgap.
 
 ## Runtime configuration

diff --git a/docs/query.md b/docs/query.md
@@ -72,7 +72,7 @@ The objects contained under the `snippets` key are structured like this:
 - `pages` contains a list of pages the snippet appears on along with their pixel dimensions. This can be useful
   for rendering highlights, e.g. if the highlighting target image is scaled down from the source image.
 - `regions` contains a list of regions that the snippet is located on. Usually this will contain only one item,
-  but in cases where a phrase spans multiple pages, it will contain a region for every page involved in the match.
+  but in cases where a phrase spans multiple pages or columns, it will contain a region for every page and column involved in the match.
   The object includes coordinates for all four corners it is defined by, as well as the identifier of the `page` the
   region is located on.
 - `highlights` contains a list of regions that contain the actual matches for the query as well as the `text` that

diff --git a/example/docker-compose.yml b/example/docker-compose.yml
@@ -1,7 +1,7 @@
 version: '2'
 services:
   solr:
-    image: solr:9.5
+    image: solr:9.6
     ports:
       - "1044:1044"  # Debugging port
       - "8983:8983"  # Solr admin interface

diff --git a/pom.xml b/pom.xml
@@ -6,7 +6,7 @@
 
   <groupId>de.digitalcollections</groupId>
   <artifactId>solr-ocrhighlighting</artifactId>
-  <version>0.9.0-SNAPSHOT</version>
+  <version>0.9.0</version>
 
   <name>Solr OCR Highlighting Plugin</name>
   <description>