Have search engine ignore all abbreviation marks #2781

stsccfr · 2024-12-17T20:16:59Z

After extracting everything we have ever entered into a Forename field for #2779, it is clear that the variety of abbreviation marks that transcribers are using is preventing the search engine from finding the records. Abbreviation marks are not confined to the end of a name, like Jno. or Wm. but can also occur in the middle of a name like Rich:d or Eliz:th and we also have cases where two abbreviation marks have been used such as Eliz.th. and Will'm. and Rich'd. Unless Soundex is used each of these characters counts as if it were a real letter, and prevents the search engine from finding the record.

The emendation rules that we have in /lib/tasks/load_emendations.rake are what we use to handle abbreviations directly. We can only have so many of these rules, however, because they add to the processing time of an uploaded CSV file. The proposal here is to allow users to continue entering whatever abbreviation marks they think best represent what the register has, but have the search engine ignore them. What isn't clear to me is whether this should be done by pruning out abbreviation marks in the search_records (which would require rebuilding the entire DB) or whether it is possible to have the search engine ignore them if present (which could slow down searching).

Ignoring abbr marks is important because there are far too many abbreviated forms of names for us to cope with them by Soundex or wildcard searches or emendation rules. Soundex tends to return too many false matches, and is currently turned on for both forenames and surnames (one cannot select one or the other). Wildcards can only be used after at least two initial letters, and only when a single Place is specified. Emendation rules are too specific and we would need many thousands of them to cover the variety of abbreviations that we already have.

The text was updated successfully, but these errors were encountered:

DeniseColbert · 2024-12-18T16:09:06Z

Can the search engine ignore all punctuation marks? How would we achieve it? How much work is this?

Vino will look into the options.

Vino-S · 2025-01-29T14:27:38Z

Made some changes in test2.
The query is efficient when place is provided. If not, currently query times out.

Vino-S · 2025-01-29T15:42:48Z

Try autocomplete for names for test2

stsccfr · 2025-01-29T21:09:05Z

I want to recap what we discussed today so that it's documented here. The UCF characters we must retain in forenames are:

square brackets []
curly brackets {}
underscore _
asterisk *
question mark ?
comma ,

The comma only occurs inside curly brackets as a range indicator, as in _{3,5} to indicate 3 to 5 unreadable letters.

The most common abbreviation marks that the search engine should ignore are:

full stop .
colon :
single quote '
hyphen -
semicolon ;
backtick `
double quote "
smart quote ’

Happily, there is no overlap between these two lists. What I proposed was that, for each forename field in a search_record, we create a corresponding search_ field which would be populated with whatever is in the forename field but stripped of abbreviation marks. The original forename field would then be used for display purposes since it contains what the transcriber actually entered. So for example, in a burial record we have the burial_person_forename field, so we would create a new search_burial_person_forename field and populate it according to the following:

search_burial_person_forename = burial_person_forename.gsub(/[.:;`'"’-]/, '')

The search engine would then try to match what was in the search_burial_person_forename field, and if it matches, the displayed record would then show the contents of the original burial_person_forename field.

DeniseColbert assigned Vino-S Dec 18, 2024

DeniseColbert added the needs info label Dec 18, 2024

Vino-S added the ready for testing pipeline label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have search engine ignore all abbreviation marks #2781

Have search engine ignore all abbreviation marks #2781

stsccfr commented Dec 17, 2024

DeniseColbert commented Dec 18, 2024

Vino-S commented Jan 29, 2025

Vino-S commented Jan 29, 2025

stsccfr commented Jan 29, 2025

Have search engine ignore all abbreviation marks #2781

Have search engine ignore all abbreviation marks #2781

Comments

stsccfr commented Dec 17, 2024

DeniseColbert commented Dec 18, 2024

Vino-S commented Jan 29, 2025

Vino-S commented Jan 29, 2025

stsccfr commented Jan 29, 2025