You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After extracting everything we have ever entered into a Forename field for #2779, it is clear that the variety of abbreviation marks that transcribers are using is preventing the search engine from finding the records. Abbreviation marks are not confined to the end of a name, like Jno. or Wm. but can also occur in the middle of a name like Rich:d or Eliz:th and we also have cases where two abbreviation marks have been used such as Eliz.th. and Will'm. and Rich'd. Unless Soundex is used each of these characters counts as if it were a real letter, and prevents the search engine from finding the record.
The emendation rules that we have in /lib/tasks/load_emendations.rake are what we use to handle abbreviations directly. We can only have so many of these rules, however, because they add to the processing time of an uploaded CSV file. The proposal here is to allow users to continue entering whatever abbreviation marks they think best represent what the register has, but have the search engine ignore them. What isn't clear to me is whether this should be done by pruning out abbreviation marks in the search_records (which would require rebuilding the entire DB) or whether it is possible to have the search engine ignore them if present (which could slow down searching).
Ignoring abbr marks is important because there are far too many abbreviated forms of names for us to cope with them by Soundex or wildcard searches or emendation rules. Soundex tends to return too many false matches, and is currently turned on for both forenames and surnames (one cannot select one or the other). Wildcards can only be used after at least two initial letters, and only when a single Place is specified. Emendation rules are too specific and we would need many thousands of them to cover the variety of abbreviations that we already have.
The text was updated successfully, but these errors were encountered:
The comma only occurs inside curly brackets as a range indicator, as in _{3,5} to indicate 3 to 5 unreadable letters.
The most common abbreviation marks that the search engine should ignore are:
full stop .
colon :
single quote '
hyphen -
semicolon ;
backtick `
double quote "
smart quote ’
Happily, there is no overlap between these two lists. What I proposed was that, for each forename field in a search_record, we create a corresponding search_ field which would be populated with whatever is in the forename field but stripped of abbreviation marks. The original forename field would then be used for display purposes since it contains what the transcriber actually entered. So for example, in a burial record we have the burial_person_forename field, so we would create a new search_burial_person_forename field and populate it according to the following:
The search engine would then try to match what was in the search_burial_person_forename field, and if it matches, the displayed record would then show the contents of the original burial_person_forename field.
After extracting everything we have ever entered into a Forename field for #2779, it is clear that the variety of abbreviation marks that transcribers are using is preventing the search engine from finding the records. Abbreviation marks are not confined to the end of a name, like Jno. or Wm. but can also occur in the middle of a name like Rich:d or Eliz:th and we also have cases where two abbreviation marks have been used such as Eliz.th. and Will'm. and Rich'd. Unless Soundex is used each of these characters counts as if it were a real letter, and prevents the search engine from finding the record.
The emendation rules that we have in
/lib/tasks/load_emendations.rake
are what we use to handle abbreviations directly. We can only have so many of these rules, however, because they add to the processing time of an uploaded CSV file. The proposal here is to allow users to continue entering whatever abbreviation marks they think best represent what the register has, but have the search engine ignore them. What isn't clear to me is whether this should be done by pruning out abbreviation marks in the search_records (which would require rebuilding the entire DB) or whether it is possible to have the search engine ignore them if present (which could slow down searching).Ignoring abbr marks is important because there are far too many abbreviated forms of names for us to cope with them by Soundex or wildcard searches or emendation rules. Soundex tends to return too many false matches, and is currently turned on for both forenames and surnames (one cannot select one or the other). Wildcards can only be used after at least two initial letters, and only when a single Place is specified. Emendation rules are too specific and we would need many thousands of them to cover the variety of abbreviations that we already have.
The text was updated successfully, but these errors were encountered: