Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Automate location extraction and english translation #642

Merged
merged 24 commits into from
Aug 5, 2024
Merged

Conversation

cka-y
Copy link
Contributor

@cka-y cka-y commented Jul 30, 2024

Summary:

Closes #618, which includes adding translations for location names (country, subdivision, and municipality) to the FeedSearch materialized view and improving the functionality for geocoding locations from GTFS feeds.

Changes include:

  1. Database Schema Changes:

    • Introduced a new country column to the Location table.
    • Introduced a new Translation table with type, language_code, key, and value columns to store translations for various location elements.
    • Modified the FeedSearch materialized view to incorporate translations for country, subdivision_name, and municipality into the searchable document field.
  2. Backend Logic:

    • Created the GeocodedLocation class to include methods for handling location extraction and translations.
    • Implemented enhancements to the update_location function to integrate location translations into the database.
  3. Testing:

    • Added unit tests for the new translation functionality.

Expected behavior:

The system should now support searching feeds using English-translated names for countries, subdivisions, and municipalities. When a feed has associated locations with translations available in the Translation table, these translations will be included in the search index, enabling users to find feeds using either the original or translated location names. This change aims to improve the searchability of feeds for users who might use different languages.

Feed locations are also now automatically extracted from reverse geolocating using five points from the dataset: the extreme points (the ones with extreme lat/lon which give four points but can be less if one point represents two extremes) and the point in stops.txt closest to the center of the bounding box. Additional points are randomly selected to complete the count of five. The decision on the subdivision or municipality is based on majority voting. If there's no majority at the subdivision level, the country level is included, and multiple countries are included if necessary.

Testing tips:
Use the PR preview URL to search for locations. Example tests:

  • Espana vs España vs. Spain
  • Cairo vs القاهرة
  • 日本 vs Japan
  • ประเทศไทย vs Thailand
  • Bayern vs Bavaria
  • Aotearoa vs New Zealand

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with ./scripts/api-tests.sh to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

@cka-y cka-y changed the title Feat/618 feat: Automate location extraction and english translation Jul 31, 2024
Copy link

Preview Firebase Hosting URL: https://mobility-feeds-dev--pr-642-u6p0n28u.web.app

@@ -117,6 +117,11 @@ def populate_location(self, feed, row, stable_id):
"""
Populate the location for the feed
"""
# TODO: validate behaviour for gtfs-rt feeds
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be validated as part of #623

@cka-y cka-y marked this pull request as ready for review July 31, 2024 17:24
@cka-y
Copy link
Contributor Author

cka-y commented Jul 31, 2024

Note that the integration tests are currently expected to fail due to the database changes. The corresponding API updates should be covered in issues #622 and #623.

@emmambd
Copy link
Contributor

emmambd commented Jul 31, 2024

@cka-y From a QA perspective, this looks great! (Solves the core problem we were trying to fix, where people can input full country name in either original language or English, accents or no accents). Much better user behaviour that I look forward to telling our usability testers we incorporated!

Couple outstanding questions:

I see a few examples of countries that when I include the original language name vs. the English, I get a different number of results:

Is this because the location data in the search UI is inaccurate? Or another reason? In these cases, it looks like the feeds are all based in the same country, so it's not because of text in the feed name or transit provider that matches where the location does not.

@cka-y
Copy link
Contributor Author

cka-y commented Jul 31, 2024

@emmambd

Bavaria vs. Bayern (difference in the red square):
Screenshot 2024-07-31 at 1 54 51 PM

It appears that the two feeds not returned when searching for "Bavaria" (the translation) are due to the following reasons:

  1. First Feed: It lacks a valid dataset because the producer URL returns an HTML file instead of the expected data.

  2. Second Feed: This is a very large feed, and we do not have sufficient memory in the cloud function to process it, specifically to open the stops and parse it for extracting either a bounding box or location. Currently, the memory allocation is 8Gi, but we can consider increasing it if we find it valuable, albeit at a higher cost.

In the first element, the feed is returned when searching for "Bayern" despite "Bayern" not appearing in the feed. This discrepancy is likely due to string transformation processes during the search query. For the feeds related to Spain, there may be a similar reason.
Also, searching for "Spain" returns 34 results, while "España" returns 31. The extracted location is always in the regional language (in this case, "España"), suggesting that the difference is not due to location mismatches but rather the search process (otherwise España should return more results than Spain).

@emmambd
Copy link
Contributor

emmambd commented Jul 31, 2024

@cka-y - Got it. If I understand this correctly, basically this is occurring either due to 1) issues with parsing the location because of the feed, whether it be missing data or size or 2) changes we still need to make to the API side?

I'm fine living with this as is for now.

@cka-y
Copy link
Contributor Author

cka-y commented Aug 1, 2024

@emmambd

I've done a deeper dive to understand the differences in search results. For Bavaria vs Bayern, the issue is with the feed data: either the file is too large to process or the dataset is missing because the producer URL does not return a valid zip file. However, for Spain vs Espana, I discovered that our search was only using the first element of the location data. This approach was initially fine because we typically had only one location per feed.

The problem arose with feeds containing multiple locations (e.g., feeds covering several countries including Espana). If Espana wasn't the first country listed, it would not be included in the search document, leading to discrepancies. Additionally, the English translation (Spain) was consistently appearing as the first element in the translations, further skewing the results.

I've now fixed this issue! Both Spain and Espana queries return the correct number of feeds, which is 35. I've double-checked this with the database to confirm the accuracy.

@davidgamez
Copy link
Member

Note that the integration tests are currently expected to fail due to the database changes. The corresponding API updates should be covered in issues #622 and #623.

We should comment the locations integration test, as (I believe ) it will block the production content updates from the catalog repository

@emmambd
Copy link
Contributor

emmambd commented Aug 1, 2024

@davidgamez @cka-y Does that mean that merging this is blocked until #622 is done?

@davidgamez
Copy link
Member

davidgamez commented Aug 1, 2024

@davidgamez @cka-y Does that mean that merging this is blocked until #622 is done?

Yes, this is why my suggestion is to comment/ignore the integration test for locations until the follow-up issue is completed.

@cka-y
Copy link
Contributor Author

cka-y commented Aug 1, 2024

@davidgamez @emmambd I could comment the location filtering integration tests as part of this PR, merge and then generate/uncomment tests as part of #622. Thoughts?

@cka-y cka-y requested a review from davidgamez August 2, 2024 14:52
Copy link
Member

@davidgamez davidgamez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work; I added minor non-blocking comments

task_id=task_id,
index=f"{i + 1}/{len(country_codes)}",
)
# def test_filter_by_country_code(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

picky: this can be disabled with

    @pytest.mark.skip(reason="This test is expected to fail until API location issue is implemented")
    def test_filter_by_country_code(self):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we do not use pytest to run the integration tests, ill leave this commented for now but i'll address the changes as part of #622

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cka-y Is it time to put back these tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#662 needs to be addressed first

task_id=task_id,
index=f"{i + 1}/{len(municipalities)}",
)
# def test_filter_by_municipality(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as before, use pytest skip annotation.

task_id=task_id,
index=f"{i + 1}/{len(country_codes)}",
)
# def test_filter_by_country_code_gtfs(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as before, use pytest skip annotation.

task_id=task_id,
index=f"{i + 1}/{len(municipalities)}",
)
# def test_filter_by_municipality_gtfs(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as before, use pytest skip annotation.

task_id=task_id,
index=f"{i + 1}/{len(country_codes)}",
)
# def test_filter_by_country_code_gtfs_rt(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as before, use pytest skip annotation.

task_id=task_id,
index=f"{i + 1}/{len(municipalities)}",
)
# def test_filter_by_municipality_gtfs_rt(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as before, use pytest skip annotation.

liquibase/changes/feat_618_2.sql Outdated Show resolved Hide resolved
@cka-y cka-y linked an issue Aug 5, 2024 that may be closed by this pull request
@cka-y cka-y merged commit c8fbae6 into main Aug 5, 2024
5 of 6 checks passed
@cka-y cka-y deleted the feat/618 branch August 5, 2024 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error with docker-compose in gh actions Automate location generation
4 participants