feat: Automate location extraction and english translation #642

cka-y · 2024-07-30T15:18:22Z

Summary:

Closes #618, which includes adding translations for location names (country, subdivision, and municipality) to the FeedSearch materialized view and improving the functionality for geocoding locations from GTFS feeds.

Changes include:

Database Schema Changes:
- Introduced a new country column to the Location table.
- Introduced a new Translation table with type, language_code, key, and value columns to store translations for various location elements.
- Modified the FeedSearch materialized view to incorporate translations for country, subdivision_name, and municipality into the searchable document field.
Backend Logic:
- Created the GeocodedLocation class to include methods for handling location extraction and translations.
- Implemented enhancements to the update_location function to integrate location translations into the database.
Testing:
- Added unit tests for the new translation functionality.

Expected behavior:

The system should now support searching feeds using English-translated names for countries, subdivisions, and municipalities. When a feed has associated locations with translations available in the Translation table, these translations will be included in the search index, enabling users to find feeds using either the original or translated location names. This change aims to improve the searchability of feeds for users who might use different languages.

Feed locations are also now automatically extracted from reverse geolocating using five points from the dataset: the extreme points (the ones with extreme lat/lon which give four points but can be less if one point represents two extremes) and the point in stops.txt closest to the center of the bounding box. Additional points are randomly selected to complete the count of five. The decision on the subdivision or municipality is based on majority voting. If there's no majority at the subdivision level, the country level is included, and multiple countries are included if necessary.

Testing tips:
Use the PR preview URL to search for locations. Example tests:

Espana vs España vs. Spain
Cairo vs القاهرة
日本 vs Japan
ประเทศไทย vs Thailand
Bayern vs Bavaria
Aotearoa vs New Zealand

Please make sure these boxes are checked before submitting your pull request - thanks!

Run the unit tests with ./scripts/api-tests.sh to make sure you didn't break anything
Add or update any needed documentation to the repo
Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
Linked all relevant issues
Include screenshot(s) showing how this pull request works and fixes the issue(s)

github-actions · 2024-07-31T14:36:15Z

Preview Firebase Hosting URL: https://mobility-feeds-dev--pr-642-u6p0n28u.web.app

cka-y · 2024-07-31T16:55:01Z

api/src/scripts/populate_db.py

@@ -117,6 +117,11 @@ def populate_location(self, feed, row, stable_id):
        """
        Populate the location for the feed
        """
+        # TODO: validate behaviour for gtfs-rt feeds


This should be validated as part of #623

cka-y · 2024-07-31T17:26:49Z

Note that the integration tests are currently expected to fail due to the database changes. The corresponding API updates should be covered in issues #622 and #623.

emmambd · 2024-07-31T17:44:10Z

@cka-y From a QA perspective, this looks great! (Solves the core problem we were trying to fix, where people can input full country name in either original language or English, accents or no accents). Much better user behaviour that I look forward to telling our usability testers we incorporated!

Couple outstanding questions:

I see a few examples of countries that when I include the original language name vs. the English, I get a different number of results:

Is this because the location data in the search UI is inaccurate? Or another reason? In these cases, it looks like the feeds are all based in the same country, so it's not because of text in the feed name or transit provider that matches where the location does not.

cka-y · 2024-07-31T18:12:00Z

@emmambd

Bavaria vs. Bayern (difference in the red square):

It appears that the two feeds not returned when searching for "Bavaria" (the translation) are due to the following reasons:

First Feed: It lacks a valid dataset because the producer URL returns an HTML file instead of the expected data.
Second Feed: This is a very large feed, and we do not have sufficient memory in the cloud function to process it, specifically to open the stops and parse it for extracting either a bounding box or location. Currently, the memory allocation is 8Gi, but we can consider increasing it if we find it valuable, albeit at a higher cost.

In the first element, the feed is returned when searching for "Bayern" despite "Bayern" not appearing in the feed. This discrepancy is likely due to string transformation processes during the search query. For the feeds related to Spain, there may be a similar reason.
Also, searching for "Spain" returns 34 results, while "España" returns 31. The extracted location is always in the regional language (in this case, "España"), suggesting that the difference is not due to location mismatches but rather the search process (otherwise España should return more results than Spain).

emmambd · 2024-07-31T18:55:08Z

@cka-y - Got it. If I understand this correctly, basically this is occurring either due to 1) issues with parsing the location because of the feed, whether it be missing data or size or 2) changes we still need to make to the API side?

I'm fine living with this as is for now.

cka-y · 2024-08-01T13:40:04Z

@emmambd

I've done a deeper dive to understand the differences in search results. For Bavaria vs Bayern, the issue is with the feed data: either the file is too large to process or the dataset is missing because the producer URL does not return a valid zip file. However, for Spain vs Espana, I discovered that our search was only using the first element of the location data. This approach was initially fine because we typically had only one location per feed.

The problem arose with feeds containing multiple locations (e.g., feeds covering several countries including Espana). If Espana wasn't the first country listed, it would not be included in the search document, leading to discrepancies. Additionally, the English translation (Spain) was consistently appearing as the first element in the translations, further skewing the results.

I've now fixed this issue! Both Spain and Espana queries return the correct number of feeds, which is 35. I've double-checked this with the database to confirm the accuracy.

davidgamez · 2024-08-01T15:21:27Z

Note that the integration tests are currently expected to fail due to the database changes. The corresponding API updates should be covered in issues #622 and #623.

We should comment the locations integration test, as (I believe ) it will block the production content updates from the catalog repository

emmambd · 2024-08-01T16:25:17Z

@davidgamez @cka-y Does that mean that merging this is blocked until #622 is done?

davidgamez · 2024-08-01T16:27:37Z

@davidgamez @cka-y Does that mean that merging this is blocked until #622 is done?

Yes, this is why my suggestion is to comment/ignore the integration test for locations until the follow-up issue is completed.

cka-y · 2024-08-01T16:32:25Z

@davidgamez @emmambd I could comment the location filtering integration tests as part of this PR, merge and then generate/uncomment tests as part of #622. Thoughts?

davidgamez

Great work; I added minor non-blocking comments

davidgamez · 2024-08-02T15:13:39Z

integration-tests/src/endpoints/feeds.py

-                task_id=task_id,
-                index=f"{i + 1}/{len(country_codes)}",
-            )
+    # def test_filter_by_country_code(self):


picky: this can be disabled with

@pytest.mark.skip(reason="This test is expected to fail until API location issue is implemented") def test_filter_by_country_code(self):

As we do not use pytest to run the integration tests, ill leave this commented for now but i'll address the changes as part of #622

@cka-y Is it time to put back these tests?

#662 needs to be addressed first

davidgamez · 2024-08-02T15:14:46Z

integration-tests/src/endpoints/feeds.py

-                task_id=task_id,
-                index=f"{i + 1}/{len(municipalities)}",
-            )
+    # def test_filter_by_municipality(self):