Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamify import and process osm data task #1214

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

GabrielBruno24
Copy link
Collaborator

Change the import and process osm data task so that it reads the source files with an asynchronous stream. Also adds an option in one of the functions used by the task so that it will not halt if just one geojson is missing compared to the raw OSM data.

Add new classes inherent from DataGeojson and DataOsmRaw.
Instead of reading the whole file at once, these classes will stream it piece by piece asynchronously, allowing for large files to be read without crashing the application.
Modifies the function getGeojsonsFromRawData() so that it accepts a new option parameter, continueOnMissingGeojson.
When generateNodesIfNotFound is false and continueOnMissingGeojson true, the function will just go to the next geojson and print a warning if a feature that is in the osm raw data is not in the geojson data.
Previously, the only option was to throw an error and interrupt the process.
Copy link
Collaborator

@tahini tahini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if I am mistaken in my review, but I'm not sure the intended goal of streaming the operation works correctly.

}

// Factory method so that we can create the class while calling an async function.
static async Create(filename: string): Promise<DataStreamGeojson> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower case 'c' for the function name here

// Factory method so that we can create the class while calling an async function.
static async Create(filename: string): Promise<DataStreamGeojson> {
const instance = new DataStreamGeojson(filename);
await instance.streamDataFromFile();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually read the whole file in the factory method, effectively adding all features in memory right now? I'm not sure this will solve the problem. See comment in the readGeojsonData method below.

console.log('Start streaming GeoJSON data.');
const readStream = fs.createReadStream(this._filename);
const jsonParser = JSONStream.parse('features.*');
const features: GeoJSON.Feature[] = [];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you seem to fill the features array with the file content. If the file content is too big for memory, so will be the features array, no? Ideally, each feature should be "processed" (whatever processed means to the consumer of this class) and dropped after processing to avoid filling the memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because the main original limitation was that the whole file was put into a big string and this had a max size.

Yes, this way takes a lot of memory, but Node seem to be able to cope since it's in a small chunk. We will need to refactor that further, but I don't think we have time for that at the moment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so it is less limited than it used to be and an intermediary step for later when all is in postgis? fair enough. then just fix the lowercase 'c' in the function name and it's good

Copy link
Collaborator

@greenscientist greenscientist Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll wait for confirmation from @GabrielBruno24 that it actually work on a region as big as Montreal.
Would be interesting to see if there's a limit.
(And maybe some stats on memory usage while it runs)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes forever because the time is O(N^2) but it does work if you give node enough memory yes. It would work better if all the write actions were part of a stream like I did with task 1 and 1b but the logic here is a lot more complex, so it would be better to just rewrite the function from scratch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants