Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model redesign #30

Merged
merged 77 commits into from
Nov 27, 2023
Merged

Model redesign #30

merged 77 commits into from
Nov 27, 2023

Conversation

annehaley
Copy link
Collaborator

@annehaley annehaley commented Oct 9, 2023

This PR is the first in a series of three PRs in a large redesign effort. This PR is scoped to include changes to database models, their rest endpoints, their conversion functions, and the ingest process that creates objects from our sample data (the populate management command).

  • Reorganize models
  • Rewrite Rest API for new models
  • Rewrite populate script and tasks to fit creation of new models

Follow-up PRs address the other facets of the redesign effort. Ideally, we should prepare and review them all together so that they can be merged all at once.

image

@jjnesbitt
Copy link
Member

Considering this, developers can use this branch with its new migration history while the docker file refers to the new location, and they can use other branches with their old database location. If this is merged, we will no longer need the old database locations.

The local storage of volume data is meant as a temporary measure for reviewing/testing out this PR, correct (i.e. wouldn't be merged to master)? This configuration makes it otherwise pesky to do a local wipe of the database.

@annehaley
Copy link
Collaborator Author

The local storage of volume data is meant as a temporary measure for reviewing/testing out this PR, correct (i.e. wouldn't be merged to master)? This configuration makes it otherwise pesky to do a local wipe of the database.

Sure, this change can be removed before merging. I find it convenient that I can see the volume-mounted folders in the project location, but I understand that's just a personal preference. Besides, it deviates from the Girder 4 setup, so this change can just be scoped to this WIP.

@annehaley
Copy link
Collaborator Author

I made another iteration on the design (aaa1c7f), with the following changes @AlmightyYakob and I have discussed:

  • Changing the relationship between NetworkNodes and NetworkEdges, as suggested above.
  • Removing the inheritance for regions. OriginalRegion and DerivedRegion just have some repeated fields instead.
  • Additionally, DerivedRegion now has a reference to a VectorDataSource as its map representation, since it doesn't have a reference to a Dataset. We discussed how the region models themselves are internal representations, just like network nodes and edges. For visualization on the map, we should rely on DataSource -type objects. Therefore, DerivedRegions need a reference to one of these, which we know should be of the Vector type.
  • Removing the attempt at a DataCollection model. This is a concept we should keep in mind for future iterations, but it doesn't make sense at the moment. This concept is intended for a user to curate their own collections, so this model would likely be tied to a User object. At this stage, we disregard User objects and have not implemented authentication, so these objects would be largely purposeless. We will need them in the future, though, and they will be easy to add in later, since they are outside of the main web of models.

@jjnesbitt
Copy link
Member

Additionally, DerivedRegion now has a reference to a VectorDataSource as its map representation, since it doesn't have a reference to a Dataset. We discussed how the region models themselves are internal representations, just like network nodes and edges. For visualization on the map, we should rely on DataSource -type objects. Therefore, DerivedRegions need a reference to one of these, which we know should be of the Vector type.

I think the problem with this is that derived regions can be "derived" from regions taken from multiple datasets. I've been thinking about the lingering gaps in our modeling, and I'm starting to think we need an additional data source, DerivedRegionDataSource. This would replace DerivedRegion entirely, and would largely contain the same fields.

I also think that rather than point to a dataset, a Region should refer to a VectorDataSource (which in turn refers to a dataset), since I think that more closely models what we're trying to achieve with the DataSource abstraction. What do you think? I've included a diagram below to represent this configuration.

---
UVDAT ER Diagram
---
erDiagram
Dataset {
    CharField name
    TextField description
    ForeignKey city
}

DerivedRegionDataSource {
    CharField name
    ForeignKey city
    S3FileField geojson_data
    ManyToManyField source_regions
}

VectorDataSource {
    ForeignKey dataset
    JSONField metadata
    S3FileField geojson_data
}

Region {
    BigAutoField id
    CharField name
    JSONField properties
    MultiPolygonField boundary
    ForeignKey data_source
}
%% Relations

Dataset }|--|| City : city
VectorDataSource }|--|| Dataset: dataset
Region }|--|| VectorDataSource : data_source
DerivedRegionDataSource }|--|{ Region: source_regions
DerivedRegionDataSource }|--|| City: city
Loading

@annehaley
Copy link
Collaborator Author

annehaley commented Oct 12, 2023

I think the problem with this is that derived regions can be "derived" from regions taken from multiple datasets. I've been thinking about the lingering gaps in our modeling, and I'm starting to think we need an additional data source, DerivedRegionDataSource. This would replace DerivedRegion entirely, and would largely contain the same fields. I also think that rather than point to a dataset, a Region should refer to a VectorDataSource (which in turn refers to a dataset), since I think that more closely models what we're trying to achieve with the DataSource abstraction.

I'm not sure we would be able to capture all we want to represent for the DerivedRegion in a DataSource. The goal for the DataSource type is to separate the map representation from the internal representation. For the map representation of a DerivedRegion, a VectorDataSource has everything we need. I think the internal representation of the boundary and the original regions should be separate from this. With the current model, it's fine if the original regions are from multiple datasets; the DerivedRegion is made independent from any datasets and a VectorDataSource (with dataset=null) can be made for it.

The way I think of the DataSource concept is as a implementation detail for visualization. Ideally, a user should be able to interact with every other model without concerning themselves with the DataSources. They can fetch Datasets and their FileItems, or they can fetch NetworkNodes and NetworkEdges related to a Dataset, or they can fetch OriginalRegions and DerivedRegions in a City. An advanced user may be interested in the converted data and may look at the DataSources, but the average user would only want the main models. Thus, the DataSource-type models shouldn't contain any data essential to the objects themselves. They should be largely hidden from the user and consumed only by the map viz.

@annehaley
Copy link
Collaborator Author

With the above consideration of how I think of DataSources, I think it would be more appropriate to make Charts as they were before. They should be a sibling to Dataset, not a child of AbstractDataSource. They aren't intended to be used by the map viz. I've made these changes in 1afb1f4

@jjnesbitt
Copy link
Member

The way I think of the DataSource concept is as a implementation detail for visualization. Ideally, a user should be able to interact with every other model without concerning themselves with the DataSources. They can fetch Datasets and their FileItems, or they can fetch NetworkNodes and NetworkEdges related to a Dataset, or they can fetch OriginalRegions and DerivedRegions in a City. An advanced user may be interested in the converted data and may look at the DataSources, but the average user would only want the main models. Thus, the DataSource-type models shouldn't contain any data essential to the objects themselves. They should be largely hidden from the user and consumed only by the map viz.

This makes sense to me, and resolves my questions/concerns with DerivedRegion. I guess for that specific use case, we would use the city field to retrieve the DerivedRegions, and when it came time to view them on the map, we would use the VectorDataSource to retrieve that data.

My only other concern then is making sure the original use case regarding time series is covered. I'm a bit fuzzy on that, maybe we should discuss this offline at some point.

@annehaley
Copy link
Collaborator Author

Sounds good to me. I spent today on bdcece6, which fits our data into these new models. I tried combining the flood area datasets (grouping them by type) as a test for multiple DataSources on the same Dataset. We would do the same thing with the time series data.
Let me know what you think. We can discuss more tomorrow or next week if you'd like.

@jjnesbitt jjnesbitt mentioned this pull request Nov 22, 2023
@annehaley annehaley requested a review from jjnesbitt November 27, 2023 16:59
Copy link
Member

@jjnesbitt jjnesbitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 0a2c9ca, which mostly just ensures that newRegionName is set back to "" when a derived region selection is canceled (the rest is formatting/ergonomics). If that looks good to you, then this is ready.

@annehaley annehaley merged commit 72000c2 into master Nov 27, 2023
@annehaley annehaley deleted the model-redesign branch November 27, 2023 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants