Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where to put different (tabular) data files? #6

Open
hoijui opened this issue Mar 21, 2023 · 10 comments
Open

Where to put different (tabular) data files? #6

hoijui opened this issue Mar 21, 2023 · 10 comments
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed question Further information is requested

Comments

@hoijui
Copy link
Owner

hoijui commented Mar 21, 2023

Type of files

Two specific types I am thinking of:

  • gathered data from:
    • measurements
    • simulations which are too costly to run on the fly or (often) in CI
    • interviews
    • survey
  • manually assembled data, like:
    • a list of file extensions, with info denoting whether they are text or binary formats
    • A CSV table describing a standard, like the ones in the repo

Storage locations - Ideation

  1. A separate data folder
    • data/measurement/forcesX.csv
    • data/simulation/stress1.csv
    • data/simulation/stress2.csv
    • data/survey1.csv
  2. A sub-folder under res
    • res/data/measurement/forcesX.csv
    • res/data/simulation/stress1.csv
    • res/data/simulation/stress2.csv
    • res/data/survey1.csv

General or specific folder name(s)

data is of course a very general term, that strictly speaking,
would apply to almost anything in a repo anyway.
something like sheet or tabular on the other hand,
seems too specific and overly-focused on the format,
which is assumed to be 2D table,
while we might also want to include more (or less) dimensions then 2.

@hoijui hoijui added help wanted Extra attention is needed question Further information is requested labels Mar 21, 2023
@hoijui hoijui changed the title Where to put different tabular data files? Where to put different (tabular) data files? Apr 1, 2023
@hoijui
Copy link
Owner Author

hoijui commented Apr 1, 2023

In an other practical example, I have slightly different data:

I wrote a script, that takes a git repo web URL (e.g. https://github.com/hoijui/osh-dir-std/), and by looking at that pages HTML source, decides whether the repo is public or not.
To come up with the code, I had to do some "research", going to different git repo hosting sites, and looking at the HTML source for their repos, both public and non-public (e.g. private) ones.
I then c&p out relevant parts, and collected them in a Markdown file, or say, two: public.md and private.md
Where to these belong?

  • src/scraped/
  • doc/scraped/
  • res/data/scraped/
  • data/scraped/
  • ...

@timmwille
Copy link

I think it is a very relevant question to answer, maybe it helps to check again what higher level structure we have:
https://github.com/hoijui/osh-dir-std/blob/main/mod/unixish/definition.csv

Let me collect my thoughts, just a sec

PS: I don't fully understand your "scraped" use case yet, but will come back to that too

@timmwille
Copy link

timmwille commented Apr 1, 2023

lets see where the datasets can go

So apart from Licenses and mods we have:

(@hoijui consider organizing the definition.csv alphabetically, really would help)

doc/
gen/
run/
res/
src/

existing options

let's go through one by one to clarify where datasets would go

  • doc/ : NO → this is where we want to put explanatory documentation that embeds from res/ (though I don't fully understand the difference between res/media/ and res/assets/media/
  • gen/ : NO → only generated files/outputs go here
  • run/ : NO → only for automation, helping build and keep the repo organized (to my understanding so far)
  • res/ : MAYBE → if the data is not SOURCE data that is constantly improved and worked with and used across doc/ and src/ equally as we always want "single source of truth" it makes sense → I'll write some examples in a bit
  • src/ : MAYBE → all Files that are part of the true "Source" of the project should sit here (no binaries!, no explanatory data apart from #comments in the code), the first place to look, that is where the CAB Review according to DIN SPEC 3105 will look (apart from the docs to go through to help with understanding)!

what about new directories?

I see only three options here:

  • data/ → very generic, but would cover a lot (not only a good thing)
  • datasets/ → very clear, might be a bit long as a name
  • records/ → a bit more open then datasets/, all data records would go here, even scraped data

Pro/Con and resulting open questions:

  • Is data/ or one of the other (datasets/ records/) a new main directory or part of the other?
  • Is records clear enough to not confuse with generated?
  • How to differentiate collected data from externally generated to internally generated data that sits in gen?

I'll evaluate this now

@timmwille
Copy link

timmwille commented Apr 1, 2023

Basically that means we're discussing:

  1. Where to put it?
  • res/
  • src/
  • <new>/

and

  1. How to name it?:
  • data/
  • datasets/
  • records/

@hoijui
Copy link
Owner Author

hoijui commented Apr 1, 2023

other possibly useful words:

  • gather
  • collect
  • recordings
  • collections

I like records a lot though!
It fits well for tabular data, for whatever dimensionality.
a issue with it is:
it describes the data-format, while (most) other dir names describe the data (content). for example, we have a directory called doc/; it is not called text/. then again, src/ is kind of in both categories.

@timmwille
Copy link

timmwille commented Apr 1, 2023

Ok I suggest:

  • res/datasets/ : for scraped datasets and other data that is just there as a resource for other parts of the documentation and references*
  • src/records/ : for all source related work data that is complied manually or via external sources to help with development

this would also help (at least me) to better understand: res/media/ and res/datasets/ as resources in source format whilst every binary resources sit under res/assets/ 💡

* I think maybe even Survey data should go there? What about TSdCs related Technical specs of the overall Machine or external parts/modules that are proprietary?


Final thought

  • in case (for a reason I can only estimate slightly right now) we only talk about resources
    and not at all about source of the project

Example A

I want to collect data from a machine to evaluate the precision and have this as reference data in my repository,
so what would I do?

  • I would write a script-a in src/software/ with a src/calc/ logic file (isn't that also a software kind of?)
    behind and some output generated through a simulation src/sim/ using that calculation as well.
  • I would want to send this simulation output to ...?
    → would this go to dataset/records too? or is this a gen/sim/ output?
  • now I take src/software/script-a to run the test with the machine by talking through an API of a src/firmware/ and collect the data records in ...?
    → would this go to datasets/records too? or is this a src/test/ source now?
  • This data now counts as my real life reference for further src/sim/ simulation runs to improve the src/mech/ and src/elec/ design (maybe even to improve the script, the software or firmware as well).

Example B

I want to create a reference data sheet for measurements out of a 3D analysis of a physical object,
from there I'll generate a parametric design, what would I do?

Example C

I want to scrape metadata from other similar hardware projects as a reference for my calculations,
design and compare with my own metadata/specs even for documentation purposes, what would I do?

Example D

I want to create a realistic image of my wind turbine rotor blade design,
by using data-points from an external Airfoil generator software, what would I do?

  • [Concept Design step] I would go to the generator, input my preset rotor blade metadata from ...?
    → would this sit in datasets/records? or in gen/calc/ as it was calculated based on power/wind/size,
    so other machine config metadata?
  • [Mech Design step] I would take that data-points from the generator for a specific 2D profile
    and with some help of a src/calc/ mathematical logic file
    (might also be embedded in the CAD program I'm using)
    and crate a nice 3D CAD Model
  • [Simulation Design step] Then I import that CAD model in src/mech to a create a src/sim simulation,
    improve the design a bit and send it to src/anim/ for creating a photo-realistic image that will be send to ...?
    → is this then to go to gen/anim/ or is this image a file that will sit under res/assets/media/img/?

as reference I used this tree view:

run/
res/
res/conf/
res/media/
res/media/img/
res/assets/
res/assets/media/
res/assets/media/img/
res/assets/media/vid/
res/assets/var/
src/
src/anim/
src/calc/
src/sim/
src/elec/
src/firmware/
src/mech/
src/software/
src/test/
gen/
gen/site/
gen/anim/
gen/calc/
gen/sim/
gen/software/
gen/firmware/
gen/elec/
gen/mech/
gen/doc/
gen/doc/assembly/
gen/doc/manuf/
gen/doc/usr/
gen/doc/recycling/
doc/
doc/assembly/
doc/manuf/
doc/usr/

@timmwille
Copy link

Here also #8 for easier communication

@hoijui
Copy link
Owner Author

hoijui commented Apr 14, 2023

I figured, file is actually a very good fit according to its definition:

  1. a folder, cabinet, or other container in which papers, letters, etc., are arranged in convenient order for storage or reference.
  2. a collection of papers, records, etc., arranged in convenient order: to make a file for a new account.

would it really be an option though? :/

src/files/bla.csv

... too general, right?

@hoijui
Copy link
Owner Author

hoijui commented Apr 14, 2023

other options:

@timmwille
Copy link

timmwille commented Jun 30, 2023

Hey sorry I totally missed this but I like src/input/ actually very much, it indicates source files that are simply input for other design files/processes and might come from external/physical sources/measurments. It is then also not limited to datasets or records but could also be something else.

src/files/ is too generic!! So go with src/input/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants