Export of Complete Collection Data for Migration, Backup #4970

campmlc · 2019-04-22T22:52:07Z

campmlc
Apr 22, 2019
Maintainer

Issue Documentation is http://handbook.arctosdb.org/how_to/How-to-Use-Issues-in-Arctos.html

Is your feature request related to a problem? Please describe.
We have received repeated inquiries from potential new collections and existing collections as to whether collections data can be exported from Arctos can be exported for backup or migration to a different platform.

Describe the solution you'd like
We currently allow export of flat file data as specimen search results through Arctos and DWC fields through external aggregators. Perhaps provide the option of a regular, automated export of these data, ftp'd to a particular server?
Additionally, we could add in options for separate, linked downloads of transactions, projects, citations ? (by collection?), object tracking (show all objects in this container, flatten?)

Also explore the option of local Oracle backups by collection? or all Arctos?

Priority
Please assign a priority-label.

campmlc · 2019-04-22T23:45:34Z

campmlc
Apr 22, 2019
Maintainer Author

From @dustymc
Sounds reasonable to me!

The "full" dump is computationally expensive and takes a lot of disk; I'm not sure if we could support running that for everyone at any reasonable interval with current resources.

loans + accessions + projects have all those problems too, but there aren't very many of them and they don't change very often - we could probably pull that off without much problem.

FLAT is cheap and easy to query (that's why it exists!) but is missing a lot of information - eg, it contains only one locality per specimen.

The DWC files contain full (except 'unaccepted') locality data, are in a Standard exchange format, and we could probably share more data than we do. I suspect that's our best bet for a "lightweight backup" but I'd need to know more about the purpose of the backup to make that call.

The Oracle backups contain everything, we're already paying the cost to make them, you can read them with free software, and I think TACC has essentially unlimited bandwidth. Scattering them across more disks in more locations under the control of more organizations would definitely make me sleep better. I think all I need is an address and write credentials to make that happen.

0 replies

campmlc · 2019-04-22T23:51:06Z

campmlc
Apr 22, 2019
Maintainer Author

The purpose of the backup would be to allow collections to maintain local flat file copies of their most critical data sufficient to recover the majority in case they decide to switch to a different platform or in case of catastrophic failure or downtime. Having this option would provide significant peace of mind to collections staff and admin, and would provide increase Arctos usability and marketability.

0 replies

dustymc · 2019-04-23T00:00:14Z

dustymc
Apr 23, 2019
Maintainer

I would need table/column detail to proceed; I can't know what anyone considers critical. (I know what I would consider critical: an Oracle backup file, which contains the rules and structure in addition to the data.)

I think a real-world use case would be very useful.

DWC data are here: http://ipt.vertnet.org:8080/ipt/resource?r=msb_mamm

What's missing?
Of that, what could we add?

0 replies

diannakrejsa · 2019-04-23T16:30:37Z

diannakrejsa
Apr 23, 2019
Collaborator

Yes, I can speak to the desire for this as well as chief necessary fields. Are you looking for particular columns that are missing? Can you direct me to a single DWC extract sheet (where all the code table values are part of one spreadsheet)?

I've been creating a column matching sheet for migrating our data from mySQL extraction of the data from local Specify-derived server, to how the data is originally entered (so we know we're extracting all necessary fields whilst using a DWC extract schema in Specify), then mapping that to Arctos fields. It will serve as a guide to the IT expert assisting in the migration. I can share that.

This may require a phone call to be most effective if I'm missing some information or not addressing what you're asking. If there is a DWC schema (the column headings) that I can look at, I can tell you what key elements are missing for our data purposes anyway.

0 replies

dustymc · 2019-04-23T17:22:32Z

dustymc
Apr 23, 2019
Maintainer

Are you looking for particular columns that are missing?

Yes.

single DWC extract sheet (where all the code table values are part of one spreadsheet)

I don't think such a thing exists; no specimen will have eg, all Attributes, and many specimens are spread across multiple DWC:Occurrences.

DWC schema

https://www.tdwg.org/standards/dwc/, but DynamicProperties make it somewhat like Arctos in that it's not limited to a spreadsheet-like structure.

0 replies

diannakrejsa · 2019-04-24T13:17:13Z

diannakrejsa
Apr 24, 2019
Collaborator

Is there a way to code as part of an 'export all' function "export all Attributes"?

Here is the Column Matching sheet I mentioned. It starts by matching all fields that we have for mammal records extracted from the server database to what we enter in "flat sheet" data entry spreadsheets. Then those fields are mapped to how they must be entered into Arctos.

Column Matching V2.xlsx

0 replies

dustymc · 2019-04-24T15:42:45Z

dustymc
Apr 24, 2019
Maintainer

I'm not sure I'm understanding, but....

The Arctos specimen bulkloader is a greatly simplified view of the most common things shared among incoming specimens.

You can see all current Attribute types at http://arctos.database.museum/info/ctDocumentation.cfm?table=CTATTRIBUTE_TYPE.

I can certainly export attributes, the question is how. As rows in an attachment, no problem. As structured data in a cell, MAYBE they'll fit now, but that won't necessarily last - attributes can hold about 8K, any specimen can have any number of them, and various tools have various length constraints.

There are many other such data - eg, any specimen can have any number of parts, any part can have any number of part attributes.

Lacking better ideas, those would likely include eg, "determiner: John Smith." Determining if that's John Smith the expert or John Smith the dyslexic prankster needs a link back to Agents. In Arctos, an agent's old phone number (publications, relations to other entities, etc.) is very much a real part of "the attribute record." (Or in general everything is a part of everything else.) In an export, if you want any of those "ancillary" data I'll need to explicitly know about it.

The simplest model we've found that's capable of carrying the complexity of the data is the one we use. The only backup I'm aware of from which that complexity could be recovered is the native Oracle backup.

0 replies

campmlc · 2019-04-24T16:49:37Z

campmlc
Apr 24, 2019
Maintainer Author

Dusty, I think in this case what we need is as follows, in the order top to bottom of what may be feasible: 1) A large flat file exactly like what we upload with the specimen bulkloader or get via the specimen results download, We can't download this ourselves because of browser timeout issues; otherwise we would. This would include all of the possible specimen bulkloader data fields = all fields added from add/remove data fields in specimen results. (Ideally, these fields would download only if there are data to populate them.) Attributes would have a Determined by Agent and Determined date field etc. Parts would include either JSON string or, even better, be parsed out into columns as they would go into the bulkloader (including barcode field). The event downloaded would be by default the most recent accepted event. The ID downloaded would be the most recent accepted ID. Agents would be preferred name. Obviously, they could not contain any other info from the agents table in this format. Accessions would be an included column. Citations would be an included column. Can we get a column for loans added as a general concatenated field and also embedded into the parts JSON script? Is this possible? 2) Download data on multiple specimen events, ID history- how do we do this? As concatenated fields like the OTHER IDs? JSON? Multiple columns? 3) Download accessions and loans as list. 4) Figure out a way to download a separate flatten part locations for all items in the collection - this could obviously be monstrous, but would be immensely helpful to have as a periodic backup / archive. 5) Figure out a way to download the full part location tree in print format - archivable? 6) Download the agents table into something that can be archived on local servers? Anything I'm missing?

…

On Wed, Apr 24, 2019 at 9:42 AM dustymc ***@***.***> wrote: I'm not sure I'm understanding, but.... The Arctos specimen bulkloader is a greatly simplified view of the most common things shared among incoming specimens. You can see all current Attribute types at http://arctos.database.museum/info/ctDocumentation.cfm?table=CTATTRIBUTE_TYPE . I can certainly export attributes, the question is how. As rows in an attachment, no problem. As structured data in a cell, MAYBE they'll fit now, but that won't necessarily last - attributes can hold about 8K, any specimen can have any number of them, and various tools have various length constraints. There are many other such data - eg, any specimen can have any number of parts, any part can have any number of part attributes. Lacking better ideas, those would likely include eg, "determiner: John Smith." Determining if that's John Smith the expert or John Smith the dyslexic prankster needs a link back to Agents. In Arctos, an agent's old phone number (publications, relations to other entities, etc.) is very much a real part of "the attribute record." (Or in general everything is a part of everything else.) In an export, if you want any of those "ancillary" data I'll need to explicitly know about it. The simplest model we've found that's capable of carrying the complexity of the data is the one we use. The only backup I'm aware of from which that complexity could be recovered is the native Oracle backup. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#2051 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADQ7JBD5U5OBE5DEQKP4VQ3PSB5XPANCNFSM4HHTHLUA> .

0 replies

dustymc · 2019-04-24T19:10:57Z

dustymc
Apr 24, 2019
Maintainer

exactly like what we upload with the specimen bulkloader or get via the specimen results download

Those are wildly different things. Eg, one will deal with 10 (or whatever the number is) parts with "core" part-components broken out, the other will deal with any number of parts (attributes, identifications, etc.) but with the complexity concatenated in various ways. I can certainly flatten various stuff into various formats (and much of that exists as FLAT), but I need specifics that address the reality of the data.

Attributes would have a Determined by Agent and Determined date field etc.

See above for the problems with merging them into structured data. I'm not sure how eg, 30 sex determinations might be munged into a spreadsheet - I suppose we could parse them out to sex_1 and sex_determiner_1 and such, but that would lead to a variable and indefinite number of columns.

Parts would include either JSON string

Not all fit within the current limitations of Oracle. That'll get better soon, but it's just a bump from 4KB to 32KB - some may still not fit.

even better, be parsed out into columns as they would go into the bulkloader

The specimen bulkloader can currently handle a ~dozen parts and no part attributes. The data can be many more parts, each with any number of attributes.

The ID downloaded would be the most recent accepted ID.

That one we can do! (As long as you don't care about taxon-stuff that won't fit into FLAT.)

column for loans added as a general concatenated field and also embedded into the parts JSON script? Is this possible?

That depends on what precisely you mean by "loans." If it's just a list of loan numbers or similar, probably. If you want more (loan data, results, involved parts, something for data loans, ....) then it likely won't easily fit.

Citations would be an included column.

That's available, but it links to Arctos so isn't very suitable for many of your reasons.

I can find a way around Oracle's datatype limitations (eg, write to files or CLOBs), but that would be computationally expensive (we can PROBABLY afford it), require a lot of disk, and I'm not sure what software would be capable of processing the results.

The purpose of the backup would be to allow collections to maintain local flat file copies of their most critical data sufficient to recover the majority in case they decide to switch to a different platform or in case of catastrophic failure or downtime. Having this option would provide significant peace of mind to collections staff and admin, and would provide increase Arctos usability and marketability.

This approach does not seem useful for that to me.

I don't think it's possible to flatten 'critical data' without significant loss (or perhaps significant liberties in defining "flat"!).

If I were going to migrate Arctos data to any other platform, I would want to start with an Oracle backup file. Absolute worst case, I could pay a consult for a few days to get what I want from it, whatever that might be.

In the case of catastrophic failure, recovering from a fresh copy of the backups stored somewhere that wasn't affected by the fire/meteor/aliens/Texan Revolution (and the stuff on GitHub) would be trivial. Recovering from anything else would be torturous.

In the case of significant downtime, pulling Arctos up (eg, on some cloud service or at another .edu) from backups (plus github) would be technically trivial, and mostly impossible from anything else that I can imagine.

0 replies

campmlc · 2019-04-24T19:49:36Z

campmlc
Apr 24, 2019
Maintainer Author

I guess the question would be, how did you extract the Cornell data to repatriate to them to go back into Specify? That would be the easiest scenario, because they didn't do anything with their data. Or maybe we didn't give them anything back after all that work? This request is largely an assurance to potential users that if they hate us, they can go back to something non-Oracle, something like what they came in with, which is largely a collection of csv files. Think of it as part of marketing - may not be structural necessary in your view, but it is necessary psychologically and sociologically to get people to feel comfortable in our environment. Also allows users to maintain local backups just in case. I think we all want the latter, for the old "Dusty gets hit by a bus" catastrophe scenario. Don't do that, by the way, at least not until we get more funding :) In my original request, this - bulkload file format with "core" part-components broken out - would be what I ideally would want, without the part and attribute limits. If this is computationally not possible, then concatenation in a format that would allow it to be parsed out later into a csv file with "core components broken out" would be acceptable. So, the specimen results download with various types of concatenation would be an OK replacement, although not ideal (I hate having to try to parse JSON into csv - but maybe I just don't know how.) Loans - concatenated list of loan numbers OK in flat. Plus a separate download as loan list info from transactions menu. Again, it would be ideal to have the parts download show loan relationships as well in some way- back to JSON? Citations - OK even without the external links. We accept that there will be loss of data in this format. But recovering some data is better than loss of all, which is what happens when Arctos, or even our local internet, goes down. I would be happy to help go through field by field to decide on data concatenation etc if that is what it takes. Google spreadsheet?

…

On Wed, Apr 24, 2019 at 1:11 PM dustymc ***@***.***> wrote: exactly like what we upload with the specimen bulkloader or get via the specimen results download Those are wildly different things. Eg, one will deal with 10 (or whatever the number is) parts with "core" part-components broken out, the other will deal with any number of parts (attributes, identifications, etc.) but with the complexity concatenated in various ways. I can certainly flatten various stuff into various formats (and much of that exists as FLAT), but I need specifics that address the reality of the data. Attributes would have a Determined by Agent and Determined date field etc. See above for the problems with merging them into structured data. I'm not sure how eg, 30 sex determinations might be munged into a spreadsheet - I suppose we could parse them out to sex_1 and sex_determiner_1 and such, but that would lead to a variable and indefinite number of columns. Parts would include either JSON string Not all fit within the current limitations of Oracle. That'll get better soon, but it's just a bump from 4KB to 32KB - some may still not fit. even better, be parsed out into columns as they would go into the bulkloader The specimen bulkloader can currently handle a ~dozen parts and no part attributes. The data can be many more parts, each with any number of attributes. The ID downloaded would be the most recent accepted ID. That one we can do! (As long as you don't care about taxon-stuff that won't fit into FLAT.) column for loans added as a general concatenated field and also embedded into the parts JSON script? Is this possible? That depends on what precisely you mean by "loans." If it's just a list of loan numbers or similar, probably. If you want more (loan data, results, involved parts, something for data loans, ....) then it likely won't easily fit. Citations would be an included column. That's available, but it links to Arctos so isn't very suitable for many of your reasons. I can find a way around Oracle's datatype limitations (eg, write to files or CLOBs), but that would be computationally expensive (we can PROBABLY afford it), require a lot of disk, and I'm not sure what software would be capable of processing the results. The purpose of the backup would be to allow collections to maintain local flat file copies of their most critical data sufficient to recover the majority in case they decide to switch to a different platform or in case of catastrophic failure or downtime. Having this option would provide significant peace of mind to collections staff and admin, and would provide increase Arctos usability and marketability. This approach does not seem useful for that to me. I don't think it's possible to flatten 'critical data' without significant loss (or perhaps significant liberties in defining "flat"!). If I were going to migrate Arctos data to any other platform, I would want to start with an Oracle backup file. Absolute worst case, I could pay a consult for a few days to get what I want from it, whatever that might be. In the case of catastrophic failure, recovering from a fresh copy of the backups stored somewhere that wasn't affected by the fire/meteor/aliens/Texan Revolution (and the stuff on GitHub) would be trivial. Recovering from anything else would be torturous. In the case of significant downtime, pulling Arctos up (eg, on some cloud service or at another .edu) from backups (plus github) would be technically trivial, and mostly impossible from anything else that I can imagine. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#2051 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADQ7JBDDXOOO7GZL4JYE4S3PSCWEFANCNFSM4HHTHLUA> .

0 replies

dustymc · 2019-04-24T21:13:49Z

dustymc
Apr 24, 2019
Maintainer

Cornell

The "full" dump (=tables)

something non-Oracle,

From an Oracle dump:

install Oracle Express (or fire up an account with some Oracle host, or go talk nice to your local financial people who almost certainly use Oracle, or...)
impdb {dumpfile}
use one of the hundreds of available tools or scripts to pull out whatever you want (including DDL, rules, relationships, datatype, etc.) in whatever format you want it, ignore what you don't want.

From anything else:

Try to figure out what you have
Hope there's enough information to whatever you're trying to do

I don't think a flatfile "export" is a bad idea, but I do think it should come with some sort of explicit explanation of where it came from and what its limitations are.

more funding

Are you buying a bus?!

field by field

Yes, I think that's what it's going to take. Here's a sample of what's easiest to get to.

UAM@ARCTOS> create table temp_flatbits as select * from flat where guid like 'MSB:Mamm:%' and rownum<10000;

temp_flatbits.csv.zip

0 replies

campmlc · 2019-04-24T21:22:34Z

campmlc
Apr 24, 2019
Maintainer Author

Maybe an Oracle dump is good too, but how much expertise is required to clean up the output enough that a student could comprehend it or use it for subsequent data entry? I like the idea of the flat file at least as a complementary approach. @dkrejsa@angelo.edu <[email protected]> let's look over this flatbits file as a start.

…

On Wed, Apr 24, 2019 at 3:13 PM dustymc ***@***.***> wrote: Cornell The "full" dump (=tables) something non-Oracle, From an Oracle dump: 1. install Oracle Express (or fire up an account with some Oracle host, or go talk nice to your local financial people who almost certainly use Oracle, or...) 2. impdb {dumpfile} 3. use one of the hundreds of available tools or scripts to pull out whatever you want (including DDL, rules, relationships, datatype, etc.) in whatever format you want it, ignore what you don't want. From anything else: 1. Try to figure out what you have 2. Hope there's enough information to whatever you're trying to do I don't think a flatfile "export" is a bad idea, but I do think it should come with some sort of explicit explanation of where it came from and what its limitations are. more funding Are you buying a bus?! field by field Yes, I think that's what it's going to take. Here's a sample of what's easiest to get to. ***@***.***> create table temp_flatbits as select * from flat where guid like 'MSB:Mamm:%' and rownum<10000; temp_flatbits.csv.zip <https://github.com/ArctosDB/arctos/files/3114179/temp_flatbits.csv.zip> — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#2051 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADQ7JBFDC5GXFQPRMNIHGVTPSDEQ5ANCNFSM4HHTHLUA> .

0 replies

dustymc · 2019-04-24T22:20:03Z

dustymc
Apr 24, 2019
Maintainer

how much expertise is required to clean up the output enough that a student could comprehend it or use it for subsequent data entry

Nothing that hasn't been on stackoverflow a million times anyway, and the container describes the data.

The front-end is on github - it's not too hard to build a clone of Arctos from an Oracle dump and a git pull either.

I think it would just be a different type of expertise required to interpret a flatfile, assuming it contains what's needed to do whatever you'd be doing.

Here's another precompiled flat view of some data - this one will be much better at locality data, but doesn't contain any encumbered data. I'm not sure which one (if either) might be more useful.

UAM@ARCTOS> create table temp_dwc_bits as select * from digir_query.ipt_view where individualID like 'http://arctos.database.museum/guid/MSB:Mamm:%' and rownum < 10000;

temp_dwc_bits.csv.zip

0 replies

diannakrejsa · 2019-04-25T14:46:00Z

diannakrejsa
Apr 25, 2019
Collaborator

temp_flatbits_missing values.xlsx

I've looked over both flatbits files. The second tab on the attached has the column headings transposed next to each other ("Comparison"). I looked at them with our data in mind and made additions in red at the bottom of the columns for what I think they're lacking or could benefit from. There are some fields that sound like they'd contain similar information -- INFORMATIONWITHHELD and ENCUMBRANCES for example. Is there a way to query all the available fields but just export ones with values? Or there would be too many redundancies and high processing time.

0 replies

campmlc · 2019-04-25T15:17:05Z

campmlc
Apr 25, 2019
Maintainer Author

So if I understand this correctly, all the attribute information would be concatentated into a single field, without dates or determiners, correct? So ATTRIBUTE could be: "ATTRIBUTE: sex = male, age class= adult, reproductive info = scrotal, t= 4 x 2," Is that correct? And would preparator number, collector number etc be in a similar concatenated field of OTHER IDs? I agree with Dianna that it would be great if we could have a download of only those fields that are populated with data. I would also want all the fields that can be currently downloaded in the specimen results view using add/remove data fields to be options for download either as individual columns or as concatenated fields.

…

On Thu, Apr 25, 2019 at 8:46 AM diannakrejsa ***@***.***> wrote: temp_flatbits_missing values.xlsx <https://github.com/ArctosDB/arctos/files/3117463/temp_flatbits_missing.values.xlsx> I've looked over both flatbits files. The second tab on the attached has the column headings transposed next to each other ("Comparison"_. I looked at them with our data in mind and made additions in red at the bottom of the columns for what I think they're lacking or could benefit from. There are some fields that sound like they'd contain similar information -- INFORMATIONWITHHELD and ENCUMBRANCES for example. Is there a way to query all the available fields but just export ones with values? Or there would be too many redundancies and high processing time. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#2051 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADQ7JBER4MWUV4L3RGJSUHTPSG72VANCNFSM4HHTHLUA> .

0 replies

dustymc · 2020-06-16T18:17:43Z

dustymc
Jun 16, 2020
Maintainer

What's the file format you're looking for for this?

That's a question for those who want flat extracts. I see no way in which they can avoid being lossy so they can be of ~no value to me.

I keep hearing things like "loans." That could mean "partial dump of table LOAN" or it could need to include the 3rd phone number of the 9th preparator of specimens related to specimens from which parts were loaned - data which is in, and easily understandable and recoverable from, a DB export. I doubt you want that specific chunk of data, but there's an infinite amount of data which could be critical to certain tasks (understanding what was used for or intended by a citation in a publication, for example) that's an equal distance from table LOAN. A request for flat files is essentially a request to discard information; I need to know precisely what you don't want to toss, and how you want it arranged.

0 replies

diannakrejsa · 2020-06-16T18:31:57Z

diannakrejsa
Jun 16, 2020
Collaborator

From the very narrow perspective of what I want, for a start I want just what goes in the specimen bulkloader, plus preparators/prep num/other ids and parts. It would be very close to what is available in the fields of the data tools download, just a few more fields would need to be added to those available.

How would you like the information presented to you/the group for further editing? Columns ala the bulkloader (more long-form) or ala the download data tool? Then we can add a "wish list" of aspects others may want (e.g., loans) and figure out what parts of those wish list items can be added?

0 replies

dustymc · 2020-06-16T18:52:02Z

dustymc
Jun 16, 2020
Maintainer

what goes in the specimen bulkloader

I would need to know what to do with the 11th collector, 13th attribute, 2nd specimen-event, etc. (And implicit agreement that strings are sufficient for your purposes - eg, the only thing you care about regarding agents is preferred_name, all other agent data can be discarded for this.)

preparators

Those are covered by "what goes in the specimen bulkloader."

prep num/other ids and parts

I need specifics; there can be any number of otherIDs, and parts have an additional dimension for part attributes.

How would you like the information presented

I don't know, perhaps because I'm having difficulty understanding the purpose. Maybe manually munge whatever you want of a record into a CSV file as an example???? That seems a fairly painful way to approach this, but it would let me request adding another record when something doesn't fit - maybe it would provide an effective means of communication.

0 replies

diannakrejsa · 2020-06-25T23:15:21Z

diannakrejsa
Jun 25, 2020
Collaborator

Export_Template.zip

Alright, attached is a stab at a beginning template. I started with the data within a bulkloader file for ASNHC:Mamm:20000. I added columns for other common data as well example fields that Mariel and Dusty exported before (something I had saved off as temp_flatbits_missingvalues, not sure that name would ring any bells for what ya'll did to export those fields in the past). The third row includes fields that might be sunk within the column above them. At the end of the series of columns I added "Loans?" simply because I imagine someone will want that data exportable in some capacity.

0 replies

dustymc · 2020-06-25T23:34:05Z

dustymc
Jun 25, 2020
Maintainer

Excellent, thanks! I pulled that into https://docs.google.com/spreadsheets/d/1caZi8YvjKtMIklVSnlnfG3BdD1WQ3rQMUQgNGzlZqbA/edit#gid=1443094724 and anyone can edit. I made some preliminary comments. Essentially I'd need more detail; what precisely do you mean by "sex" (for example), and if there are 13972 determinations then how would you like them handled?

0 replies

diannakrejsa · 2020-06-27T21:48:58Z

diannakrejsa
Jun 27, 2020
Collaborator

Cool! I wrote some responses to these, but other folks should take a look since I don't necessarily have a stake in every field (or know the full usage someone might require of them). When going through, one thought I had was making it a multi-page export process where data managers select what database they manage they'd like exported, then it's a locality page where they check what aspects of locality for those records they may want, then it's an attributes page with all options from the attributes_code_tables list and they check which ones they want to export data from, and so on. Kind of like the Download data tools thing but with more options?
I'll let others take a look at it.

0 replies

Jegelewicz · 2020-06-29T15:46:35Z

Jegelewicz
Jun 29, 2020

multi-page export process where data managers select what database they manage they'd like exported, then it's a locality page where they check what aspects of locality for those records they may want, then it's an attributes page with all options from the attributes_code_tables list and they check which ones they want to export data from, and so on. Kind of like the Download data tools thing but with more options?

OK - I am going to say what I think I have been saying all along. A complete export should be more than one file. Here is what you need (stuff in parens are the columns for each file, not comprehensive at this point...)

A file that includes all GUIDs in a collection(s)
A file that includes all other ID's associated with those GUIDs (GUID, OTHER_ID_NUM_TYPE, OTHER_ID_NUM, OTHER_ID_RELATIONSHIP)
A file that includes all parts associated with those GUIDs (GUID, PART_NAME, CONDITION, DISPOSITION, QUANTITY, REMARKS, BARCODE, PART_NUMBER (must be the unique identifer in Arctos that relates the part to its attributes)
A file that inlcudes all part attributes (PART_NUMBER, PART_ATTRIBUTE_TYPE, PART_ATTRIBUTE_DETERMINER, PART_ATTRIBUTE_DET_DATE, etc..)
A file that includes all identifications (GUID, ID_BY, ID_DATE, ID_FORMULA, TAXON (dependent upon formula, may be more than one), NATURE_OF_ID, ID_CONFIDENCE, SENSU, ID_REMARK)
A file that includes all Specimen Events (GUID, Event stuff, collecting even and locality stuff)
A file that includes all attributes (GUID, ATTRIBUTE_TYPE, all other attribute fileds)
A file that includes all loans (LOAN_NUMBER, LOAN fields)
A file that includes all loan items (LOAN_NUMBER, GUID, PART_NAME, REMARK)
A file that includes all accessions (ACCN_NUM, Accession fields)
A file that includes all accession items (ACCN_NUM, GUID)
A file that includes all encumbrances (ENCUMBRANCE, encumbrance details)
A file that includes all encumbrance items (ENCUMBRANCE, GUID)
A file that includes all media (MEDIA_URI, MEDIA_THUMB_URI)
A file that includes all media relationships (MEDIA_URI, MEDIA_RELATIONSHIP_TYPE, MEDIA_RELATIONSHIP_VALUE)
A file that includes all media labels (MEDIA_URI, MEDIA_LABEL_TYPE, MEDIA_LABEL_VALUE)

What have I forgotten? This is going to give you "your" data in a way that could be related to each other so that you could re-create stuff in Arctos with bulkloader tools. It will not be useable as Arctos, but that isn't what we are after here is it? Each file is going to include one row of data for each "thing", so if you have an object with multiple identifications, you are going to have more than one row using that GUID in column one. This is what you are going to need if you want to import the data into something else. If object tracking is used, a file for BARCODE (BARCODE, PARENT_BARCODE) would be needed as well and maybe something else I am missing.

0 replies

dustymc · 2020-06-29T16:06:43Z

dustymc
Jun 29, 2020
Maintainer

What have I forgotten?

Probably something - "complete export" still seems a very wrong description - but I think that's closer to achievable, and more useful, than trying to pretend that Arctos is a giant spreadsheet can be.

This is getting closer to a DB dump, which includes everything you've mentioned plus whatever you've forgotten, and includes assembly instructions in a language that both computers and people can understand.

0 replies

campmlc · 2020-07-01T18:01:10Z

campmlc
Jul 1, 2020
Maintainer Author

Yes, absolutely, this is what I have been trying to request. Also need a file for BARCODE (BARCODE, PARENT_BARCODE), or better yet: Part Location Path. A DB is fine as long as it includes files that can be opened in spreadsheets. I have a server . . . ready to move MSB data there now.

…

On Mon, Jun 29, 2020 at 10:07 AM dustymc ***@***.***> wrote: * [EXTERNAL]* What have I forgotten? Probably something - "complete export" still seems a very wrong description - but I think that's closer to achievable, and more useful, than trying to pretend that Arctos is a giant spreadsheet can be. This is getting closer to a DB dump, which includes everything you've mentioned plus whatever you've forgotten, and includes assembly instructions in a language that both computers and people can understand. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#2051 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADQ7JBBACAIH6C7SFAIAZ5TRZC32FANCNFSM4HHTHLUA> .

0 replies

Jegelewicz · 2022-02-16T16:34:36Z

Jegelewicz
Feb 16, 2022

Given that @mkoo asked this of a potential incoming collection yesterday

"If you can get your data out of Specify..."

We really need to think about how this would work when a collection eventually decides to leave Arctos.

0 replies

campmlc · 2022-02-16T16:45:08Z

campmlc
Feb 16, 2022
Maintainer Author

Agree.

…

On Wed, Feb 16, 2022, 9:34 AM Teresa Mayfield-Meyer < ***@***.***> wrote: * [EXTERNAL]* Given that @mkoo <https://github.com/mkoo> asked this of a potential incoming collection yesterday "If you can get your data out of Specify..." We really need to think about how this would work when a collection eventually decides to leave Arctos. — Reply to this email directly, view it on GitHub <#2051 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADQ7JBHHUIZBNQHVJEE4NVLU3PG2PANCNFSM4HHTHLUA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were assigned.Message ID: ***@***.***>

0 replies

dustymc · 2022-02-16T16:50:22Z

dustymc
Feb 16, 2022
Maintainer

when a collection eventually decides to leave Arctos.

In the past I've provided their parts of tables as CSV.

Happy to discuss more, but I don't think this is going to go anywhere without some actionable specification - eg "LOAN fields" could literally be almost anything, I would need specifics to act.

0 replies

Jegelewicz · 2022-06-04T15:31:00Z

Jegelewicz
Jun 4, 2022

wondering how https://github.com/ArctosDB/internal/issues/168 would make this "easier"?

0 replies

Jegelewicz · 2023-04-06T17:02:32Z

Jegelewicz
Apr 6, 2023

Just putting this here. Symbiota allows for download of certain tables along with their basic "occurrence record". Maybe we could do something like this? Basic catalog record, plus parts table, identifiers table, identifications table, attributes table, events table - all zipped up. FWIW, I downloaded the DwC for all UTEP:Herb records and it took a while, but it didn't time out.

0 replies

dustymc · 2023-04-06T17:23:45Z

dustymc
Apr 6, 2023
Maintainer

Symbiota

Symbiota is built on an exchange standard, there are no useful analogies between it and Arctos.

I downloaded the DwC for all UTEP:Herb records

https://github.com/ArctosDB/internal/issues/260 would make that a lot more reliable (and allow you to make the request to vn's hardware).

didn't time out.

Timeouts exist to protect the system (and aren't very effective at this when something like this is involved - pg's copy function can overload the VM faster than it can produce the error meant to save itself). There's an issue somewhere, the capabilities are purposeful so I'm not quite complaining, but you can absolutely kill Arctos that way.

I've pgified my 'export a collection' scripts since this was started (for @jebrad), but I'm very hesitant to try to automate them for the reasons above.

I'm also struggling with this on #6018

All of this of course still suffers from the limitations above - some tables of whatever format don't include the language necessary to really understand the data.

If we need backups as a button, then I probably need a dedicated VM for it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arctos DB

Export of Complete Collection Data for Migration, Backup #4970

{{title}}

Replies: 70 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Arctos DB

Export of Complete Collection Data for Migration, Backup #4970

campmlc Apr 22, 2019 Maintainer

Replies: 70 comments

campmlc Apr 22, 2019 Maintainer Author

campmlc Apr 22, 2019 Maintainer Author

dustymc Apr 23, 2019 Maintainer

diannakrejsa Apr 23, 2019 Collaborator

dustymc Apr 23, 2019 Maintainer

diannakrejsa Apr 24, 2019 Collaborator

dustymc Apr 24, 2019 Maintainer

campmlc Apr 24, 2019 Maintainer Author

dustymc Apr 24, 2019 Maintainer

campmlc Apr 24, 2019 Maintainer Author

dustymc Apr 24, 2019 Maintainer

campmlc Apr 24, 2019 Maintainer Author

dustymc Apr 24, 2019 Maintainer

diannakrejsa Apr 25, 2019 Collaborator

campmlc Apr 25, 2019 Maintainer Author

dustymc Jun 16, 2020 Maintainer

diannakrejsa Jun 16, 2020 Collaborator

dustymc Jun 16, 2020 Maintainer

diannakrejsa Jun 25, 2020 Collaborator

dustymc Jun 25, 2020 Maintainer

diannakrejsa Jun 27, 2020 Collaborator

Jegelewicz Jun 29, 2020

dustymc Jun 29, 2020 Maintainer

campmlc Jul 1, 2020 Maintainer Author

Jegelewicz Feb 16, 2022

campmlc Feb 16, 2022 Maintainer Author

dustymc Feb 16, 2022 Maintainer

Jegelewicz Jun 4, 2022

Jegelewicz Apr 6, 2023

dustymc Apr 6, 2023 Maintainer

campmlc
Apr 22, 2019
Maintainer

campmlc
Apr 22, 2019
Maintainer Author

campmlc
Apr 22, 2019
Maintainer Author

dustymc
Apr 23, 2019
Maintainer

diannakrejsa
Apr 23, 2019
Collaborator

dustymc
Apr 23, 2019
Maintainer

diannakrejsa
Apr 24, 2019
Collaborator

dustymc
Apr 24, 2019
Maintainer

campmlc
Apr 24, 2019
Maintainer Author

dustymc
Apr 24, 2019
Maintainer

campmlc
Apr 24, 2019
Maintainer Author

dustymc
Apr 24, 2019
Maintainer

campmlc
Apr 24, 2019
Maintainer Author

dustymc
Apr 24, 2019
Maintainer

diannakrejsa
Apr 25, 2019
Collaborator

campmlc
Apr 25, 2019
Maintainer Author

dustymc
Jun 16, 2020
Maintainer

diannakrejsa
Jun 16, 2020
Collaborator

dustymc
Jun 16, 2020
Maintainer

diannakrejsa
Jun 25, 2020
Collaborator

dustymc
Jun 25, 2020
Maintainer

diannakrejsa
Jun 27, 2020
Collaborator

Jegelewicz
Jun 29, 2020

dustymc
Jun 29, 2020
Maintainer

campmlc
Jul 1, 2020
Maintainer Author

Jegelewicz
Feb 16, 2022

campmlc
Feb 16, 2022
Maintainer Author

dustymc
Feb 16, 2022
Maintainer

Jegelewicz
Jun 4, 2022

Jegelewicz
Apr 6, 2023

dustymc
Apr 6, 2023
Maintainer