First check if a XML namespace is declared in the document:
$ head loc.mrc.xml
<collection xmlns="http://www.loc.gov/MARC21/slim">
<record>
<leader>01227cam a22002894a 4500</leader>
<controlfield tag="001">12360325</controlfield>
<controlfield tag="005">20070126075126.0</controlfield>
<controlfield tag="008">010327s2001 nyua 001 0 eng </controlfield>
<datafield tag="906" ind1=" " ind2=" ">
<subfield code="a">7</subfield>
<subfield code="b">cbc</subfield>
<subfield code="c">orignew</subfield>
If a namespace is set use the "local" XML element name in the XPATH expression:
$ xmllint --xpath '//*[local-name()="controlfield"]/@tag' loc.mrc.xml
Extract all tags and count them:
$ xmllint --xpath '//@tag' loc.mrc.xml | sort | uniq -c
Extract all IDs from MARC 001:
$ xmllint --xpath '//*[local-name()="controlfield"][@tag="001"]/text()' loc.mrc.xml
Extract all subfields from MARC 245 fields:
$ xmllint --xpath '//*[local-name()="datafield"][@tag="245"]' loc.mrc.xml
Extract subfield "a" from MARC 245 fields:
$ xmllint --xpath '//*[local-name()="datafield"][@tag="245"]/*[local-name()="subfield"][@code="a"]/' loc.mrc.xml
Extract content from subfield "a" from MARC 245 fields:
$ xmllint --xpath '//*[local-name()="datafield"][@tag="245"]/*[local-name()="subfield"][@code="a"]/text()' loc.mrc.xml
Extraxt all ISBNs:
$ xmllint --xpath '//*[local-name()="datafield"][@tag="020"]/*[local-name()="subfield"][@code="a"]/text()' loc.mrc.xml
Extract all DDC numbers:
$ xmllint --xpath '//*[local-name()="datafield"][@tag="082"]/*[local-name()="subfield"][@code="a"]/text()' loc.mrc.xml
Catmandu uses a domain specific language (DSL) called "fix" to extract, map and tranform data. Several "fixes" for library specifc data formats like MARC and PICA are available. Most common "fixes" are documented on a cheat sheet. "Fixes" can be used as command-line options or stored in a "fix" file:
$ catmandu convert MARC to CSV --fix 'marc_map(001,id); retain_field(id)' < loc.mrc
$ catmandu convert MARC to YAML --fix marc2dc.fix < loc.mrc
With marc_map
you can extract (sub)fields from MARC records and map them to your own data model:
marc_map(001,dc_identifier)
# {"dc_identifier":"12360325"}
MARC uses several "fixed-length" fields, where data elements are positionally defined. E.g. if you want to extract the language code from MARC 008 specify the positions with /35-37
:
marc_map(008/35-37,dc_language)
# {"dc_language":"eng"}
If you want to extract fields with certain indicators specify them within sqare brackes [1,4]
marc_map('246[1,4]',marc_varyingFormOfTitle)
# {"marc_varyingFormOfTitle":"Games, diversions & Perl culture"}
To extract certain subfields from a MARC data field use the subfield codes. By default several subfields will be joined to one string. Use option join
to join them with another string. With option split
you can split the subfields to a list. Use option pluck
if you want to extract the subfields in a certain order.
marc_map(245ab,dc_title,join:' ')
# {"dc_title":"Perl : the complete reference /"}
marc_map(245ab,dc_title,split:1)
# {"dc_title":["Perl :","the complete reference /"]}
marc_map(245ba,dc_title,split:1,pluck:1)
# {"dc_title":["the complete reference /","Perl :"]}
MARC data fields could be repeatable. Use option split
to create a list from all fields.
marc_map(650a,dc_subject,split:1)
# {"dc_subject":["Data mining.","Text processing (Computer science)","Perl (Computer program language)"]}
MARC subfields could be repeatable within a MARC data field. Use option split:1
to create a list from all fields. To create a list for all subfields within one data field use option nested_arrays:1
which will return a "list of lists" of subfields, one list for each data field.
marc_map(655ay,marc_indexTermGenre,split:1)
# {"marc_indexTermGenre":["Portrait photographs","1910-1920.","Photographic prints","1910-1920."]}
marc_map(655ay,marc_indexTermGenre,split:1,nested_arrays:1)
# {"marc_indexTermGenre":[["Portrait photographs","1910-1920."],["Photographic prints","1910-1920."]]}
To extract a subfield only if another subfield in the same data field has a certain value use a loop with a condition.
=856 4\$uhttp://journal.code4lib.org/$xVerlag$zkostenfrei
=856 4\$uhttp://www.bibliothek.uni-regensburg.de/ezeit/?2415107$xEZB
do marc_each()
if marc_match(856x,EZB)
marc_map(856u,ezb_uri)
end
end
# {"ezb_uri":"http://www.bibliothek.uni-regensburg.de/ezeit/?2415107"}
Use conditions marc_has
, marc_has_many
or marc_match
to check if an record has certain fields or match certain conditions.
set_array(errors)
# Check if a 245 field is present
unless marc_has('245')
set_field(errors.$append,"no 245 field")
end
# Check if there is more than one 245 field
if marc_has_many('245')
set_field(errors.$append,"more than one 245 field?")
end
# Check if in 008 position 7 to 10 contains a
# 4 digit number ('\d' means digit)
unless marc_match('008/07-10','\d{4}')
set_field(errors.$append,"no 4-digit year in 008 position 7->10")
end
You can add fields to MARC records with marc_add
.
marc_add(999,a,my,b,local,c,field)
marc_add(900,a,$.my.field)
Use marc_append
to append values to a (sub)field
marc_append(001,'-X')
marc_append(100a,' [author]')
Assign a new value to a MARC field with marc_set
.
marc_set(001,123456789)
marc_set(245a,'Perl - battle tested.')
Use marc_remove
to remove (sub)fields from MARC records.
marc_remove(991)
marc_remove(9..)
marc_remove(0359)
Use marc_replace_all
to replace a string in MARC (sub)fields.
marc_replace_all(001,1,X)
marc_replace_all(245a,Perl,"Perl [programming language]")
You can filter MARC records from a dataset with reject
or select
.
reject marc_has_many(245)
select marc_match(245a,Perl)
You can validate
MARC records and collect the error messages or filter valid
records.
validate(.,MARC,error_field:errors)
select valid(.,MARC)
MARC uses codes for languages and countries. You can build dictionaries based on these list and lookup names for these codes.
$ less languages.csv
eng,English
enm,English, Middle (1100-1500)
epo,Esperanto
esk,Eskimo languages
est,Estonian
...
# { "dc_language": "eng" }
lookup(dc_language,languages.csv)
lookup(dc_language,languages.csv,default:English)
lookup(dc_language,languages.csv,delete:1)
# { "dc_language": "English" }
Use issn
, isbn10
or isbn13
to normalize international identifier.
# { "issn" : "1553667x" }
issn(issn)
# { "issn" : "1553-667X" }
# { "isbn" : "1565922573" }
isbn10(isbn)
# {"isbn" : "1-56592-257-3" }
isbn13(isbn)
# { "isbn" : "978-1-56592-257-0" }
Cheat sheet: http://librecat.org/assets/catmandu_cheat_sheet.pdf
MARC mapping rules: https://github.com/LibreCat/Catmandu-MARC/wiki/Mapping-rules
MARC tutorial: https://metacpan.org/pod/distribution/Catmandu-MARC/lib/Catmandu/MARC/Tutorial.pod
Example "fix" http://librecat.org/Catmandu/#example-fix-script