Webscrape of the Zimmerman en Space podcast, and (re)publication on Wikimedia Commons (and Zenodo in the future). High 5 for CC0 licenses, space, astronomy and nerds!
Latest update : 17 September 2024
Episodes 1 - 92 are now available on Wikimedia Commons:
- Category: Zimmerman en Space podcast
- Gallery, grouped by year, sorted by date : Zimmerman en Space podcast
- Zimmerman en Space podcast URL list, Season 1, Episodes 1 to 92 : https://ookgezellig.github.io/Zimmerman-en-Space-podcast/episodes.html
- To be used as initial scrape map by webscraper.oi for scraping data of individual episodes
Output of webscrape, with post-processing to make data suitable input for Wikimedia Commons, OpenRefine and the Python modules used below: https://ookgezellig.github.io/Zimmerman-en-Space-podcast/ZimmermanEnSpacePodcast_episodes1-92.xlsx
- Python script: download_mp3s.py
- Folder: mp3-files
- Filenames are Buzzsprout titles, eg. 11845039-tsunami-s-op-mars.mp3
Converting from mp3 to ogg/oga:
- Python script: convert_mp3s_to_oga.py - Make sure ffmpeg is installed on your machine and it has been added to your System's PATH.
- Alternatively, use an online .mp3 to .ogg bulk converter, such as online-audio-converter.com. This was the actual tool used for converting the first batch of episodes (1-92). File extension can be changed from .ogg tot .oga without penalty.
- Folder: ogg-files
Wikimedia Commons:
- Files must be copied and renamed from Buzzsprout to Wikimedia Commons syntax titles, eg. from ogg-files/11845039-tsunami-s-op-mars.ogg to oga-files/Tsunami's_op_Mars_-Zimmerman_en_Space-S01E01-2022-12-09-_11845039.oga
- Python script: copy_and_rename_local_ogg_to_wmc_oga.py
- Folder with Commons compatible files: oga-files
- Excel file as source for OpenRefine project, to bulk upload the .oga files and metadata to Wikimedia Commons: see Excel above, this includes all columns needed for the OpenRefine project.
- OpenRefine project files : ZimmermanEnSpacePodcast-episodes1-92-xlsx.openrefine.tar.gz
- Category: Zimmerman en Space podcast
- Gallery, grouped by year, sorted by date : Zimmerman en Space podcast
- Category: Hens Zimmerman
Full-text audio transcriptions are being added bit by bit to the Commons files in the coming months.
- For the ChatGPT corrected transcribed texts of each episode, see the transcripts/chatgpt-corrected folder. Files are in Markdown (.md) format.
- For a first, fully worked example on Commons, see S01E01 Tsunami's op Mars.
- For current status, see this issue.
To the structured data of each Commons file, main subject (P921) will be added bit by bit in the coming months. These episode subjects/keywords will be extracted from the title and full-text audio transcriptions using Named Entity Recognition (NER) techniques and subsequent reconciliation of the found entities against Wikidata. For current status, see this issue.
For a fully worked example, see S01E01 Tsunami's op Mars.
Request info about episode 14, AI en Chat GPT in de sterrenkunde
Structured data has been added to all files, so we can do some (basic) semantic searching via SPARQL queries.
- Zimmerman en Space podcast: https://www.wikidata.org/wiki/Q130355362
- Hens Zimmerman : https://www.wikidata.org/wiki/Q130279350
All episodes 1-92 of the Zimmerman en Space podcast have been licensed under the Creative Commons CC0 1.0 license, as stated in the shownotes of each episode.