Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research

Podify is the first podcast streaming service specifically designed for academic research. With high resemblances to existing modern streaming services, and a scalable design to accommodate large-scale user studies, it implements a customisable catalogue search, with manual playlist creation and curation, podcast listening, and explicit and implicit feedback collection mechanisms. With all user interactions automatically logged by the platform and easily exportable in a readable format for subsequent analysis, Podify aims to reduce the overhead researchers face when conducting user studies.

This repository contains the source code for the platform outlined in the Demonstration Paper Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research, accepted at the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2023).

For the YouTube presentation of this platform, please click here.

To know more about our research activities at NeuraSearch Laboratory, please follow us on Twitter (@NeuraSearch) and to get notified of future uploads please subscribe to our YouTube channel!

Installation

Ruby v3.0.2
- It is recommended to use a version manager such as rbenv
Bundler:
- gem install bundler
Ruby on Rails:
- cd Podify
- bundle install
Redis
Docker
FFmpeg
geoip-database
- It is preinstalled on Heroku
- sudo apt-get install geoip-database
Tailwind CSS
- cd Podify
- ./bin/rails tailwindcss:install

Run Podify

In a terminal window, and from the root folder (cd Podify), run:

./bin/dev

Navigate to localhost:3000 from your local browser. Before interacting with Podify, however, please complete the all the steps outlined below.

Elastic Search Instance

Run the following command prior to creating and seeding the database.

docker run \
    -d \
    --name elasticsearch-podify \
    --publish 9200:9200 \
    --env "discovery.type=single-node" \
    --env "cluster.name=elasticsearch-rails" \
    --env "cluster.routing.allocation.disk.threshold_enabled=false" \
    --rm \
    docker.elastic.co/elasticsearch/elasticsearch-oss:7.6.0

The Database

Creation and Seeding

Create and seed the PostgreSQL database, as specified in db/seeds.rb:

rails db:reset

In this step, an admin user is also created, with the following credentials:

Username: [email protected]
Password: password

These credentials can be used to access the admin dashboard, available at: localhost:3000/admin

Amazon Web Services (AWS)

Podify uses AWS S3 Buckets to generate the catalogue as well as downloading and storing the audio, transcript, and image files.

Please create a S3 Bucket and then edit the following credentials. Please make sure to provide the access_key_id, secret_access_key, region, and bucket_name values:

EDITOR="code --wait" bin/rails credentials:edit

Data Pre-Processing

Since Podify expects RSS feeds, it does not restrict its usage to only, for example, the Spotify Podcast Dataset. However, the RSS feeds originating from the Spotify Podcast Dataset were used for the demonstration paper. Thus, in order to pre-process the data for creation of the catalogue, the following scripts have to be executed:

python3 utils/1-extract_episodes.py
- [Requirement]: metadata.tsv of the Spotify Podcast Dataset
- This script creates episodes.json from metadata.tsv. Only the episodes with valid metadata and RSS feed are included in the JSON file. This list of episodes will be the catalogue.
python3 utils/2-download_audio_files.py
- [Requirement]: setup rclone as documented in the Spotify Podcast Dataset README.md file
- This script creates a new folder and it downloads the audio files from the Spotify Podcast Dataset for the episodes listed in episodes.json
python3 utils/3-convert_transcripts_to_vtt.py
- [Requirement]: a folder (podcasts-transcripts) that contains all the transcript files of the Spotify Podcast Dataset. The tar.gz files have to be extracted. The resulting podcasts-transcripts folder will be used by this script
- This script converts the transcripts to a VTT format and to a word-level representation. The transcript will be uploaded during the catalogue creation to be indexed by the Elastic Search instance
python3 utils/4-extract_transcript_files.py
- This script, similar to step (2), creates a new folder and it fetches only the transcript files that are listed in episodes.json

Whilst Podify is built in Ruby on Rails, these scripts have been provided in Python. This is to ease the researchers' job of customising and adapting these procedures to their own needs.

Catalogue Creation Procedure

In a terminal window, and from the root folder (cd Podify), run Sidekiq:

bundle exec sidekiq

With Sidekiq operating and ready to accept incoming jobs, the following task will create the catalogue. Please be aware that this process may take some time, depending on the number of episodes that are going to be uploaded onto Podify.

rails episodes:seed_episodes bucket_segments_object_key="episodes.json"

Once the catalogue is fully created (the pending jobs, if any, can be found in localhost:3000/admin/sidekiq), the Sidekiq process can be stopped and the terminal closed. Although user behaviour can be manually downloaded via the admin dashboard, a cron schedule is also implemented to avoid any potential data loss. Please note that this requires a running Sidekiq process.

Deployment (Heroku)

Install the Heroku CLI with the following guide: https://devcenter.heroku.com/articles/heroku-cli

Once the CLI is installed, and you are logged in (heroku login), run the following:

cd Podify
heroku apps:create --stack=heroku-20 neurasearch-podify
heroku buildpacks:set heroku/nodejs --index 1
heroku buildpacks:set heroku/ruby --index 2
heroku buildpacks:add --index 3 https://github.com/jonathanong/heroku-buildpack-ffmpeg-latest.git
git push heroku main
heroku run rake db:migrate
heroku ps:scale web=1
heroku open

Cite

Please, cite this work as follows:

@inproceedings{10.1145/3539618.3591824,
    author = {Meggetto, Francesco and Moshfeghi, Yashar},
    title = {Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research},
    year = {2023},
    isbn = {9781450394086},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3539618.3591824},
    doi = {10.1145/3539618.3591824},
    booktitle = {Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval},
    pages = {3215–3219},
    numpages = {5},
    keywords = {user behaviour, platform, podcast, logging, listening, search},
    location = {Taipei, Taiwan},
    series = {SIGIR '23}
}

Francesco Meggetto and Yashar Moshfeghi. 2023. Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23). Association for Computing Machinery, New York, NY, USA, 3215–3219. https://doi.org/10.1145/3539618.3591824

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
bin		bin
config		config
db		db
lib		lib
log		log
public		public
test		test
utils		utils
vendor		vendor
.gitignore		.gitignore
.rubocop.yml		.rubocop.yml
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
Procfile		Procfile
Procfile.dev		Procfile.dev
README.md		README.md
Rakefile		Rakefile
config.ru		config.ru
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research

Installation

Run Podify

Elastic Search Instance

The Database

Creation and Seeding

Amazon Web Services (AWS)

Data Pre-Processing

Catalogue Creation Procedure

Deployment (Heroku)

Cite

About

Releases

Packages

Contributors 2

Languages

License

NeuraSearch/Podify

Folders and files

Latest commit

History

Repository files navigation

Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research

Installation

Run Podify

Elastic Search Instance

The Database

Creation and Seeding

Amazon Web Services (AWS)

Data Pre-Processing

Catalogue Creation Procedure

Deployment (Heroku)

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages