Skip to content

Podify: a web-based platform for podcast streaming and consumption specifically designed for research. Published at SIGIR2023.

License

Notifications You must be signed in to change notification settings

NeuraSearch/Podify

Repository files navigation

Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research

Podify is the first podcast streaming service specifically designed for academic research. With high resemblances to existing modern streaming services, and a scalable design to accommodate large-scale user studies, it implements a customisable catalogue search, with manual playlist creation and curation, podcast listening, and explicit and implicit feedback collection mechanisms. With all user interactions automatically logged by the platform and easily exportable in a readable format for subsequent analysis, Podify aims to reduce the overhead researchers face when conducting user studies.

This repository contains the source code for the platform outlined in the Demonstration Paper Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research, accepted at the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2023).

For the YouTube presentation of this platform, please click here.

To know more about our research activities at NeuraSearch Laboratory, please follow us on Twitter (@NeuraSearch) and to get notified of future uploads please subscribe to our YouTube channel!

Installation

  1. Ruby v3.0.2
    • It is recommended to use a version manager such as rbenv
  2. Bundler:
    • gem install bundler
  3. Ruby on Rails:
    • cd Podify
    • bundle install
  4. Redis
  5. Docker
  6. FFmpeg
  7. geoip-database
    • It is preinstalled on Heroku
    • sudo apt-get install geoip-database
  8. Tailwind CSS
    • cd Podify
    • ./bin/rails tailwindcss:install

Run Podify

In a terminal window, and from the root folder (cd Podify), run:

./bin/dev

Navigate to localhost:3000 from your local browser. Before interacting with Podify, however, please complete the all the steps outlined below.

Elastic Search Instance

Run the following command prior to creating and seeding the database.

docker run \
    -d \
    --name elasticsearch-podify \
    --publish 9200:9200 \
    --env "discovery.type=single-node" \
    --env "cluster.name=elasticsearch-rails" \
    --env "cluster.routing.allocation.disk.threshold_enabled=false" \
    --rm \
    docker.elastic.co/elasticsearch/elasticsearch-oss:7.6.0

The Database

Creation and Seeding

Create and seed the PostgreSQL database, as specified in db/seeds.rb:

rails db:reset

In this step, an admin user is also created, with the following credentials:

These credentials can be used to access the admin dashboard, available at: localhost:3000/admin

Amazon Web Services (AWS)

Podify uses AWS S3 Buckets to generate the catalogue as well as downloading and storing the audio, transcript, and image files.

Please create a S3 Bucket and then edit the following credentials. Please make sure to provide the access_key_id, secret_access_key, region, and bucket_name values:

EDITOR="code --wait" bin/rails credentials:edit

Data Pre-Processing

Since Podify expects RSS feeds, it does not restrict its usage to only, for example, the Spotify Podcast Dataset. However, the RSS feeds originating from the Spotify Podcast Dataset were used for the demonstration paper. Thus, in order to pre-process the data for creation of the catalogue, the following scripts have to be executed:

  1. python3 utils/1-extract_episodes.py
    • [Requirement]: metadata.tsv of the Spotify Podcast Dataset
    • This script creates episodes.json from metadata.tsv. Only the episodes with valid metadata and RSS feed are included in the JSON file. This list of episodes will be the catalogue.
  2. python3 utils/2-download_audio_files.py
    • [Requirement]: setup rclone as documented in the Spotify Podcast Dataset README.md file
    • This script creates a new folder and it downloads the audio files from the Spotify Podcast Dataset for the episodes listed in episodes.json
  3. python3 utils/3-convert_transcripts_to_vtt.py
    • [Requirement]: a folder (podcasts-transcripts) that contains all the transcript files of the Spotify Podcast Dataset. The tar.gz files have to be extracted. The resulting podcasts-transcripts folder will be used by this script
    • This script converts the transcripts to a VTT format and to a word-level representation. The transcript will be uploaded during the catalogue creation to be indexed by the Elastic Search instance
  4. python3 utils/4-extract_transcript_files.py
    • This script, similar to step (2), creates a new folder and it fetches only the transcript files that are listed in episodes.json

Whilst Podify is built in Ruby on Rails, these scripts have been provided in Python. This is to ease the researchers' job of customising and adapting these procedures to their own needs.

Catalogue Creation Procedure

In a terminal window, and from the root folder (cd Podify), run Sidekiq:

bundle exec sidekiq

With Sidekiq operating and ready to accept incoming jobs, the following task will create the catalogue. Please be aware that this process may take some time, depending on the number of episodes that are going to be uploaded onto Podify.

rails episodes:seed_episodes bucket_segments_object_key="episodes.json"

Once the catalogue is fully created (the pending jobs, if any, can be found in localhost:3000/admin/sidekiq), the Sidekiq process can be stopped and the terminal closed. Although user behaviour can be manually downloaded via the admin dashboard, a cron schedule is also implemented to avoid any potential data loss. Please note that this requires a running Sidekiq process.

Deployment (Heroku)

Install the Heroku CLI with the following guide: https://devcenter.heroku.com/articles/heroku-cli

Once the CLI is installed, and you are logged in (heroku login), run the following:

cd Podify
heroku apps:create --stack=heroku-20 neurasearch-podify
heroku buildpacks:set heroku/nodejs --index 1
heroku buildpacks:set heroku/ruby --index 2
heroku buildpacks:add --index 3 https://github.com/jonathanong/heroku-buildpack-ffmpeg-latest.git
git push heroku main
heroku run rake db:migrate
heroku ps:scale web=1
heroku open

Cite

Please, cite this work as follows:

@inproceedings{10.1145/3539618.3591824,
    author = {Meggetto, Francesco and Moshfeghi, Yashar},
    title = {Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research},
    year = {2023},
    isbn = {9781450394086},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3539618.3591824},
    doi = {10.1145/3539618.3591824},
    booktitle = {Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval},
    pages = {3215–3219},
    numpages = {5},
    keywords = {user behaviour, platform, podcast, logging, listening, search},
    location = {Taipei, Taiwan},
    series = {SIGIR '23}
}
Francesco Meggetto and Yashar Moshfeghi. 2023. Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23). Association for Computing Machinery, New York, NY, USA, 3215–3219. https://doi.org/10.1145/3539618.3591824

About

Podify: a web-based platform for podcast streaming and consumption specifically designed for research. Published at SIGIR2023.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published