Skip to content
This repository has been archived by the owner on Jul 5, 2022. It is now read-only.

Commit

Permalink
merge: Add fireside-scraper that pulls all JB episodes #30
Browse files Browse the repository at this point in the history
  • Loading branch information
StefanS-O committed Jun 28, 2022
1 parent be80d05 commit 563cdca
Show file tree
Hide file tree
Showing 14 changed files with 607 additions and 13 deletions.
4 changes: 3 additions & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
public/
.github/
.git/
.git/
fireside-scraper
scraped-data
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,7 @@ $RECYCLE.BIN/

# Editor
.idea
.vscode

# Ignore all the scraped data
scraped-data
17 changes: 16 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,19 @@ build:
hugo -D

run:
docker-compose up -d --build
docker-compose up -d --build jbsite

# Clean the scraped data
scrape-clean:
rm -r scraped-data && mkdir scraped-data

# Execute scrapig all the data from fireside into scraped-data dir
scrape: scrape-clean
docker-compose up -d --build fireside-scraper && \
docker-compose logs --no-log-prefix -f fireside-scraper

# Copy contents of the scraped-data into the project
scrape-copy:
./scrape-copy.sh && ./generate-guests-symlinks.sh

scrape-full: scrape scrape-copy
43 changes: 42 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,50 @@ Deployment is done with Github Actions, see workflow file in `.github/workflows/
At the moment it is only triggered when something in the `main` branch is changing, but it can also be set up to run at certain times.
This would also enable scheduled publishing, since Hugo per default only build pages which have set `date` in frontmatter to <= `now`


## Fireside Scraper

The [fireside-scraper](./fireside-scraper/) is based on [JB Show Notes](https://github.com/selfhostedshow/show-notes) that was written by [ironicbadger](https://github.com/ironicbadger).

It goes over all the JB firside shows and scrapes the episodes into the format that is expected by hugo for each episode (using [this template](./fireside-scraper/src/templates/episode.md.j2)).

Besides the episodes it also scrapes and creates the json files for:

- sponsors
- hosts
- guests (every host is symlinked into the [guests dir](./data/guests/) since a host of one show, could be a guest on an episode of a different show)

There are makefile commands that should me used to run it.

### Run the scraper

The command below would build, and start up the container which would save all the data into `scraped-data` dir.

```
make scrape
```

The files are organised in the same way as the files in the root project. This makes it very trivial to just copy the contents of `scraped-data` over to the root dir of the repo to include all the scraped content. And it can be done with:

```
make scrape-copy
```

or you could just run the following to scrape and copy over the root dir all at once:

```
make scrape-full
```

### Configuring the scraper

Configure the scraper by modifying this [config.yml file](./fireside-scraper/src/config.yml)

## Credits

I took parts of the functionality from the Castanet Theme: https://github.com/mattstratton/castanet
- I took parts of the functionality from the Castanet Theme: https://github.com/mattstratton/castanet
Mainly the RSS feed generation and managing of hosts / guests.

- [ironicbadger](https://github.com/ironicbadger) and [JB Show Notes](https://github.com/selfhostedshow/show-notes) project which was used as the base for the `fireside-scraper`

Time spend so far: 13h
2 changes: 1 addition & 1 deletion config.toml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
baseURL = 'https://jb.codefighters.net/'
baseURL = 'http://localhost:1111/'
languageCode = 'en-us'
title = 'Jupiter Broadcasting'

Expand Down
9 changes: 0 additions & 9 deletions data/guests/alex.json

This file was deleted.

9 changes: 9 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,12 @@ services:
context: .
ports:
- 1111:80
fireside-scraper:
user: 1000:1000
image: fireside-scraper
container_name: fireside-scraper
build:
context: ./fireside-scraper
volumes:
- ./scraped-data:/data
- ./data:/hugo-data:ro
10 changes: 10 additions & 0 deletions fireside-scraper/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
FROM python:3.10-alpine

RUN mkdir /data && chown -R 1000:1000 /data

COPY ./src/ /
RUN chown 1000:1000 /scraper.py
RUN pip install -U -r requirements.txt

USER 1000
CMD [ "python3", "scraper.py" ]
31 changes: 31 additions & 0 deletions fireside-scraper/src/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
shows:
selfhosted:
fireside_url: https://selfhosted.show
header_image: /images/shows/selfhosted.png
acronym: SSH
name: Self-Hosted
coderradio:
fireside_url: https://coder.show
header_image: /images/shows/coderradio.png
acronym: CR
name: Coder Radio
linux-action-news:
fireside_url: https://linuxactionnews.com
header_image: /images/shows/linux-action-news.png
acronym: LAN
name: Linux Action News
linuxun:
fireside_url: https://linuxunplugged.com
header_image: /images/shows/linuxun.png
acronym: LUP
name: LINUX Unplugged
extras:
fireside_url: https://extras.show
header_image: /images/shows/extras.png
acronym: JE
name: Jupiter EXTRAS
officehours:
fireside_url: https://www.officehours.hair
header_image: /images/shows/officehours.png
acronym: JE
name: Office Hours
7 changes: 7 additions & 0 deletions fireside-scraper/src/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
beautifulsoup4==4.9.3
requests==2.25.1
jinja2==3.0.1
pymdown-extensions==8.2
html2text==2020.1.16
pyyaml==5.4.1
python-dateutil==2.8.2
Loading

0 comments on commit 563cdca

Please sign in to comment.