This is set of nodejs scripts that crawl the Lemmy API and store the data to Redis. They are also able to generate JSON output bundles for the frontend.
A redis instance must be running on localhost:6379
for the crawler to store data.
There is a npx tsx index.js --out
script to output the instances and communities to json for the frontend.
-
Start redis server (
docker compose up -d redis
)Redis server config is in
lib/const.ts
-
Start crawler (
yarn start
)This will use pm2 to start the crawler in the background. You can use
yarn logs
to monitor the crawler. -
Put some jobs into the queue (
npx tsx index.js --init
)This will put some jobs into the queue based off the lists in
lib/const.ts
The crawler can be run in docker-compose with the following commands.
-
Start redis server in background (
docker compose up -d redis
) -
Start crawler in foreground (
docker compose up crawler --build
)You can also use
docker compose up -d crawler --build
and then usedocker-compose logs --tail 400 -f crawler
to monitor it.
If you want to configure auto-upload to s3 or anything, you need to copy the .env.example
to .env
and edit it.
These immediately run a specific task.
npx tsx index.js [options]
Option | Description |
---|---|
--out |
Output JSON files for frontend |
--clean |
Clean up old jobs |
--init |
Initialize queue with seed jobs |
--health |
Check worker health |
--aged |
Create jobs for aged instances and communities |
--kbin |
Create jobs for kbin communities |
--uptime |
Immediately crawl uptime data |
--fedi |
Immediately crawl Fediseer data |
Action | Command |
---|---|
Output json output bundles for frontend | npx tsx index.js --out |
Initialize queue with seed jobs | npx tsx index.js --init |
View worker health | npx tsx index.js --health |
Create jobs for aged instances and communities | npx tsx index.js --aged |
Immediately crawl uptime data | npx tsx index.js --uptime |
These start a worker that will run continuously, processing jobs from the relevant queue.
npx tsx index.js -w [worker_name]
Worker | Description |
---|---|
-w instance |
Crawl instances from the queue |
-w community |
Crawl communities from the queue |
-w single |
Crawl single communities from the queue |
-w kbin |
Crawl kbin communities from the queue |
-w cron |
Schedule all CRON jobs for aged instances and communities, etc |
Action | Command |
---|---|
Start Instance Worker | npx tsx index.js -w instance |
Start CRON Worker | npx tsx index.js -w cron |
These start a worker that will run a single job, then exit.
npx tsx index.js -m [worker] [base_url] [community_url]?
Worker | Description |
---|---|
-m [i|instance] <base_url> |
Crawl a single instance |
-m [c|community] <base_url> |
Crawl a single instance's community list |
-m [s|single] <base_url> <community_name> |
Crawl a single community, delete if not exists |
-m [k|kbin] <base_url> |
Crawl a single community |
Action | Command |
---|---|
Crawl a single instance | npx tsx index.js -m i lemmy.tgxn.net |
Crawl a single instance's community list | npx tsx index.js -m c lemmy.tgxn.net |
Directory | Description |
---|---|
bin/ |
helpers for CLI interface |
crawler/ |
main crawler scripts |
lib/ |
libraries for crawling, error handling and storage |
output/ |
scripts to generate JSON output bundles |
queue/ |
queue processor scripts |
store/ |
redis storage classes |
Crawlers are tasks created to perform an action, which could be crawling an instance, community, or other data.
Crawler | Description |
---|---|
instance |
Instance Crawling |
community |
Community Crawling |
fediseer |
Fediseer Crawling |
uptime |
Uptime Crawling |
kbin |
Kbin Crawling |
Queues are where Tasks can be placed to be processed.
Task | Description |
---|---|
instance |
Crawl an instance |
community_list |
Crawl a community |
community_single |
Crawl a single community |
kbin |
Crawl a kbin community |
Redis is used to store crawled data.
You can use docker compose up -d
to start a local redis server.
Data is persisted to a .data/redis
directory.
Redis Key | Description |
---|---|
attributes:* |
Tracked attribute sets (change over time) |
community:* |
Community details |
deleted:* |
Deleted data (recycle bin if something broken) |
error:* |
Exception details |
fediverse:* |
Fediverse data |
instance:* |
Instance details |
last_crawl:* |
Last crawl time for instances and communities |
magazine:* |
Magazine data (kbin magazines) |
uptime:* |
Uptime data (fetched from api.fediverse.observer ) |
Most of the keys have sub keys for the instance base_url
or community base_url:community_name
.