TODO.txt

﻿
Crawling metrics:
    - New URLs per URL processed? Running average? Shows how much its growing

    - should I track in the crawler if a document can be indexed? Then use something like host health tracker to determine how many responses
in the last 100 have been indexable, and then discard URLs if its not indexable?

Ranking ideas:
    - Use crawl depth as a component of ranking. "Deeper" pages ranking lower?

=== WARC work:
    - oh snap, when importing multiple WARCs into the search index, if a later search index has an error like a timeout, we aren't clearing the FTS entries for it
    - crash importing WARC: /Users/billy/HDD Inside/Kennedy-Work/WARCs/2023-04-20.warc	42700	316 ms

    = Seeing things in image table that aren't in imageserarch table
        = looks like they don't have targetIDs... orphaned content?

    = Also seeing LastVisit > LastSuccessful where Status = 20

GeminiRequestor
=== Truncated body should still have some content

//crawler
- Support 44 slow down


// WARC refactor:
- Need to make sure detected language is normalized to ISO lang codes
- When converting older crawls to WARC, not properly seeing truncated header in output WARCs
- 44 slow downs probably shouldn't be imported into the search DB or archive

- WarcProtocol parser is pretty slow. 2x slower than warcio on my laptop, and about 100x slower than a raw read baseline

Archive maintenance
- need a daily/weekly job that grabs robots.txt and updates the archive to remove content from it


--- showing all interactive elements

select  Distinct srcdoc.Title, LinkText, tardoc.Meta, tardoc.Url , srcdoc.Url From Documents as tardoc
join Links on Links.TargetUrlID = tardoc.UrlID
join Documents as srcdoc on Links.SourceUrlID = srcdoc.UrlID
Where IsExternal = false and tardoc.StatusCode = 10
order by tardoc.URL DESC