-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enhanced Web Crawler with Markdown Output and Customization #169
base: main
Are you sure you want to change the base?
Conversation
- Updated .gitignore to include .history and crawled directories. - Modified config.ts: - Changed URL and match pattern - Added exclusion patterns for various actions (resendpwd, register, login, logout, profile, edit, diff, revisions) - Increased maxPagesToCrawl to 75 - Updated selector and output format - Added @mozilla/readability, jsdom, and turndown dependencies to package-lock.json - Updated versions for various @crawlee, @apify, and other dependencies in package-lock.json - Removed old versions of chalk, cli-width, figures, and other dependencies from package-lock.json
type: "list", | ||
name: "outputFileFormat", | ||
message: messages.outputFileFormat, | ||
choices: ["json", "markdown", "human_readable_markdown"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: It occurs to me that we could store this array as a constant somewhere, so we can use it on this line and on src/config.ts.
if (config.exclude && Array.isArray(config.exclude)) { | ||
const url = new URL(req.url); | ||
for (const pattern of config.exclude) { | ||
if (typeof pattern === "string" && pattern.includes("&do=")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose &do=
is specific to some test case? Shouldn't the exclude
config option take care of excluding declared patterns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @robin-collins 👋 ! Thanks for opening this PR. I've left a couple comments. Simple stuff! Out of pure curiosity: what is the use case for markdown output?
the markdown output has two uses for me, a) in Anthropic Claude 'projects', and I also use the markdown version with the self-hosted libre chat that the AI has it as part of the RAG system. The human-readable format is for me to have a nice n neat markdown document for my own reference. I prefer it than hitting an online documentation site regularly. |
from upstream
I need to get the image src |
Added @mozilla/readability, jsdom, and turndown dependencies to enable markdown conversion and improve content extraction.
Updated versions for various @crawlee, @apify, and other dependencies to leverage the latest features and improvements.
Removed old versions of chalk, cli-width, figures, and other dependencies to streamline the project.
Added command-line options (-f, --outputFileFormat) and interactive prompts for output file format (JSON, markdown, human-readable markdown) and name, providing flexibility and customization.
Implemented enhanced markdown conversion with better handling of HTML elements (lists, links) for cleaner and more accurate output.
Added a "human-readable markdown" option (-f human_readable_markdown) with table of contents and "Back to Top" links for easier navigation in larger documents.
Refined URL exclusion logic in core.ts to handle query parameters containing &do= more effectively, allowing for more precise control over which pages are crawled.
Updated server (server.ts) to set content type to text/markdown for markdown output, ensuring correct rendering in browsers and other applications.
below is an example of a ./config.ts file that takes advantage of the enhancements / updates.