Jobot

A useless, quick and dirty internet crawler which extracts the hyperlinks from the web pages. Jobot also saves the metadata for each processed URL into the file <USER_HOME>/.jobot/<URL>/links.txt. The saved metadata is a list of the URLs extracted from the corresponding HTML page.

Build

./gradlew clean jar

Run

java -jar build/libs/jobot.jar https://en.wikipedia.org/wiki/\(486958\)_2014_MU69

Implementation Notes

Jobot modifies and filters the URLs upon processing:

URL should not be equal to any of the last 1,000,000 extracted URLs
Anchor and query parts of the URL are being dropped
URL should begin with http, otherwise it's dropped entirely
The corresponding HTTP response content type should begin with text, otherwise the content is dropped

The URL processing queue size is 1,000,000.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
gradle/wrapper		gradle/wrapper
src/main/java/com/dell/jobot		src/main/java/com/dell/jobot
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jobot

Build

Run

Implementation Notes

About

Releases

Packages

Languages

EMCECS/jobot

Folders and files

Latest commit

History

Repository files navigation

Jobot

Build

Run

Implementation Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages