forked from crawler-commons/crawler-commons
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCHANGES.txt
233 lines (217 loc) · 16.7 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
Crawler-Commons Change Log
Current Development 1.5-SNAPSHOT (yyyy-mm-dd)
- [Sitemaps] Added https variants of namespaces (jnioche) #487
- [Domains] Add version of public suffix list shipped with release packages enhancement (sebastian-nagel, Richard Zowalla) #433, #484
- [Domains] Improve representation of public suffix match results by class EffectiveTLD (sebastian-nagel, Richard Zowalla) #478
- Javadoc: fix links to Java core classes (sebastian-nagel, Richard Zowalla) #417, #483
- [Sitemaps] Improve logging done by SiteMapParser (Valery Yatsynovich, sebastian-nagel) #457
- [Sitemaps] Google Sitemap PageMap extensions (josepowera, sebastian-nagel, Richard Zowalla, jnioche) #388, #442
- [Domains] Installation of a gzip-compressed public suffix list from Maven cache breaks EffectiveTldFinder to address (sebastian-nagel, Richard Zowalla) #441, #443
- Upgrade dependencies (dependabot) #437, #444, #448, #451, #473, #465, #466, #468
- Upgrade Maven plugins (dependabot) #434, #438, #439, #449, #445, #452, #455, #459, #460, #464, #469, #467, #470, #471, #472, #474, #475, #476, #477, #480, #481, #482
Release 1.4 (2023-07-13)
- [Robots.txt] Implement Robots Exclusion Protocol (REP) IETF Draft: port unit tests (sebastian-nagel, Richard Zowalla) #245, #360
- [Robots.txt] Close groups of rules as defined in RFC 9309 (kkrugler, garyillyes, jnioche, sebastian-nagel) #114, #390, #430
- [Robots.txt] Empty disallow statement not to clear other rules (sebastian-nagel, jnioche) #422, #424
- [Robots.txt] SimpleRobotRulesParser main() to follow five redirects (sebastian-nagel, jnioche) #428
- [Robots.txt] Add more spelling variants and typos of robots.txt directives (sebastian-nagel, jnioche) #425
- [Robots.txt] Document effect of rules merging in combination with multiple agent names (sebastian-nagel, Richard Zowalla) #423, #426
- [Robots.txt] Pass empty collection of agent names to select rules for any robot (wildcard user-agent name) (sebastian-nagel, Richard Zowalla) #427
- [Robots.txt] Rename default user-agent / robot name in unit tests (sebastian-nagel, Richard Zowalla) #429
- [Robots.txt] Add units test based on examples in RFC 9309 (sebastian-nagel, Richard Zowalla) #420
- [BasicNormalizer] Query parameters normalization in BasicURLNormalizer (aecio, sebastian-nagel, Richard Zowalla) #308, #421
- [Robots.txt] Deduplicate robots rules before matching (sebastian-nagel, jnioche) #416
- [Robots.txt] SimpleRobotRulesParser main to use the new API method (sebastian-nagel, jnioche) #413
- Generate JaCoCo reports when testing (jnioche) #409, #412
- Push Code Coverage to Coveralls (Richard Zowalla, jnioche) #414
- [Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters (tkalistratov, sebastian-nagel, Richard Zowalla) #195, #408
- [Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters (sebastian-nagel, Richard Zowalla, aecio) #389, #401
- [Robots.txt] Improve readability of robots.txt unit tests (sebastian-nagel, Richard Zowalla) #383
- Upgrade project to use Java 11 (Avi Hayun, Richard Zowalla, aecio, sebastian-nagel) #320, #376
- [Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks (sebastian-nagel, Richard Zowalla) #362
- [Robots.txt] Matching user-agent names does not conform to robots.txt RFC (YossiTamari, sebastian-nagel) #192
- [Robots.txt] Improve robots check draft rfc compliance (Eduardo Jimenez) #351
- Upgrade dependencies (dependabot) #379, #384, #394, #399, #404, #419
- Upgrade Maven plugins (dependabot) #377, #381, #386, #396, #397, #398, #400, #402, #403, #405, #406, #407, #415, #418
- Javadoc: ensure Javascript search is working (sebastian-nagel, Richard Zowalla, aecio) #378, #380
Release 1.3 (2022-07-19)
- [Sitemaps] Disable support for DTDs in XML sitemaps and feeds by default (Kenneth Wong) #371
- Migrate Continuous Integration from Travis to GitHub Actions (Valery Yatsynovich) #333
- Upgrade dependencies (dependabot, Richard Zowalla) #334, #339, #345, #346, #347, #350, #354, #361, #369
- Upgrade Maven plugins (dependabot, Richard Zowalla, sebastian-nagel) #328, #329, #330, #331, #335, #336, #337, #338, #340, #341, #343, #356, #363. #364, #366, #373, #374
- Update pom.xml to address Maven warnings and deprecations (sebastian-nagel, Richard Zowalla, Avi Hayun) #342
- Enable Dependabot (Valery Yatsynovich) #327
- Removes test dependency towards mockito-core (Richard Zowalla) #367
- Drops provided dependency towards servlet-api (Richard Zowalla) #368
Release 1.2 (2021-10-06)
- [Sitemaps] Avoid calling java.net.URL::equals in equals method of sitemaps and extensions (sebastian-nagel) #322
- [URLs] Provide a builder class to configure the URL normalizer (aecio) #321, #324
- [URLs] Make normalization of IDNs configurable (to ASCII or Unicode) via builder (aecio, sebastian-nagel) #324
- [Sitemaps] Fix XXE vulnerability in Sitemap parser (kovyrin) #323
- [URLs] Sorting the Query Parameters (aecio) #246, #309
- [URLs] Allows to (optionally) remove common irrelevant query parameters (aecio) #309
- [Sitemaps] Allow to normalize URLs in sitemaps (murderinc, sebastian-nagel) #305
- Normalize CHANGES.txt (Avi Hayun) #270
- Readme.MD Overhaul of TOC, Installation, License (Avi Hayun) #311
- [URLs] Normalize URL without a scheme (Avi Hayun, sebastian-nagel) #271
- [Domains] EffectiveTldFinder: upgrade public suffix list / Download latest effective_tld_names.dat during Maven build (Richard Zowalla) #295, #302
- [URLs] decode percent-encoded host names (sebastian-nagel) #303
- [Sitemaps] Document options *strict* and *allowPartial* in SiteMapParser constructors (sebastian-nagel) #267
- [Robots.txt] Maximum values (crawl-delay and warnings): document and make visible (sebastian-nagel, Avi Hayun) #276
- [Sitemaps] Replace priority "NaN" by default value (sebastian-nagel) #296
- [Sitemaps] Adding duration to the map generated by VideoAttributes.asMap (evanhalley) #300
Release 1.1 (2020-06-29)
- [Sitemaps] Sitemaps to implement Serializable (cdalexndr, sebastian-nagel) #244
- [Sitemaps] Allow to deduplicate sitemap links in sitemap indexes (sebastian-nagel) #262
- [Robots.txt] Upgrade the toString() method of the Base/Simple RobotRules (Avi Hayun) #264
- Upgrade GitIgnore (Avi Hayun) #260
- [Robots.txt] Deduplicate sitemap links (sebastian-nagel) #261
- [Domains] EffectiveTldFinder to log loading of public suffix list (sebastian-nagel) #284
- [Sitemaps] SiteMapParser getPublicationDate in VideoAttributes may throw NPE (panthony, sebastian-nagel) #283
- [Robots.txt] SimpleRobotRulesParser: Trim log messages (jnioche, sebastian-nagel) #281
- [Robots.txt] SimpleRobotRulesParser: counter _numWarnings not thread-safe (sebastian-nagel, kkrugler) #278
- ParameterizedTest not executed by mvn builds (sebastian-nagel) #273
- [URLs] Empty path before query to be normalized to `/` (Avi Hayun, sebastian-nagel) #247
- [Domains] EffectiveTldFinder to validate returned domain names for length restrictions (sebastian-nagel, Avi Hayun) #251
- Upgrade unit tests to use JUnit v5.x and parameterized tests (Avi Hayun) #249, #253, #255
- [Robots.txt] Robots parser to always handle absolute sitemap URL even without valid base URL (pr3mar, kkrugler, sebastian-nagel) #240
- [Sitemaps] Adding asMap to ExtensionMetadata Interface (evanhalley) #288
- [Sitemaps] NewsAttribute.equals() compares the instance variable PublicationDate with itself (evanhalley) #289
Release 1.0 (2019-03-19)
- [Sitemaps] Unit tests depend on system timezone (kkrugler, sebastian-nagel) #238
- [Domains] EffectiveTldFinder: upgrade public suffix list (sebastian-nagel) #219
- [Sitemaps] Detection and parsing of XML sitemaps fails with whitespace before XML declaration (sebastian-nagel, jnioche) #144
- [Sitemaps] XMLHandler needs to append text in characters() vs. immediately processing (kkrugler, sebastian-nagel) #226
- [Sitemaps] XMLIndexHandler needs to accumulate the lastmod date string before parsing (kkrugler, sebastian-nagel) #225
- [Domains] EffectiveTldFinder throws IllegalArgumentException on IDN domain names containing prohibited characters (sebastian-nagel) #231
- [Sitemaps] Trim Unicode whitespace around URLs (sebastian-nagel, kkrugler) #224
- [Sitemaps] Sitemap index: stop URL at closing </loc> (sebastian-nagel, kkrugler) #213
- [Sitemaps] Allow empty price in video sitemaps (sebastian-nagel) #221
- [Sitemaps] In case of the use of a different locale, price tag can be formatted with ',' instead of '.' leading to a NPE (goldenlink) #220
- [Sitemaps] Add support for sitemap extensions (tuxnco, sebastian-nagel) #35, #36, #149, #162
- [Sitemaps] Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps (sebastian-nagel) #217
- [Robots.txt] Fix for handling URLs with query parameters but no path (kkrugler) #215
Release 0.10 (2018-06-05)
- Add JAX-B dependencies to POM (jnioche) #207
- [Sitemaps] Add method to parse and iterate sitemap SiteMapParser#walkSiteMap(URL,Consumer) (Luc Boruta) #190
- [Sitemaps] Sitemap file location to ignore query part of URL (sebastian-nagel) #202
- [Sitemaps] Link extraction from RSS feeds fails on XML entities (sebastian-nagel) #204
- [Sitemaps] Resolve relative links in RSS feeds (sebastian-nagel) #203
- [Sitemaps] Extract links from <guid> elements (sebastian-nagel) #201
- [Sitemaps] Limit on "bad url" log messages (sebastian-nagel) #145
- [Domains] EffectiveTldFinder to parse Internationalized Domain Names (sebastian-nagel) #179
- [Domains] Add main() to EffectiveTldFinder (sebastian-nagel) #187
- [Domains] Handle new suffixes in PaidLevelDomain (kkrugler) #183
- Remove Tika dependency (kkrugler) #199
- [Sitemaps] Improve MIME detection for sitemaps (sebastian-nagel) #200
- [Robots.txt] Make RobotRules accessible (aecio via kkrugler) #134
- [Robots.txt] SimpleRobotRulesParser: Expose MAX_WARNINGS and MAX_CRAWL_DELAY (aecio via kkrugler) #194
- [Robots.txt] Added main to SimpleRobotRulesParser for testing (sebastian-nagel) #193
- [Sitemaps] Allow for legacy URIs when checking sitemap namespaces (sebastian-nagel) #211
Release 0.9 (2017-10-27)
- [Sitemaps] Removed DOM-based sitemap parser (jnioche) #177
- [Domains] Incorrect domains returned by EffectiveTldFinder (sebastian-nagel) #172
- [Sitemaps] Add namespace aware DOM/SAX parsing for XML Sitemaps (Marko Milicevic, jnioche, sebastian-nagel) #176
- Upgraded Tika 1.16 (jnioche) #175
- [Sitemaps] Sitemap SAX parsing mangles target URLs (jnioche, sebastian-nagel) #169
- [Sitemaps] RSS parser ignores pubDate of link (MichealKum via kkrugler) #166
Release 0.8 (2017-06-09)
- Upgraded Tika 1.15 (jnioche) #163
- [Sitemaps] Disable XML resolvers (sebastian-nagel) #151
- Update forbiddenapis to v2.3 (jnioche) #99
- [Sitemaps] gzipped text files fail to parse (sebastian-nagel) #143
- [Sitemaps] Optionally use SAX parser (matt-deboer, jnioche, sebastian-nagel) #116
- [Sitemaps] Properly log XML parsing errors (sebastian-nagel) #146
- Use StandardCharsets where applicable (sebastian-nagel) #141
- [Sitemaps] Increase sitemap size limit to 50MB (Avi Hayun) #132
- Remove dependencies to system-specific locale (sebastian-nagel) #137
- [URLs] BasicURLNormalizer: NPE for URLs without authority (sebastian-nagel) #136
- [URLs] BasicURLNormalizer to strip empty port (sebastian-nagel) #133
- Remove deprecated HTTP fetcher (kkrugler) #96
Release 0.7 (2016-11-24)
- Upgrade to JDK 1.8 (lewismc) #126
- [Sitemaps] SitemapParser methods now protected (michaellavelle) #124
- [Sitemaps] Faster parsing of dates (jnioche) #117
- Upgraded Tika 1.13 (jnioche) #113
- Fix license headers (jnioche) #108
- Rename package crawlercommons.url (jnioche) #107
- [Sitemaps] Sitemap url is not extracted if user agent matches earlier in file (srwilson, kkrugler) #112
- Deprecate HTTP fetcher support (kkrugler) #92
- [URLs] Added URLFilter interface + BasicURLNormalizer (jnioche) #106
- [Domains] Updated tld names from publicsuffix.org (jnioche) #100
- Upgraded http-client to version 4.5.1 (aecio via kkrugler) #84
- Upgraded Tika 1.10 (jnioche) #89
- [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #82
- [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #60
- Simplify pom file (jnioche, lewismc) #77
- Upgrade javac.src.version and javac.target.version to 1.7 or 1.8 (lewismc) #93
- [Sitemaps] Not able to detect RSS feeds (yogendrasoni via kkrugler) #87
- [Robots.txt] Added javadoc comments to the SimpleRobotRulesParser class (MichaelRoeder, kkrugler) #95
Release 0.6 (2015-05-27)
- Issue 75: [Sitemaps] more robust parsing of XML elements (jnioche, kkrugler)
- Issue 76: maven-java-formatter-plugin (jnioche)
- Issue 73: Switch groupID in pom from com.google.code.crawler-commons to crawler-commons (jnioche)
- Issue 71: Upgrade to Tika 1.8 (jnioche)
- Issue 68: [Robots.txt] Path matching should be case-sensitive (kkrugler)
- Issue 67: [Sitemaps] Parsing of lastMod date should use time portion (kkrugler)
- Issue 59: [Robots.txt] Let SimpleRobotRules and its members implements the Serializable interface (kkrugler)
- Issue 65: [Sitemaps] Make SiteMapTool simpler by removing the Recursive flag (Avi Hayun)
- Issue 64: Upgraded to Tika 1.7 (jnioche)
- Issue 32: [Robots.txt] Resolve relative URL for sitemaps (jnioche)
- Issue 62: [Sitemaps] Add new parseSiteMap method (jnioche)
- Issue 57: [Sitemaps] SiteMap should contain a list of SitemapUrls instead of a table of them (Avi Hayun)
- Issue 51: Upgrade httpclient to the latest version (Avi Hayun)
- Issue 61: [Sitemaps] Sitemap Parser changes the processed flag unnecessarily (Avi Hayun)
- Issue 56: [Sitemaps] SiteMap.setBaseUrl(...) causes the domain name to be lowered case which shouldn't happen (Avi Hayun)
- Issue 50: Add Fetch Report to FetchedResult (lewismc, avraham2)
- Issue 55: [Sitemaps] SitemapUrl "setPriority(String str)" should check for proper value (Avi Hayun)
Release 0.5 (2014-10-15)
- Issue 53: Spaces in a comma separated list of names in a User-agent: line cause rules to be applicable to all agents (kkrugler)
- Issue 45: [Sitemaps] Upgrade code after release of Tika v1.6 (Avi Hayun)
- Issue 48: Upgraded to Tika 1.6 (jnioche)
- Issue 47: [Sitemaps] SiteMapParser Tika detection doesn't work well on some cases (Avi Hayun)
- Issue 40: [Sitemaps] Add Tika MediaType Support (Avi Hayun)
- Issue 39: [Sitemaps] Add the Parser a convenience method with only a URL argument (Avi Hayun via lewismc)
- Issue 42: [Sitemaps] Add more JUnit tests (Avi Hayun via lewismc)
- Issue 37: Upgrade the Slf4j logging Library to v1.7.7 (Avi Hayun via kkrugler)
- Issue 41: Upgrade to JUnit v4 conventions in SiteMapParser (Avi Hayun via lewismc)
- Issue 34: Upgrade the Slf4j logging in SiteMaps (Avi Hayun via lewismc)
Release 0.4 (2014-04-11)
- Issue 13: Fix deprecation in Crawler Commons Code (lewismc via kkrugler)
- Issue 8 : Upgrade of httpclient to v4.2.6 (Fuad Efendi, lewismc via kkrugler)
- Issue 18: Support matching against query parameters in robots.txt rules (alparslanavci, kkrugler)
- Issue 21: Follow Google example of giving Allow directives higher match weight than Disallow directives (y.vladimirov, via kkrugler)
- Issue 22: Use longest-match-wins approach to matching URLs in robots.txt (kkrugler)
- Issue 17: Support Googlebot-compatible regular expressions in URL specifications (alparslanavci. kkrugler)
- Issue 31: Missing top level domains (jnioche, kkrugler)
- Issue 23: Trivial improvements to UserAgent (lewismc)
- Issue 30: SitemapIndex should allow to skip sitemaps (Sebastian Nagel, kkrugler)
- cleanup of ANT build remnants [lib and lib-ext] (jnioche)
Release 0.3 (2013-10-11)
- Upgraded to Tika 1.4 (jnioche)
- [SiteMap] added utility class for testing sitemaps (jnioche)
- Issue 16: remove ant scripts and configuration (lewismc)
- Issue 27: [SiteMap] Unnecessary String concatenations when logging + in SiteMapURL.toString() (jnioche)
- Issue 26: [SiteMap] Set correct default priority for URL in a sitemap file (jnioche)
- Issue 25: [Robots.txt] Robots parser should not lowercase sitemap URLs (jnioche)
- Issue 29: [SiteMap] try urls when <loc> element is missing (jnioche)
Release 0.2 (2013-02-02)
- Move to pure Maven for CC build lifecycle (lewismc)
- Move Javadoc out of core code (lewismc)
- Substantiate Javadoc (lewismc)
- Review default.properties (lewismc)
- add HTTP status code & reason to FetchedResult (Fuad Efendi via kkrugler)
- support for multiple user agent names (Tejas Patil via kkrugler)
- added javadoc generation, publish in /doc/javadoc (kkrugler)
- switch to using eclipse-formatter.properties (kkrugler)
- support robots.txt files that have UTF-16LE and UTF-16BE BOMs (kkrugler)
- support for user agent names that contain spaces (kkrugler)
- fixed handling of BOM in sitemaps (Vivek Magotra via kkrugler)
- refactoring of SiteMap objects (Hannes Schwarz via jnioche)
- added simple support for the file: protocol (kkrugler)
- cleaned up packaging and added "install" target (kkrugler)
Release 0.1
- parsing robots.txt
- parsing sitemaps
- URL analyzer which returns Top Level Domains
- a simple HttpFetcher