Skip to content

Commit

Permalink
Fix for properly parsing scheme urls after non-scheme-chars preceed them
Browse files Browse the repository at this point in the history
  • Loading branch information
gregjacobs committed Dec 16, 2024
1 parent 0bf34a3 commit e239b9a
Show file tree
Hide file tree
Showing 9 changed files with 87 additions and 16 deletions.
25 changes: 23 additions & 2 deletions dist/Autolinker.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion dist/Autolinker.js.map

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion dist/Autolinker.min.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion dist/Autolinker.min.js.map

Large diffs are not rendered by default.

24 changes: 23 additions & 1 deletion src/parser/parse-matches.ts
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ export function parseMatches(text: string, args: ParseMatchesArgs): Match[] {

// For debugging: search for and uncomment other "For debugging" lines
// const table = new CliTable({
// head: ['charIdx', 'char', 'states', 'charIdx', 'startIdx', 'reached accept state'],
// head: ['charIdx', 'char', 'code', 'type', 'states', 'charIdx', 'startIdx', 'reached accept state'],
// });

let charIdx = 0;
Expand Down Expand Up @@ -219,12 +219,29 @@ export function parseMatches(text: string, args: ParseMatchesArgs): Match[] {
assertNever(stateMachine.state);
}
}

// Special case for handling a colon (or other non-alphanumeric)
// when preceded by another character, such as in the text:
// Link 1:http://google.com
// In this case, the 'h' character after the colon wouldn't start a
// new scheme url because we'd be in a ipv4 or tld url and the colon
// would be interpreted as a port ':' char. Also, only start a new
// scheme url machine if there isn't currently one so we don't start
// new ones for colons inside a url
if (charIdx > 0 && isSchemeStartChar(char)) {
const prevChar = text.charAt(charIdx - 1);
if (!isSchemeStartChar(prevChar) && !stateMachines.some(isSchemeUrlStateMachine)) {
stateMachines.push(createSchemeUrlStateMachine(charIdx, State.SchemeChar));
}
}
}

// For debugging: search for and uncomment other "For debugging" lines
// table.push([
// charIdx,
// char,
// `10: ${char.charCodeAt(0)}\n0x: ${char.charCodeAt(0).toString(16)}\nU+${char.codePointAt(0)}`,
// stateMachines.map(machine => `${machine.type}${'matchType' in machine ? ` (${machine.matchType})` : ''}`).join('\n') || '(none)',
// stateMachines.map(machine => State[machine.state]).join('\n') || '(none)',
// charIdx,
// stateMachines.map(m => m.startIdx).join('\n'),
Expand Down Expand Up @@ -1071,6 +1088,7 @@ export function excludeUnbalancedTrailingBracesAndPunctuation(matchedText: strin
}

// States for the parser
// For debugging: temporarily remove 'const'
const enum State {
// Scheme states
SchemeChar = 0, // First char must be an ASCII letter. Subsequent characters can be: ALPHA / DIGIT / "+" / "-" / "."
Expand Down Expand Up @@ -1270,3 +1288,7 @@ function createPhoneNumberStateMachine(startIdx: number, state: State): PhoneNum
acceptStateReached: false,
};
}

function isSchemeUrlStateMachine(machine: StateMachine): machine is SchemeUrlStateMachine {
return machine.type === 'url' && machine.matchType === 'scheme';
}
2 changes: 1 addition & 1 deletion src/parser/tld-regex.ts

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions src/regex-lib.ts
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,8 @@ export const alphaCharsStr = /A-Za-z\xAA\xB5\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02C1\u0
export const emojiStr =
/\u2700-\u27bf\udde6-\uddff\ud800-\udbff\udc00-\udfff\ufe0e\ufe0f\u0300-\u036f\ufe20-\ufe23\u20d0-\u20f0\ud83c\udffb-\udfff\u200d\u3299\u3297\u303d\u3030\u24c2\ud83c\udd70-\udd71\udd7e-\udd7f\udd8e\udd91-\udd9a\udde6-\uddff\ude01-\ude02\ude1a\ude2f\ude32-\ude3a\ude50-\ude51\u203c\u2049\u25aa-\u25ab\u25b6\u25c0\u25fb-\u25fe\u00a9\u00ae\u2122\u2139\udc04\u2600-\u26FF\u2b05\u2b06\u2b07\u2b1b\u2b1c\u2b50\u2b55\u231a\u231b\u2328\u23cf\u23e9-\u23f3\u23f8-\u23fa\udccf\u2935\u2934\u2190-\u21ff/
.source;
// ^ high surrogate
// ^ low surrogate

/**
* The string form of a regular expression that would match all of the
Expand Down
6 changes: 4 additions & 2 deletions tests-integration/test-live-example/test-live-example.spec.ts
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@ describe('Live example page -', function () {
it('should correctly load Autolinker and display the output with the default settings', async () => {
// Path to the index.html file of the live example *from the output
// directory of this .spec file* (i.e. './.tmp/tests-integration/live-example')
const pathToHtmlFile = path.normalize(`${__dirname}/../../../docs/examples/index.html`);
const pathToHtmlFile = path.normalize(
`${__dirname}/../../../docs/examples/live-example/index.html`
);
if (!fs.existsSync(pathToHtmlFile)) {
throw new Error(
`The live example index.html file was not found at path: '${pathToHtmlFile}'\nDid the location of the file (or the output location of this .spec file) change? The file should be referenced from the root-level './docs/examples' folder in the repo`
Expand All @@ -39,8 +41,8 @@ describe('Live example page -', function () {
expect(autolinkerOutputHtml).toBe(
[
`<a href="http://google.com" target="_blank" rel="noopener noreferrer">google.com</a><br>`,
`<a href="http://www.google.com" target="_blank" rel="noopener noreferrer">google.com</a><br>`,
`<a href="http://google.com" target="_blank" rel="noopener noreferrer">google.com</a><br>`,
`<a href="http://192.168.0.1" target="_blank" rel="noopener noreferrer">192.168.0.1</a><br>`,
`<a href="mailto:[email protected]" target="_blank" rel="noopener noreferrer">[email protected]</a><br>`,
`<a href="tel:1234567890" target="_blank" rel="noopener noreferrer">123-456-7890</a><br>`,
`@MentionUser<br>`,
Expand Down
38 changes: 31 additions & 7 deletions tests/autolinker-url.spec.ts
Original file line number Diff line number Diff line change
Expand Up @@ -205,9 +205,33 @@ describe('Autolinker Url Matching >', () => {
});

it('should match urls containing emoji', function () {
const result = autolinker.link('emoji url http://📙.la/🧛🏻‍♂️ mid-sentance');
const result = autolinker.link('emoji url http://📙.la/🧛🏻‍♂️ mid-sentence');

expect(result).toBe(`emoji url <a href="http://📙.la/🧛🏻‍♂️">📙.la/🧛🏻‍♂️</a> mid-sentance`);
expect(result).toBe(`emoji url <a href="http://📙.la/🧛🏻‍♂️">📙.la/🧛🏻‍♂️</a> mid-sentence`);
});

it('should match urls if a URL begins after a colon', function () {
const result = autolinker.link('stuff :https://nia.nexon.com testing');

expect(result).toBe(`stuff :<a href="https://nia.nexon.com">nia.nexon.com</a> testing`);
});

it(`should match urls if a URL begins after a semicolon (i.e. char that isn't part of a url)`, function () {
const result = autolinker.link('Link 1;https://nia.nexon.com testing');

expect(result).toBe(`Link 1;<a href="https://nia.nexon.com">nia.nexon.com</a> testing`);
});

it('should match urls if a URL begins after a numeric character+colon', function () {
const result = autolinker.link('Link 1:https://nia.nexon.com testing');

expect(result).toBe(`Link 1:<a href="https://nia.nexon.com">nia.nexon.com</a> testing`);
});

it('should match urls with scheme starting with an emoji', function () {
const result = autolinker.link('emoji url 👉http://📙.la/🧛🏻‍♂️ mid-sentence');

expect(result).toBe(`emoji url 👉<a href="http://📙.la/🧛🏻‍♂️">📙.la/🧛🏻‍♂️</a> mid-sentence`);
});

it("should NOT autolink possible URLs with the 'javascript:' URI scheme", () => {
Expand Down Expand Up @@ -758,7 +782,7 @@ describe('Autolinker Url Matching >', () => {
);
});

it(`should correctly accept square brackets such as PHP array
it(`should correctly accept square brackets such as PHP array
representation in query strings, when the entire URL is surrounded
by square brackets
`, () => {
Expand Down Expand Up @@ -984,11 +1008,11 @@ describe('Autolinker Url Matching >', () => {
Sometimes you need to go to a path like yahoo.com/my-page
And hit query strings like yahoo.com?page=index
Port numbers on known TLDs are important too like yahoo.com:8000.
Hashes too yahoo.com:8000/#some-link.
Hashes too yahoo.com:8000/#some-link.
Sometimes you need a lot of things in the URL like https://abc123def.org/path1/2path?param1=value1#hash123z
Do you see the need for dashes in these things too https://abc-def.org/his-path/?the-param=the-value#the-hash?
There's a time for lots and lots of special characters like in https://abc123def.org/-+&@#/%=~_()|\'$*[]?!:,.;/?param1=value-+&@#/%=~_()|\'$*[]?!:,.;#hash-+&@#/%=~_()|\'$*[]?!:,.;z
Don't forget about good times with unicode https://ru.wikipedia.org/wiki/Кириллица?Кириллица=1#Кириллица
Don't forget about good times with unicode https://ru.wikipedia.org/wiki/Кириллица?Кириллица=1#Кириллица
and this unicode http://россия.рф
along with punycode http://xn--d1acufc.xn--p1ai
Oh good old www links like www.yahoo.com
Expand All @@ -1008,11 +1032,11 @@ describe('Autolinker Url Matching >', () => {
Sometimes you need to go to a path like <a href="http://yahoo.com/my-page">yahoo.com/my-page</a>
And hit query strings like <a href="http://yahoo.com?page=index">yahoo.com?page=index</a>
Port numbers on known TLDs are important too like <a href="http://yahoo.com:8000">yahoo.com:8000</a>.
Hashes too <a href="http://yahoo.com:8000/#some-link">yahoo.com:8000/#some-link</a>.
Hashes too <a href="http://yahoo.com:8000/#some-link">yahoo.com:8000/#some-link</a>.
Sometimes you need a lot of things in the URL like <a href="https://abc123def.org/path1/2path?param1=value1#hash123z">abc123def.org/path1/2path?param1=value1#hash123z</a>
Do you see the need for dashes in these things too <a href="https://abc-def.org/his-path/?the-param=the-value#the-hash">abc-def.org/his-path/?the-param=the-value#the-hash</a>?
There's a time for lots and lots of special characters like in <a href="https://abc123def.org/-+&@#/%=~_()|'$*[]?!:,.;/?param1=value-+&@#/%=~_()|'$*[]?!:,.;#hash-+&@#/%=~_()|'$*[]?!:,.;z">abc123def.org/-+&@#/%=~_()|'$*[]?!:,.;/?param1=value-+&@#/%=~_()|'$*[]?!:,.;#hash-+&@#/%=~_()|'$*[]?!:,.;z</a>
Don't forget about good times with unicode <a href="https://ru.wikipedia.org/wiki/Кириллица?Кириллица=1#Кириллица">ru.wikipedia.org/wiki/Кириллица?Кириллица=1#Кириллица</a>
Don't forget about good times with unicode <a href="https://ru.wikipedia.org/wiki/Кириллица?Кириллица=1#Кириллица">ru.wikipedia.org/wiki/Кириллица?Кириллица=1#Кириллица</a>
and this unicode <a href="http://россия.рф">россия.рф</a>
along with punycode <a href="http://xn--d1acufc.xn--p1ai">xn--d1acufc.xn--p1ai</a>
Oh good old www links like <a href="http://www.yahoo.com">www.yahoo.com</a>
Expand Down

0 comments on commit e239b9a

Please sign in to comment.