Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make built-in adapters' identifiers configurable #247

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lafrenierejm
Copy link
Contributor

@lafrenierejm lafrenierejm commented Sep 4, 2024

This will allow end users to provide their own lists of extensions and/or mimetypes for each of the built-in adapters.

This feature would obsolete the need for feature requests such as:

The functionality proposed here is a superset of that in #244. That PR makes only the Zip adapter's extensions configurable, whereas this exposes the extensions and mimetypes of all built-in adapters for end-user configurability.

Output of cargo run --bin=rga -- --rga-print-config-schema from this branch.
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "rga configuration",
  "description": "this is kind of a \"polyglot\" struct, since it serves three functions\n\n1. describing the command line arguments using structopt+clap and for man page / readme generation 2. describing the config file format (output as JSON schema via schemars)",
  "type": "object",
  "properties": {
    "accurate": {
      "description": "Use more accurate but slower matching by mime type\n\nBy default, rga will match files using file extensions. Some programs, such as sqlite3, don't care about the file extension at all, so users sometimes use any or no extension at all. With this flag, rga will try to detect the mime type of input files using the magic bytes (similar to the `file` utility), and use that to choose the adapter. Detection is only done on the first 8KiB of the file, since we can't always seek on the input (in archives).",
      "type": "boolean"
    },
    "adapters": {
      "description": "Change which adapters to use and in which priority order (descending)\n\n\"foo,bar\" means use only adapters foo and bar. \"-bar,baz\" means use all default adapters except for bar and baz. \"+bar,baz\" means use all default adapters and also bar and baz.",
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "cache": {
      "$ref": "#/definitions/CacheConfig"
    },
    "max_archive_recursion": {
      "description": "Maximum nestedness of archives to recurse into\n\nWhen searching in archives, rga will recurse into archives inside archives. This option limits the depth.",
      "allOf": [
        {
          "$ref": "#/definitions/MaxArchiveRecursion"
        }
      ]
    },
    "no_prefix_filenames": {
      "description": "Don't prefix lines of files within archive with the path inside the archive.\n\nInside archives, by default rga prefixes the content of each file with the file path within the archive. This is usually useful, but can cause problems because then the inner path is also searched for the pattern.",
      "type": "boolean"
    },
    "custom_adapters": {
      "type": [
        "array",
        "null"
      ],
      "items": {
        "$ref": "#/definitions/CustomAdapterConfig"
      }
    },
    "custom_identifiers": {
      "anyOf": [
        {
          "$ref": "#/definitions/CustomIdentifiers"
        },
        {
          "type": "null"
        }
      ]
    }
  },
  "definitions": {
    "CacheConfig": {
      "type": "object",
      "properties": {
        "disabled": {
          "description": "Disable caching of results\n\nBy default, rga caches the extracted text, if it is small enough, to a database in ${XDG_CACHE_DIR-~/.cache}/ripgrep-all on Linux, ~/Library/Caches/ripgrep-all on macOS, or C:\\Users\\username\\AppData\\Local\\ripgrep-all on Windows. This way, repeated searches on the same set of files will be much faster. If you pass this flag, all caching will be disabled.",
          "type": "boolean"
        },
        "max_blob_len": {
          "description": "Max compressed size to cache\n\nLongest byte length (after compression) to store in cache. Longer adapter outputs will not be cached and recomputed every time.\n\nAllowed suffixes on command line: k M G",
          "allOf": [
            {
              "$ref": "#/definitions/CacheMaxBlobLen"
            }
          ]
        },
        "compression_level": {
          "description": "ZSTD compression level to apply to adapter outputs before storing in cache db\n\nRanges from 1 - 22",
          "allOf": [
            {
              "$ref": "#/definitions/CacheCompressionLevel"
            }
          ]
        },
        "path": {
          "description": "Path to store cache db",
          "allOf": [
            {
              "$ref": "#/definitions/CachePath"
            }
          ]
        }
      }
    },
    "CacheMaxBlobLen": {
      "type": "integer",
      "format": "uint",
      "minimum": 0.0
    },
    "CacheCompressionLevel": {
      "type": "integer",
      "format": "int32"
    },
    "CachePath": {
      "type": "string"
    },
    "MaxArchiveRecursion": {
      "type": "integer",
      "format": "int32"
    },
    "CustomAdapterConfig": {
      "type": "object",
      "required": [
        "args",
        "binary",
        "description",
        "extensions",
        "mimetypes",
        "name",
        "version"
      ],
      "properties": {
        "name": {
          "description": "the unique identifier and name of this adapter. Must only include a-z, 0-9, _",
          "type": "string"
        },
        "description": {
          "description": "a description of this adapter. shown in help",
          "type": "string"
        },
        "disabled_by_default": {
          "description": "if true, the adapter will be disabled by default",
          "type": [
            "boolean",
            "null"
          ]
        },
        "version": {
          "description": "version identifier. used to key cache entries, change if the configuration or program changes",
          "type": "integer",
          "format": "int32"
        },
        "extensions": {
          "description": "the file extensions this adapter supports. For example [\"epub\", \"mobi\"]",
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "mimetypes": {
          "description": "if not null and --rga-accurate is enabled, mime type matching is used instead of file name matching",
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "match_only_by_mime": {
          "description": "if --rga-accurate, only match by mime types, ignore extensions completely",
          "type": [
            "boolean",
            "null"
          ]
        },
        "binary": {
          "description": "the name or path of the binary to run",
          "type": "string"
        },
        "args": {
          "description": "The arguments to run the program with. Placeholders: - $input_file_extension: the file extension (without dot). e.g. foo.tar.gz -> gz - $input_file_stem, the file name without the last extension. e.g. foo.tar.gz -> foo.tar - $input_virtual_path: the full input file path. Note that this path may not actually exist on disk because it is the result of another adapter\n\nstdin of the program will be connected to the input file, and stdout is assumed to be the converted file",
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "output_path_hint": {
          "description": "The output path hint. The placeholders are the same as for `.args`\n\nIf not set, defaults to \"${input_virtual_path}.txt\"\n\nSetting this is useful if the output format is not plain text (.txt) but instead some other format that should be passed to another adapter",
          "type": [
            "string",
            "null"
          ]
        }
      }
    },
    "CustomIdentifiers": {
      "type": "object",
      "properties": {
        "bz2": {
          "description": "The identifiers to process as bz2 archives",
          "anyOf": [
            {
              "$ref": "#/definitions/CustomIdentifier"
            },
            {
              "type": "null"
            }
          ]
        },
        "ffmpeg": {
          "description": "The identifiers to process via ffmpeg",
          "anyOf": [
            {
              "$ref": "#/definitions/CustomIdentifier"
            },
            {
              "type": "null"
            }
          ]
        },
        "gz": {
          "description": "The identifiers to process as gz archives",
          "anyOf": [
            {
              "$ref": "#/definitions/CustomIdentifier"
            },
            {
              "type": "null"
            }
          ]
        },
        "xz": {
          "description": "The identifiers to process as xz archives",
          "anyOf": [
            {
              "$ref": "#/definitions/CustomIdentifier"
            },
            {
              "type": "null"
            }
          ]
        },
        "zip": {
          "description": "The identifiers to process as zip archives",
          "anyOf": [
            {
              "$ref": "#/definitions/CustomIdentifier"
            },
            {
              "type": "null"
            }
          ]
        },
        "zst": {
          "description": "The identifiers to process as zst archives",
          "anyOf": [
            {
              "$ref": "#/definitions/CustomIdentifier"
            },
            {
              "type": "null"
            }
          ]
        },
        "mbox": {
          "description": "The identifiers to process as mbox files",
          "anyOf": [
            {
              "$ref": "#/definitions/CustomIdentifier"
            },
            {
              "type": "null"
            }
          ]
        }
      }
    },
    "CustomIdentifier": {
      "type": "object",
      "properties": {
        "extensions": {
          "description": "The file extensions this adapter supports, for example `[\"gz\", \"tgz\"]`.",
          "type": [
            "array",
            "null"
          ],
          "items": {
            "type": "string"
          }
        },
        "mimetypes": {
          "description": "If not null and --rga-accurate is enabled, mimetype matching is used instead of file name matching.",
          "type": [
            "array",
            "null"
          ],
          "items": {
            "type": "string"
          }
        }
      }
    }
  }
}

@lafrenierejm lafrenierejm force-pushed the configurable-identifiers branch from b53f171 to 24f57ef Compare September 4, 2024 06:19
@lafrenierejm lafrenierejm marked this pull request as ready for review September 4, 2024 06:26
@lafrenierejm lafrenierejm force-pushed the configurable-identifiers branch 2 times, most recently from 82fb415 to 80cbb9d Compare September 15, 2024 14:49
@lafrenierejm
Copy link
Contributor Author

@phiresky This is ready for your review whenever you get the chance.

@lafrenierejm
Copy link
Contributor Author

@phiresky Bumping the request for review.

@phiresky
Copy link
Owner

phiresky commented Oct 8, 2024

In general this seems like probably a good idea, but I'm not sure about the approach?

  • There's a lot of samey code for each different extension we know of, that would need to be updated each time for each future adapter and extension group
  • Can only be defined in the config file and not CLI. This should definitely be settable via CLI since often I'd only want this temporarily - if an extension is not in the default set of extensions then that likely means that adding it is something that only makes sense for some people OR in some contexts (I'd wager most xlsx files as zip archives generate a lot of noise in matching and not lots of useful info)
  • The syntax differs for configuring extensions for builtin adapters vs custom adapters. In custom adapters you have to change the config file and add it to the extensions array, and for builtin adapters you have to find the right part of CustomIdentifiers
  • The custom extensions are passed around starting in the config file, then into each adapter (which itself apart from decompress does not actually care about this), and then back out to the generic (choose_adapter) code
  • Lots of .copy()ing required to be added in order to make the metadata non-static. i'm not sure how often the relevant methods get called to say whether this is important or no, but if it can be avoided it should be

Maybe it would be better and simpler to use a syntax like
rga --rga-additional-extensions jar=zip,xlsx=zip
which with a custom serde parser would be equivalent to config file additional_extensions: [{"seen_extension": "jar", "used_extension": "zip"}]

As in, you specify pairs of extensions [a,b] and every file with extension a is treated as if it had extension b. That way you also don't the additional mapping of how the adapter should treat the file internally (only relevant for the decompress adapter afaik).

The only change that would be needed is that the "fake" extension needs to be given to the adapter. Since some files given to an adapter already don't actually exist on the FS (e.g. within zips), this can potentially be done by just changing filepath_hint:

    /// file path. May not be an actual file on the file system (e.g. in an archive). Used for matching file extensions.
    pub filepath_hint: PathBuf,

Then no other changes are required per adapter, and the override also works to temporarily override extensions of custom adapters a user has configured.

@phiresky
Copy link
Owner

phiresky commented Oct 8, 2024

Going the complete other way: This problem seems to really only have appeared for zip files so far, so it might be feasible to make it purely configurable for that adapter which would also be a lot simpler (--rga-additional-zip-extensions=jar,xxx,zzz. The zip and tar adapters are somewhat special since they have a fair amount of custom logic in them to handle recursion and binary files that you can't really reproduce with a custom external adapter. If you want an additional pdf extension then you can just add a new custom adapter.

But probably the general solution above is better

@lafrenierejm lafrenierejm force-pushed the configurable-identifiers branch 3 times, most recently from 9b45e19 to 982b233 Compare November 15, 2024 21:56
@t1anchen
Copy link

t1anchen commented Dec 8, 2024

Going the complete other way: This problem seems to really only have appeared for zip files so far, so it might be feasible to make it purely configurable for that adapter which would also be a lot simpler (--rga-additional-zip-extensions=jar,xxx,zzz. The zip and tar adapters are somewhat special since they have a fair amount of custom logic in them to handle recursion and binary files that you can't really reproduce with a custom external adapter. If you want an additional pdf extension then you can just add a new custom adapter.

But probably the general solution above is better

Settings like --rga-additional-zip-extensions might be good idea, and not sure if we need priority for extensions? For a extreme case, a user set xlsx in a custom adapter (probably they built with a custom external adapter) and then set xlsx in additional zip extensions here...

Anyway sorry for my delay for review but it is always happy to see things are growing :)

@perplexes
Copy link

sqlite databases can also be named custom things. vscode names them things like "state.vscdb" -- I'm trying to extract my Cursor LLM conversations for example, but I don't know which workspace has what uuid. rga seemed like a great fit, but the extension issue cropped up.

@phiresky
Copy link
Owner

phiresky commented Dec 21, 2024

Great example of extension remapping also being useful for other purposes . @lafrenierejm would you be willing to update/rewrite your implementation to use

rga --rga-additional-extensions jar=zip,xlsx=zip,vscdb=sqlite3

and in the config file:

"additional_extensions": [{"seen_extension": "jar", "used_extension": "zip"}]

or

"additional_extensions": {"jar": "zip"}

instead of modifying individual adapters?

@perplexes: note that --rga-accurate should work for your case though

@lafrenierejm
Copy link
Contributor Author

Great example of extension remapping also being useful for other purposes . @lafrenierejm would you be willing to update/rewrite your implementation…

Certainly! I think I will have time to do so within the next week or so, but I can't promise that.

@lafrenierejm lafrenierejm force-pushed the configurable-identifiers branch from 982b233 to 93fc5ab Compare December 29, 2024 05:34
lafrenierejm and others added 2 commits December 29, 2024 00:55
This will allow end users to provide their own lists of extensions and/or
mimetypes for each of the built-in adapters.
@lafrenierejm lafrenierejm force-pushed the configurable-identifiers branch from 93fc5ab to 5308890 Compare December 29, 2024 05:56
@lafrenierejm
Copy link
Contributor Author

...would you be willing to update/rewrite your implementation to use

rga --rga-additional-extensions jar=zip,xlsx=zip,vscdb=sqlite3

and in the config file:

"additional_extensions": [{"seen_extension": "jar", "used_extension": "zip"}]

or

"additional_extensions": {"jar": "zip"}

instead of modifying individual adapters?

@phiresky The initial refactor for this is done. I went ahead and exposed mimetypes in addition to extensions. I named the options custom_mimetypes and custom_extensions, respectively, since they allow overwriting the built-in adapters' defaults as well as add new entries.

I haven't implemented thorough tests yet. That should be done before this PR is considered ready for merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants