Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc for dedicated local storage in ilab cli #104

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

RobotSail
Copy link
Member

This document outlines an approach for config and program data management that simplifies
the user flow and creates a streamlined experience.

Signed-off-by: Oleg S [email protected]

@RobotSail RobotSail requested review from cdoern and jeremyeder June 24, 2024 23:01
Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally happy with the direction here.

I'd like to hear a little more around the proposed unique refs -- how would you generate them? where would they be stored? How would they be mapped to a file (or files) on disk?

Are you intentionally leaving it vague and just "TBD based on whatever seems easy to implement" later? or do you have anything more concrete in mind?

There's also some linter complaints, but I'm sure you saw those already.


For tools to be useful to the user, they generally need to work on and know about data pertinent to the user. For example, podman and docker track image blobs, kubectl tracks cluster credentials, etc.

The InstructLab CLI manages a number of settings today such as the teacher model, taxonomy location, inference model, etc. using a local file known as `config.yaml`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI itself doesn't manage any of this. It's user config. It generates an initial file for you, but never updates it from there. I only bring this up because user config isn't the same as data that is never touched directly by a user. (config vs data)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, I would say it makes a stronger point for the proposed structure of the ~/.local/share directory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be so @RobotSail but as @russellb mentioned the CLI doesn't manage this.


We propose the following:

1. `ilab` will have a dedicated config directory seated at `~/.config/ilab` on Linux/MacOS systems, and the `%APPDATA%` equivalent if we decide to support Windows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably this works on windows with WSL? I know people have been using it on WSL, already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's WSL then we should still assume it's a linux system. The only WSL-specific bits are more on the CUDA drivers side. And even then, that works fine currently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more of "raw windows", but you're right - WSL is a great alternative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobotSail Are qwe going to support native Windows? if not, we should be explicit that it is WSL.


The steps outlined to implement this are:

1. Implement logic for `~/.config/ilab` and `~/.local/share/ilab` directories and expect the config to live and be read from there unless another config is manually specified. For simplicity, if the is read in from `~/.config/ilab` then we assume the relative paths should be realtive to the data directory `~/.local/share/ilab`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A way this can work is have a data_dir that can be changed via a CLI flag or config file option, but always default to what you have specified here -- very similar to what you describe as a default location for the config dir unless otherwise specified via a flag

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree with that.

docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved
docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved
docs/cli/ilab-local-config.md Show resolved Hide resolved
docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved
docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved
docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved
docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved
docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved
docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved
docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved
@RobotSail
Copy link
Member Author

I'd like to hear a little more around the proposed unique refs -- how would you generate them? where would they be stored? How would they be mapped to a file (or files) on disk?

Are you intentionally leaving it vague and just "TBD based on whatever seems easy to implement" later? or do you have anything more concrete in mind?

I was thinking that we could just use a simple UUID or SHA2 digest. For the initial version we could have the refs just be the relative paths in the ~/.local/share directory.

docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved
docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved
docs/cli/ilab-local-config.md Show resolved Hide resolved
docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved
docs/cli/ilab-local-config.md Outdated Show resolved Hide resolved

We propose the following:

1. `ilab` will have a dedicated config directory seated at `~/.config/instructlab` on Linux/MacOS systems, and the `%APPDATA%` equivalent if Windows support is added in the future.
Copy link
Member

@n1hility n1hility Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your intent to support XDG?:
https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html#introduction

Not arguing that you should necessarily but if you intended to I would change the text to be seated at $XDG_CONFIG_HOME/instructlab (.local/share/instructlab) (And XDG_DATA_HOME later on)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I was thinking of.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly recommend to use https://pypi.org/project/platformdirs/ . It has the correct defaults for all important platforms.

I also like to point out that the use of XDG directories does not work well with systemd services. @mairin just mentioned that she wants systemd service support for InstructLab. XDG directories are problematic, because they are designed for human end-users with a home directory. systemd --system need an entirely different directory structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiran Thanks for pointing this out! I had initially suggested XDG thinking of the laptop-targeted experience. But if we want a systemd service we'll need to use system directories. Ideally we'd use XDG for the laptop-target user experience and system-wide directories for the big mama GPU hw experience.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ to this end, @tiran suggests I would store the paths in config.yaml and have a option to either use XDG or a common base directory for all paths.

The output would look something like


## Evolving configs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a distinct proposal, was there a reason you included it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is definitely distinct, but will be something we'll need as a feature in the future.

We propose the following:

1. `ilab` will have a dedicated config directory seated at `~/.config/instructlab` on Linux/MacOS systems, and the `%APPDATA%` equivalent if Windows support is added in the future.
2. `ilab` will have a dedicated program files directory seated at `~/.local/share/instructlab` which will store model checkpoints, downloaded models, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To carry through on the Windows spec I would say %LOCALAPPDATA% here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!


Based on these steps, #1 would be the trickiest because we assume that the paths referenced in the config file are relative to the calling process.

One approach here could be removing these paths from the default and forcing the user to specify them relatively.
Copy link
Member

@n1hility n1hility Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble understanding what you mean on this sentence

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n1hility My wording here is bad but the idea is simple:

  • When the config read is the default one ~/.config/ilab/config.yaml then parse paths relative to ~/.local/share/ilab.
  • When that's NOT the config, we would parse relative paths with respect to the calling process. E.g. if ilab is invoked in /foo and the config.model_path references a file in bar/granite_7b.gguf then it would resolve to /foo/bar/granite_7b.gguf.
  • All absolute paths will be parsed as-is


One approach here could be removing these paths from the default and forcing the user to specify them relatively.

Another approach would be implementing the list commands at the same time as #1 and setting it up so that the relative paths passed to `ilab` are relative ***to the `~/.local/share/instructlab` directory***
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any compatibility with existing configs? For example it looks in the current dir and if there is a config.yml expects abs paths, or is this an intentional major change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the experience be like without #3 (implementing refs) in your list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would list tell you were a file was in that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a breaking change then but we'd need to make it. We can maintain compatibility by allowing users to provide a --config flag which should read in the config as we do today. Not sure how much harder this would be versus what we're doing today.

Copy link
Member

@n1hility n1hility left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I really like this proposal, but could use some clarifications on a few points made.

Signed-off-by: Oleg S <[email protected]>
Copy link
Member

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall this is a good proposal and direction. Thanks for pushing this @RobotSail.

Some additional points:

  • As raised by @n1hility, I think supporting the XDG Base Directory Specification is a good idea. Do you mind adding this to the proposal? You could then mention support for the environment variables as per the XDG Base Directory Specification: $XDG_CONFIG_HOME and $XDG_DATA_HOME.
  • It was mentioned about capability to change the data and config paths. This could be implemented by providing environment variables like $INSTRUCTLAB_PATH_CONFIG and $INSTRUCTLAB_PATH_CONFIG


The steps outlined to implement this are:

1. Implement logic for the `~/.config/instructlab` and `~/.local/share/instructlab` directories, and expect the config to live and be read from there unless another config is manually specified. For simplicity, if the config we read in is located at `~/.config/instructlab` - the default location - then we'll assume relative paths should be resolved with respect to the data directory `~/.local/share/instructlab`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be that relative paths are resolved with respect to the config directory?

@danmcp
Copy link
Member

danmcp commented Aug 27, 2024

@RobotSail Would you be willing to update this doc based on feedback and current state? Would be good to get it merged given the state of the code has already moved past the state of this doc.

@RobotSail
Copy link
Member Author

@danmcp Yes we can update this one to reflect what currently exists in ilab

@russellb russellb dismissed their stale review August 27, 2024 17:19

old review, code moved on independent of this doc

@danmcp
Copy link
Member

danmcp commented Aug 27, 2024

@danmcp Yes we can update this one to reflect what currently exists in ilab

Thank you!

@nathan-weinberg
Copy link
Member

@RobotSail are you still planning on updating this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants