Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data masking for agnostic pretraining data #427

Open
aakankshaduggal opened this issue Dec 4, 2024 · 1 comment
Open

Update data masking for agnostic pretraining data #427

aakankshaduggal opened this issue Dec 4, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@aakankshaduggal
Copy link
Member

Update data masking for agnostic pretraining data

adding @Maxusmusti and @RobotSail for more details around the requirements from the training

@bbrowning bbrowning added the enhancement New feature or request label Jan 21, 2025
@bbrowning
Copy link
Contributor

The goal here is for SDG to be able to generate data once that's usable with any student model. Today, the pretraining samples we generate have to be generated in a format that matches the chat prompt of the intended student model used for training. Instead, we want to generate data with SDG once and have training do whatever post-processing it needs to our pretrain samples to adapt them to the student model being trained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants