-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How is the contrastive data pipeline implemented? #12
Comments
I have the same question. I guess maybe use the same data process in the Memorizing Transformers(Figure 3)? |
As mentioned in the readme the instruction fine-tuning does not use FoT.
However, this is not the implementation that was used to create the base models. |
@MarkYangjiayi As described Appendix A.2 in FoT paper, maybe FoT does not need the same data process pipeline in Memorizing Transformers. C_curr and C_prev don't represent by batch, instead they represent by segments(vertical) within batch, this can explain two statements in FoT paper:
If it is correct, how is the data process of FoT? does FoT split long doc into multiple subsequences like Memorizing Transformers thus training can utilize data in one long doc as much as possible? or it just perform truncation and padding for every single doc? @CStanKonrad |
Have there been any developments about “ official FoT large scale continual pre-training (FoT finetuning) code ” |
It's been almost two weeks, how's the plan on releasing the FoT pipeline? Still looking forward to seeing the actual implementation of the cross batched contrastive learning FoT. |
@hxs91 My hypothesis is that FoT is using a similar training strategy to Recurrent Memory Transformer, if you want to train a local context of 2k with 4 segments, you enter 8k tokens and split it in the training loop. |
Yeah, I realize that if put different segments in different batch they are not differentiable, which is inconsistent with the description in FoT paper. |
Hi, I saw in the paper mentioning that C_curr and C_prev from the same document in the batch, but didn't really see how this is implemented.
It seems that in the data_processing part of the code, each time the processor just samples from a new piece of data, how does it guarantee that the next batch of data will have same context in different steps? Thanks
The text was updated successfully, but these errors were encountered: