Reproduction of experiments #7

BeachWang · 2023-12-13T07:49:22Z

Hi,

We follow the training pipeline in experimental to replicate the DSIR results. However, our average performance reached only 81.05, significantly below the reported benchmark of 82.30. Are there any additional techniques or optimizations that we might have overlooked?

The text was updated successfully, but these errors were encountered:

sangmichaelxie · 2023-12-20T05:02:47Z

Could you provide some more details? What were the per-task results that you got?

Did you use the quality filter that filters for length, numeric ratio, etc? Did you preprocess the data into chunks?

sangmichaelxie · 2023-12-20T05:27:47Z

Ah, just found a typo that was introduced when fixing the domain_to_idxs issue earlier:

dsir/experimental/data_selection/dsir_pipeline.py

Line 224 in cb7b6c6

domain_idxs_path = ds_path / f"{domain.replace(' ', '_')}_idxs.npy"

Could you try running the resampling step again?

BeachWang · 2023-12-26T03:22:36Z

Hi,

I have preprocessed the data by running bash preprocessing/run.sh and used the quality filter by running bash preprocessing/quality_scores/run_slurm_quality_stats.sh and bash data_selection/run_cmds.sh in advance. We also turned the --qualityfilter on in data selection. I guess that there may be some random factors at play influencing the experiments with the resample selection. I also tried to replicate your experiment with the top-k selection and achieved 81.3 performance which matched the performance reported in your paper.

BeachWang · 2023-12-26T03:35:04Z

Actually, I believe your work is reasonable and I have been following it for a long time. I find your algorithms are totally different between your 'v1' and 'v3' released in the arXiv. However, I am puzzled by the fact that the reported results in Table 4 of 'v1' version and the results in Table 3 of 'v2' version are identical.

sangmichaelxie · 2023-12-26T04:13:28Z

Ah, just found a typo that was introduced when fixing the domain_to_idxs issue earlier:

dsir/experimental/data_selection/dsir_pipeline.py

Line 224 in cb7b6c6

domain_idxs_path = ds_path / f"{domain.replace(' ', '_')}_idxs.npy"

Could you try running the resampling step again?

Did you try doing the resampling again after your first post on this issue? Basically, this line was mistakenly moved above the for loop, and this made it so that the selection-by-domain did not work (with the typo, all the indices for each domain were the same). This affects the experiment since we treat the wikipedia and books domains differently.

Regarding the different arxiv versions - the algorithm has stayed the same across all the versions. Any differences would be due to clarification or improvement of the presentation.

BeachWang · 2023-12-27T03:09:53Z

Hi, thanks very much.
I had revised the compute_domain_idxs function as following in my experiment.

def compute_domain_idxs(filter_domains):
    ds_paths = dsname_to_args['pile']['task_name']
    ds_dir = Path(ds_paths[0]).parent.parent

    domain_to_idxs = defaultdict(list)
    todo_domains = []
    for domain in filter_domains:
        domain_idxs_path = ds_dir / f"{domain.replace(' ', '_')}_idxs.npy"
        if not domain_idxs_path.exists():
            todo_domains.append(domain)
    todo_domains = set(todo_domains)

    base_idx = 0
    subset_id = 0
    for ds_path in ds_paths:
        if len(todo_domains) > 0:
            combined_streaming_ds = load_dataset(
                'json',
                data_files=ds_path,
                streaming=True)['train']
            cnt = 0
            for i, ex in tqdm(enumerate(combined_streaming_ds), miniters=1000000, desc=str(subset_id)):
                domain = ex["metadata"]["pile_set_name"]
                cnt += 1
                if domain in todo_domains:
                    domain_to_idxs[domain].append(base_idx + i)
            base_idx += cnt

        subset_id += 1

    print("total idx", base_idx)

    for domain, idxs in domain_to_idxs.items():
        np.save(ds_dir / f"{domain.replace(' ', '_')}_idxs.npy", np.asarray(idxs))

    for domain in filter_domains:
        domain_idxs_path = ds_dir / f"{domain.replace(' ', '_')}_idxs.npy"
        domain_idxs = np.load(domain_idxs_path)
        domain_to_idxs[domain] = domain_idxs

    return domain_to_idxs

BeachWang · 2023-12-27T03:24:36Z

Thank you for clarifying my confusion. Are you saying that you use the token distributions to compute the weights in 'v1' rather than learning two generative models as 'v1' suggests?

BeachWang · 2023-12-27T03:38:25Z

BTW, I am also confused about the different results of Top-k selection and resample selection. In my experiments, the performance of resample selection often falls between the performances of Top-k selection and random selection. However, the opposite is reported in the paper.

sangmichaelxie · 2024-01-05T20:16:05Z

When you print total_idx in your code, is the number matching 1745766302?

Thank you for clarifying my confusion. Are you saying that you use the token distributions to compute the weights in 'v1' rather than learning two generative models as 'v1' suggests?

Generative models are just models of the data distribution - bag-of-words ("token distributions") is a simple generative model. I suppose the recent "generative AI" stuff has made it seem like generative = transformers/GPT/diffusion models.

resample selection often falls between the performances of Top-k selection and random selection

To clarify, by top-k here you mean to not perturb the importance weights with Gumbel noise before taking top-k? I've run the resampling a couple times before and haven't seen this, but I can take a look when I get a chance soon.

BeachWang · 2024-01-10T02:27:38Z

Thank you very much.

Yes. The number is matching 1745766302. And the top-k means to not perturb the importance weights with Gumbel noise. I'm excited to see the further experiments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction of experiments #7

Reproduction of experiments #7

BeachWang commented Dec 13, 2023

sangmichaelxie commented Dec 20, 2023 •

edited

Loading

sangmichaelxie commented Dec 20, 2023

BeachWang commented Dec 26, 2023

BeachWang commented Dec 26, 2023

sangmichaelxie commented Dec 26, 2023

BeachWang commented Dec 27, 2023

BeachWang commented Dec 27, 2023

BeachWang commented Dec 27, 2023

sangmichaelxie commented Jan 5, 2024

BeachWang commented Jan 10, 2024

Reproduction of experiments #7

Reproduction of experiments #7

Comments

BeachWang commented Dec 13, 2023

sangmichaelxie commented Dec 20, 2023 • edited Loading

sangmichaelxie commented Dec 20, 2023

BeachWang commented Dec 26, 2023

BeachWang commented Dec 26, 2023

sangmichaelxie commented Dec 26, 2023

BeachWang commented Dec 27, 2023

BeachWang commented Dec 27, 2023

BeachWang commented Dec 27, 2023

sangmichaelxie commented Jan 5, 2024

BeachWang commented Jan 10, 2024

sangmichaelxie commented Dec 20, 2023 •

edited

Loading