Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in generate_dataset_from_dir() #266

Closed
TimofeiKargin opened this issue Oct 20, 2023 · 1 comment
Closed

Bug in generate_dataset_from_dir() #266

TimofeiKargin opened this issue Oct 20, 2023 · 1 comment

Comments

@TimofeiKargin
Copy link

I am trying to generate dataset out of bunch of PDFs using xturing.datasets.InstructionDataset.generate_dataset_from_dir().
Here is the code, everything looks just fine to me:

chatgpt_token = '<my token>'
engine = ChatGPT(chatgpt_token)
dataset = InstructionDataset.generate_dataset_from_dir(path="./data/num_pdf", engine=engine)
dataset.save("./my_generated_dataset")

But it seems like there is some bug, here is what I got:

Traceback (most recent call last):

  File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File ~/myfolder/ft/gen_dataset.py:17
    dataset = InstructionDataset.generate_dataset_from_dir(path="./data/num_pdf", engine=engine)

  File ~/.local/lib/python3.10/site-packages/xturing/datasets/instruction_dataset.py:204 in generate_dataset_from_dir
    prepare_seed_tasks.prepare_seed_tasks(

  File ~/.local/lib/python3.10/site-packages/xturing/self_instruct/prepare_seed_tasks.py:52 in prepare_seed_tasks
    pairs = instruction_input_suggest(

  File ~/.local/lib/python3.10/site-packages/xturing/self_instruct/prepare_seed_tasks.py:28 in instruction_input_suggest
    outputs = engine.get_completion(prompts=[prompt])

  File ~/.local/lib/python3.10/site-packages/xturing/model_apis/openai.py:146 in get_completion
    completion = openai.ChatCompletion.create(

  File ~/.local/lib/python3.10/site-packages/openai/api_resources/chat_completion.py:25 in create
    return super().create(*args, **kwargs)

  File ~/.local/lib/python3.10/site-packages/openai/api_resources/abstract/engine_api_resource.py:155 in create
    response, _, api_key = requestor.request(

  File ~/.local/lib/python3.10/site-packages/openai/api_requestor.py:299 in request
    resp, got_stream = self._interpret_response(result, stream)

  File ~/.local/lib/python3.10/site-packages/openai/api_requestor.py:710 in _interpret_response
    self._interpret_response_line(

  File ~/.local/lib/python3.10/site-packages/openai/api_requestor.py:775 in _interpret_response_line
    raise self.handle_error_response(

InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 4237 tokens. Please reduce the length of the messages.

Is there any way to make it work? Maybe to pass some parameter to make chunks smaller?

@StochasticRomanAgeev
Copy link
Contributor

Hi @TimofeiKargin,
PDF text's in your dir are too long, try to use smaller texts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants