Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: 模块内部管控调用迁移到 V2 #853

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
be3d0d8
更新 Redis 依赖未安装的 Warning 信息
Dobiichi-Origami Oct 29, 2024
5326d24
初步迁移 Dataset 到 V2
Dobiichi-Origami Oct 31, 2024
34dfb20
Merge branch 'main' into dataset_calling_v2_update
Dobiichi-Origami Nov 4, 2024
aeab72a
修改单测以适配新的 V2 Dataset
Dobiichi-Origami Nov 4, 2024
6c3a42e
Evaluation 迁移 V2
Dobiichi-Origami Nov 5, 2024
1388dfe
Merge branch 'main' into dataset_calling_v2_update
Dobiichi-Origami Nov 12, 2024
eeafede
修改文档,迁移模型发布功能到 v2
Dobiichi-Origami Nov 13, 2024
79d7fa6
模型发布功能 Bugfix
Dobiichi-Origami Nov 13, 2024
e697c8b
修复由于接口文档与实际返回不对导致的 Bug
Dobiichi-Origami Nov 14, 2024
88152b4
修复单测问题
Dobiichi-Origami Nov 15, 2024
f595b74
修改函数入参
Dobiichi-Origami Nov 18, 2024
c6aaec9
修复了 create_from_bos_file 的 bug
Dobiichi-Origami Nov 25, 2024
c3a12f8
本地数据集兼容具有字段缺失的情况
Dobiichi-Origami Nov 27, 2024
24929cc
从 QianfanDataSource 中删除 ak, sk 字段,支持 SFT 数据集清洗
Dobiichi-Origami Nov 29, 2024
d757050
支持 PromptImageResponse 类型数据集
Dobiichi-Origami Dec 2, 2024
3388226
修复单测错误
Dobiichi-Origami Dec 3, 2024
4b2c477
Merge branch 'main' into dataset_calling_v2_update
Dobiichi-Origami Dec 3, 2024
e454abf
模型服务部署参数更新
Dobiichi-Origami Dec 3, 2024
0e16daa
FakePyarrow 添加 Schema
Dobiichi-Origami Dec 3, 2024
2409ce2
创建服务 reservation 参数错误
Dobiichi-Origami Dec 3, 2024
f699e5e
正则表达式匹配 bos 路径时不强制要求尾随左下划线
Dobiichi-Origami Dec 3, 2024
ce6ac05
模型压缩任务参数 config 更名为 comp_config
Dobiichi-Origami Dec 3, 2024
09a896a
在未安装 pyarrow 且需要直接上传数据集的时候统一上传前后的文件拓展名
Dobiichi-Origami Dec 3, 2024
45ceb70
刷新文档参数
Dobiichi-Origami Dec 4, 2024
4a3105f
管控接口调用在检查到接口返回失败时打印日志
Dobiichi-Origami Dec 4, 2024
1e355f0
修改 _get_transmission_bos_info 以确保返回的 bos 路径以 / 开头
Dobiichi-Origami Dec 4, 2024
c170435
更新 cookbook 中的传参
Dobiichi-Origami Dec 5, 2024
eddc2d4
支持手动设置覆盖千帆数据集上传时的文件格式
Dobiichi-Origami Dec 5, 2024
35691ca
save 到千帆数据源时支持传入备用 ak sk
Dobiichi-Origami Dec 9, 2024
f3b4ec0
bug 修复,支持在传入 dataset_version_id 时指定文件格式
Dobiichi-Origami Dec 10, 2024
84945cb
使用 _extract_all_with_utf8 来替换 zipfile 对象的 extractall 方法以解决 cp437 默认编码的问题
Dobiichi-Origami Dec 10, 2024
5ae29fc
移除从分享链接上传 Bos 的方式,修改 text 类型数据集的打包方式
Dobiichi-Origami Dec 12, 2024
a0c074d
移除 csv.DictReader 构造函数中的 kwargs
Dobiichi-Origami Dec 12, 2024
07d7577
更新文档
Dobiichi-Origami Dec 18, 2024
5d66312
当不能使用 cp437 解码时回退到 utf8
Dobiichi-Origami Dec 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 13 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ print(resp["result"])
```python
from qianfan.dataset import Dataset

ds = Dataset.load(qianfan_dataset_id="your_dataset_id")
ds = Dataset.load(qianfan_dataset_version_id="your_dataset_id")
```

且千帆 Python SDK 集成了一系列本地的数据处理功能,允许用户在本地对来自多个数据源的数据进行增删改查等操作,详见[Dataset 框架](./docs/dataset.md)。
Expand All @@ -192,34 +192,38 @@ from qianfan.dataset import Dataset
# 从本地文件导入
ds = Dataset.load(data_file="path/to/dataset_file.json")


def filter_func(row: Dict[str, Any]) -> bool:
return "answer" in row.keys()
return "answer" in row.keys()


def map_func(row: Dict[str, Any]) -> Dict[str, Any]:
return {
"prompt": row["question"],
"response": row["answer"],
}
return {
"prompt": row["question"],
"response": row["answer"],
}


# 链式调用处理数据
ds.filter(filter_func).map(map_func).pack()

# 上传到千帆
# 数据集只有上传到千帆后才可以用于训练
# 请确保你的数据集格式符合要求
ds.save(qianfan_dataset_id="your_dataset_id")
ds.save(qianfan_dataset_version_id="your_dataset_id")
```

#### Trainer

千帆 Python SDK 以Pipeline为基础串联整个模型训练的流程,同时允许用户更好的把控训练流程状态 [Trainer 框架](./docs/trainer.md)。
以下是一个快速实现ERNIE-Speed-8K fine-tuning的例子:

```python
from qianfan.dataset import Dataset
from qianfan.trainer import Finetune

# 加载千帆平台上的数据集
ds: Dataset = Dataset.load(qianfan_dataset_id="ds-xxx")
ds: Dataset = Dataset.load(qianfan_dataset_version_id="ds-xxx")

# 新建trainer LLMFinetune,最少传入train_type和dataset
# 注意fine-tune任务需要指定的数据集类型要求为有标注的非排序对话数据集。
Expand All @@ -242,7 +246,7 @@ trainer.run()
from qianfan.model import Model
from qianfan.dataset import Dataset

ds = Dataset.load(qianfan_dataset_id="ds-xx")
ds = Dataset.load(qianfan_dataset_version_id="ds-xx")
m = Model(version_id="amv-xx")

m.batch_inference(dataset=ds)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -271,7 +271,7 @@
}
],
"source": [
"ds = Dataset.load(qianfan_dataset_id = \"ds-1j390abu4fv5abkf\", format = FormatType.Jsonl)\n",
"ds = Dataset.load(qianfan_dataset_version_id = \"ds-1j390abu4fv5abkf\", format = FormatType.Jsonl)\n",
"print(ds[0])"
]
},
Expand Down Expand Up @@ -475,9 +475,9 @@
"\n",
"sft_svc: Service = m.deploy(DeployConfig(\n",
" name=\"cusserv_1\",\n",
" endpoint_prefix=\"customer\",\n",
" endpoint_suffix=\"customer\",\n",
" replicas=1,\n",
" pool_type=DeployPoolType.PrivateResource,\n",
" months=1,\n",
" service_type=ServiceType.Completion,\n",
"))"
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -362,7 +362,7 @@
}
],
"source": [
"ds = Dataset.load(qianfan_dataset_id = \"ds-scm8g98a7pv3zzf3\", format = FormatType.Jsonl)\n",
"ds = Dataset.load(qianfan_dataset_version_id = \"ds-scm8g98a7pv3zzf3\", format = FormatType.Jsonl)\n",
"print(ds[0])"
]
},
Expand Down Expand Up @@ -548,7 +548,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -562,7 +562,7 @@
}
],
"source": [
"eval_ds = Dataset.load(qianfan_dataset_id =\"ds-n1dg1czx3ciqrakr\",organize_data_as_group=False, input_columns=[\"prompt\"], reference_column=\"response\")"
"eval_ds = Dataset.load(qianfan_dataset_version_id =\"ds-n1dg1czx3ciqrakr\",organize_data_as_group=False, input_columns=[\"prompt\"], reference_column=\"response\")"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion cookbook/awesome_demo/dpo_words_count_control/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from qianfan.dataset import Dataset

def eval(version_id, ds):
result_ds = ds.test_using_llm(model_version_id=version_id)
result_ds = ds.test_using_llm(model_id=version_id)
res = []
for i in result_ds:

Expand Down
18 changes: 9 additions & 9 deletions cookbook/awesome_demo/dpo_words_count_control/main.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
"metadata": {},
"outputs": [],
"source": [
"!pip install 'qianfan>=0.3.16'"
"!pip install 'qianfan'"
]
},
{
Expand Down Expand Up @@ -96,7 +96,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -115,7 +115,7 @@
}
],
"source": [
"ds_test = Dataset.load(qianfan_dataset_id = \"ds-2hdewmq2w2yw8dz7\")\n",
"ds_test = Dataset.load(qianfan_dataset_version_id = \"ds-2hdewmq2w2yw8dz7\")\n",
"ds_test = ds_test.save(data_file=\"data_file/dpo_test.jsonl\")"
]
},
Expand Down Expand Up @@ -190,7 +190,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -204,14 +204,14 @@
],
"source": [
"#enable_log(logging.INFO)\n",
"ds_sft = Dataset.load(qianfan_dataset_id = \"ds-sjv3xchndftmg2fu\")#sft训练集\n",
"ds_sft = Dataset.load(qianfan_dataset_version_id = \"ds-sjv3xchndftmg2fu\")#sft训练集\n",
"#ds_sft = ds_sft.save(data_file=\"data_file/sft_train.jsonl\")\n",
"#print(new_ds[0])\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -224,7 +224,7 @@
}
],
"source": [
"ds_dpo = Dataset.load(qianfan_dataset_id = \"ds-ca94jxph35qp1ks3\")#dpo训练集\n",
"ds_dpo = Dataset.load(qianfan_dataset_version_id = \"ds-ca94jxph35qp1ks3\")#dpo训练集\n",
"#ds_dpo = ds_dpo.save(data_file=\"data_file/dpo_train.jsonl\")"
]
},
Expand Down Expand Up @@ -632,7 +632,7 @@
},
{
"cell_type": "code",
"execution_count": 70,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -645,7 +645,7 @@
}
],
"source": [
"dpo_test = Dataset.load(qianfan_dataset_id = \"ds-2hdewmq2w2yw8dz7\")#dpo评估集\n",
"dpo_test = Dataset.load(qianfan_dataset_version_id = \"ds-2hdewmq2w2yw8dz7\")#dpo评估集\n",
"# dpo_test = dpo_test.save(data_file=\"data_file/dpo_test.jsonl\")"
]
},
Expand Down
22 changes: 5 additions & 17 deletions cookbook/awesome_demo/essay_scoring/main.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1862,14 +1862,14 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 加载训练用的预置数据集\n",
"qf_train_ds = Dataset.load(qianfan_dataset_id=\"ds-553hczysf3um4cc9\")\n",
"qf_train_ds = Dataset.load(qianfan_dataset_version_id=\"ds-553hczysf3um4cc9\")\n",
"# 加载评估用的预置数据集\n",
"qf_eval_ds = Dataset.load(qianfan_dataset_id=\"ds-6ubasnsry5pa4azi\")"
"qf_eval_ds = Dataset.load(qianfan_dataset_version_id=\"ds-6ubasnsry5pa4azi\")"
]
},
{
Expand Down Expand Up @@ -2705,7 +2705,7 @@
},
{
"cell_type": "code",
"execution_count": 76,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -2731,9 +2731,8 @@
"\n",
"sft_svc: Service = m.deploy(DeployConfig(\n",
" name=\"essay_correct_3\",\n",
" endpoint_prefix=\"essaycor\",\n",
" endpoint_suffix=\"essaycor\",\n",
" replicas=1,\n",
" pool_type=DeployPoolType.PrivateResource,\n",
" service_type=ServiceType.Chat,\n",
"))"
]
Expand Down Expand Up @@ -2812,17 +2811,6 @@
"for s in result:\n",
" print(s['result'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"Dataset.map_reduce"
]
}
],
"metadata": {
Expand Down
21 changes: 9 additions & 12 deletions cookbook/awesome_demo/role_play/main.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -884,16 +884,14 @@
"metadata": {},
"outputs": [],
"source": [
"bos_bucket_name = \"your_bos_bucket\"\n",
"bos_bucket_file_path = \"/your_path/\"\n",
"bos_bucket_file_path = \"bos://your_bos_bucket/your_path/\"\n",
"qianfan_dataset_name = \"your_ds_name\"\n",
"\n",
"# 创建千帆数据集,并上传保存\n",
"qianfan_data_source = QianfanDataSource.create_bare_dataset(\n",
" name=qianfan_dataset_name,\n",
" template_type=console_consts.DataTemplateType.NonSortedConversation,\n",
" storage_type=console_consts.DataStorageType.PrivateBos,\n",
" storage_id=bos_bucket_name,\n",
" dataset_format=console_consts.V2.DatasetFormat.PromptResponse,\n",
" storage_type=console_consts.V2.StorageType.Bos,\n",
" storage_path=bos_bucket_file_path,\n",
")\n",
"qf_ds = ds.save(qianfan_data_source, should_overwrite_existed_file=True)"
Expand Down Expand Up @@ -981,7 +979,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -1012,7 +1010,7 @@
"\n",
"# 导入平台上预置的测试集\n",
"ds = Dataset.load(\n",
" qianfan_dataset_id=\"ds-bimjvfatbnard1we\",\n",
" qianfan_dataset_version_id=\"ds-bimjvfatbnard1we\",\n",
" organize_data_as_group=False,\n",
" input_columns=[\"prompt\"],\n",
" reference_column=\"response\",\n",
Expand Down Expand Up @@ -1054,7 +1052,7 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -1138,7 +1136,7 @@
],
"source": [
"#加载被评估模型,在version_id处填入模型版本ID\n",
"eb_turbo_model = Model(version_id=\"amv-3ytrunai0k0n\")\n",
"eb_turbo_model = Model(id=\"amv-3ytrunai0k0n\")\n",
"#设置本地评估器\n",
"em = EvaluationManager(local_evaluators=[local_evaluator])\n",
"result = em.eval([eb_turbo_model], ds)\n",
Expand Down Expand Up @@ -1191,7 +1189,7 @@
},
{
"cell_type": "code",
"execution_count": 24,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -1209,9 +1207,8 @@
" name=\"role_play_sftfin\",\n",
" endpoint_suffix=\"sdkcqa1\",\n",
" replicas=1, # 副本数, 与qps强绑定\n",
" pool_type=DeployPoolType.PrivateResource, # 私有资源池\n",
" service_type=ServiceType.Chat,\n",
" hours=1,\n",
" months=1,\n",
"))\n"
]
},
Expand Down
6 changes: 3 additions & 3 deletions cookbook/dataset/batch_inference_using_dataset.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -516,7 +516,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -621,9 +621,9 @@
"#-# cell_skip\n",
"cloud_dataset_id = \"dataset_id\"\n",
"\n",
"qianfan_ds = Dataset.load(qianfan_dataset_id=cloud_dataset_id)\n",
"qianfan_ds = Dataset.load(qianfan_dataset_version_id=cloud_dataset_id)\n",
"\n",
"result = qianfan_ds.test_using_llm(model_version_id=\"amv-qb8ijukaish3\")\n",
"result = qianfan_ds.test_using_llm(model_id=\"amv-qb8ijukaish3\")\n",
"print(result[0])"
]
},
Expand Down
Loading
Loading