Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: lots of tpcc new order txn timeout during stability test on distributed mode #21065

Open
1 task done
aressu1985 opened this issue Jan 3, 2025 · 4 comments
Open
1 task done
Assignees
Labels
kind/bug Something isn't working phase/testing severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Milestone

Comments

@aressu1985
Copy link
Contributor

aressu1985 commented Jan 3, 2025

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

2.0-dev

Commit ID

bad5761

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
2*PROXY: 3C 6G
- OS type:
- Others:

Actual Behavior

There are lots of tpcc new order txn timeout during stability test on distributed mode after about 6 hours.
image

mo-log:
https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%2238Q%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-stb-bad5761-202501021202%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221735815600000%22,%22to%22:%221735819199000%22%7D%7D%7D&schemaVersion=1&orgId=1

metrics link:
https://shanghai.idc.matrixorigin.cn:30001/d/ae3ttwulohhq8e/txn-metrics?orgId=1&var-interval=1m&var-namespace=mo-stb-bad5761-202501021202&var-pod=All&from=1735815600000&to=1735819199000

statement_info of those timeout:
statement_info.txt

Expected Behavior

No response

Steps to Reproduce

1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with  75 terminals in one independant tenant,non-prepare mode

Additional information

No response

@aressu1985 aressu1985 added kind/bug Something isn't working needs-triage severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels Jan 3, 2025
@aressu1985 aressu1985 added this to the 2.0.2 milestone Jan 3, 2025
@aressu1985 aressu1985 self-assigned this Jan 3, 2025
@aressu1985 aressu1985 assigned ouyuanning and unassigned aressu1985 Jan 3, 2025
@ouyuanning
Copy link
Contributor

ouyuanning commented Jan 8, 2025

今天跟苏动,徐鹏看了一下。
1、目前能看到的是

     tpch的请求会集中打到某个CN,
          -> 导致该CN的CPU占用较高
                -> 导致该CN的go routine调度慢
                      -> 导致log tail apply拿不到资源,使得该部分耗时增加
                            -> 导致txn耗时高,推高txn queue里的数量,引起后面建立txn的等待时长,超过wait active时长,于是大量报错

2、但是这个现象发生之后,sys租户会一直有很多的链接数,下不去。这个部分还没有结论

@ouyuanning
Copy link
Contributor

1的部分。昨晚去掉了tpch依然会存在timeout
下一步根据早上会议讨论,先看看内部执行器在处理点什么,会占用较多的资源

@aressu1985
Copy link
Contributor Author

最近几天回归测试,均没有在出现该问题

@aressu1985 aressu1985 modified the milestones: 2.0.2, 2.1.0 Jan 14, 2025
@XuPeng-SH
Copy link
Contributor

没有再复现

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working phase/testing severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Projects
None yet
Development

No branches or pull requests

5 participants