Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extreamly slow booting #207

Closed
jasl opened this issue May 2, 2023 · 9 comments
Closed

extreamly slow booting #207

jasl opened this issue May 2, 2023 · 9 comments

Comments

@jasl
Copy link

jasl commented May 2, 2023

Come from paritytech/polkadot-sdk#13

I'd catch nperf data from my Azure VM (premium SSD), and the node running in a Docker container, nperf running in the container too.

The current status is the node is stuck on booting, and it is not because of Cumulus reason, and the node is fresh sync last week (stop when speed reduced to 0 bps)

Here are flaming graphs I record the boot starting

from_start_2
from_start

flaminggraph.zip

the file system is ext4 without any configuration, the node is --state-pruning=archive-canonical --blocks-pruning=archive-canonical

@arkpar
Copy link
Member

arkpar commented May 3, 2023

This may happen if there was an abnormal shutdown and fsync is slow for whatever reason. The log file queue may grow to gigabytes and it will take a long time to re-apply all logs on startup. It has been fixed in 0.4.7. The log queue is now limited to about 256Mb.

@arkpar arkpar closed this as completed May 3, 2023
@jasl
Copy link
Author

jasl commented May 14, 2023

boot

Upgraded latest parity-db, the slow boot problem still exists (it takes about 8 minutes to start the node which using Azure premium SSD), here's another flaming graph

boot.svg.zip

UPDATE: after a very slow boot, re-run the node and it finish booting nearly instantly, what's the magic?

@jasl
Copy link
Author

jasl commented May 14, 2023

exit
exit.svg.zip

Here's another flaming graph for the node stuck on exiting

@arkpar
Copy link
Member

arkpar commented May 15, 2023

Upgraded latest parity-db, the slow boot problem still exists (it takes about 8 minutes to start the node which using Azure premium SSD), here's another flaming graph

Are you sure this database was created with parity-db 0.4.7? Previous versions created a long queue of write-ahead log files that would take a long time to process on start. So if you already a database created by an older version and open it with 0.4.7, it will be slow the first time.

@jasl
Copy link
Author

jasl commented May 15, 2023

Upgraded latest parity-db, the slow boot problem still exists (it takes about 8 minutes to start the node which using Azure premium SSD), here's another flaming graph

Are you sure this database was created with parity-db 0.4.7? Previous versions created a long queue of write-ahead log files that would take a long time to process on start. So if you already a database created by an older version and open it with 0.4.7, it will be slow the first time.

I can confirm I'm using 0.4.7, and I have removed the old database before I do the new sync.

The boot time is faster than the previous version (the previous will take more than half an hour to start, 0.4.7 about 7 minutes)

@arkpar
Copy link
Member

arkpar commented May 15, 2023

What does ls -alh log* show inside chains/polkadot/paritydb/full/ dir before the slow start?

@jasl
Copy link
Author

jasl commented May 15, 2023

ls -alh chains/khala/paritydb/full/ | grep log
-rw-r--r-- 1 root root     0 May 14 20:21 log0
-rw-r--r-- 1 root root   65M May 14 20:15 log1
-rw-r--r-- 1 root root   65M May 14 20:15 log2
-rw-r--r-- 1 root root   65M May 14 20:15 log3
-rw-r--r-- 1 root root   65M May 14 20:15 log4
-rw-r--r-- 1 root root   65M May 14 20:15 log5
-rw-r--r-- 1 root root   65M May 14 20:15 log6
-rw-r--r-- 1 root root   65M May 14 20:21 log7
ls -alh polkadot/chains/ksmcc3/paritydb/full/ | grep log
-rw-r--r-- 1 root root   57M May 14 20:19 log0
-rw-r--r-- 1 root root   75M May 14 20:15 log1
-rw-r--r-- 1 root root   65M May 14 20:15 log2
-rw-r--r-- 1 root root   88M May 14 20:15 log3
-rw-r--r-- 1 root root   65M May 14 20:15 log4
-rw-r--r-- 1 root root   65M May 14 20:16 log5
-rw-r--r-- 1 root root   72M May 14 20:16 log6

@jasl
Copy link
Author

jasl commented May 15, 2023

e4defrag 1.44.1 (24-Mar-2018)
<Fragmented files>                             now/best       size/ext
1. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/index_06_16
                                              2077/1              8 KB
2. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/table_08_cc
                                                 2/1              8 KB
3. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/table_08_cf
                                                 2/1              8 KB
4. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/table_06_ce
                                                 2/1              8 KB
5. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/table_08_d3
                                                 2/1              8 KB

 Total/best extents				17099/1270
 Average size per extent			5411 KB
 Fragmentation score				1
 [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
 This directory (/data) does not need defragmentation.
 Done.

@arkpar
Copy link
Member

arkpar commented May 15, 2023

This looks normal. If it takes 8 minutes to fsync ~500Mb to the disk, you might be hitting IOPS limits on the disk or VM. Check your VM configuration or try on a different hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants