Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: m_topology->running_arrow_count >= 0 #227

Closed
faustus123 opened this issue Jun 27, 2023 · 2 comments
Closed

Error: m_topology->running_arrow_count >= 0 #227

faustus123 opened this issue Jun 27, 2023 · 2 comments

Comments

@faustus123
Copy link
Collaborator

faustus123 commented Jun 27, 2023

The jana built-in benchmark was run on ejfat-5.jlab.org and crashed with the error printed below. I saw this hit the assert when setting to 72 threads (machine reports 128 cores so this was in the middle of the test). Note that I ran it a second time and it hit the assert when setting to 73 threads so it is not perfectly reproducible.

The command run was:

jana -Pplugins=JTest -b

The machine specs are:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    16
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           1
Model name:                      AMD EPYC 7763 64-Core Processor
Stepping:                        1
Frequency boost:                 enabled
CPU MHz:                         1500.000
CPU max MHz:                     3529.0520
CPU min MHz:                     1500.0000
BogoMIPS:                        4899.90
Virtualization:                  AMD-V
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        512 MiB

Last part of output, including error message

nthreads=69  rate=120.958Hz  (avg = 143.939 +/- 8.57713 Hz)
nthreads=69  rate=159.95Hz  (avg = 146.607 +/- 7.55132 Hz)
nthreads=69  rate=149.952Hz  (avg = 147.085 +/- 6.48766 Hz)
nthreads=69  rate=129.962Hz  (avg = 144.945 +/- 6.01944 Hz)
nthreads=69  rate=159.932Hz  (avg = 146.61 +/- 5.57619 Hz)
nthreads=69  rate=130.958Hz  (avg = 145.045 +/- 5.23364 Hz)
nthreads=69  rate=148.954Hz  (avg = 145.4 +/- 4.76991 Hz)
nthreads=69  rate=159.903Hz  (avg = 146.609 +/- 4.52294 Hz)
nthreads=69  rate=141.951Hz  (avg = 146.25 +/- 4.18919 Hz)
nthreads=69  rate=137.957Hz  (avg = 145.658 +/- 3.93162 Hz)
nthreads=69  rate=159.91Hz  (avg = 146.608 +/- 3.78258 Hz)
Setting NTHREADS = 70 ...
[INFO] Scaling to 70 threads
[INFO] JArrowProcessingController: scale(): Stopping all running workers
[INFO] JArrowProcessingController: scale(): All workers are stopped
[INFO] JArrowProcessingController: scale(): Restarting 70 workers
nthreads=70  rate=159.955Hz  (avg = 159.955 +/- -nan Hz)
nthreads=70  rate=159.916Hz  (avg = 159.935 +/- 0.0103851 Hz)
nthreads=70  rate=156.941Hz  (avg = 158.937 +/- 0.81497 Hz)
nthreads=70  rate=122.96Hz  (avg = 149.943 +/- 7.81328 Hz)
nthreads=70  rate=159.953Hz  (avg = 151.945 +/- 6.50205 Hz)
nthreads=70  rate=159.903Hz  (avg = 153.271 +/- 5.55201 Hz)
nthreads=70  rate=119.955Hz  (avg = 148.512 +/- 6.48565 Hz)
nthreads=70  rate=159.953Hz  (avg = 149.942 +/- 5.83049 Hz)
nthreads=70  rate=159.923Hz  (avg = 151.051 +/- 5.28708 Hz)
nthreads=70  rate=130.956Hz  (avg = 149.041 +/- 5.12606 Hz)
nthreads=70  rate=148.948Hz  (avg = 149.033 +/- 4.66006 Hz)
nthreads=70  rate=159.875Hz  (avg = 149.936 +/- 4.35843 Hz)
nthreads=70  rate=119.946Hz  (avg = 147.629 +/- 4.59331 Hz)
nthreads=70  rate=159.933Hz  (avg = 148.508 +/- 4.34847 Hz)
nthreads=70  rate=158.902Hz  (avg = 149.201 +/- 4.11341 Hz)
Setting NTHREADS = 71 ...
[INFO] Scaling to 71 threads
[INFO] JArrowProcessingController: scale(): Stopping all running workers
[INFO] JArrowProcessingController: scale(): All workers are stopped
[INFO] JArrowProcessingController: scale(): Restarting 71 workers
nthreads=71  rate=119.959Hz  (avg = 119.959 +/- 0.0187951 Hz)
nthreads=71  rate=159.951Hz  (avg = 139.955 +/- 14.1393 Hz)
nthreads=71  rate=159.918Hz  (avg = 146.609 +/- 10.88 Hz)
nthreads=71  rate=138.924Hz  (avg = 144.688 +/- 8.32788 Hz)
nthreads=71  rate=140.946Hz  (avg = 143.939 +/- 6.69584 Hz)
nthreads=71  rate=159.901Hz  (avg = 146.6 +/- 6.08542 Hz)
nthreads=71  rate=146.951Hz  (avg = 146.65 +/- 5.21628 Hz)
nthreads=71  rate=132.966Hz  (avg = 144.939 +/- 4.83657 Hz)
nthreads=71  rate=159.946Hz  (avg = 146.607 +/- 4.57759 Hz)
nthreads=71  rate=159.921Hz  (avg = 147.938 +/- 4.3091 Hz)
nthreads=71  rate=119.973Hz  (avg = 145.396 +/- 4.60666 Hz)
nthreads=71  rate=159.951Hz  (avg = 146.609 +/- 4.37954 Hz)
nthreads=71  rate=157.948Hz  (avg = 147.481 +/- 4.1286 Hz)
nthreads=71  rate=121.967Hz  (avg = 145.659 +/- 4.21678 Hz)
nthreads=71  rate=159.958Hz  (avg = 146.612 +/- 4.04198 Hz)
Setting NTHREADS = 72 ...
[INFO] Scaling to 72 threads
[INFO] JArrowProcessingController: scale(): Stopping all running workers
jana: /home/davidl/work/2023.06.26.JANA2/JANA2/src/libraries/JANA/Engine/JScheduler.cc:74: JArrow* JScheduler::next_assignment(uint32_t, JArrow*, JArrowMetrics::Status): Assertion `m_topology->running_arrow_count >= 0' failed.
Aborted (core dumped)
(venv) davidl@ejfat-5:~/work/2023.06.26.JANA2$ 
@nathanwbrei
Copy link
Collaborator

I think I've addressed the root cause with #259, although I'll be a lot more confident after doing performance/stress testing

@nathanwbrei
Copy link
Collaborator

I did some performance testing on a farm18 node and didn't see any crashes. Note that in order to reach the ~150Hz that you did, I needed to set -Pjtest:parser_ms=0 as per #106.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants