Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrectly formated resource request crashes pbs_server #425

Open
mattmix opened this issue May 10, 2017 · 4 comments
Open

Incorrectly formated resource request crashes pbs_server #425

mattmix opened this issue May 10, 2017 · 4 comments

Comments

@mattmix
Copy link

mattmix commented May 10, 2017

A user typo lead to a segfault on pbs_server:
#PBS -l walltime=24:00:00,nodes=1,ppn=4,mem=30gb

should have been:
#PBS -l walltime=24:00:00,nodes=1:ppn=4,mem=30gb

pbs_server crashes consistently when jobs are submitted with this resource request.

@dbeer
Copy link

dbeer commented May 12, 2017

This is fixed in the 6.0.3 release.

@dbeer dbeer closed this as completed May 12, 2017
@mattmix
Copy link
Author

mattmix commented May 15, 2017

We are seeing this issue in 6.1.1.1.

@dbeer dbeer reopened this May 15, 2017
@dbeer
Copy link

dbeer commented May 15, 2017

Can you provide more information about reproducing this? It should be fixed by fd385d8, but that commit is already present in 6.1.1.1. When I submit a job with ',ppn=X' as you reported, it is rejected immediately.

@mattmix
Copy link
Author

mattmix commented May 19, 2017

mattmix@ln0003 ~>qsub --about 2>&1 | grep Version
Version: 6.1.1.1
mattmix@ln0003 ~>qsub -I -l walltime=24:00:00,nodes=1,ppn=4,mem=30gb
qsub: submit error (This stream has already been closed. End of File.)
mattmix@ln0003 ~>^C
mattmix@ln0003 ~>qsub -I -l walltime=24:00:00,nodes=1,ppn=4
qsub: submit error (This stream has already been closed. End of File.)
mattmix@ln0003 ~>qsub -I -l nodes=1,ppn=4,walltime=24:00:00
qsub: submit error (Job rejected by all possible destinations (check syntax, queue resources, ...))

It looks like it only kicks it back when the nodes/ppn is first. I'm not sure if other options besides walltime before nodes impacts this, and I don't want to crash pbs_server too much. The End of File errors represent a SIGSEGV on pbs_server

Here is a backtrace:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fca2dffb700 (LWP 3104)]
0x00000000004e38d0 in encode_unkn (attr=0x7fca84042628, phead=0x7fca2dfea8e0, UNUSED_atname=0x5508e7 "Resource_List", UNUSED_rsname=0x55ef4b "|unknown|",
UNUSED_mode=3, UNUSED_perm=2047) at attr_fn_unkn.c:216
216 pnew = (svrattrl *)calloc(1, plist->al_tsize+1);
(gdb) bt
#0 0x00000000004e38d0 in encode_unkn (attr=0x7fca84042628, phead=0x7fca2dfea8e0, UNUSED_atname=0x5508e7 "Resource_List", UNUSED_rsname=0x55ef4b "|unknown|",
UNUSED_mode=3, UNUSED_perm=2047) at attr_fn_unkn.c:216
#1 0x00000000004e0cea in encode_resc (attr=0x7fca8404bf90, phead=0x7fca2dfea8e0, atname=0x5508e7 "Resource_List", rsname=0x55ef4b "|unknown|", mode=3,
ac_perm=2047) at attr_fn_resc.c:297
#2 0x0000000000435fea in add_encoded_attributes (attr_node=0x7fca2dfebd38, pattr=0x7fca8404bc30) at job_recov.c:996
#3 0x000000000043620b in add_attributes (rnode=0x7fca2dfed158, pjob=0x7fca8404ae90) at job_recov.c:1064
#4 0x0000000000436332 in saveJobToXML (pjob=0x7fca8404ae90, filename=0x7fca2dfed1b0 "/var/spool/torque/server_priv/jobs/")
at job_recov.c:1127
#5 0x000000000043664d in job_save (pjob=0x7fca8404ae90, updatetype=1, mom_port=0) at job_recov.c:1264
#6 0x00000000004985d7 in svr_setjobstate (pjob=0x7fca8404ae90, newstate=1, newsubstate=10, has_queue_mutex=0) at svr_jobfunc.c:1299
#7 0x000000000047dca2 in perform_commit_work (preq=0x7fca2dff7bb0, pj=0x7fca8404ae90, version=2) at req_quejob.c:1510
#8 0x000000000047eb1d in req_quejob (preq=0x7fca2dff7bb0, version=2) at req_quejob.c:1925
#9 0x0000000000462188 in dispatch_request (sfds=13, request=0x7fca2dff7bb0) at process_request.c:739
#10 0x0000000000462081 in process_request (chan=0x7fca84080a40) at process_request.c:701
#11 0x00000000004bb38f in process_pbs_server_port (sock=13, is_scheduler_port=0, args=0x7fcc74002650) at incoming_request.c:162
#12 0x00000000004bb5b0 in start_process_pbs_server_port (new_sock=0x7fcc74002650) at incoming_request.c:270
#13 0x00000000004ff359 in work_thread (a=0x589c450) at u_threadpool.c:318
#14 0x0000003cb3407aa1 in start_thread (arg=0x7fca2dffb700) at pthread_create.c:301
#15 0x0000003cb2ce8aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants