-
Notifications
You must be signed in to change notification settings - Fork 151
TRAFODION-2940 In HA env, one node lose network, when recover, trafci can't use #1427
base: master
Are you sure you want to change the base?
Conversation
Check Test Started: https://jenkins.esgyn.com/job/Check-PR-master/2399/ |
Test Passed. https://jenkins.esgyn.com/job/Check-PR-master/2399/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one question about the changes.
LOG.info("Connected to ZooKeeper successful, restart DCS Master."); | ||
// reset lock | ||
isLeader = new CountDownLatch(1); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this logic. Do we sit inside one of these method calls during normal processing? Or does tmpZkc.connect() and close() complete immediately? If so, it looks like we just loop around the while loop and do it again, over and over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic is for when dcsmaster return with network error situation.
in the logic , it will try to connect to zk
if it can't conn ( tmpZkc.connect(); ) , there will in catch block and do retry
if it connect to zk, close the connect, then dcsmaster will run call() method again, in the time dcsmaster rework ,there must hava another backup-master working ( because there must one dcs master work and current master lose network ,then backupmaster take over the role) , so when dcsmaster rework , it will set value in zk /rootpath/dcs/leader/ then hang by lock "isLeader = new CountDownLatch(1);"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, Zookeeper connection lost has been covered by session expired event, so this loop is useless.
Hope some DCS experts to take a review. @mashengchen please invite proper DCS experts. I cannot understand these changes well. |
@arvind-narain @kevinxu021 can you help to take a look |
Can you please describe with more details regarding the original issue in the JIRA. It is unclear to me the scenario with 2 floating IPs. Let's hold off on the merge until we understand the original issue that will help us understand the changes done |
|
@hegdean, are you happy with this change now? Should I merge it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not versed in networking enough to judge teh whole change. Just comment on duplicated code.
This code was tested against the test scenario you described? And it solved the duplicate IP problem that the old code did not?
unbindlb=`echo "$myifport"|awk '{print $NF}'` | ||
echo "Virtual ip $gv_float_external_ip is in use on node $HOSTNAME bound to interface $myinterface($unbindlb) - unbinding..." | ||
sudo /sbin/ip addr del $unbindip dev $myinterface | ||
status=$? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems unnecessary to duplicate entire function just to change whether commands run locally or remotely. That can be done with a variable, so that we don't have two copies of this code to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is a little different from the one using PDSH, the ouput of $SQ_PDSH $MY_NODES /sbin/ip addr show
is different from /sbin/ip link show
, so the cut step after the " | " (pipe) is different.
@svarnau yes , I had done the test, and it solved the duplicate IP problem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some fixes needed. Waiting for the changes.
LOG.info("Connected to ZooKeeper successful, restart DCS Master."); | ||
// reset lock | ||
isLeader = new CountDownLatch(1); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, Zookeeper connection lost has been covered by session expired event, so this loop is useless.
LOG.error(e); | ||
System.exit(1); | ||
LOG.error(e.getMessage(), e); | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close zk connection while any connection exception happens. it's better to use specific Exception for zookeeper.
@@ -203,8 +244,8 @@ public void run() { | |||
} catch (KeeperException.NodeExistsException e) { | |||
// do nothing...some other server has created znodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be better to add warning for this exception that we could what happened. Any events happened should be reflected in log files.
} | ||
} | ||
return 1; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add try-catch for each progress.
if (scriptContext.getExitCode() == 0) | ||
LOG.info("Unbind Floating IP successful"); | ||
else | ||
LOG.error("Unbind Floating IP failed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to print exit code.
@@ -93,6 +93,7 @@ public void run() { | |||
try { | |||
queue.wait(); | |||
} catch (InterruptedException e) { | |||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log message for what happened by debug level.
setSessionRecoverSuccessful(false); | ||
} | ||
} catch (KeeperException e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log messages in file.
connect(true); | ||
} catch (IOException e) { | ||
setSessionRecoverSuccessful(false); | ||
LOG.error("session expired and throw IOException while do reconnect: " + e.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use 2 params API for log. If it doesn't matter for whatever the message is, it should be LOG.warn.
setSessionRecoverSuccessful(false); | ||
LOG.error("session expired and throw IOException while do reconnect: " + e.getMessage()); | ||
} catch (InterruptedException e) { | ||
LOG.error("session expired and throw InterruptedException while do reconnect: " + e.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
// session expired, may be never happending. but if it happen there | ||
// need to close old client and rebuild new client | ||
try { | ||
connect(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In connect(...), I saw it was catched for zookeeper exception, how can we get exception here?
c92bd61
to
c43bb2e
Compare
Check Test Started: https://jenkins.esgyn.com/job/Check-PR-master/2475/ |
Test Passed. https://jenkins.esgyn.com/job/Check-PR-master/2475/ |
Check Test Started: https://jenkins.esgyn.com/job/Check-PR-master/2691/ |
Test Passed. https://jenkins.esgyn.com/job/Check-PR-master/2691/ |
@arvind can u please review this |
I think the wrong Arvind was tagged :) |
@arvind-narain can u please review |
Check Test Started: https://jenkins.esgyn.com/job/Check-PR-master/2887/ |
Test Failed. https://jenkins.esgyn.com/job/Check-PR-master/2887/ |
when there loses network for a long time ,and then network recover, there will trigger zookeeper session expired, at this time ,check whether current dcsmaster is leader, if not unbind this node's floating ip, and make dcsmaster init status, then rerun dcs master.