[Fix][Producer]: handle TopicNotFound/TopicTerminated/ProducerBlockedQuotaExceededException/ProducerFenced when reconnecting #1134

gunli · 2023-11-17T03:57:32Z

(If this PR fixes a github issue, please add Fixes #<xyz>.)

(or if this PR is one task of a github issue, please add Master Issue: #<xyz> to link to the master issue.)

Master Issue: #1128

Motivation

In Java client, when we get TopicNotFound/TopicTerminated/ProducerBlockedQuotaExceededException/ProducerFenced, we should failPendingMessages, and close producer. But in Go client, we forget to handle ProducerBlockedQuotaExceededException/ProducerFenced, and in #1128, we just call sr.done(), actually we should call failPendingMessages().

https://github.com/apache/pulsar-client-go/pull/1128/files
https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java#L1663

Modifications

rename errMsgTopicNotFount to errMsgTopicNotFound
handle TopicTerminated/ProducerBlockedQuotaExceededException/ProducerFenced, call failPendingMessages();

                if strings.Contains(errMsg, errMsgTopicNotFound) {
			// when topic is deleted, we should give up reconnection.
			p.log.Warn("Topic not found, stop reconnecting")
			break
		}

		if strings.Contains(errMsg, errMsgTopicTerminated) {
			p.log.Warn("Topic was terminated, failing pending messages, stop reconnecting")
			p.failPendingMessages(newError(TopicTerminated, err.Error()))
			// can not set to producerClosing , or it will fail when we call internalClose()
			// there is a Terminated state in JAVA client, maybe we should add a producerTerminated state ?
			// p.setProducerState(producerClosing)
			break
		}

		if strings.Contains(errMsg, errMsgProducerBlockedQuotaExceededException) {
			p.log.Warn("Producer was blocked by quota exceed exception, failing pending messages, stop reconnecting")
			p.failPendingMessages(newError(ProducerBlockedQuotaExceededException, err.Error()))
			break
		}

		if strings.Contains(errMsg, errMsgProducerFenced) {
			p.log.Warn("Producer was fenced, failing pending messages, stop reconnecting")
			p.failPendingMessages(newError(ProducerFenced, err.Error()))
			// can not set to producerClosing , or it will fail when we call internalClose()
			// there is a ProducerFenced state in JAVA client, maybe we should add a producerFenced state ?
			// p.setProducerState(producerClosing)
			break
		}

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): no)
The public API: (yes / no)
The schema: (yes / no / don't know)
The default values of configurations: (yes / no)
The wire protocol: (yes / no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable / docs / GoDocs / not documented)
If a feature is not applicable for documentation, explain why?
If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

gunli · 2023-11-17T03:58:15Z

@RobertIndie @tisonkun @pkumar-singh

RobertIndie

We also need to handle the errors when creating the producer here:

pulsar-client-go/pulsar/producer_partition.go

Lines 194 to 198 in 1b1dd23

    
           if err != nil { 
        
           	p.batchFlushTicker.Stop() 
        
           	logger.WithError(err).Error("Failed to create producer at newPartitionProducer") 
        
           	return nil, err 
        
           }

But this could be considered as a separate issue and fixed in a separate PR.

pulsar/producer_partition.go

pkumar-singh · 2023-11-18T22:54:35Z

LGTM

gunli · 2023-11-20T03:36:44Z

We also need to handle the errors when creating the producer here:

pulsar-client-go/pulsar/producer_partition.go

Lines 194 to 198 in 1b1dd23

if err != nil {

p.batchFlushTicker.Stop()

logger.WithError(err).Error("Failed to create producer at newPartitionProducer")

return nil, err

}

But this could be considered as a separate issue and fixed in a separate PR.

Hmm, in newPartitionProducer() we just call p.grabCnx(), when p.grabCnx() return an error, we just stop creating, I think it is OK now, but the consumer should close itself when we get TopicNotFound in reconnectToBroker(), see https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java#L915-L917, I am not familiar with consumer, I can summit an Issue to keep track of it.

RobertIndie · 2023-11-20T09:38:21Z

Hmm, in newPartitionProducer() we just call p.grabCnx(), when p.grabCnx() return an error, we just stop creating, I think it is OK now,

If we don't introduce new states, then I think it's OK. I'm OK with not importing new producer states. TopicTerminated/ProducerFenced could be considered an error.

but the consumer should close the itself when we get TopicNotFound in reconnectToBroker(), see https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java#L915-L917, I am not familiar with consumer, I can summit an Issue to keep track of it.

We need to investigate it further. Please submit an issue for it. Thanks!

RobertIndie

Overall LGTM. Could you add some tests?

gunli · 2023-11-21T02:33:52Z

Overall LGTM. Could you add some tests?

Sure, could you please tell me how to trigger a ProducerFenced error?

RobertIndie · 2023-11-21T10:01:43Z

Sure, could you please tell me how to trigger a ProducerFenced error?

You could create two producers both with ProducerAccessModeExclusive. The creation of the second producer should fails with ProducerFenced.
You can see this test: https://github.com/apache/pulsar/blob/3fdbc9fca6b6206bcbeef8e48937ccb3cb2d273f/pulsar-broker/src/test/java/org/apache/pulsar/broker/service/ExclusiveProducerTest.java#L83

gunli · 2023-11-21T15:44:19Z

Sure, could you please tell me how to trigger a ProducerFenced error?

You could create two producers both with ProducerAccessModeExclusive. The creation of the second producer should fails with ProducerFenced. You can see this test: https://github.com/apache/pulsar/blob/3fdbc9fca6b6206bcbeef8e48937ccb3cb2d273f/pulsar-broker/src/test/java/org/apache/pulsar/broker/service/ExclusiveProducerTest.java#L83

@RobertIndie Hmm, this PR is about reconnecting, not producer creation, I think it is difficult to simulate when a producer's connection is closed while another producer connect to the same topic exclusively at the same time, and then the closed producer reconneting, it receive a ProducerFenced error.

RobertIndie · 2023-11-22T01:54:05Z

@RobertIndie Hmm, this PR is about reconnecting, not producer creation, I think it is difficult to simulate when a producer's connection is closed while another producer connect to the same topic exclusively at the same time, and then the closed producer reconneting, it receive a ProducerFenced error.

@gunli
You could invoke the ConnectionClose to simulate the connection loss on a producer:

pulsar-client-go/pulsar/producer_partition.go

Lines 369 to 373 in 1b1dd23

    
           func (p *partitionProducer) ConnectionClosed() { 
        
           	// Trigger reconnection in the produce goroutine 
        
           	p.log.WithField("cnx", p._getConn().ID()).Warn("Connection was closed") 
        
           	p.connectClosedCh <- connectionClosed{} 
        
           }

You could refer to this way:

pulsar-client-go/pulsar/producer_test.go

Line 1283 in ec846ff

partitionProducerImp := _producer.(*producer).producers[0].(*partitionProducer)

to get the internal partition producer.

gunli · 2023-11-22T03:51:50Z

@RobertIndie Hmm, this PR is about reconnecting, not producer creation, I think it is difficult to simulate when a producer's connection is closed while another producer connect to the same topic exclusively at the same time, and then the closed producer reconneting, it receive a ProducerFenced error.

@gunli You could invoke the ConnectionClose to simulate the connection loss on a producer:

pulsar-client-go/pulsar/producer_partition.go

Lines 369 to 373 in 1b1dd23

func (p *partitionProducer) ConnectionClosed() {

// Trigger reconnection in the produce goroutine

p.log.WithField("cnx", p._getConn().ID()).Warn("Connection was closed")

p.connectClosedCh <- connectionClosed{}

}

You could refer to this way:

pulsar-client-go/pulsar/producer_test.go

Line 1283 in ec846ff

partitionProducerImp := _producer.(*producer).producers[0].(*partitionProducer)

to get the internal partition producer.

I know that, but the timing is difficult, when we call ConnectionClosed(), it will start to reconnect, but before reconnection, we must use another producer to occupy the same topic, the time window size is very small

RobertIndie · 2023-11-22T04:06:33Z

I know that, but the timing is difficult, when we call ConnectionClosed(), it will start to reconnect, but before reconnection, we must use another producer to occupy the same topic, the time window size is very small

Could you try using ProducerAccessModeWaitForExclusive for the second producer?

pulsar-client-go/pulsar/producer.go

Line 69 in 50015d3

ProducerAccessModeWaitForExclusive

gunli · 2023-11-22T04:27:10Z

ProducerAccessModeWaitForExclusive

I see, I will try that later, thank you.

gunli · 2023-11-22T09:14:51Z

I know that, but the timing is difficult, when we call ConnectionClosed(), it will start to reconnect, but before reconnection, we must use another producer to occupy the same topic, the time window size is very small

Could you try using ProducerAccessModeWaitForExclusive for the second producer?

pulsar-client-go/pulsar/producer.go

Line 69 in 50015d3

ProducerAccessModeWaitForExclusive

@RobertIndie I have tried that but failed, reconnecting is too fast, the second producer has no chance to get connected. And I also failed in simulating TopicNotFound, 'cause when there is an active producer, deleting a topic is denied by the server. I have pushed but commented the test cases, you can check them out.

func TestTopicNotFound(t *testing.T) {
	client, err := NewClient(ClientOptions{
		URL: serviceURL,
	})
	assert.NoError(t, err)
	defer client.Close()

	topicName := newTopicName()
	producer, err := client.CreateProducer(ProducerOptions{
		Topic:       topicName,
		SendTimeout: 2 * time.Second,
	})
	assert.Nil(t, err)
	defer producer.Close()

	afterCh := time.After(5 * time.Second)
	topicNotFoundChan := make(chan bool)
	go func() {
		for {
			_, err := producer.Send(context.Background(), &ProducerMessage{
				Payload: make([]byte, 1024),
			})
			if err != nil {
				e := err.(*Error)
				if e.result == TopicNotFound || err == errProducerClosed {
					topicNotFoundChan <- true
				} else {
					topicNotFoundChan <- false
				}
			}
			time.Sleep(1 * time.Millisecond)
		}
	}()

	deleteURL := adminURL + "/admin/v2/persistent/public/default/" + topicName
	log.Info(deleteURL)
	makeHTTPCall(t, http.MethodDelete, deleteURL, "")

	for {
		select {
		case d := <-topicNotFoundChan:
			assert.Equal(t, d, true)
			return
		case <-afterCh:
			assert.Fail(t, "Time is up. Topic should have been deleted by now")
			return
		}
	}
}

func TestProducerFenced(t *testing.T) {
	client, err := NewClient(ClientOptions{
		URL: serviceURL,
	})
	assert.NoError(t, err)
	defer client.Close()

	topicName := newTopicName()
	consumer, err := client.Subscribe(ConsumerOptions{
		Topic:            topicName,
		SubscriptionName: "producer_fenced_sub",
	})
	assert.Nil(t, err)
	defer consumer.Close() // subscribe but do nothing

	// create the first producer exclusively
	producer1, err := client.CreateProducer(ProducerOptions{
		Topic:                   topicName,
		SendTimeout:             2 * time.Second,
		ProducerAccessMode:      ProducerAccessModeWaitForExclusive,
		BatchingMaxMessages:     2,
		BatchingMaxSize:         200,
		BatchingMaxPublishDelay: 1 * time.Second,
	})
	assert.Nil(t, err)
	defer producer1.Close()

	go func() {
		// create the second producer wait for exclusive
		fmt.Println("create the second producer wait for exclusive...")
		producer2, err := client.CreateProducer(ProducerOptions{
			Topic:              topicName,
			SendTimeout:        2 * time.Second,
			ProducerAccessMode: ProducerAccessModeWaitForExclusive,
		})
		assert.Nil(t, err)
		defer producer2.Close()
		fmt.Println("the second producer is ready")
		// keep producer2 alive
		time.Sleep(30 * time.Second)
	}()

	time.Sleep(3 * time.Second)
	afterCh := time.After(10 * time.Second)
	producerFencedChan := make(chan bool)
	go func() {
		for {
			producer1.SendAsync(context.Background(),
				&ProducerMessage{Payload: make([]byte, 100)},
				func(id MessageID, producerMessage *ProducerMessage, err error) {
					if err != nil {
						fmt.Println(err)
						e := err.(*Error)
						if e.result == ProducerFenced || err == errProducerClosed {
							producerFencedChan <- true
						} else {
							producerFencedChan <- false
						}
					}
				},
			)

			time.Sleep(1 * time.Millisecond)
		}
	}()

	// trigger reconnecting
	doneChan := make(chan bool)
	go func() {
		ticker := time.NewTicker(1 * time.Second)
		defer ticker.Stop()
		for {
			select {
			case <-doneChan:
				return
			case <-ticker.C:
				fmt.Println("close connections...")
				producers := producer1.(*producer).producers
				for i := 0; i < len(producers); i++ {
					partitionProducerImp := producers[i].(*partitionProducer)
					partitionProducerImp.ConnectionClosed()
				}
			default:

			}
		}
	}()

	for {
		select {
		case d := <-producerFencedChan:
			assert.Equal(t, d, true)
			doneChan <- true
			return
		case <-afterCh:
			assert.Fail(t, "Time is up. Producer should have been fenced by now")
			doneChan <- true
			return
		}
	}
}

pulsar/producer_partition.go

pulsar/producer_test.go

pulsar/producer_partition.go

…dException/ProducerFenced when reconnecting

gunli · 2023-11-30T03:19:22Z

@tisonkun

gunli · 2023-12-01T09:35:20Z

@RobertIndie The CI is failed, but I can find out the root cause from the logs, would you please check it out?

RobertIndie · 2023-12-06T12:43:42Z

There is a data race issue in the CI: https://github.com/apache/pulsar-client-go/actions/runs/6978661724/job/19165650715?pr=1134#step:5:9630

@gunli Could you take a look?

gunli · 2023-12-07T02:12:28Z

There is a data race issue in the CI: https://github.com/apache/pulsar-client-go/actions/runs/6978661724/job/19165650715?pr=1134#step:5:9630

@gunli Could you take a look?

@RobertIndie I have pushed a commit to fix it, PTAL and run the CI.

tisonkun requested review from tisonkun and RobertIndie November 17, 2023 04:56

RobertIndie reviewed Nov 17, 2023

View reviewed changes

pulsar/producer_partition.go Outdated Show resolved Hide resolved

pulsar/producer_partition.go Outdated Show resolved Hide resolved

RobertIndie reviewed Nov 20, 2023

View reviewed changes

RobertIndie reviewed Nov 23, 2023

View reviewed changes

pulsar/producer_partition.go Outdated Show resolved Hide resolved

pulsar/producer_partition.go Outdated Show resolved Hide resolved

pulsar/producer_test.go Outdated Show resolved Hide resolved

gunli force-pushed the fix-failPendingMessages branch from 551f98d to 9662c60 Compare November 24, 2023 02:15

RobertIndie reviewed Nov 24, 2023

View reviewed changes

pulsar/producer_partition.go Outdated Show resolved Hide resolved

gunli added 8 commits November 24, 2023 16:49

fix: handle TopicNotFound/TopicTerminated/ProducerBlockedQuotaExceede…

066a5b1

…dException/ProducerFenced when reconnecting

close producer when TopicNotFound/TopicTerminated/ProducerFenced

51f7a78

add TopicNotFound test case

1712eed

fix: handle channel close

cc377a2

comment TestTopicNotFound/TestProducerFenced

6a3dc45

delete useless test cases

c1751f4

delete comments and exit event loop when dataChan/cmdChan is closed

0ac083f

revert p.batchFlushTicker.Stop()

3b278a9

gunli force-pushed the fix-failPendingMessages branch from 9662c60 to 3b278a9 Compare November 24, 2023 08:50

RobertIndie approved these changes Nov 24, 2023

View reviewed changes

fix data race

e99da7f

gunli mentioned this pull request Dec 7, 2023

fix: update ci timeout to 50m #1140

Closed

1 task

RobertIndie merged commit bd11581 into apache:master Dec 8, 2023
6 checks passed

gunli mentioned this pull request Dec 19, 2023

[Bug][Consumer] Should close the consumer when TopicNotFound #1135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][Producer]: handle TopicNotFound/TopicTerminated/ProducerBlockedQuotaExceededException/ProducerFenced when reconnecting #1134

[Fix][Producer]: handle TopicNotFound/TopicTerminated/ProducerBlockedQuotaExceededException/ProducerFenced when reconnecting #1134

gunli commented Nov 17, 2023

gunli commented Nov 17, 2023

RobertIndie left a comment

pkumar-singh commented Nov 18, 2023

gunli commented Nov 20, 2023 •

edited

Loading

RobertIndie commented Nov 20, 2023

RobertIndie left a comment

gunli commented Nov 21, 2023

RobertIndie commented Nov 21, 2023

gunli commented Nov 21, 2023

RobertIndie commented Nov 22, 2023 •

edited

Loading

gunli commented Nov 22, 2023

RobertIndie commented Nov 22, 2023

gunli commented Nov 22, 2023

gunli commented Nov 22, 2023 •

edited

Loading

gunli commented Nov 30, 2023

gunli commented Dec 1, 2023

RobertIndie commented Dec 6, 2023

gunli commented Dec 7, 2023

	if err != nil {
	p.batchFlushTicker.Stop()
	logger.WithError(err).Error("Failed to create producer at newPartitionProducer")
	return nil, err
	}

[Fix][Producer]: handle TopicNotFound/TopicTerminated/ProducerBlockedQuotaExceededException/ProducerFenced when reconnecting #1134

[Fix][Producer]: handle TopicNotFound/TopicTerminated/ProducerBlockedQuotaExceededException/ProducerFenced when reconnecting #1134

Conversation

gunli commented Nov 17, 2023

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

gunli commented Nov 17, 2023

RobertIndie left a comment

Choose a reason for hiding this comment

pkumar-singh commented Nov 18, 2023

gunli commented Nov 20, 2023 • edited Loading

RobertIndie commented Nov 20, 2023

RobertIndie left a comment

Choose a reason for hiding this comment

gunli commented Nov 21, 2023

RobertIndie commented Nov 21, 2023

gunli commented Nov 21, 2023

RobertIndie commented Nov 22, 2023 • edited Loading

gunli commented Nov 22, 2023

RobertIndie commented Nov 22, 2023

gunli commented Nov 22, 2023

gunli commented Nov 22, 2023 • edited Loading

gunli commented Nov 30, 2023

gunli commented Dec 1, 2023

RobertIndie commented Dec 6, 2023

gunli commented Dec 7, 2023

gunli commented Nov 20, 2023 •

edited

Loading

RobertIndie commented Nov 22, 2023 •

edited

Loading

gunli commented Nov 22, 2023 •

edited

Loading