Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in 7.1 with BUILD_AZURE_BACKUP turned on, need help to troubleshoot further #7570

Open
harikb opened this issue Jul 11, 2022 · 7 comments
Assignees

Comments

@harikb
Copy link
Contributor

harikb commented Jul 11, 2022

Hello,

I built a version of foundationdb 7.1 branch to enable backup to Azure blobstore. I took the latest in the 7.1 branch and followed steps outlined here

Upon running fdbbackup, it segfaults. However, I am unable to obtain a meaningful location/trace. Looking for advice on how I should get more troubleshooting details.

NOTE: I replaced the binaries on top of a 7.12 release that was installed via dpkg. I am also looking for instructions to build a package from the compiled output (instead of copying over the binaries).

It was built with the following cmake options.

$ cmake -DNDEBUG=0 -DBUILD_AZURE_BACKUP=ON -S src/foundationdb -B build_output -G Ninja

// binaries installed and foundationdb restarted
// set sample data via fdbcli - db seems to work for trivial queries (no extensive testing done)

$ export AZURE_KEY= xxxxxxxxxxx
$ fdbbackup start --destcontainer "azure://............./fdbtest02/testing/" --log

SIGNAL: Segmentation fault (11)
Trace: addr2line -e fdbbackup.debug -p -C -f -i 0x7f7fec5bf420
Segmentation fault (core dumped)
$ addr2line -e $(which fdbbackup)  -p -C -f -i 0x7f7fec5bf420
?? ??:0

Last few lines of the local trace file

$ tail -n 4 trace.127.0.0.1.32356.1657571034.0ZQNi5.0.1.xml
<Event Severity="10" Time="1657571034.860529" DateTime="2022-07-11T20:23:54Z" Type="ConnectionExchangingConnectPacket" ID="87d456bd7f6fff03" SuppressedEventCount="0" PeerAddr="127.0.0.1:4500" ThreadID="4733494822528056024" Machine="127.0.0.1:32356" LogGroup="default" ClientDescription="primary-7.1.13-4733494822528056024" />
<Event Severity="10" Time="1657571034.861009" DateTime="2022-07-11T20:23:54Z" Type="ConnectionEstablished" ID="87d456bd7f6fff03" SuppressedEventCount="0" Peer="127.0.0.1:4500" ConnectionId="1" ThreadID="4733494822528056024" Machine="127.0.0.1:32356" LogGroup="default" ClientDescription="primary-7.1.13-4733494822528056024" />
<Event Severity="10" Time="1657571034.861009" DateTime="2022-07-11T20:23:54Z" Type="ConnectedOutgoing" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="127.0.0.1:4500" ThreadID="4733494822528056024" Machine="127.0.0.1:32356" LogGroup="default" ClientDescription="primary-7.1.13-4733494822528056024" />
<Event Severity="40" ErrorKind="BugDetected" Time="1657571035.184829" DateTime="2022-07-11T20:23:55Z" Type="Crash" ID="0000000000000000" Signal="11" Name="Segmentation fault" Trace="addr2line -e fdbbackup.debug -p -C -f -i 0x7f7fec5bf420" ThreadID="16319990330548784926" Backtrace="addr2line -e fdbbackup.debug -p -C -f -i 0xf310ac 0xf2fcc0 0xf300be 0xf05873 0x7f7fec5bf420" Machine="127.0.0.1:32356" LogGroup="default" ClientDescription="primary-7.1.13-4733494822528056024" />

$ file $(which fdbbackup)
/usr/bin/fdbbackup: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=2e6cc0a000f57ab5bfc6deeff93941b133386cb5, with debug_info, not stripped
@harikb
Copy link
Contributor Author

harikb commented Jul 11, 2022

Here is some more info from my (probably misguided) attempt to run it with gdb

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6a9f700 (LWP 294882)]
[New Thread 0x7ffff625e700 (LWP 294883)]
[New Thread 0x7ffff5a3a700 (LWP 294884)]
[New Thread 0x7ffff5239700 (LWP 294885)]
[Thread 0x7ffff5239700 (LWP 294885) exited]

Thread 3 "fdbbackup" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff625e700 (LWP 294883)]
__GI___libc_free (mem=0x1) at malloc.c:3102
3102    malloc.c: No such file or directory.
(gdb) where
#0  __GI___libc_free (mem=0x1) at malloc.c:3102
#1  0x00000000010551aa in SSL_SESSION_free ()
#2  0x000000000104f899 in SSL_free ()
#3  0x00007ffff7f8b7cd in ?? () from /lib/x86_64-linux-gnu/libcurl.so.4
#4  0x00007ffff7f8b825 in ?? () from /lib/x86_64-linux-gnu/libcurl.so.4
#5  0x00007ffff7f498ce in ?? () from /lib/x86_64-linux-gnu/libcurl.so.4
#6  0x00007ffff7f5e102 in ?? () from /lib/x86_64-linux-gnu/libcurl.so.4
#7  0x00007ffff7f5e7fe in ?? () from /lib/x86_64-linux-gnu/libcurl.so.4
#8  0x00007ffff7f5fd51 in curl_multi_perform () from /lib/x86_64-linux-gnu/libcurl.so.4
#9  0x00007ffff7f561bb in curl_easy_perform () from /lib/x86_64-linux-gnu/libcurl.so.4
#10 0x00000000010357dc in azure::storage_lite::CurlEasyRequest::perform (this=this@entry=0x7fffec000bd0)
    at fdbclient/azurestorage-src/src/http/libcurl_http_client.cpp:73
#11 0x0000000001036dc2 in azure::storage_lite::CurlEasyRequest::submit(std::function<void (int, azure::storage_lite::storage_istream, CURLcode)>, std::chrono::duration<long, std::ratio<1l, 1l> >) (this=this@entry=0x7fffec000bd0, cb=..., interval=...,
    interval@entry=...) at fdbclient/azurestorage-src/include/http/libcurl_http_client.h:97
#12 0x0000000000fee037 in azure::storage_lite::async_executor<void>::submit_helper (promise=..., outcome=..., account=...,
    request=..., http=..., context=..., retry=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:87
#13 0x0000000000feff16 in azure::storage_lite::async_executor<void>::submit (account=..., request=..., http=..., context=...)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/atomicity.h:96
#14 0x0000000000fe1309 in azure::storage_lite::blob_client::create_container (this=<optimized out>, container=...)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/atomicity.h:96
#15 0x0000000000d8c45a in BackupContainerAzureBlobStore::<lambda()>::operator()(void) const (__closure=0x1b65c70)
    at /tmp/src/foundationdb/fdbclient/BackupContainerAzureBlobStore.actor.cpp:338
#16 0x0000000000d8ccb0 in AsyncTaskThread::<lambda()>::operator() (this=0x1b65c68)
    at /tmp/src/foundationdb/fdbclient/AsyncTaskThread.h:46
#17 AsyncTask<AsyncTaskThread::execAsync(const F&, TaskPriority) [with F = BackupContainerAzureBlobStore::create()::<lambda()>; decltype (func()) = Void]::<lambda()> >::operator()(void) (this=0x1b65c60) at /tmp/src/foundationdb/fdbclient/AsyncTaskThread.h:46
#18 0x0000000000dc4df4 in AsyncTaskThread::run (self=0x1af31a8) at /tmp/src/foundationdb/fdbclient/AsyncTaskThread.actor.cpp:88
#19 0x000000000122060f in execute_native_thread_routine ()
#20 0x00007ffff7dac609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#21 0x00007ffff7ccf133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@harikb
Copy link
Contributor Author

harikb commented Jul 11, 2022

Well, that was quick.... I discovered the segfault was caused by my using a path that had two components - "fdbtest02/testing/" - just putting "fdbtest03" seems to have created at least that part. I can confirm it is able to talk to Azure as I can see a directory created there. The backup still failed with a different error, but let me dig this some more..

Details:
 LogBytes written - 0
 RangeBytes written - 0
 Last complete log version and timestamp        - 1456970928727, 2022/07/11.20:40:47+0000
 Last complete snapshot version and timestamp   - N/A, N/A
 Current Snapshot start version and timestamp   - 1456970928727, 2022/07/11.20:40:47+0000
 Expected snapshot end version and timestamp    - 1456970928727, 2022/07/11.20:40:47+0000
 Backup supposed to stop at next snapshot completion - Yes
Recent Errors (since initialization)
23 second(s) ago : 'Backup Container URL invalid' on 'file_backup_write_logs_5.2'

@harikb
Copy link
Contributor Author

harikb commented Jul 11, 2022

Error seem to be coming from the message - invalid URL prefix and it appears as per code that can only happen if BUILD_AZURE_BACKUP is not defined. Quite possible some binary/library hasn't been updated/replaced with the custom build. I have replaced all executable binaries and .so files, but something is still missing.

Coincidently, fdbbackup status also seem to want the AZURE_KEY even if I am merely trying to find the status of the backup.

    $ fdbbackup status
    [snip]
    BackupUID: f20c3287a008daf81d6c7067ac6b603c
    BackupURL: azure://.....[snip]...../fdbtest05
    Snapshot interval is 864000 seconds.  The initial snapshot is still running.
    
    Details:
     LogBytes written - 0
     RangeBytes written - 0
     Last complete log version and timestamp        - 1461196288219, 2022/07/11.22:27:00+0000
     Last complete snapshot version and timestamp   - N/A, N/A
     Current Snapshot start version and timestamp   - 1461196288219, 2022/07/11.22:27:00+0000
     Expected snapshot end version and timestamp    - 1461196288219, 2022/07/11.22:27:00+0000
     Backup supposed to stop at next snapshot completion - Yes
    Recent Errors (since initialization)
    47 second(s) ago : 'Backup Container URL invalid' on 'file_backup_write_logs_5.2'
    
    $ unset AZURE_KEY
    $ fdbbackup status
    ERROR: Backup error
    Fatal Error: Backup error

The trace for the above error, since it is unrelated, is logged here. We don't need to delve into it here.

@harikb
Copy link
Contributor Author

harikb commented Jul 11, 2022

It seems not all code is enabled - there should ideally be a #define here. I will make that change and see if it makes progress.

@harikb
Copy link
Contributor Author

harikb commented Jul 15, 2022

It seems the Segfault is caused at BackupContainerAzureBlobStore.actor.cpp:336 when the container already exists. However, the trace is well inside azure::storage_lite::CurlEasyRequest::perform . I also seeing other folks reporting similar bugs

Azure/azure-storage-cpplite#87
Azure/azure-storage-cpplite#114
Azure/azure-storage-cpplite#115

@sfc-gh-clin
Copy link
Collaborator

@harikb Looks like you fixed the issue, shall we close this?

@harikb
Copy link
Contributor Author

harikb commented Jul 29, 2022

@sfc-gh-clin - I still see a repeat of the issue in a different environment. I am still debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants