Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic segmentation faults when reading LAMMPS dump files using Julia bindings #37

Closed
amlimaye opened this issue Jun 13, 2020 · 12 comments · Fixed by #40
Closed
Labels

Comments

@amlimaye
Copy link

Hi,

I'm using the Julia bindings to Chemfiles to read a LAMMPS dump file (file extension .lammpstrj). The topology information is stored in a LAMMPS data file (file extension .data). For some reason, when I index the positions read out from the LAMMPS dump using indices provided by evaluating a selection on the frame, I reliably but non-deterministically get segmentation faults that crash the Julia REPL. Here's a minimal working example with Julia Chemfiles 0.9.3:

using Chemfiles

top_dir = "electrode-sim-files"
traj_fname = joinpath(top_dir, "dumptest.lammpstrj")
topo_fname = joinpath(top_dir, "dumptest.data")

traj = Trajectory(traj_fname, 'r', "LAMMPS")
topo_traj = Trajectory(topo_fname, 'r', "LAMMPS Data")
topo_frame = read(topo_traj)

selstr = "not type == \"16\" or type == \"17\""
sel = Selection(selstr)

# this loop runs successfully
for ix in 1:length(traj)
    frame = read_step(traj, ix - 1)
    pos = positions(frame)
    println(sum(pos))
end

# this loop segfaults
for ix in 1:length(traj)
    frame = read_step(traj, ix - 1)
    pos = positions(frame)
    indices = evaluate(sel, frame)
    println(sum(pos[:, indices .+ 1]))
end

close(traj)

Here's the stack trace I get from the segfault, it appears to be somewhere in the getindex function for the positions array:

signal (11): Segmentation fault: 11
in expression starting at /Users/aditya/research/023_tba_electrolyte/clean.jl:22
macro expansion at ./multidimensional.jl:697 [inlined]
macro expansion at ./cartesian.jl:64 [inlined]
macro expansion at ./multidimensional.jl:694 [inlined]
_unsafe_getindex! at ./multidimensional.jl:690 [inlined]
_unsafe_getindex at ./multidimensional.jl:684
_getindex at ./multidimensional.jl:670 [inlined]
getindex at ./abstractarray.jl:981
top-level scope at /Users/aditya/research/023_tba_electrolyte/clean.jl:26

Could you please provide some advice on what might be going wrong here? The code above does not always segfault, but in the few times I've tried it, usually segfaults within the first 2-3 times running it. I'm attaching the offending LAMMPS dump and data files here as well. Thanks in advance for any help you can provide!

files.zip

@Luthaf
Copy link
Member

Luthaf commented Jun 15, 2020

Hi, and thanks for the report!

I was able to get it to segfault once over ~200 runs, so I could not get much more information.

Could you also share your OS & Julia version?

Did you try checking that all values in indices are valid indexes?

@amlimaye
Copy link
Author

Thanks for looking into this! I'm on Mac OS Mojave (10.14), running Julia 1.2.0. I included a check for all of indices being in-bounds:

# this loop segfaults
for ix in 1:length(traj)
    frame = read_step(traj, ix - 1)
    pos = positions(frame)
    indices = evaluate(sel, topo_frame)
    println(all((indices .+ 1) .< size(frame)))
    println(sum(pos[:, indices .+ 1]))
end

and I still get segfaults, but not always (happened on the third time I ran it).

If it's helpful, I wrote the analogous program using the C++ API and everything runs okay (ran 200+ times without a segfault):

#include <cstdio>
#include "chemfiles.hpp"

int main() {
    // evaluate selection on the topology frame
    chemfiles::Trajectory topo_traj("dumptest.data", 'r', "LAMMPS Data");
    auto selection = chemfiles::Selection("not type == \"16\" or type == \"17\"");
    auto topo_frame = topo_traj.read();
    auto indices = selection.list(topo_frame);

    // load up the trajectory
    chemfiles::Trajectory trajectory("dumptest.lammpstrj");

    // iterate through all the frames
    for (size_t f_ix = 0; f_ix < trajectory.nsteps(); f_ix++) {
        auto frame = trajectory.read();
        auto positions = frame.positions();

        double sumx = 0.0;
        double sumy = 0.0;
        double sumz = 0.0;

        // iterate through all atoms in the selection
        for (const auto& at_ix : indices) {
            sumx += positions[at_ix][0];
            sumy += positions[at_ix][1];
            sumz += positions[at_ix][2];
        }

        printf("sumx: %0.4f, sumy: %0.4f, sumz: %0.4f\n", sumx, sumy, sumz);
    }
}

So it seems like the issue lies specifically in the Julia bindings. I'm happy to help track down this bug but I'm not too experienced with Julia or the chemfiles codebase - if you could point me in the right direction I can try my hand at some debugging.

@ezavod
Copy link
Member

ezavod commented Jun 15, 2020

Hey, that's very interesting.
@amlimaye have you compared the resulting values between your C and jl version? And between different runs of your jl version?

I can't reproduce crashes, but the second loop generates basically random values for the first frame (looks suspiciously like uninitialized memory). This seems to bee a bug in Julia, because this does not happen with the current nighly (or at least it is really rare). I have no clue about julia, so I don't know what might have changed :(
Are you able to try a newer version of julia?

Replacing the loop in the example below with ix = 1 seems to fix the problem as well. Without the loop I get 496097.0006436603. This is the same value that is given by the nightly with loop.

minimal example
for i in {1..10}; do ./julia-1.2.0/bin/julia test.jl; done

test.jl:

using Chemfiles

top_dir = "electrode-sim-files"
traj_fname = joinpath(top_dir, "dumptest.lammpstrj")
topo_fname = joinpath(top_dir, "dumptest.data")

traj = Trajectory(traj_fname, 'r', "LAMMPS")
topo_traj = Trajectory(topo_fname, 'r', "LAMMPS Data")
topo_frame = read(topo_traj)

selstr = "not type == \"16\" or type == \"17\""
sel = Selection(selstr)

for ix in 1:1
    frame = read_step(traj, ix - 1)
    pos = positions(frame)
    indices = evaluate(sel, frame)
    println(sum(pos[:, indices .+ 1]))
end

close(traj)

@amlimaye
Copy link
Author

Interesting. I upgraded to Julia 1.4.2, and I ran clean.jl (see below) several times at the cmdline with julia clean.jl. Everything is a success there - no segfaults after running 100 times, and all the values match the values produced by the C++ API.

It seems that I can only get segfaults in the Julia REPL when running the script using include("clean.jl"). Up until the segfault, all values that are printed match those reported by the C++ API, and are consistent between Julia runs.

I thought maybe this was due to some weird name collision issue, and so i wrapped everything in a function like the example script below. No dice, the script is successful when run at the cmdline, but segfaults within the first 10 include()s at the Julia REPL.

clean.jl

using Chemfiles

function main()
    top_dir = "."
    traj_fname = joinpath(top_dir, "dumptest.lammpstrj")
    topo_fname = joinpath(top_dir, "dumptest.data")

    traj = Trajectory(traj_fname, 'r', "LAMMPS")
    topo_traj = Trajectory(topo_fname, 'r', "LAMMPS Data")
    topo_frame = read(topo_traj)

    selstr = "not type == \"16\" or type == \"17\""
    sel = Selection(selstr)

    # this loop segfaults
    for ix in 1:length(traj)
        frame = read_step(traj, ix - 1)
        pos = positions(frame)
        indices = evaluate(sel, topo_frame)
        println(all((indices .+ 1) .< size(frame)))
        println(sum(pos[:, indices .+ 1]))
    end

    close(traj)
end

main()

Julia version info

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

@Luthaf Luthaf transferred this issue from chemfiles/chemfiles Jun 15, 2020
@Luthaf
Copy link
Member

Luthaf commented Jun 15, 2020

Moved the issue on the julia binding repo =)


I've also only seen the segfault with julia 1.0, and not with 1.5 nightly.

It seems that I can only get segfaults in the Julia REPL when running the script using include("clean.jl"). Up until the segfault, all values that are printed match those reported by the C++ API, and are consistent between Julia runs.

This looks like an issue with finalizers not being run, I found that running the code in REPL there is more chance to hit this kind of issue. To test this, you can add explicit calls to finalize(frame) inside the loop, just before the end.

We already have to copy chemfiles pointers a lot (see Atom(::Frame) and friends), so there is perhaps something going on with Frame or the positions.

I'll try to write up a longer explanation tomorrow =)

@Luthaf Luthaf added the bug label Jun 15, 2020
@amlimaye
Copy link
Author

Adding a call to finalize(frame) seems to fix the segfaults at the REPL - thanks so much! Let me know if I can help in any way with the fix in Chemfiles.jl, I'm eager to help contribute to the project.

@Luthaf
Copy link
Member

Luthaf commented Jun 16, 2020

More debugging information: if I comment the topo_traj/topo_frame part from your initial script, the segfault also goes away.

Adding a call to finalize(frame) seems to fix the segfaults at the REPL - thanks so much!

That's great!

I don't think having to manually free memory is a good experience for Julia developers though, so finding the root cause of this would be nice. Unfortunately, there are multiple things interacting here, and any of them could be wrong.

What seems to be wrong is the position function, which returns a view inside C++-owned memory

function positions(frame::Frame)
ptr = Ref{Ptr{Float64}}()
natoms = Ref{UInt64}(0)
__check(lib.chfl_frame_positions(__ptr(frame), ptr, natoms))
return unsafe_wrap(Array{Float64, 2}, ptr[], (3, Int(natoms[])); own=false)
end

The function calls chfl_frame_positions, which uses the frame.positions.data() pointer directly, meaning there is no copy of the position data. When the frame is deleted with a call to finalize (which calls chfl_free, calling chemfiles::shared_allocator::free, calling C++ delete operator); the position array is also deleted, and future access to it are undefined.
https://github.com/chemfiles/chemfiles/blob/c7a3848962c4a94251cdd7da79102d13337451b6/src/capi/frame.cpp#L53-L66

The issue do not seems to be that simple here, I've added a bit of logging to the allocation/deallocation functions and it seems to be de-allocated at the right time.

Let me know if I can help in any way with the fix in Chemfiles.jl, I'm eager to help contribute to the project.

That's very appreciated =) In addition to this bug, we also have easier issues that need a bit of love, in particular #26!


Completely unrelated side note, but your selection is a bit strange. Did you meant to use not (type == "16" or type == "17") (with parenthesis) instead? Else the type == "17" part is redundant with the not type == "16"

@amlimaye
Copy link
Author

Re: the selection string, I did think that was a little peculiar. Here's an example of the behavior I'm getting from the selections on my system. My total system has 7426 atoms. 396 atoms are type 16, and another 396 are type 17, and i want to craft a selection that excludes both atoms of type 17 and type 16.

Some basic sanity checks first:

julia> indices_all = evaluate(Selection("all"), topo_frame);

julia> indices_no17 = evaluate(Selection("not type == \"17\""), topo_frame);

julia> indices_no16 = evaluate(Selection("not type == \"16\""), topo_frame);

julia> length(indices_all)
7426

julia> length(indices_no17)
7030

julia> length(indices_no16)
7030

julia> all(indices_no17 .== indices_no16)
false

That all seems fine. Now I can use your recommended selection, and also use a selection that distributes the not operator using DeMorgan's law - these should agree and indeed they do.

julia> indices_parenthesized = evaluate(Selection("not (type == \"16\" or type == \"17\")"), topo_frame);

julia> indices_demorgan = evaluate(Selection("(not type == \"16\") and (not type == \"17\")"), topo_frame);

julia> length(indices_parenthesized)
6634

julia> all(indices_parenthesized .== indices_demorgan)
true

Oddly enough though, my original selection also gets the exact same result:

julia> indices_peculiar = evaluate(Selection("not type == \"16\" or type == \"17\""), topo_frame);

julia> length(indices_peculiar)
6634

julia> all(indices_peculiar .== indices_parenthesized)
true

I think this seems to suggest that in my original selection string, the type == 17 is not redundant with the not type == "16" portion. I'm not very familiar with the Chemfiles selection language, but I think this means the or operator has precendence over not, at least in the context of the original selection string. Is this incorrect behavior?

@Luthaf
Copy link
Member

Luthaf commented Jun 20, 2020

Wow, very nice investigation! Intuitively, I would want not to have higher precedence than or/and, so I think the current implementation is wrong. I'll open a separate issue to discuss this =)

@Luthaf
Copy link
Member

Luthaf commented Jun 21, 2020

Running the script in valgrind (valgrind julia -e 'for _ in 1:100; include("clean.jl"); end') gives multiple instances of use-after-free:

==34240== Invalid read of size 8
==34240==    at 0x12B160750: ???
==34240==    by 0x1: ???
==34240==    by 0x1048CCF6F: ???
==34240==    by 0x116E0EDAF: ???
==34240==    by 0x12A718ADF: ???
==34240==    by 0x4DBD: ???
==34240==    by 0x2: ???
==34240==    by 0x19E9: ???
==34240==    by 0x19E9: ???
==34240==    by 0x116C29A4F: ???
==34240==    by 0x10D43C20F: ???
==34240==    by 0x19E9: ???
==34240==  Address 0x12a718ae0 is 0 bytes inside a block of size 178,224 free'd
==34240==    at 0x1000D6EAD: free (in /usr/local/Cellar/valgrind/HEAD-60ab74a/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
==34240==    by 0x12A158F46: std::__1::__function::__func<void chemfiles::shared_allocator::insert_new<chemfiles::Frame>(chemfiles::Frame*)::{lambda()#1}, std::__1::allocator<{lambda()#1}>, void ()>::operator()() (in /Users/guillaume/code/chemfiles/julia/deps/usr/lib/libchemfiles.0.9.3.dylib)
==34240==    by 0x12A15C510: chemfiles::shared_allocator::release(void const*) (in /Users/guillaume/code/chemfiles/julia/deps/usr/lib/libchemfiles.0.9.3.dylib)
==34240==    by 0x12A1591BF: chfl_free (in /Users/guillaume/code/chemfiles/julia/deps/usr/lib/libchemfiles.0.9.3.dylib)
==34240==    by 0x12B163016: ???
==34240==    by 0x100131330: jl_gc_run_finalizers_in_list (gc.c:210)
==34240==    by 0x100131D15: jl_gc_collect (gc.c:245)
==34240==    by 0x100131AF8: jl_gc_pool_alloc (gc.c:954)
==34240==    by 0x10013638A: jl_gc_alloc (julia_internal.h:274)
==34240==    by 0x100107C45: _new_array_ (array.c:111)
==34240==    by 0x10010866B: jl_alloc_array_2d (array.c:156)
==34240==    by 0x12B160668: ???
==34240==  Block was alloc'd at
==34240==    at 0x1000D6AD5: malloc (in /usr/local/Cellar/valgrind/HEAD-60ab74a/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
==34240==    by 0x103AE4377: operator new(unsigned long) (in /usr/lib/libc++abi.dylib)
==34240==    by 0x12A12FEA0: chemfiles::Frame::Frame(chemfiles::Frame const&) (in /Users/guillaume/code/chemfiles/julia/deps/usr/lib/libchemfiles.0.9.3.dylib)
==34240==    by 0x12A1E6F47: chemfiles::Molfile<(chemfiles::MolfileFormat)2>::read_step(unsigned long, chemfiles::Frame&) (in /Users/guillaume/code/chemfiles/julia/deps/usr/lib/libchemfiles.0.9.3.dylib)
==34240==    by 0x12A12E37C: chemfiles::Trajectory::read_step(unsigned long) (in /Users/guillaume/code/chemfiles/julia/deps/usr/lib/libchemfiles.0.9.3.dylib)
==34240==    by 0x12A185906: chfl_trajectory_read_step (in /Users/guillaume/code/chemfiles/julia/deps/usr/lib/libchemfiles.0.9.3.dylib)
==34240==    by 0x12B15A0A4: ???
==34240==    by 0x1000EBCE2: jl_fptr_trampoline (gf.c:1829)
==34240==    by 0x10011CC39: jl_toplevel_eval_flex (toplevel.c:781)
==34240==    by 0x1000F804E: jl_parse_eval_all (ast.c:838)
==34240==    by 0x10011D872: jl_load_ (toplevel.c:821)
==34240==    by 0x10DC9BE91: japi1_include_relative_2922.clone_1 (boot.jl:317)
==34240==

So my best guess is that Julia pre 1.4/1.3 run the GC a bit too much aggressively, and tries to remove the frame immediately after it's last use (indices = evaluate(sel, frame)), meaning the access to pos next line is done on freed memory.

Adding an extra use of the frame after the memory access makes the segfault go away and valgrind happy.

for ix in 1:length(traj)
    frame = read_step(traj, ix - 1)
    pos = positions(frame)
    indices = evaluate(sel, frame)
    println(sum(pos[:, indices .+ 1]))
    println("$(size(frame))")
end

Disabling GC during the loop body does the same

for ix in 1:length(traj)
    Base.GC.enable(false)
    frame = read_step(traj, ix - 1)
    pos = positions(frame)
    indices = evaluate(sel, frame)
    println(sum(pos[:, indices .+ 1]))
    Base.GC.enable(true)
end

Overall, it seems to me that we can not do much here, the issue seems to be a julia bug, which was fixed somewhere in 1.3 or 1.4. For this reason, I would tend to mark this bug as "wontfix" and point everyone to use julia >= 1.4.

Is this a possibility in your case @amlimaye?

@amlimaye
Copy link
Author

Yeah, I can definitely bump my Julia version, but the issue seems to persist for me when running with Julia 1.4.2. Here's the script:

using Chemfiles

top_dir = "."
traj_fname = joinpath(top_dir, "dumptest.lammpstrj")
topo_fname = joinpath(top_dir, "dumptest.data")

traj = Trajectory(traj_fname, 'r', "LAMMPS")
topo_traj = Trajectory(topo_fname, 'r', "LAMMPS Data")
topo_frame = read(topo_traj)

selstr = "not type == \"16\" or type == \"17\""
sel = Selection(selstr)

# this loop segfaults
for ix in 1:length(traj)
    frame = read_step(traj, ix - 1)
    pos = positions(frame)
    indices = evaluate(sel, topo_frame)
    println(sum(pos[:, indices .+ 1]))
end

And results:

$ julia --version
julia version 1.4.2

$ julia -e 'for _ in 1:100; include("clean.jl"); end'
496097.0006436603
496096.05060523003
496094.9440102568
496093.70100879576
496092.30348686344
496090.76237287885
496089.1267655236
496087.4726818176
496085.9021523008
496084.53908750624
496083.4936800767
496082.8484617979
496082.65906167217
496082.9031651241
496083.53639996215
496084.4680373197
496085.59078211273
496086.830310256
496088.1045205294
496089.390346257

signal (11): Segmentation fault: 11

If I add an extra use of frame or turn off the GC like you suggested, I have no issues. Also, this gives no issues (expected, because the problem is only in interactive mode):

$ for _ in $(seq 1 100); do julia clean.jl > /dev/null; done

Which version of Julia are you using for the results you posted above? Might need to change your last statement to julia >= 1.5.

@Luthaf
Copy link
Member

Luthaf commented Jun 22, 2020

I spoke too soon ... I do also get the segfault with Julia 1.5, so we'll have to do something =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants