-
Notifications
You must be signed in to change notification settings - Fork 5
Scanning
The scanner is designed to be multi-scale, meaning that your workflow will stay the same regardless of whether you are working on 1 MB or 1TB of data. SegyIO accomplishes this by being:
-
Direct. Read only what needs to be read.
-
Seamless. Remove the need to deal with complex filesystems.
-
Performant. Read, write, and scan at disk speed.
A scanned volume provides a higher level of abstraction, removing the need for a user to directly manage individual files. Scanning a file (or a group of files) returns a SeisCon
object, which contains the necessary information to partition the volume into more managable pieces and directly access these partitions. By default, the scanner will automatically partition the volume when the source location changes.
The data/
directory contains 4 SEGY files generated from the Overthrust model, each containing roughly 20 shots. Scanning these 4 files into a SeisCon
object will allow direct access to these shots without duplicating any memory.
julia> using SegyIO
julia> dir2scan = joinpath(SegyIO.myRoot,"data/")
".../SegyIO/data/"
julia> file_filter = "overthrust"
"overthrust"
julia> keys = ["GroupX"; "GroupY"]
2-element Array{String,1}:
"GroupX"
"GroupY"
julia> s = segy_scan(dir2scan, file_filter, keys);
Scanning ... .../SegyIO/data/overthrust_2D_shot_1_20.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_21_40.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_41_60.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_61_80.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_81_97.segy
julia> length(s)
97
At the core of the SeisCon
object, s
, is a vector of BlockScan
objects. Each BlockScan
contains a metadata summary for each partition. Only the header fields defined in keys
are summarized.
The scanner automatically detected 97 shots, and created metadata summaries for all traces in each shot.
The first three fields contain the file path, the start byte, and end byte of the partition. The last field is a dictionary containing the minimum and maximum value for each key
within the partition.
julia> s.blocks[1]
SegyIO.BlockScan(".../SegyIO/data/overthrust_2D_shot_1_20.segy", 3600, 415588, Dict("SourceX"=>Int32[400, 400],"SourceY"=>Int32[0, 0],"GroupX"=>Int32[100, 6400],"GroupY"=>Int32[0, 0]))
julia> fieldnames(BlockScan)
4-element Array{Symbol,1}:
:file
:startbyte
:endbyte
:summary
get_sources
and get_header
can be used to inspect these summaries for an entire SeisCon
object.
julia> src_locations = get_sources(s)
97×2 Array{Int32,2}:
400 0
600 0
800 0
⋮
19200 0
19400 0
19600 0
The data referenced in in a BlockScan
can be accessed by indexing the SeisCon
object.
julia> d = s[1];
The combination of metadata summaries and direct access allows the user to easily perform operations that would otherwise be quite tedious. For example, let's load every trace where the source was located between gridpoints 5000 and 10000.
julia> d = s[find(x -> 5000<= x <=10000, src_locations[:,1])]
By default, segy_scan
distributes files queued for scanning to all workers in order to parallelize scanning.
julia> addprocs(4)
4-element Array{Int64,1}:
2
3
4
5
julia> @everywhere using SegyIO
julia> dir2scan = joinpath(SegyIO.myRoot,"data")
".../SegyIO/data/"
julia> file_filter = "overthrust"
"overthrust"
julia> keys = ["GroupX"; "GroupY"]
2-element Array{String,1}:
"GroupX"
"GroupY"
julia> s = segy_scan(dir2scan, file_filter, keys);
From worker 2: Scanning ... .../SegyIO/data/overthrust_2D_shot_1_20.segy
From worker 4: Scanning ... .../SegyIO/data/overthrust_2D_shot_41_60.segy
From worker 3: Scanning ... .../SegyIO/data/overthrust_2D_shot_21_40.segy
From worker 5: Scanning ... .../SegyIO/data/overthrust_2D_shot_61_80.segy
From worker 2: Scanning ... .../SegyIO/data/overthrust_2D_shot_81_97.segy
The optional keyword argument pool
can be set to define a different worker pool to distribute across.
julia> s = segy_scan(dir2scan, file_filter, keys, pool = WorkerPool(workers()[1:2]));
From worker 2: Scanning ... .../SegyIO/data/overthrust_2D_shot_1_20.segy
From worker 3: Scanning ... .../SegyIO/data/overthrust_2D_shot_21_40.segy
From worker 2: Scanning ... .../SegyIO/data/overthrust_2D_shot_41_60.segy
From worker 3: Scanning ... .../SegyIO/data/overthrust_2D_shot_61_80.segy
From worker 2: Scanning ... .../SegyIO/data/overthrust_2D_shot_81_97.segy
Inorder to scan files larger than memory, the scanner reads chunksize
MB of data into an IOBuffer at a time in order to process the data in memory. This allows the scanner to scale to any file size, and has the added benefit of reducing the number of calls to read from the disk.
The default chunksize
is 1024 MB, and can be modified using the chunksize
keyword argument in segy_scan
. Increasing chunksize
yields moderate performance gains at the expensive of increased peak memory.
julia> @time s = segy_scan(dir2scan, file_filter, keys);
Scanning ... .../SegyIO/data/overthrust_2D_shot_1_20.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_21_40.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_41_60.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_61_80.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_81_97.segy
0.077360 seconds (384.77 k allocations: 5.013 GiB, 10.37% gc time)
julia> @time s = segy_scan(dir2scan, file_filter, keys, chunksize = 10);
Scanning ... .../SegyIO/data/overthrust_2D_shot_1_20.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_21_40.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_41_60.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_61_80.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_81_97.segy
0.110804 seconds (385.64 k allocations: 103.844 MiB, 4.85% gc time)
The performance hit of using a small chunk size becomes more pronounced at scale. The timings below are from scanning a 330 GB file using a 1 GB and a 20 GB chunksize
respectively.
julia> @time s = segy_scan(pwd(), bigfile, ["GroupX"; "GroupY"]);
Scanning ... _____.sgy
739.929009 seconds (1.01 G allocations: 359.513 GiB, 6.06% gc time)
julia> @time s = segy_scan(pwd(), bigfile, ["GroupX"; "GroupY"], chunksize = 20*1024);
Scanning ... _____.sgy
691.861098 seconds (1.00 G allocations: 368.240 GiB, 1.63% gc time)
These timings show that chunksize
should be set as large as possible for optimal performance, however the performance hit taken when scanning files much larger than memory is acceptable.