Scanning

Introduction

The scanner is designed to be multi-scale, meaning that your workflow will stay the same regardless of whether you are working on 1 MB or 1TB of data. SegyIO accomplishes this by being:

Direct. Read only what needs to be read.
Seamless. Remove the need to deal with complex filesystems.
Performant. Read, write, and scan at disk speed.

A scanned volume provides a higher level of abstraction, removing the need for a user to directly manage individual files. Scanning a file (or a group of files) returns a SeisCon object, which contains the necessary information to partition the volume into more managable pieces and directly access these partitions. By default, the scanner will automatically partition the volume when the source location changes.

The data/ directory contains 4 SEGY files generated from the Overthrust model, each containing roughly 20 shots. Scanning these 4 files into a SeisCon object will allow direct access to these shots without duplicating any memory.

julia> using SegyIO

julia> dir2scan = joinpath(SegyIO.myRoot,"data/")
".../SegyIO/data/"

julia> file_filter = "overthrust"
"overthrust"

julia> keys = ["GroupX"; "GroupY"]
2-element Array{String,1}:
 "GroupX"
 "GroupY"

julia> s = segy_scan(dir2scan, file_filter, keys);
Scanning ... .../SegyIO/data/overthrust_2D_shot_1_20.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_21_40.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_41_60.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_61_80.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_81_97.segy

julia> length(s)
97

SeisCon Objects

At the core of the SeisCon object, s, is a vector of BlockScan objects. Each BlockScan contains a metadata summary for each partition. Only the header fields defined in keys are summarized.

The scanner automatically detected 97 shots, and created metadata summaries for all traces in each shot.

The first three fields contain the file path, the start byte, and end byte of the partition. The last field is a dictionary containing the minimum and maximum value for each key within the partition.

julia> s.blocks[1]
SegyIO.BlockScan(".../SegyIO/data/overthrust_2D_shot_1_20.segy", 3600, 415588, Dict("SourceX"=>Int32[400, 400],"SourceY"=>Int32[0, 0],"GroupX"=>Int32[100, 6400],"GroupY"=>Int32[0, 0]))

julia> fieldnames(BlockScan)
4-element Array{Symbol,1}:
 :file     
 :startbyte
 :endbyte  
 :summary

get_sources and get_header can be used to inspect these summaries for an entire SeisCon object.

julia> src_locations = get_sources(s)
97×2 Array{Int32,2}:
  400  0
  600  0
  800  0
    ⋮
19200  0
19400  0
19600  0

The data referenced in in a BlockScan can be accessed by indexing the SeisCon object.

julia> d = s[1];

The combination of metadata summaries and direct access allows the user to easily perform operations that would otherwise be quite tedious. For example, let's load every trace where the source was located between gridpoints 5000 and 10000.

julia> d = s[find(x -> 5000<= x <=10000, src_locations[:,1])]

Distributed Scanning

By default, segy_scan distributes files queued for scanning to all workers in order to parallelize scanning.

julia> addprocs(4)
4-element Array{Int64,1}:
 2
 3
 4
 5

julia> @everywhere using SegyIO

julia> dir2scan = joinpath(SegyIO.myRoot,"data")
".../SegyIO/data/"

julia> file_filter = "overthrust"
"overthrust"

julia> keys = ["GroupX"; "GroupY"]
2-element Array{String,1}:
 "GroupX"
 "GroupY"

julia> s = segy_scan(dir2scan, file_filter, keys);
    From worker 2:  Scanning ... .../SegyIO/data/overthrust_2D_shot_1_20.segy
    From worker 4:  Scanning ... .../SegyIO/data/overthrust_2D_shot_41_60.segy
    From worker 3:  Scanning ... .../SegyIO/data/overthrust_2D_shot_21_40.segy
    From worker 5:  Scanning ... .../SegyIO/data/overthrust_2D_shot_61_80.segy
    From worker 2:  Scanning ... .../SegyIO/data/overthrust_2D_shot_81_97.segy

The optional keyword argument pool can be set to define a different worker pool to distribute across.

julia> s = segy_scan(dir2scan, file_filter, keys, pool = WorkerPool(workers()[1:2]));
    From worker 2:  Scanning ... .../SegyIO/data/overthrust_2D_shot_1_20.segy
    From worker 3:  Scanning ... .../SegyIO/data/overthrust_2D_shot_21_40.segy
    From worker 2:  Scanning ... .../SegyIO/data/overthrust_2D_shot_41_60.segy
    From worker 3:  Scanning ... .../SegyIO/data/overthrust_2D_shot_61_80.segy
    From worker 2:  Scanning ... .../SegyIO/data/overthrust_2D_shot_81_97.segy

Chunking

Inorder to scan files larger than memory, the scanner reads chunksize MB of data into an IOBuffer at a time in order to process the data in memory. This allows the scanner to scale to any file size, and has the added benefit of reducing the number of calls to read from the disk.

The default chunksize is 1024 MB, and can be modified using the chunksize keyword argument in segy_scan. Increasing chunksize yields moderate performance gains at the expensive of increased peak memory.

julia> @time s = segy_scan(dir2scan, file_filter, keys);
Scanning ... .../SegyIO/data/overthrust_2D_shot_1_20.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_21_40.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_41_60.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_61_80.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_81_97.segy
  0.077360 seconds (384.77 k allocations: 5.013 GiB, 10.37% gc time)

julia> @time s = segy_scan(dir2scan, file_filter, keys, chunksize = 10);
Scanning ... .../SegyIO/data/overthrust_2D_shot_1_20.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_21_40.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_41_60.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_61_80.segy
Scanning ... .../SegyIO/data/overthrust_2D_shot_81_97.segy
  0.110804 seconds (385.64 k allocations: 103.844 MiB, 4.85% gc time)

The performance hit of using a small chunk size becomes more pronounced at scale. The timings below are from scanning a 330 GB file using a 1 GB and a 20 GB chunksize respectively.

julia> @time s = segy_scan(pwd(), bigfile, ["GroupX"; "GroupY"]);
Scanning ... _____.sgy
739.929009 seconds (1.01 G allocations: 359.513 GiB, 6.06% gc time) 

julia> @time s = segy_scan(pwd(), bigfile, ["GroupX"; "GroupY"], chunksize = 20*1024);
Scanning ... _____.sgy
691.861098 seconds (1.00 G allocations: 368.240 GiB, 1.63% gc time)

These timings show that chunksize should be set as large as possible for optimal performance, however the performance hit taken when scanning files much larger than memory is acceptable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scanning

Introduction

SeisCon Objects

Distributed Scanning

Chunking

Clone this wiki locally