Skip to content

Commit

Permalink
description of dataset sizes
Browse files Browse the repository at this point in the history
  • Loading branch information
wkliao committed Jan 31, 2020
1 parent 445e198 commit 5d395ab
Showing 1 changed file with 37 additions and 34 deletions.
71 changes: 37 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,29 @@
# Parallel Data Concatenation for High Energy Physics Data Analysis
# Parallel HDF5 Dataset Concatenation for High Energy Physics Data Analysis

This software package contains C++ programs for concatenating multiple HDF5
files into a single one by appending individual datasets one after another.
This software package contains C++ programs for concatenating HDF5 datasets
across multiple files into a single file by appending individual datasets one
after another.

## Input HDF5 Files
* Each file contains multiple groups, each representing a "relational database
table".
* Each group contains multiple datasets, each representing a column of the
* Each group contains multiple datasets. The number of datasets in a group can
be different from others. Each dataset can be considered as a column of the
database table.
* Datasets in the same group are 2D arrays sharing the same size of 1st
dimension (most significant). The size of 2nd dimension may be different.
* Some of the datasets are actually 1D arrays whose 2nd dimension if of size 1.
* Datasets can be of size zero, i.e. either dimension is of size 0.
* All the files have the same "schema", i.e. same structure of groups and
datasets.
* A dataset in an input file may be of different 1st dimension size from the
one in other files, while the 2nd dimension should be of the same size
across files.
(most significant) dimension. The 2nd dimension size may be different.
* Some of the datasets are actually 1D arrays whose 2nd dimension is of size 1.
* Datasets can be of size zero, i.e. the 1st dimension being of size 0.
* All the files have the same "schema", i.e. same numbers of groups and
datasets with the same names.
* The size of 1st dimension of a dataset in an input file may be different from
the dataset with the same name in other files. The 2nd dimension should be of
the same size across all input files.

## Software Requirements
* A C++ compiler that support ISO C++0x standard or higher
* MPI C and C++ compilers
* An HDF5 library version 1.10.5 and later built with parallel I/O feature enabled
* An HDF5 library version 1.10.5 and later built with parallel I/O feature enabled

## Instructions to Build
0. If building from a git clone of this repository, then run command below first.
Expand All @@ -42,7 +44,7 @@ files into a single one by appending individual datasets one after another.
2. Run command "make" to create the executable file named "ph5_concat"

## Command to Run
* Run command and command-line options are:
* Command-line options are:
```
mpiexec -n <np> ./ph5_concat [-h|-q|-d|-r|-s|-p|-x] [-t num] [-m size] [-k name] [-z level] [-b size] [-o outfile] [-i infile]
Expand Down Expand Up @@ -84,7 +86,27 @@ files into a single one by appending individual datasets one after another.
read by all processes collectively (i.e. shared-file reads) and then all
processes collectively write to the output file.

## An example output shown on screen from a run on Cori using 128 MPI processes.
## Sample input and output files
* There are four sample input files provided in folder `examples`.
+ examples/sample_input_1.h5
+ examples/sample_input_2.h5
+ examples/sample_input_3.h5
+ examples/sample_input_4.h5
* Sample run commands
```
mpiexec -n 2 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5
mpiexec -n 4 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5 -k evt
```
The output shown on screen is stored in `examples/sample_stdout.txt`.
* Sample output files
+ The output files from concatenating the 4 sample files are available in
`examples/sample_output.h5` whose metadata dumped from command below is
also available in `examples/sample_output.metadata`.
```
h5dump -Hp sample_output.h5
```
## An example timing output from a run on Cori using 128 MPI processes.
```
% srun -n 128 ./ph5_concat -i ./nd_list_128.txt -o /scratch1/FS_1M_128/nd_out.h5 -b 512 -k evt -x

Expand Down Expand Up @@ -141,25 +163,6 @@ files into a single one by appending individual datasets one after another.
Close output files total: 0.4799
End-to-end: 314.8095
```
## Sample input and output files
* There are four sample input files provided in folder `examples`.
+ examples/sample_input_1.h5
+ examples/sample_input_2.h5
+ examples/sample_input_3.h5
+ examples/sample_input_4.h5
* Sample run commands
```
mpiexec -n 2 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5
mpiexec -n 4 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5 -k evt
```
The output shown on screen is stored in `examples/sample_stdout.txt`.
* Sample output files
+ The output files from concatenating the 4 sample files are available in
`examples/sample_output.h5` whose metadata dumped from command below is
also available in `examples/sample_output.metadata`.
```
h5dump -Hp sample_output.h5
```
## Questions/Comments:
* Sunwoo Lee <[email protected]>
Expand Down

0 comments on commit 5d395ab

Please sign in to comment.