This project demonstrates the use of Apache Spark for scalable and efficient post-processing of molecular dynamics (MD) simulation data. It focuses on calculating the Mass Accommodation Coefficient (MAC) to analyze phase change phenomena like evaporation and boiling.
- Parallel processing of large MD trajectory datasets using Apache PySpark.
- Performance benchmarking against sequential implementations.
- Handles datasets with periodic boundary conditions and large-scale molecule trajectories.
MD simulations generate vast datasets that require significant computational resources for post-processing. This project leverages Spark's distributed memory mechanism to reduce runtime and improve efficiency.
- Sequential Code: A MATLAB implementation for baseline performance.
- Parallel Code: A PySpark implementation to process data efficiently.
- Achieved up to 4x speedup in runtime with optimized code.
- Benchmarked performance on datasets up to 3 GB in size.
- Limited by the lack of an HDFS cluster and reliance on single-node Databricks Community Edition.
- Clone the repository:
git clone https://github.com/swargo98/Scalable-Computation-of-Molecular-Trajectory-Analysis.git
- Install dependencies
- Fix the sbatch file to submit the job.
- Submit the job with your trajectory data using:
sbatch your_script.sbatch