A prototype framework to benchmark Spark applications with multiple parameters.
- Install Python 3.8 or a newer version.
- Install required Python packages:
pip install -r python/benchmark/requirements.txt
- Download Spark on the machine where you want to start the Spark cluster.
- Copy the application you want to benchmark to the machine where the cluster is going to be deployed.
- Create the benchmark and cluster configuration file (
config.json
) and the application parameters config file (parameters.json
). As an example seepython/examples/
orpython/eaxmples/CountWord
). - Run the benchmark executor with these configuration files:
python --file python/executor.py -c python/examples/config.json -p python/examples/parameters.json
As an example application you may use CountWord in this repository.
- Create a fat jar from the application:
CountWord-gradle/gradlew shadowJar
- Copy the fat jar to the Spark cluster:
cp CountWord-gradle/app/build/libs/ParametrizableCountWord.jar <spark-cluster-path>
- Copy the
bible.txt
to the Spark cluster:cp bible.txt <spark-cluster-path>
- Adapt the
python/examples/CountWord/config.json
andpython/examples/CountWord/parameters.json
according to your Spark cluster. - Run the benchmark:
python --file python/executor.py -c python/examples/CountWord/config.json -p python/examples/CountWord/parameters.json