Skip to content

Latest commit

 

History

History
61 lines (50 loc) · 3.43 KB

README.md

File metadata and controls

61 lines (50 loc) · 3.43 KB

Storm Performance Experiment Tools

Introduction

This suite contains tools to run performance experiments on WASB and ADLS using Apache Storm. The Storm spout emits a randomly generated record of fixed size. The bolt writes to storage and ACKs the tuple. Once all executor threads complete work, the topology is killed. Results for each worker process are stored on the respective nodes.

Prequisites/Setup

  1. Create a HDInsight cluster of desired size.

  2. Fork and clone this repository so you have a local copy.

  3. Install the following jar from the /lib directory of this repo to your local maven repository:

    mvn -q install:install-file -Dfile=lib/eventhubs-storm-spout-0.9.4-jar-with-dependencies.jar -DgroupId=com.microsoft.eventhubs -DartifactId=eventhubs-storm-spout -Dversion=0.9.4 -Dpackaging=jar

  4. Build from root folder: mvn clean package

Usage

Submit topology as follows:

storm jar target/org.apache.storm.hdfs.writebuffertest-0.1.jar org.apache.storm.hdfs.WriteTopology

Required parameters:
  -workers,-w                    Number of workers processes on the cluster
  -recordSize,-x                 Size of the record generated by spout (bytes)
  -spoutParallelism,-s           Number of spout executor processes across all workers
  -numTasksSpout,-e              Number of spout tasks across all workers
  -numAckers,-k                  Number of ackers
  -boltParallelism,-b            Number of spout executor processes across all workers
  -numTasksBolt,-t               Number of bolt tasks across all workers
  -fileRotationSize,-f           File size at which the file is rolled over and a new file is written to
  -fileBufferSize,-z             Client side buffer size. Messages are buffered to this size before being flushed to disk.
  -numRecords,-n                 Number of records written by each bolt instance
  -maxSpoutPending,-p            Max number of records that can be alive in the topology that are pending ACKs.
  -topologyName,-y               Name of the topology.
  -storageUrl,-u                 URL to WASB/ADLS Storage Endpoint
  -storageFileDirPath,-r         Relative path within the storage account. E.g. "/pathToDir/".
  
Optional parameters:
   -sizeSyncPolicyEnabled,-v     Enable size sync policy. When this is active, data is flushed only when fileBufferSize is reached.

Running the experiment

On WASB:

storm jar target/org.apache.storm.hdfs.writebuffertest-0.1.jar org.apache.storm.hdfs.WriteTopology -workers 2
-recordSize 100 -spoutParallelism 8 -numTasksSpout 8 -numAckers 8 -boltParallelism 64 -numTasksBolt 64
-fileRotationSize 100 -numRecords 10000000 -maxSpoutPending 1000
-topologyName $topologyName -storageUrl "wasb://$clusterContainer@$storageAccountName.blob.core.windows.net" 
-storageFileDirPath $storageDirectory

On ADLS:

storm jar target/org.apache.storm.hdfs.writebuffertest-0.1.jar org.apache.storm.hdfs.WriteTopology -workers 2
-recordSize 100 -spoutParallelism 8 -numTasksSpout 8 -numAckers 8 -boltParallelism 64 -numTasksBolt 64
-fileRotationSize 100 -numRecords 10000000 -maxSpoutPending 1000
-topologyName $topologyName -storageUrl "adl://$storageAccountName.azuredatalakestore.net"
-storageFileDirPath $storageDirectory

Analyzing Results

Results for run are stored under /tmp folder on the worker nodes. The name of the the file is the name of the topology specified in the input arguments.