Below we describe the methodology used to benchmark CLP and other tools.
There are two types of logs used in the benchmark:
-
Unstructured logs are log events that don't follow a predefined format or structure, making it difficult to automatically parse or analyze them. Unstructured logs are often just text interspersed with variable values like the log event's timestamp, log-level, and others such as usernames, etc. For example:
2018-06-05 06:15:29,701 DEBUG org.apache.hadoop.util.Shell: setsid exited with exit code 0
-
Semi-structured logs do have follow a predefined format, making them parseable, but their schema is not rigid. Semi-structured log events usually have a few consistent fields (e.g., the event's timestamp and log level) but the rest may appear and disappear across each event. For example:
{ "t": { "$date": "2023-03-21T23:34:54.576-04:00" }, "s": "I", "c": "NETWORK", "id": 4915701, "ctx": "-", "msg": "Initialized wire specification", "attr": { "spec": { "incomingExternalClient": { "minWireVersion": 0, "maxWireVersion": 17 }, "incomingInternalClient": { "minWireVersion": 0, "maxWireVersion": 17 }, "outgoing": { "minWireVersion": 6, "maxWireVersion": 17 }, "isInternalClient": true } } }
The unstructured log dataset, hadoop-25GB, contains log events generated by Hadoop when running workloads from the HiBench Benchmark Suite on three Hadoop clusters, each containing 48 data nodes. The dataset is a 258 GiB subset of a larger 14 TiB dataset (specifically, workers 1-3).
The semi-structured log dataset, mongodb, contains log events generated by MongoDB when running YCSB workloads A-E repeatedly. The dataset is 64.8 GiB and contains 186,287,600 log events.
Where we report benchmark results, we use a Linux server with Intel Xeon E5-2630v3 processor and 128GB of DDR4 memory. Both the uncompressed and compressed logs are stored on a 7200RPM SATA HDD.
We test two possible runtime scenarios for each tool:
- Hot Run is where we ingest the test dataset and then immediately run the query benchmark.
- TODO: In the future, we'll warm up the system by running the benchmark until the results remain consistent, and then collect results.
- Cold Run is where we ingest the data, restart the system, and then run the query benchmark.
- TODO: In the future, we plan to implement a more comprehensive approach by clearing the OS' caches to simulate a completely cold environment.
This measures the time taken to ingest the data. Smaller values are better indicating faster ingestion performance.
This measures the on-disk size of the data, post-ingestion, in the tool being tested. We measure it
using du <data> -bc
. Smaller values are better indicating higher compression.
This measures average memory usage, separately, for the ingestion and query stages. Smaller values are better indicating lower resource usage.
There are two collection methods based on how the tools being tested run:
- For tools running with multiple microservices (e.g., Loki), we use
docker stats
to poll the total memory usage of all related containers, then average the results. - For other tools, we use
ps
to poll theRSS
(resident-set size) field for all related processes, then average the results.
This measures the time taken to completely execute a query. Smaller values are better indicating faster query performance.
The benchmark currently tests the following tools:
- For unstructured logs:
- CLP. Specifically, the
glt
binary. - Elasticsearch.
- Loki.
grep
.
- CLP. Specifically, the
- For semi-structured logs:
Each tool requires different setup and configuration steps to run the benchmark.
To begin, download the latest released binary, or clone the most recent code and compile it locally by following these instructions. Note that the benchmark expects that the tools are run in Docker containers, so you should follow the previous instructions inside a Docker container.
We use the glt
binary for CLP (unstructured logs) and clp-s
for CLP-S (semi-structured logs).
For CLP, start by creating a yaml
configuration file like the example below:
system_metric:
enable: True
memory:
ingest_polling_interval: 5
run_query_benchmark_polling_interval: 5
glt:
container_id: xiaochong-clp-oss-jammy-xiaochong
binary_path: /home/xiaochong/develop/clp-bench/tests/clp-unstructured/glt-local-compiled
data_path: /home/xiaochong/develop/clp-bench/tests/clp-unstructured/offline-mode-data
dataset_path: /home/xiaochong/develop/clp-bench/tests/datasets/hadoop-small
queries:
- '" org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid "'
- '" org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer, at "'
- '" INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container "'
- '" DEBUG org.apache.hadoop.mapred.ShuffleHandler: verifying request. enc_str="'
- '" to pid 21177 as user "'
- '" 10000 reply: "'
- '" 10 reply: "'
- '" 178.2 MB "'
- '" 1.9 GB "'
- '"job_1528179349176_24837"'
- '"blk_1075089282_1348458"'
- '"hdfs://master:8200/HiBench/Bayes/temp/worddict"'
- '" abcde "'
Note that in this configuration (which also applies for the following benchmark targets):
data_path
specifies where the tool will store ingested data.dataset_path
refers to the location of data that is ready to be ingested.
With the yaml
file configured and the container running, you can execute the following command to
run the benchmarks:
clp-bench -t GLT -m {mode} -c {path-to-yaml}
Here, mode
can be set to hot
, cold
, or query-only
(which also applies for the following
benchmark targets). For more details, run clp-bench --help
.
No preprocessing is needed for the raw data. During data ingestion, clp-bench
will execute:
docker exec {container_id} {binary_path} c {data_path} {dataset_path}
For query benchmarking, clp-bench
runs the following command for each query:
docker exec {container_id} {binary_path} s {data_path} {query}
To verify the results, clp-bench
appends | wc -l
to count the matched log lines, ensuring the
tool correctly identifies matches during the query process (which also applies for the following
benchmark targets).
For memory monitoring, clp-bench
periodically executes ps aux
(based on the intervals specified
under system.memory
in the yaml
file) within the container, checking the RSS
field for
processes associated with binary_path
and data_path
:
docker exec {container_id} ps aux
The averages of the memory usage collected during ingestion and query benchmark stages will be calculated respectively.
CLP-S' setup is nearly identical to CLP's. Below is an example of a yaml
configuration file for
CLP-S:
system_metric:
enable: True
memory:
ingest_polling_interval: 5
run_query_benchmark_polling_interval: 5
clp_s:
container_id: xiaochong-clp-oss-jammy-xiaochong
binary_path: /home/xiaochong/develop/clp-bench/tests/clp-json/clp-json-x86_64-v0.1.3/bin/clp-s-local-compiled
data_path: /home/xiaochong/develop/clp-bench/tests/clp-json/offline-mode-data
dataset_path: /home/xiaochong/develop/clp-bench/tests/datasets/mongodb-single
queries:
- 'attr.tickets:*'
- 'id: 22419'
- 'attr.message.msg: log_release* AND attr.message.session_name: connection'
- 'ctx: initandlisten AND (NOT msg: "WiredTigermessage" OR attr.message.msg: log_remove*)'
- 'c: WTWRTLOG AND attr.message.ts_sec > 1679490000'
- 'ctx: FlowControlRefresher AND attr.numTrimmed: 0'
The difference is the query's format. The query for JSON logs uses KQL.
Similarly, with the yaml
file configured and the container running, you can execute the following
command to run the benchmarks:
clp-bench -t CLPS -m {mode} -c {path-to-yaml}
Like CLP, no preprocessing of raw data is required. The commands for data ingestion and query
benchmarking remain the same. Memory monitoring is also performed using ps aux
and checking the
RSS
field.
grep is a command-line tool in Unix/Linux systems used to search for specific patterns within files or text. We use it as a baseline of unstructured log query benchmark.
Below is an example of yaml
file of it:
system_metric:
enable: False
grep:
dataset_path: /home/xiaochong/develop/clp-bench/tests/datasets/hadoop-small
queries:
- '" org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid "'
- '" org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer, at "'
- '" INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container "'
- '" DEBUG org.apache.hadoop.mapred.ShuffleHandler: verifying request. enc_str="'
- '" to pid 21177 as user "'
- '" 10000 reply: "'
- '" 10 reply: "'
- '" 178.2 MB "'
- '" 1.9 GB "'
- '"job_1528179349176_24837"'
- '"blk_1075089282_1348458"'
- '"hdfs://master:8200/HiBench/Bayes/temp/worddict"'
- '" abcde "'
Since grep does not ingest data, so there is no difference between modes, you can randomly choose one:
clp-bench -t Grep -m {mode} -c {path-to-yaml}
For the query benchmark, clp-bench
executes the following command for each query:
grep -r {query} {dataset_path}
Loki runs with two microservices. Loki itself is working as a backend which ingests the data sent by the log collector. We use Promtail as its log collector. These two are running in two containers communicated through REST APIs. For query benchmark, we use LogCLI to execute queries.
We haven't integrated launching and ingesting for Loki into clp-bench
yet, so you may need to
manually launch and ingest data first, then use clp-bench
to run the query benchmark.
To run Loki, create a loki-config.yaml
configuration file as the following (see details
here):
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
instance_addr: 127.0.0.1
path_prefix: /tmp/loki
storage:
filesystem:
chunks_directory: /tmp/loki/chunks
rules_directory: /tmp/loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 1000
schema_config:
configs:
- from: 1972-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
reject_old_samples: false
ingestion_rate_mb: 1000
ruler:
alertmanager_url: http://localhost:9093
Then launch a container for Loki with the following command (assuming you are currently at the
directory which contains the yaml
configuration file above):
docker run \
--name loki \
-d \
-v $(pwd):/mnt/config \
-p 3100:3100 \
grafana/loki:3.0.0 \
-config.file=/mnt/config/loki-config.yaml
To run Promtail, create a promtail-config.yaml
file as the following:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: benchlogs
__path__: /mnt/datasets/hadoop/worker*/*
Note that __path__
should be the pattern of directories which contain the log files.
Then launch a container for Promtail with the following command (assuming you are currently at the
directory which contains the above mentioned yaml
configuration file):
docker run \
--name promtail \
-d \
-v $(pwd):/mnt/config \
-v /path/to/hadoop-log-datasets:/mnt/datasets/hadoop \
--link loki \
grafana/promtail:3.0.0 \
-config.file=/mnt/config/promtail-config.yaml
When containers for Loki and Promtail have been launched, run the following command until it prints
ready
, then Loki will start ingesting data:
curl -G http://localhost:3100/ready
Loki ingests data automatically when connects to Promtail. We do not do any preprocessing for the dataset.
We use docker stats
to get the memory usage of ingesting data periodically (the frequency should
be consistent with the yaml
configuration file for `clp-bench as mentioned in the next section)
until the ingestion finishes:
docker stats loki promtail --no-stream
Since Loki does not have prompt when ingestion finishes, we monitor the current ingested data by the following command:
curl -G http://localhost:3100/metrics | grep 'loki_distributor_bytes_received_total'
When the above command prints the size that equals to the size of the dataset, it means the ingestion finishes. Then we run the following command to get the time spent for ingesting data:
curl -G http://localhost:3100/metrics | \
grep 'loki_request_duration_seconds_sum{method="POST",route="loki_api_v1_push"'
An example of yaml
configuration file of Loki for clp-bench
to run the query benchmark is like:
system_metric:
enable: True
memory:
ingest_polling_interval: 5
run_query_benchmark_polling_interval: 5
loki:
logcli_binary_path: /home/xiaochong/develop/loki/logcli
job: benchlogs
limit: 2000000
batch: 1000
from: '2024-10-08T10:00:00Z'
to: '2024-10-09T10:00:00Z'
interval: 30
queries:
- '" org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid "'
- '" org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer, at "'
- '" INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container "'
- '" DEBUG org.apache.hadoop.mapred.ShuffleHandler: verifying request. enc_str="'
- '" to pid 21177 as user "'
- '" 10000 reply: "'
- '" 10 reply: "'
- '" 178.2 MB "'
- '" 1.9 GB "'
- '"job_1528179349176_24837"'
- '"blk_1075089282_1348458"'
- '"hdfs://master:8200/HiBench/Bayes/temp/worddict"'
- '" abcde "'
Note that in this configuration:
loki.job
should match thejob
of thelabels
in thepromtail-config.yaml
(see the example ofpromtail-config.yaml
).limit
should be at least the maximum number of matched log lines of the query.batch
is maximum number of matched log lines that Loki will send to the client at once.from
andto
define the rough wall-clock time range of ingesting data. In this example, it means that data ingestion happened between2024-10-08T10:00:00Z
and2024-10-09T10:00:00Z
.interval
defines the time range, in minutes, that Loki will use to query log lines ingested within that period. For example, setting an interval of 30 means the time range between from and to will be divided into 30-minute slices. Loki will run the query on log lines ingested during each of these slices. For each query,clp-bench
instructs Loki to execute the query across all time slices, ensuring the entire dataset is covered.
With the yaml
configuration file for clp-bench
(which is different from the yaml
file for Loki
and Promtail containers) and running containers of Loki and Promtail (all data must have been
ingested), we can run the following command to run the query benchmark:
clp-bench -t GrafanaLoki -m query-only -c {path-to-loki-clp-bench-yaml}
For each query, clp-bench
run the following command. Note that the time_slice_start
and
time_slice_to
are the start and end timestamps for each time slice, clp-bench
will iterate all
time slices between from
and to
in the configuration to cover the entire dataset.
<logcli_binary_path> \
query '{ job="{job}" } |~ {query}' \
--limit={limit} \
--batch={batch} \
--from="{time_slice_start}" \
--to="{time_slice_to}"
For measuring memory usage during query execution, we employ the same method used for data ingestion.
We benchmarked Elasticsearch for both unstructured and semi-structured logs. They are basically the same, except the query's format and the preprocessing of the semi-structured log dataset. Generally, we use single-node deployment of Elasticsearch, disabling the security feature of xpack.
For unstructured logs, an example of the yaml
configuration file is like:
system_metric:
enable: True
memory:
ingest_polling_interval: 10
run_query_benchmark_polling_interval: 1
elasticsearch:
container_id: elasticsearch-xiaochong
launch_script_path: /home/assets/start-ela.sh
compress_script_path: /home/assets/compress.py
search_script_path: /home/assets/query.py
terminate_script_path: /home/assets/stop-ela.sh
memory_polling_script_path: /home/assets/poll_mem.py
data_path: /var/lib/elasticsearch
log_path: /var/log/elasticsearch
dataset_path: /home/datasets/worker*/worker*/*log*
queries:
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": " org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid "}}}}, "size": 10000}'
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": " org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer, at "}}}}, "size": 10000}'
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": " INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container "}}}}, "size": 10000}'
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": " DEBUG org.apache.hadoop.mapred.ShuffleHandler: verifying request. enc_str="}}}}, "size": 10000}'
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": " to pid 21177 as user "}}}}, "size": 10000}'
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": " 10000 reply: "}}}}, "size": 10000}'
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": " 10 reply: "}}}}, "size": 10000}'
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": " 178.2 MB "}}}}, "size": 10000}'
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": " 1.9 GB "}}}}, "size": 10000}'
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": "job_1528179349176_24837"}}}}, "size": 10000}'
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": "blk_1075089282_1348458"}}}}, "size": 10000}'
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": "hdfs://master:8200/HiBench/Bayes/temp/worddict"}}}}, "size": 10000}'
- '{"query": {"bool": {"must": {"match_phrase": {"log_line": " abcde "}}}}, "size": 10000}'
For semi-structured logs, an example of the yaml
configuration file is like:
system_metric:
enable: True
memory:
ingest_polling_interval: 10
run_query_benchmark_polling_interval: 1
elasticsearch:
container_id: elasticsearch-semi-xiaochong
launch_script_path: /home/assets/start-ela.sh
compress_script_path: /home/assets/compress.py
search_script_path: /home/assets/query.py
terminate_script_path: /home/assets/stop-ela.sh
memory_polling_script_path: /home/assets/poll_mem.py
data_path: /var/lib/elasticsearch
log_path: /var/log/elasticsearch
dataset_path: /home/datasets/mongod.log
queries:
- '{"query": {"exists": {"field": "attr.tickets"}}, "size": 10000}'
- '{"query": {"term": {"id": 22419}}, "size": 10000}'
- '{"query": {"bool": {"must": [{"wildcard": {"attr.message.msg": "log_release*"}}, {"match": {"attr.message.session_name": "connection"}}]}}, "size": 10000}'
- '{"query": {"bool": {"must": [{"match": {"ctx": "initandlisten"}}], "should": [{"wildcard": {"attr.message.msg": "log_remove*"}}, {"bool": {"must_not": [{"match_phrase": {"msg": "WiredTiger message"}}]}}], "minimum_should_match": 1}}, "size": 10000}'
- '{"query": {"bool": {"must": [{"match": {"c": "WTWRTLOG"}}, {"range": {"attr.message.ts_sec": {"gt": 1679490000}}}]}}, "size": 10000}'
- '{"query": {"bool": {"must": [{"match": {"ctx": "FlowControlRefresher"}}, {"match": {"attr.numTrimmed": 0}}]}}, "size": 10000}'
Note that there are some scripts used by clp-bench
can be found under
clp-bench/assets/elasticsearch-unstructured
(for unstructured logs) and
clp-bench/assets/elasticsearch
(for semi-structured logs). There are also Dockerfile
can be found under these directories, which are used to build the containers with docker_build.sh
.
Once the containers are built, use docker_run.sh
to launch them. The script requires two
arguments: the first is the absolute path of the dataset on the host, and the second is the absolute
path where the ingested data will be stored on the host.
With the yaml
file and the running container corresponding to the log dataset type, we can run the
following command to benchmark Elasticsearch:
# For unstructured log dataset
clp-bench -t ElasticsearchUnstructured -m {mode} -c {path-to-yaml}
# For semi-structured log dataset
clp-bench -t Elasticsearch -m {mode} -c {path-to-yaml}
For data ingestion, no preprocessing is needed for unstructured log datasets. However, JSON logs
generated by MongoDB require some adjustments to be searchable by Elasticsearch. For details, refer
to the traverse_data
function in clp-bench/assets/elasticsearch/compress.py
. The
general approach involves reorganizing certain fields by moving them to outer or inner objects to
ensure the queries function correctly. For both types of dataset, clp-bench
uses streaming_bulk
from elasticsearch.helpers
to ingest data (refer to either
clp-bench/assets/elasticsearch-unstructured/compress.py
or
clp-bench/assets/elasticsearch/compress.py
).
For query benchmarking, clp-bench
also uses functionality from elasticsearch
python package to
execute queries. For details, refer to execute_query_without_cache
functions in
clp-bench/assets/elasticsearch-unstructured/query.py
and
clp-bench/assets/elasticsearch/query.py
.
For memory monitoring, similar to CLP and CLP-S, clp-bench
uses ps aux
and checks the RSS
field.