-
Notifications
You must be signed in to change notification settings - Fork 21
Production Guide
Taking SolrWayback to production typically involves scaling challenges. Therefore the main focus of this guide is how to scale indexing and search. The guide should not be necessary for people who are only trying out the SolrWayback bundle or running smaller scale installations (less than 100 million records). To setup the bundle for smaller installations, have a look at this wiki page.
It is written from experience at the Royal Danish Library where we run netarchive search. It might be applicable elsewhere, but is not intended as a general Solr guide.
Familiarity with running the SolrWayback bundle in "standard" mode is presumed. The reader does not need to have detailed knowledge of running Solr in production. Solr requires Java and runs on at least MacOS, Linux and Windows. While this guide assumes Linux, it should be fairly easy to adapt to MacOS and Windows.
Indexing is the process of populating a Solr installation with documents generated from WARC entries. There will be 1 Solr document per WARC entry.
The SolrWayback bundle comes with the script warc-indexer which is used for the analyzing and index building part of the SolrWayback eco system. Underneath the surface, webarchive-discovery does all the work.
warc-indexer
provides threaded indexing on a single machine and depending on local WARC file logistics it can be used independently on multiple machines. For collections of a few terabytes of WARCs the script is a viabale option.
The analyzing of WARC entries is by far the heaviest part compared to updating the Solr installation with the results. A ratio of 10 CPU cores for analyzing for each 1 CPU core for running Solr should work well.
Moving into 10s or 100s of terabytes of WARCs with a steady influx of new material, a more organized approach is to use the Hadoop indexer of the webarchive-discovery
project or create a custom local solution. The Royal Danish Library created netsearch for handling 1 petabyte of WARCs, but it is not developed as a generally usable tool.
The SolrWayback bundle comes with the simplest possible Solr setup: Single Solr standalone. A typical production setup will use a Solr Cloud but for collections that are below 100 million documents, a single Solr suffices from a performance point of view. Standalone can be pressed into the 500 million document range, but it is not advisable from both a performance- and stability-viewpoint.
Solr Cloud provides:
- Failure resilience using replicas
- Scaling out using shards
- Mix'n'match of content from multiple collections using alias
- Controlled backup using backup
- Complex data extraction with streaming expressions
This guide focuses on Solr Cloud as Solr Cloud is preferred in a production setup. It provides step-by-step instructions for running a small cloud as well as high level descriptions for multiple larger setups. The technical details for running larger setups are not part of this: See the Apache Solr Reference Guide instead.
At the time of writing the SolrWayback bundle comes with Solr 7. Hopefully this will soon be changed to Solr 9, also used in this guide. Note: Solr 7 has a performance problem with large segments (See LUCENE-8374) that might show in SolrWayback context, depending on how the index is managed.
Solr versions are only compatible 1 version back: A Solr 7 index can be used by Solr 8, but not by Solr 9. Even when the upgrade is just from one major version to the next, it is advisable to perform a full re-index.
Switching from Solr Standalone to Solr Cloud, while keeping the existing index, is possible and without any drawbacks besides that it requires a little sleight of hand: Create a SolrCloud with a collection consisting of a single shard without extra replicas as per the instructions in the next section, then shut down the cloud, copy the raw data from standalone to the cloud data folder and start the cloud again.
After this the single shard can be split to the wanted number of shards and replicas added as needed.
Below follows a step-by-step guide on how to use SolrWayback with a Solr 9 Cloud running on a single Solr node on a single machine. This setup does not provide redundancy (replicas) due to its single-machine nature, but scales well with hardware if the number of shards is increased.
Download Solr from the Apache Solr homepage. As of 2024-05-12 the webarchive-discovery project has schemas for Solr 7 (deprecated) and Solr 9. When Solr 10 is released, it might be able to use Solr 9 schemas. In this example we'll use the binary release of Solr 9.3.0:
wget 'https://www.apache.org/dyn/closer.lua/solr/solr/9.3.0/solr-9.3.0.tgz?action=download' -O solr-9.3.0.tgz
Unpack Solr, rename the folder:
tar xovf solr-9.3.0.tgz
mv solr-9.3.0 solr-9
Allow Solr to be accessed from other machines than localhost by opening solr-9/bin/solr.in.sh
(or solr-9/bin/solr.in.cmd
if using Windows) and changing the line:
#SOLR_JETTY_HOST="127.0.0.1"
to:
SOLR_JETTY_HOST="0.0.0.0"
Start Solr in cloud mode (call solr-9/bin/solr
without arguments for information on how to use it). See the Solr Tutorials for more information:
solr-9/bin/solr start -c -m 8g
Visit http://localhost:8983/solr to see if Solr is running.
Note the 8g
argument. This is to be adjusted when the index grows. The solr Admin GUI shows the memory usage at any given time.
When Solr starts, it is likely to complain about the number of file handles being too small with a message such as:
*** [WARN] *** Your open file limit is currently 1024.
It should be set to 65000 to avoid operational disruption.
for testing this can often be ignored, but when running in production the limit must be raised! Failure to do so will lead to failed index updates when the number of shards and documents grow.
Upload the webarchive-discovery Solr 9 configuration to the local Solr Cloud and create a collection with 2 shards. Here the webarchive discovery project is cloned, but (eventually) the files should also be available in the SolrWayback bundle.
Note that the number of shards (2) is part of the call. This can be freely specified upon creation or adjusted later (with some limitations) using shard split. Also note that the name of the collection is netarchivebuilder
to match the SolrWayback bundle default. Changing this requires matching changes to the indexer and the SolrWayback properties.
git clone [email protected]:ukwa/webarchive-discovery.git
solr-9/bin/solr create_collection -c netarchivebuilder -d webarchive-discovery/warc-indexer/src/main/solr/solr9/discovery/conf/ -n sw_conf1 -shards 2
Visit http://localhost:8983/solr/#/~cloud?view=graph to check that there is now a collection called netarchivebuilder
with 2 active shards.
That's it for Solr Cloud setup.
Index to the Solr Cloud using warc-indexer.sh
from the SolrWayback bundle or by other means. It uses the same port and the same collection name as Solr Standalone. Start the tomcat with SolrWayback (but don't start the Solr 7 Standalone as there will be a port clash) and visit http://localhost:8080/solrwayback/
Delete all documents in the index:
curl 'http://localhost:8983/solr/netarchivebuilder/update?commit=true' -H 'Content-Type: text/xml' --data-binary '<delete><query>*:*</query></delete>'
For a webscale Solr index, the primary problem is performance. For Solr standard searches, random access to index data is the bottleneck: CPU power is mostly secondary and mostly helps with large exports or streaming expressions. The basic "works everywhere" strategy to improve random access performance is "Buy more RAM", but at webarchive scale this can be a rather expensive affair. A basic and relatively inexpensive advice is to use SSDs instead of spinning drives, but this is 2023 so who even uses spinning drives for anything performance related anyway?
How much hardware? Well, that really depends on performance requirements and index size. At the Royal Danish Library the performance requirements are low and the index size large (40+ billion records) so hardware has been dialed as far down as possible. Some details are available at 70TB, 16b docs, 4 machines, 1 SolrCloud, but it can be roughly condensed to being 140+ shards, each with 300 million records or 900GB of index data, with each shard having 0.64 CPU and 16GB RAM available to serve it. This should be seen as minimum requirements as there are plans for doubling the amount of memory at the Royal Danish Library.
A rough rule of thumb for a setup with low level of use (1-3 concurrent searches): 100 million records ~ 300GB Solr index data ~ 0.5 CPU ~ 10GB of RAM.
We have tried to create an overview of setup sizes at the following wiki page.
At this scale a basic Solr Cloud works well: Deploy a cloud according to the Solr documentation, create a single collection with a shard count so that each shard holds 100-300 million records of the expected total record count.
The big decision is whether to add shard replicas or not.
- Pro: Replicas on separate hardware ensures continued operation in case of hardware failure
- Pro: Although not a substitute for proper backup, replicas can be seen as RAID 1
- Con: Replicas double the RAM requirements for the same search latency, although they do also double throughput if CPUs are scaled equivalently
- Con: Replicas lower overall performance if RAM is not doubled
At this scale a single collection with hundreds of shards is hard to manage: Any newly added documents can go to any shard and trigger search delays due to updating of the index data. All Solr nodes must be able (read: Have enough RAM) to have 2 indexes open at the same time, due to the nature of the Lucene index used by Solr.
The strategy at the Royal Danish Library is to use multiple collections of manageable size in a hot/cold setup:
At any given time there is a single hot collection where documents are added using the indexer. The rest of the collections are cold and never updated. When the hot collection reaches a given size, it is optimized and classified as cold, followed by creation of a new empty hot collection. The alias mechanism exposes both hot and cold collections as one single logical collection used by SolrWayback.
The division of hot and cold collections makes it possible to have a dedicated builder node (with the hot collection) and dedicated searcher nodes (with the cold collections). As the cold collections are never updated, the hardware requirements for the searcher nodes goes down and backup needs to be done once and only once for the cold collection.
When building the full index in one go, e.g. when switching major Solr version, the use of multiple collections means that they can be build fully independent and thus indexing can be scaled indefinitely with hardware.
At the Royal Danish Library, there are no extra shard replicas: If a collection is damaged due to storage failure, the whole cloud is down until the shard has been restored from backup. This halves the hardware requirements for the searchers. If a netarchive of this size has a lot of active users, adding replicas might be needed to increase search capacity.
The size of the sub-collections at the Royal Danish Library is dictated by hardware (~900GB each to fit a single SSD). Another division might be in logical units such as years or harvest job ID. The wins in logistics and hardware stays the same.
In general, going beyond a few hundreds shards in a single Solr Cloud is seen as problematic territory in the Solr community. The use of the hot/cold principle should help here, but it is unknown if there is a point where performance becomes markedly worse of if it will just gradually go down due to the rule of the weakest performance link in the distributed chain.
Anyone with experience on this are more than welcome to contact us.
- webarchive-discovery config
- Multiple collections
- Explicit ZooKeeper
- Compact index
- Alias