Documentation on Marginalia Search Engine Software Documentation

Documentation on Marginalia Search Engine Software Documentation https://docs.marginalia.nu/ Recent content in Documentation on Marginalia Search Engine Software Documentation Hugo -- gohugo.io en-us Wed, 17 Jan 2024 00:00:00 +0000 1.1 Hardware and Configuration https://docs.marginalia.nu/1_overview/01_hardware/ Tue, 16 Jan 2024 00:00:00 +0000 https://docs.marginalia.nu/1_overview/01_hardware/ The Marginalia Search Engine is designed to be run on a single server. The server should be an x86-64 machine, have at least 16GB of RAM, and at least 4 cores. It is designed to run on physical hardware, and will likely be very expensive to run in the cloud. Overall the system is designed to be run on a single server, but it is possible to run the index nodes on separate servers. 1.2 Software Requirements https://docs.marginalia.nu/1_overview/02_software-reqs/ Tue, 16 Jan 2024 00:00:00 +0000 https://docs.marginalia.nu/1_overview/02_software-reqs/ The software requirements for running the Marginalia Search Engine are: Linux, probably doesn’t matter which distro, the instructions assume something Debian or Ubuntu-like. If you’re running something else, especially something other than Linux, it may or may not work. Docker (install guides) Docker Compose (install guides) Java (use sdkman to install): JDK 22 for the latest released version of the system JDK 23 for the head of the git repository NPM 1.3 Installing https://docs.marginalia.nu/1_overview/03_installing/ Tue, 16 Jan 2024 00:00:00 +0000 https://docs.marginalia.nu/1_overview/03_installing/ To install the search engine software, you need to clone the repository. $ git clone https://github.com/MarginaliaSearch/MarginaliaSearch.git This will create a directory called MarginaliaSearch in your current directory. Change into that directory, and run the setup.sh script. It will download a bunch of additional files, primarily from https://downloads.marginalia.nu. This is necessary as the search engine uses large binary model files for language processing, and these don’t agree well with git. $ run/setup. 2.1 New Crawl https://docs.marginalia.nu/2_crawling/1_new_crawl/ Tue, 16 Jan 2024 00:00:00 +0000 https://docs.marginalia.nu/2_crawling/1_new_crawl/ NOTE: Please be sure to read the crawling disclaimer before proceeding. Bootstrapping the domain database While a running search engine can use the link database to figure out which websites to visit, a clean system does not know of any links, so you must add a few domains yourself. To do this, either follow the link in the New Crawl GUI, or use the top menu and select Domains->Add Domains 2.2 Recrawling https://docs.marginalia.nu/2_crawling/2_recrawl/ Tue, 16 Jan 2024 00:00:00 +0000 https://docs.marginalia.nu/2_crawling/2_recrawl/ The work flow with a crawl spec was a one-off process to bootstrap the search engine. To keep the search engine up to date, it is preferable to do a recrawl. This will try to reduce the amount of data that needs to be fetched. To trigger a Recrawl, go to Nodes->Node N->Actions->Re-crawl. This will bring you to a page that looks similar to the ’new crawl page’, where you can select a set of existing crawl data to use as a source. 2.3 Processing and Loading https://docs.marginalia.nu/2_crawling/3_loading/ Tue, 16 Jan 2024 00:00:00 +0000 https://docs.marginalia.nu/2_crawling/3_loading/ Once the crawl is done, the data needs to be processed before its searchable. This process extracts keywords and features from the documents, and converts them into a format that can be loaded into the search engine. This is done by going to Nodes->Node N->Actions->Process Crawl Data. Process Crawl Data Dialog This will start the conversion process. This will again take a while, depending on the size of the crawl. The process bar will show the progress. 1.4 Configuration https://docs.marginalia.nu/1_overview/04_configuring/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/1_overview/04_configuring/ After installing, a directory structure will be created in the install directory. In it, the following files and directories will be created: path description conf/properties java-style properties files for configuring the search engine conf/suggestions.txt A list of suggestions for the search interface conf/db.properties JDBC configuration env/mariadb.env Environment variables for the mariadb container env/service.env Environment variables for Marginalia Search services logs/ Log files model/ Language processing models index-1/backup Index backups for index node 1 index-1/index Index data for index node 1 index-1/storage Raw and processed crawl data for index node 1 index-1/work Temporary work directory for index node 1 index-1/uploads Upload directory for index node 1 index-2/backup Index backups for index node 2 index-2/index Index data for index node 2 … … For a production-like deployment, you will probably want to move the db and index directories to a separate storage device. 1.5 System Overview https://docs.marginalia.nu/1_overview/05_system_overview/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/1_overview/05_system_overview/ The search engine consists of several components, each of which is run in a separate Docker container. Index Nodes The system is designed to be able to run with multiple partitions. At least one partition is required, but more can be added. Each partition is called an Index Node. Each index node is a separate Docker container, and can be run on a separate server, or on the same server. 2.4.1 - WARCs https://docs.marginalia.nu/2_crawling/4_sideloading/1_warc/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/2_crawling/4_sideloading/1_warc/ WARC files are the standard format for web archives. They can be created e.g. with wget. The Marginalia software can read WARC files directly, and sideload them into the index, as long as each warc file contains only one domain. Let’s for example archive www.marginalia.nu (I own this domain, so feel free to try this at home) $ wget -r --warc-file=marginalia www.marginalia.nu Note If you intend to do this on other websites, you should probably add a --wait parameter to wget, e. 2.4.2 - ZIM https://docs.marginalia.nu/2_crawling/4_sideloading/2_openzim/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/2_crawling/4_sideloading/2_openzim/ Wikipedia is the archetype of a website that is too large to crawl. Thankfully, they provide dumps of their data in a format called ZIM. This is a format that is optimized for offline use, and is used by the Kiwix project to provide offline access to Wikipedia. Wikipedia’s zim files are available for download at https://download.kiwix.org/zim/wikipedia/ Since the search engine doesn’t process images, we can use the smaller “no images” version of the dump, wikipedia_en_all_nopic_YYYY-MM. 2.4.3 - Stackexchange https://docs.marginalia.nu/2_crawling/4_sideloading/3_stackexchange/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/2_crawling/4_sideloading/3_stackexchange/ The search engine is capable of side-loading stackexchange data dumps. These are available from https://archive.org/details/stackexchange. The data dumps are available in a compressed XML format. It is probably a good idea to select the torrent option, as the files are quite large, and archive.org’s servers are not particularly fast. This will also allow you to limit the download to the sites you are interested in. The system will digest the 7z files directly, so you don’t need to uncompress them. 2.4.4 - Directory Tree https://docs.marginalia.nu/2_crawling/4_sideloading/4_dirtree/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/2_crawling/4_sideloading/4_dirtree/ For relatively small websites, ad-hoc side-loading is available directly from a folder structure on the hard drive. This is intended for loading manuals, documentation and similar data sets that are large and slowly changing. A website can be archived with wget, like this wget -nc -x --continue -w 1 -r -A "html" "docs.marginalia.nu" After doing this to a bunch of websites, create a YAML file in the upload directory, with contents something like this: 2.5 - WARC export https://docs.marginalia.nu/2_crawling/5_warc_export/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/2_crawling/5_warc_export/ It is possible to configure the crawler to export its crawl data in a WARC format on top of the native parquet format. This is toggled in the node configuration, available from Index Nodes -> Node N -> Configuration The node configuration panel, showing the `Keep WARC files during crawling` option If the option Keep WARC files during crawling is enabled, the crawler will retain a WARC record of the crawl. 3.1 Node Configuration https://docs.marginalia.nu/3_configuration_options/1_node_configuration/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/3_configuration_options/1_node_configuration/ Under Nodes -> Node N -> Configuration, you will find a list of configuration options that can be set for each node. Node Configuration Dialog Accept Queries This option toggles whether the query service will route queries to this node. This is useful if you want to take a node out of rotation for some reason. Keep WARC files during crawling If this option is enabled, the WARC files be compacted into common-crawl style indexed WARC files. 3.2 Data Sets https://docs.marginalia.nu/3_configuration_options/2_data_sets/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/3_configuration_options/2_data_sets/ Under System -> Data Sets, you will find a couple of options. Thse define URLs that the system will use to download data sets from. Data Sets Data Set URLs Blogs The blogs list is a list of domains that the system considers to be blogs. This affects how these domains are processed, paths like /tags or /category are ignored, and the system will operate on the assumption that the content is a blog post. 3.3 Domain Ranking Sets https://docs.marginalia.nu/3_configuration_options/3_ranking_sets/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/3_configuration_options/3_ranking_sets/ Under System -> Domain Ranking Sets, you will find a list of domain ranking sets. These are configurations for the domain ranking system. The domain ranking system is a system for assigning a score to each domain, which affects the order in which the domains are considered in the index. Thus a high ranking means results from a domain is more likely to be returned in a query. A few domain ranking sets are reserved, and cannot be deleted. 4.1 Model Files https://docs.marginalia.nu/4_data/1_model_files/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/4_data/1_model_files/ In the model/ directory, the following files are stored: File Description English.DICT RDRPosTagger dictionary English.RDR RDRPosTagger model lid.176.ftz fasttext language identification model opennlp-sentence.bin OpenNLP sentence detector model opennlp-tokens.bin OpenNLP tokenizer model tfreq-new-algo3.bin Marginalia term frequency model ngrams.bin Marginalia n-grams model The RDRPosTagger models are used for fast part-of-speech tagging. These models, and additional models are available at the RDRPOSTagger git repository The fasttext language identification model is used to identify the language of a document. 4.2 Data Files https://docs.marginalia.nu/4_data/2_data_files/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/4_data/2_data_files/ In the data/ directory, the following files are stored: File Description adblock.txt Adblock rules asn-data-raw-table CIDR->ASN data asn-used-autnums ASN->AS registry data IP2LOCATION-LITE-DB1.CSV IP2Location data atags.parquet Anchor tags adblock.txt is used to detect ads and other problematic content in the crawled documents. The asn-files are sourced from APNIC and used to map IP addresses to the corresponding CIDR and autononous system. The ip2location data is from https://lite.ip2location.com/, and available under CC-BY-SA 4. 4.3 Crawl Data, Processed Data, etc. https://docs.marginalia.nu/4_data/3_crawl_data/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/4_data/3_crawl_data/ The system stores crawl data in index-n/storage/, along with processed data and other various long-term data. The data generated by the system is in general viewable within the control GUI, under Index Nodes -> Node N -> Storage. Listing of data Clicking on the paths in this view will bring up details. Data details screenshot This view will show the path of the data relative to the node storage root (e. 4.4 Backups https://docs.marginalia.nu/4_data/4_backups/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/4_data/4_backups/ The system automatically snapshots the index data before the index is constructed. This allows relatively quick rollbacks of the index if for some reason this operation needs to be undone. The index data is stored in the index-n/backup directory, where n is the index node number. Backup restoration is done from the control interface, under Node N->Actions->Restore Backup. This will restore the backup and the index will be rebuilt from there. 5. Sample Crawl Data https://docs.marginalia.nu/2_crawling/4_sideloading/5_sample_data/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/2_crawling/4_sideloading/5_sample_data/ It is possible to download sample crawl data from the marginalia search project. This is useful for quickly setting up a test environment for experimentation or assessment of changes to the code. Caveat If you load sample data into the system, these domains will also be included in future re-crawls. It is a good practice to segregate test environments with sample data from real environments, to avoid contamination of the domains table. Migration, 2024-03+ https://docs.marginalia.nu/6_notes/6_1__migrate_2024_03_plus/ Mon, 01 Jan 0001 00:00:00 +0000 https://docs.marginalia.nu/6_notes/6_1__migrate_2024_03_plus/ After the end of February 2023, the project uses zookeeper for service discovery and migrates to a new docker build system. This is a fairly large change, and requires a few manual migration steps to keep using an existing installation. Easy way: Do a clean install somewhere and copy the docker-compose.yml and env/service.env to your existing install. Hard way: Add a zookeeper service to the docker-compose file. services: ... zookeeper: image: zookeeper container_name: "zookeeper" restart: always ports: - "127.