ConceptSketch

A high-performance corpus-based collocation analysis tool built on BlackLab corpus search software (which relies on Apache Lucene). This project implements word and dependecy sketch functionality (grammatical relations and collocations), semantic field exploration, and conceptual mining for corpus linguistics research and NLP applications.

Features

Fast Collocation Analysis: O(1) instant lookup with precomputed collocations
CQL Support: Full Corpus Query Language with distance modifiers and constraints
logDice Scoring: Association strength metric (0-14 scale)
Multiple Grammatical Relations: ADJ_PREDICATE, ADJ_MODIFIER, SUBJECT_OF, OBJECT_OF
Concordance Examples: View real corpus sentences for any word pair with highlighting
REST API: HTTP server with semantic field exploration endpoints
Web Interface: Interactive Semantic Field Explorer with D3.js visualization
Multi-Seed Exploration: Explore semantic fields using multiple seed words

Quick Start (5 minutes)

Prerequisites

Java 17+ (Java 21+ recommended)
Maven 3.6+
Python 3 (for web server)

1. Build

mvn clean package

Corpus Data for Testing

You can test ConceptSketch by downloading this indexed and tagged corpus:

Frontiers in Psychology Corpus, https://doi.org/10.18150/4LJ9WD

It is sufficiently large to provide interesting insights about the language of contemporary psychology (2010-2021, before the advent of AI-generated papers).

2. Create an Index

Step 1 — Prepare a CoNLL-U corpus

Tag your text with any CoNLL-U-producing tool. The project includes a Stanza GPU script for efficient tagging:

Option A: Use the Stanza script (recommended)

# Download model (one-time)
python tag_with_stanza.py --download --lang en

# Tag corpus (uses GPU automatically if available)
python tag_with_stanza.py \
  --input corpus.txt \
  --output corpus.conllu \
  --lang en

For GPU tuning and more options, see STANZA_GPU.md.

Option B: Use UDPipe 2 directly

udpipe --tokenize --tag --parse --output=conllu english.udpipe corpus.txt > corpus.conllu

Option C: Use another CoNLL-U tagger (Stanza in Python without GPU, spaCy, etc.)

Step 2 — Preprocess: add `<s>` sentence markers

BlackLab's tabular parser requires explicit inline tags for sentence boundaries. The project ships a script that converts CoNLL-U blank-line sentence boundaries into <s> / </s> inline tags:

python scripts/conllu_to_wpl.py corpus.conllu corpus_s.conllu

Move the output file into a dedicated input directory:

mkdir input_dir
mv corpus_s.conllu input_dir/

Step 3 — Index with BlackLab

The shaded JAR bundles BlackLab's IndexTool. Run it from the project root (so --format-dir . can find conllu-sentences.blf.yaml):

java -cp target/concept-sketch-1.6.0-shaded.jar \
  nl.inl.blacklab.tools.IndexTool create \
  --format-dir . \
  my_index/ input_dir/ conllu-sentences

Argument	Meaning
`--format-dir .`	Directory containing `conllu-sentences.blf.yaml`
`my_index/`	Output index directory (created automatically)
`input_dir/`	Directory with preprocessed `.conllu` files
`conllu-sentences`	Format name (matches the `.blf.yaml` filename)

3. Start API Server

# Terminal 1
java -jar target/concept-sketch-1.6.0-shaded.jar server --index my_index/ --port 8080

CORS configuration: By default the API allows requests from http://localhost:3000. To allow a different origin, pass the cors.allow.origin JVM system property:
java -Dcors.allow.origin=https://myapp.example.com \
     -jar target/concept-sketch-1.6.0-shaded.jar server --index my_index/ --port 8080

Server startup output:

API server started on http://localhost:8080
Algorithm: PRECOMPUTED (O(1) instant lookup)
Endpoints:
  GET /health
  GET /api/sketch/{lemma}
  GET /api/relations
  GET /api/relations/dep
  GET /api/semantic-field/explore
  GET /api/semantic-field/explore-multi
  GET /api/semantic-field/examples
  GET /api/concordance/examples
  GET /api/visual/radial
  GET /api/bcql (POST)

4. Start Web Interface

# Terminal 2
python -m http.server 3000 --directory webapp

Open browser to: http://localhost:3000

The web interface allows to produce some radial plots for collocates:

And you use some semantic exploration features:

5. Try a Query

# Find adjectives describing "house"
curl "http://localhost:8080/api/sketch/house"

# Get example sentences for "house" + "big"
curl "http://localhost:8080/api/concordance/examples?seed=house&collocate=big&top=5"

# Explore semantic field from "theory" (ADJ_PREDICATE relation)
curl "http://localhost:8080/api/semantic-field/explore?seed=theory&relation=adj_predicate"

# Multi-seed exploration
curl "http://localhost:8080/api/semantic-field/explore-multi?seeds=theory,model,hypothesis&top=10"

Core Usage

Index a Corpus

Prerequisites

A corpus in CoNLL-U format (columns: ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC)
The project's conllu-sentences.blf.yaml format file (in the project root)
Java 21+ and the shaded JAR (target/concept-sketch-1.6.0-shaded.jar)

Step 1 — Preprocess CoNLL-U: add sentence markers

BlackLab's tabular parser needs explicit <s> / </s> inline tags to index sentence spans. The bundled script converts CoNLL-U blank-line boundaries:

python scripts/conllu_to_wpl.py corpus.conllu corpus_s.conllu

What the script does:

Skips comment lines (#) and multi-word token lines (1-2, 1.1, …)
Emits <s> before the first token of each sentence
Emits </s> after the last token
Preserves all 10 CoNLL-U columns as tab-separated values

Step 2 — Create a BlackLab index

mkdir input_dir
cp corpus_s.conllu input_dir/

# Run from the project root (so --format-dir finds conllu-sentences.blf.yaml)
java -cp target/concept-sketch-1.6.0-shaded.jar \
  nl.inl.blacklab.tools.IndexTool create \
  --format-dir . \
  my_index/ input_dir/ conllu-sentences

To add more documents to an existing index later:

java -cp target/concept-sketch-1.6.0-shaded.jar \
  nl.inl.blacklab.tools.IndexTool add \
  --format-dir . \
  my_index/ more_input_dir/ conllu-sentences

Indexed annotations

Annotation	Source column	Forward index
`word`	FORM (col 2)	✓
`lemma`	LEMMA (col 3)	✓
`pos`	UPOS (col 4)	✓
`xpos`	XPOS (col 5)	✓
`deprel`	DEPREL (col 8)	✓
`wordnum`	ID (col 1)	—
`feats`	FEATS (col 6)	—
`head`	HEAD (col 7)	—

Query via Command Line

# Find all collocations for "theory"
java -jar target/concept-sketch-1.6.0-shaded.jar \
  blacklab-query --index my_index/ --lemma theory

# Find adjectival modifiers of "theory" (deprel=amod)
java -jar target/concept-sketch-1.6.0-shaded.jar \
  blacklab-query --index my_index/ --lemma theory --deprel amod
# Increase result count and filter by logDice
java -jar target/concept-sketch-1.6.0-shaded.jar \
  blacklab-query --index my_index/ --lemma theory \
  --deprel nsubj --limit 50 --min-logdice 4.0

Grammar Configuration

The grammar configuration (relations, copulas, CQL patterns) is externalized in JSON:

Config file: grammars/relations.json

{
  "version": "1.0",
  "copulas": ["be", "appear", "seem", "become", ...],
  "relations": [
    {
      "id": "noun_adj_predicates",
      "name": "Adjectives (predicative)",
      "head_pos": "noun",
      "collocate_pos": "adj",
      "cql_pattern": "[tag=jj.*]",
      "uses_copula": true,
      "default_slop": 8,
      "relation_type": "ADJ_PREDICATE",
      "exploration_enabled": true
    },
    ...
  ]
}

Fields:

id - Unique relation identifier
name - Human-readable name
head_pos - POS group of the head word (noun, verb, adj)
collocate_pos - POS group of the collocate
cql_pattern - CQL pattern to match
uses_copula - Whether this relation uses copula verbs
default_slop - Default distance window
relation_type - Semantic relation type
exploration_enabled - Whether usable for semantic field exploration

API endpoint:

To view active relations, use GET /api/relations.

To modify relations or add new ones, edit grammars/relations.json and restart the server.

REST API Endpoints

Health Check

curl http://localhost:8080/health

Get Word Sketch

curl "http://localhost:8080/api/sketch/house"

To limit full sketches to relations whose head is a specific POS group:

curl "http://localhost:8080/api/sketch/predict?head_pos=verb"

Accepted values: noun, verb, adj, adv.

Response:

{
  "status": "ok",
  "lemma": "house",
  "patterns": {
    "noun_modifiers": {
      "name": "Adjectives modifying (ADJ X)",
      "cql": "[tag=jj.*]~{0,3}",
      "total_matches": 3421,
      "collocations": [
        {
          "lemma": "big",
          "frequency": 287,
          "logDice": 11.24,
          "relativeFrequency": 0.084
        }
      ]
    }
  }
}

Single-Seed Semantic Field Exploration

curl "http://localhost:8080/api/semantic-field/explore?seed=theory&relation=adj_predicate&top=15&min_logdice=2"

Relations:

adj_predicate: "X is ADJ" (e.g., "theory is correct")
adj_modifier: "ADJ X" (e.g., "correct theory")
subject_of: subject verbs for noun-head sketches, strict local pattern (e.g., "theory suggests")
noun_verbs: nearby verbs for noun-head sketches, looser window than subject_of
verb_subjects: subject nouns for verb-head sketches (e.g., "theory predicts")
object_of: object nouns for verb-head sketches, strict local pattern (e.g., "develop theory")
verb_nouns: nearby nouns for verb-head sketches, looser window than object_of

Response:

{
  "status": "ok",
  "seed": "theory",
  "seed_collocates": [
    {"word": "correct", "log_dice": 4.21, "frequency": 142},
    {"word": "practical", "log_dice": 3.73, "frequency": 98}
  ],
  "core_collocates": [...],
  "discovered_nouns": [
    {
      "word": "development",
      "shared_count": 5,
      "shared_collocates": ["correct", "practical", "quantum"]
    }
  ]
}

Multi-Seed Semantic Field Exploration

curl "http://localhost:8080/api/semantic-field/explore-multi?seeds=theory,model,hypothesis&relation=adj_predicate&top=10"

Response:

{
  "status": "ok",
  "seeds": ["theory", "model", "hypothesis"],
  "seed_collocates": [
    {"word": "correct", "log_dice": 4.21, "frequency": 142}
  ],
  "seed_collocates_count": 23,
  "core_collocates": [],
  "common_collocates": [],
  "common_collocates_count": 0,
  "discovered_nouns": ["theory", "model", "hypothesis"],
  "edges": [
    {"source": "theory", "target": "correct", "log_dice": 4.21, "type": "SURFACE"}
  ]
}

Note: All seed_collocates items have the same shape {word, log_dice, frequency} across both endpoints.

Concordance Examples for Word Pairs

curl "http://localhost:8080/api/concordance/examples?seed=house&collocate=big&top=10"

Get actual example sentences from the corpus containing both words (lemmas). This feature validates collocations by showing real usage contexts.

How It Works:

Uses SpanNearQuery to efficiently find sentences where both lemmas appear within 10 words
Decodes token data (word, lemma, tag, position) from BinaryDocValues (tokens field)
Generates HTML with <mark> tags highlighting both target words
Returns sentence text, highlighted HTML, and position arrays

Technical Details:

The HYBRID index stores tokens as BinaryDocValues, decoded via TokenSequenceCodec
Lemma field is indexed with positions, enabling fast SpanQueries
No need to store lemma/word/tag as separate StoredFields - DocValues provide O(1) lookup
Query complexity: O(log N) for SpanQuery + O(k) for decoding k matching documents

Parameters:

seed (required) - Headword (lemma)
collocate (required) - Collocate word (lemma)
top (optional) - Number of examples to return (default: 10)
relation (optional) - Grammatical relation ID (default: noun_adj_predicates)

Response:

{
  "status": "ok",
  "seed": "house",
  "collocate": "big",
  "relation": "noun_adj_predicates",
  "top": 10,
  "total_results": 3,
  "examples": [
    {
      "sentence": "The big house! - The big house.",
      "raw": "The big house ! - The big house ."
    },
    {
      "sentence": "Houses Big and beautiful house with 4 bedrooms Houses big...",
      "raw": "Houses Big and beautiful house with 4 bedrooms Houses big ..."
    }
  ]
}

Response Fields:

sentence - Raw sentence text from the corpus
raw - Tokenized sentence (space-separated)

Use Cases:

Validate collocations before citing in research
Understand usage contexts and frequency patterns
Discover idiomatic expressions and multi-word units
Quality check corpus tagging and lemmatization

Integration with Web UI:

Word Sketch tab: Click any collocation word to see inline examples
Semantic Field Explorer: Click graph edges to see example sentences
Examples appear in expandable panels below the visualization
Up to 10 examples shown with "Load More" option for additional contexts

Web Interface (Semantic Field Explorer)

The webapp/ directory contains an interactive web interface built with D3.js.

Features

Word Sketch Search
- Browse collocations for any lemma
- Filter by POS tags
- Click any collocation to see example sentences from the corpus
- Examples appear in a panel below with highlighted target words
- Adjust logDice thresholds
Single-Seed Exploration
- Bootstrap from one seed word
- Select grammatical relation
- Discover semantically similar words
- Force-directed graph visualization
Multi-Seed Exploration (NEW)
- Explore from multiple seeds at once
- See all collocates per seed
- Identify common patterns
- Cluster-based semantic field analysis

Start Both Services

# Terminal 1: API Server
java -jar target/concept-sketch-1.6.0-shaded.jar server --index <corpus_path> --port 8080

# Terminal 2: Web Server
python -m http.server 3000 --directory webapp

# Open browser to http://localhost:3000

To configure a non-default CORS origin (e.g., for production), use the cors.allow.origin JVM property:

java -Dcors.allow.origin=https://myapp.example.com \
    -jar target/concept-sketch-1.6.0-shaded.jar server --index <corpus_path> --port 8080

CQL Pattern Syntax

Basic Patterns

Pattern	Meaning
`"house"`	Match lemma "house"
`[tag="NN.*"]`	Match POS tag regex (nouns)
`[tag="JJ"]`	Match exact POS tag
`[word="the"]`	Match word form

Constraints

[tag="JJ.*"]              # Adjectives (any type)
[tag="VB.*"]              # Verbs (any type)
[tag="NN.*"]              # Nouns
[tag!="NN.*"]             # NOT nouns
[tag="JJ"|tag="RB"]       # Adjectives OR adverbs

Distance Modifiers

[tag="JJ"]                # Adjacent (distance = 1)
[tag="JJ"] ~ {0,3}        # Within 0-3 words
[tag="JJ"] ~ {1,5}        # 1-5 words apart

Examples

# Adjectives modifying a noun
[tag="jj.*"]

# Verbs taking noun as object
[tag="vb.*"]

# Adjectives within 3 words
[tag="jj.*"] ~ {0,3}

Architecture

Query Pipeline

User Input
    ↓
CQL Pattern Parser (CQLParser.java)
    ↓
Lucene SpanQuery Compiler (CQLToLuceneCompiler.java)
    ↓
Index Lookup (Lucene)
    ↓
logDice Scorer (LogDiceCalculator.java)
    ↓
Response (JSON/HTML)

Index Structure

Field	Type	Purpose
`doc_id`	Numeric, stored	Sentence ID
`position`	Numeric, stored	Word position
`word`	Stored	Raw word form
`lemma`	Indexed	Lemma for search
`tag`	Keyword, indexed	POS tag (NN, JJ, VB, etc.)
`pos_group`	Keyword	Broad category (noun/verb/adj/adv)
`sentence`	Stored	Full sentence

Collocation Computation

logDice (Default)

logDice = log₂(2 * f(A,B) / (f(A) + f(B))) + 14

Scale: 0-14 (14 = perfect association)
Symmetric measure - same value regardless of direction

MI3 (Mutual Information)

MI3 = log₂((f(A,B) * N) / (f(A) * f(B)))

Higher values indicate stronger association
Good for finding rare but informative collocations

T-Score

T = (f(A,B) - expected) / sqrt(expected)
where expected = (f(A) * f(B)) / N

Measures statistical significance
Higher absolute values indicate more significant associations

Log-Likelihood (G-squared)

G2 = 2 * f(A,B) * log(f(A,B) / expected)

Measures deviance from expected co-occurrence
Higher values indicate greater statistical significance

Parameters:

f(A,B) = co-occurrence frequency (collocate with headword)
f(A) = headword frequency
f(B) = collocate total frequency
N = total tokens in corpus

Query API:

The server uses logDice scoring by default. Simply query the sketch endpoint:

curl "http://localhost:8080/api/sketch/house"

Project Structure

concept-sketch/
├── src/main/java/pl/marcinmilkowski/word_sketch/
│   ├── Main.java                    # CLI entry point
│   ├── api/
│   │   ├── WordSketchApiServer.java          # REST API server (14 endpoints)
│   │   ├── ConcordanceHandlers.java          # Handlers for concordance/examples endpoints
│   │   ├── CorpusQueryHandlers.java          # Handler for BCQL corpus query endpoint
│   │   ├── ExplorationHandlers.java          # Handlers for semantic field exploration endpoints
│   │   ├── ExploreResponseAssembler.java     # Builds JSON response maps for exploration results
│   │   ├── GrammarConfigSerializer.java      # Serializes GrammarConfig/RelationConfig to JSON
│   │   ├── HttpApiUtils.java                 # HTTP utilities: sendJsonResponse, parseQueryParams
│   │   ├── RequestEntityTooLargeException.java  # RuntimeException for HTTP 413 responses
│   │   ├── SketchHandlers.java               # Handlers for word sketch endpoints
│   │   └── VisualizationHandlers.java        # Handler for radial plot endpoint
│   ├── config/
│   │   ├── GrammarConfig.java                # Immutable grammar configuration (relations, version)
│   │   ├── GrammarConfigLoader.java          # Loads grammar config from JSON
│   │   ├── RelationConfig.java               # Single relation: pattern, deprel derivation
│   │   ├── RelationPatternBuilder.java       # Builds CQL patterns for relations
│   │   └── RelationUtils.java               # Utility: relation type checks
│   ├── exploration/
│   │   ├── CollocateProfileComparator.java   # Compares adjective profiles across seed nouns
│   │   ├── ExplorationService.java           # Interface for exploration operations
│   │   ├── MultiSeedExplorer.java            # Multi-seed semantic field exploration
│   │   └── SemanticFieldExplorer.java        # Coordination facade for SEF (single + multi seed)
│   ├── indexer/
│   │   └── blacklab/
│   │       ├── BlackLabConllUIndexer.java    # CoNLL-U corpus indexer for BlackLab
│   │       └── ConlluConverter.java          # Converts CoNLL-U to WPL chunk format
│   ├── model/
│   │   ├── FetchExamplesOptions.java         # Options for fetchExamples
│   │   ├── PosGroup.java                     # POS group enum: NOUN, VERB, ADJ, ADV, OTHER
│   │   ├── QueryResults.java                 # Result DTOs: WordSketchResult, ConcordanceResult
│   │   ├── RelationType.java                 # Enum: SURFACE | DEP
│   │   └── exploration/
│   │       ├── CollocateProfile.java         # Adjective collocate profile for SEF comparison
│   │       ├── ComparisonResult.java         # Result DTO for compareCollocateProfiles()
│   │       ├── CoreCollocate.java            # High-coverage shared collocate
│   │       ├── DiscoveredNoun.java           # Noun discovered via shared adjectives
│   │       ├── Edge.java                     # Graph edge for D3.js visualization
│   │       ├── ExplorationOptions.java       # Base options for SEF exploration
│   │       ├── ExplorationResult.java        # Top-level result DTO for SEF exploration
│   │       ├── RelationEdgeType.java         # Enum for edge types in exploration graphs
│   │       ├── SharingCategory.java          # Enum: FULLY_SHARED, PARTIALLY_SHARED, SPECIFIC
│   │       └── SingleSeedExplorationOptions.java  # Options for single-seed exploration
│   ├── query/
│   │   ├── BlackLabQueryExecutor.java        # BlackLab-backed query executor
│   │   ├── BlackLabSnippetParser.java        # Parses BlackLab XML snippets
│   │   ├── CollocateQueryHelper.java         # Low-level collocate frequency/example lookup
│   │   └── QueryExecutor.java               # Query executor interface
│   ├── utils/
│   │   ├── CqlUtils.java                    # CQL parsing: splitCqlTokens, escapeForRegex
│   │   ├── LogDiceUtils.java                # logDice scoring
│   │   └── MathUtils.java                   # Math utilities: round2dp
│   └── viz/
│       └── RadialPlot.java                  # Radial plot data builder
├── webapp/
│   ├── index.html                   # Web UI (D3.js visualization)
│   └── assets/                      # CSS, D3.js
├── src/test/java/                   # Unit tests
├── pom.xml                          # Maven config
└── README.md                        # This file

Technical Deep Dive

Concordance Examples Implementation

The concordance feature efficiently retrieves example sentences containing word pairs using a two-stage approach:

Stage 1: SpanQuery for Fast Document Retrieval

// Build SpanNearQuery: both lemmas within 10 words
SpanTermQuery span1 = new SpanTermQuery(new Term("lemma", "house"));
SpanTermQuery span2 = new SpanTermQuery(new Term("lemma", "big"));

SpanNearQuery nearQuery = SpanNearQuery.newUnorderedNearQuery("lemma")
    .addClause(span1)
    .addClause(span2)
    .setSlop(10)  // Max distance: 10 tokens
    .build();

TopDocs results = searcher.search(nearQuery, limit);

Stage 2: DocValues Decoding for Token Details

// For each matching document, decode tokens from BinaryDocValues
BinaryDocValues tokensDV = reader.getBinaryDocValues("tokens");
tokensDV.advanceExact(docId);
BytesRef tokensBytes = tokensDV.binaryValue();

// Decode using TokenSequenceCodec
List<Token> tokens = TokenSequenceCodec.decode(tokensBytes);

// Each token contains: position, word, lemma, tag, startOffset, endOffset

Why This Design?

Compact Storage: Tokens stored as binary (varint encoding) instead of separate fields
- Typical sentence (~20 tokens): 400-600 bytes vs 1-2KB for separate fields
- 62M sentence corpus: ~30GB vs ~80GB storage
Fast Retrieval:
- SpanQuery uses inverted index with positions → O(log N) lookup
- DocValues provide O(1) document access (memory-mapped)
- No need to reconstruct from stored text
Position Accuracy:
- Positions preserved from tagging pipeline
- Support for multi-word tokens and contractions
- Exact alignment with original text offsets

Binary Encoding Format (TokenSequenceCodec):

[token_count: varint]
For each token:
  [position: varint]
  [word_length: varint][word: UTF-8]
  [lemma_length: varint][lemma: UTF-8]
  [tag_length: varint][tag: UTF-8]
  [start_offset: varint]
  [end_offset: varint]

Varint encoding saves space for common cases (positions < 128 = 1 byte).

Examples

correct (logDice: 4.21)

Dependency Sketches

What are Dependency Sketches?

Dependency sketches are visual or data-driven representations of how words relate to each other based on syntactic dependencies in the corpus. They help users understand grammatical and semantic relationships beyond simple collocations, leveraging dependency parsing to reveal patterns such as subject, object, modifier, and predicate relations.

Usage

Dependency sketches are generated from parsed corpora (e.g., CoNLL-U format) and can be explored via the API and web UI. They provide insights into grammatical structures and are useful for linguistic analysis, semantic field exploration, and advanced querying.

Example

For the noun "theory", a dependency sketch might show its typical subjects, objects, and modifiers, visualized as a graph or listed in ranked tables.

See also: MULTI_SEED_EXPLORATION.md for advanced semantic field features. practical (logDice: 3.73) wrong (logDice: 3.58) mathematical (logDice: 3.47) quantum (logDice: 2.89)


### Example 2: Find Words "House" Can Be Object Of

```bash
curl "http://localhost:8080/api/semantic-field/explore?seed=house&relation=object_of&top=10"

Result: Find verbs that take "house" as object

locate (logDice: 5.12)
build (logDice: 4.89)
buy (logDice: 4.21)

Discovered nouns (words that share these verbs):

hotel (shared: build, locate)
apartment (shared: build, buy, locate)
property (shared: buy, locate)

Example 3: Multi-Seed Cluster Analysis

curl "http://localhost:8080/api/semantic-field/explore-multi?seeds=dog,cat,horse&relation=subject_of&top=8"

Result: What do dogs, cats, and horses do?

All seeds can: eat, run, live
Dog-specific: bark, beg, fetch
Cat-specific: meow, purr, scratch

Development

Run Tests

mvn test

Build Documentation

See plans/ directory for:

concept-sketch-spec.md - Overall technical specification
precomputed-collocations-spec.md - Precomputed algorithm details
hybrid-index-spec.md - Hybrid index architecture

Code Quality

Tests cover:

CQL parsing (50+ patterns)
Lucene query compilation
logDice calculation
API endpoints
Multi-seed exploration

Name		Name	Last commit message	Last commit date
Latest commit History 392 Commits
.claude		.claude
.github/workflows		.github/workflows
.settings		.settings
.vscode		.vscode
__pycache__		__pycache__
diagnostics		diagnostics
docs		docs
grammars		grammars
index		index
plans		plans
scripts		scripts
src		src
test-data		test-data
webapp		webapp
.classpath		.classpath
.gitignore		.gitignore
.project		.project
BLACKLAB_SETUP.md		BLACKLAB_SETUP.md
CLAUDE.md		CLAUDE.md
MULTI_SEED_EXPLORATION.md		MULTI_SEED_EXPLORATION.md
README.md		README.md
STANZA_GPU.md		STANZA_GPU.md
blacklab-core-pom.xml		blacklab-core-pom.xml
conllu-integrated.blf.yaml		conllu-integrated.blf.yaml
conllu-sentences.blf.yaml		conllu-sentences.blf.yaml
filter_conllu_boilerplate.py		filter_conllu_boilerplate.py
filter_text_corpus.py		filter_text_corpus.py
index-conllu-blacklab.ps1		index-conllu-blacklab.ps1
index_corpus.ps1		index_corpus.ps1
index_corpus.sh		index_corpus.sh
install-blacklab.ps1		install-blacklab.ps1
install-blacklab.sh		install-blacklab.sh
pom.xml		pom.xml
requirements-stanza-gpu.txt		requirements-stanza-gpu.txt
screenshot1.png		screenshot1.png
screenshot2.png		screenshot2.png
semantic-exploration.png		semantic-exploration.png
tag_and_build.ps1		tag_and_build.ps1
tag_with_stanza.py		tag_with_stanza.py
word_sketch.ps1		word_sketch.ps1
word_sketch.sh		word_sketch.sh

Folders and files

Latest commit

History

Repository files navigation

ConceptSketch

Features

Quick Start (5 minutes)

Prerequisites

1. Build

Corpus Data for Testing

2. Create an Index

Step 1 — Prepare a CoNLL-U corpus

Step 2 — Preprocess: add <s> sentence markers

Step 3 — Index with BlackLab

3. Start API Server

4. Start Web Interface

5. Try a Query

Core Usage

Index a Corpus

Prerequisites

Step 1 — Preprocess CoNLL-U: add sentence markers

Step 2 — Create a BlackLab index

Indexed annotations

Query via Command Line

Grammar Configuration

REST API Endpoints

Health Check

Get Word Sketch

Single-Seed Semantic Field Exploration

Multi-Seed Semantic Field Exploration

Concordance Examples for Word Pairs

Web Interface (Semantic Field Explorer)

Features

Start Both Services

CQL Pattern Syntax

Basic Patterns

Constraints

Distance Modifiers

Examples

Architecture

Query Pipeline

Index Structure

Collocation Computation

logDice (Default)

MI3 (Mutual Information)

T-Score

Log-Likelihood (G-squared)

Project Structure

Technical Deep Dive

Concordance Examples Implementation

Examples

Dependency Sketches

What are Dependency Sketches?

Usage

Example

Example 3: Multi-Seed Cluster Analysis

Development

Run Tests

Build Documentation

Code Quality

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 2 — Preprocess: add `<s>` sentence markers

Packages