A high-performance corpus-based collocation analysis tool built on BlackLab corpus search software (which relies on Apache Lucene). This project implements word and dependecy sketch functionality (grammatical relations and collocations), semantic field exploration, and conceptual mining for corpus linguistics research and NLP applications.
- Fast Collocation Analysis: O(1) instant lookup with precomputed collocations
- CQL Support: Full Corpus Query Language with distance modifiers and constraints
- logDice Scoring: Association strength metric (0-14 scale)
- Multiple Grammatical Relations: ADJ_PREDICATE, ADJ_MODIFIER, SUBJECT_OF, OBJECT_OF
- Concordance Examples: View real corpus sentences for any word pair with highlighting
- REST API: HTTP server with semantic field exploration endpoints
- Web Interface: Interactive Semantic Field Explorer with D3.js visualization
- Multi-Seed Exploration: Explore semantic fields using multiple seed words
- Java 17+ (Java 21+ recommended)
- Maven 3.6+
- Python 3 (for web server)
mvn clean packageYou can test ConceptSketch by downloading this indexed and tagged corpus:
- Frontiers in Psychology Corpus, https://doi.org/10.18150/4LJ9WD
It is sufficiently large to provide interesting insights about the language of contemporary psychology (2010-2021, before the advent of AI-generated papers).
Tag your text with any CoNLL-U-producing tool. The project includes a Stanza GPU script for efficient tagging:
Option A: Use the Stanza script (recommended)
# Download model (one-time)
python tag_with_stanza.py --download --lang en
# Tag corpus (uses GPU automatically if available)
python tag_with_stanza.py \
--input corpus.txt \
--output corpus.conllu \
--lang enFor GPU tuning and more options, see STANZA_GPU.md.
Option B: Use UDPipe 2 directly
udpipe --tokenize --tag --parse --output=conllu english.udpipe corpus.txt > corpus.conlluOption C: Use another CoNLL-U tagger (Stanza in Python without GPU, spaCy, etc.)
BlackLab's tabular parser requires explicit inline tags for sentence boundaries.
The project ships a script that converts CoNLL-U blank-line sentence boundaries
into <s> / </s> inline tags:
python scripts/conllu_to_wpl.py corpus.conllu corpus_s.conlluMove the output file into a dedicated input directory:
mkdir input_dir
mv corpus_s.conllu input_dir/The shaded JAR bundles BlackLab's IndexTool. Run it from the project root
(so --format-dir . can find conllu-sentences.blf.yaml):
java -cp target/concept-sketch-1.6.0-shaded.jar \
nl.inl.blacklab.tools.IndexTool create \
--format-dir . \
my_index/ input_dir/ conllu-sentences| Argument | Meaning |
|---|---|
--format-dir . |
Directory containing conllu-sentences.blf.yaml |
my_index/ |
Output index directory (created automatically) |
input_dir/ |
Directory with preprocessed .conllu files |
conllu-sentences |
Format name (matches the .blf.yaml filename) |
# Terminal 1
java -jar target/concept-sketch-1.6.0-shaded.jar server --index my_index/ --port 8080CORS configuration: By default the API allows requests from
http://localhost:3000. To allow a different origin, pass thecors.allow.originJVM system property:java -Dcors.allow.origin=https://myapp.example.com \ -jar target/concept-sketch-1.6.0-shaded.jar server --index my_index/ --port 8080
Server startup output:
API server started on http://localhost:8080
Algorithm: PRECOMPUTED (O(1) instant lookup)
Endpoints:
GET /health
GET /api/sketch/{lemma}
GET /api/relations
GET /api/relations/dep
GET /api/semantic-field/explore
GET /api/semantic-field/explore-multi
GET /api/semantic-field/examples
GET /api/concordance/examples
GET /api/visual/radial
GET /api/bcql (POST)
# Terminal 2
python -m http.server 3000 --directory webappOpen browser to: http://localhost:3000
The web interface allows to produce some radial plots for collocates:
And you use some semantic exploration features:
# Find adjectives describing "house"
curl "http://localhost:8080/api/sketch/house"
# Get example sentences for "house" + "big"
curl "http://localhost:8080/api/concordance/examples?seed=house&collocate=big&top=5"
# Explore semantic field from "theory" (ADJ_PREDICATE relation)
curl "http://localhost:8080/api/semantic-field/explore?seed=theory&relation=adj_predicate"
# Multi-seed exploration
curl "http://localhost:8080/api/semantic-field/explore-multi?seeds=theory,model,hypothesis&top=10"- A corpus in CoNLL-U format (columns: ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC)
- The project's
conllu-sentences.blf.yamlformat file (in the project root) - Java 21+ and the shaded JAR (
target/concept-sketch-1.6.0-shaded.jar)
BlackLab's tabular parser needs explicit <s> / </s> inline tags to index
sentence spans. The bundled script converts CoNLL-U blank-line boundaries:
python scripts/conllu_to_wpl.py corpus.conllu corpus_s.conlluWhat the script does:
- Skips comment lines (
#) and multi-word token lines (1-2,1.1, …) - Emits
<s>before the first token of each sentence - Emits
</s>after the last token - Preserves all 10 CoNLL-U columns as tab-separated values
mkdir input_dir
cp corpus_s.conllu input_dir/
# Run from the project root (so --format-dir finds conllu-sentences.blf.yaml)
java -cp target/concept-sketch-1.6.0-shaded.jar \
nl.inl.blacklab.tools.IndexTool create \
--format-dir . \
my_index/ input_dir/ conllu-sentencesTo add more documents to an existing index later:
java -cp target/concept-sketch-1.6.0-shaded.jar \
nl.inl.blacklab.tools.IndexTool add \
--format-dir . \
my_index/ more_input_dir/ conllu-sentences| Annotation | Source column | Forward index |
|---|---|---|
word |
FORM (col 2) | ✓ |
lemma |
LEMMA (col 3) | ✓ |
pos |
UPOS (col 4) | ✓ |
xpos |
XPOS (col 5) | ✓ |
deprel |
DEPREL (col 8) | ✓ |
wordnum |
ID (col 1) | — |
feats |
FEATS (col 6) | — |
head |
HEAD (col 7) | — |
# Find all collocations for "theory"
java -jar target/concept-sketch-1.6.0-shaded.jar \
blacklab-query --index my_index/ --lemma theory
# Find adjectival modifiers of "theory" (deprel=amod)
java -jar target/concept-sketch-1.6.0-shaded.jar \
blacklab-query --index my_index/ --lemma theory --deprel amod
# Increase result count and filter by logDice
java -jar target/concept-sketch-1.6.0-shaded.jar \
blacklab-query --index my_index/ --lemma theory \
--deprel nsubj --limit 50 --min-logdice 4.0The grammar configuration (relations, copulas, CQL patterns) is externalized in JSON:
Config file: grammars/relations.json
{
"version": "1.0",
"copulas": ["be", "appear", "seem", "become", ...],
"relations": [
{
"id": "noun_adj_predicates",
"name": "Adjectives (predicative)",
"head_pos": "noun",
"collocate_pos": "adj",
"cql_pattern": "[tag=jj.*]",
"uses_copula": true,
"default_slop": 8,
"relation_type": "ADJ_PREDICATE",
"exploration_enabled": true
},
...
]
}Fields:
id- Unique relation identifiername- Human-readable namehead_pos- POS group of the head word (noun, verb, adj)collocate_pos- POS group of the collocatecql_pattern- CQL pattern to matchuses_copula- Whether this relation uses copula verbsdefault_slop- Default distance windowrelation_type- Semantic relation typeexploration_enabled- Whether usable for semantic field exploration
API endpoint:
To view active relations, use GET /api/relations.
To modify relations or add new ones, edit grammars/relations.json and restart the server.
curl http://localhost:8080/healthcurl "http://localhost:8080/api/sketch/house"To limit full sketches to relations whose head is a specific POS group:
curl "http://localhost:8080/api/sketch/predict?head_pos=verb"Accepted values: noun, verb, adj, adv.
Response:
{
"status": "ok",
"lemma": "house",
"patterns": {
"noun_modifiers": {
"name": "Adjectives modifying (ADJ X)",
"cql": "[tag=jj.*]~{0,3}",
"total_matches": 3421,
"collocations": [
{
"lemma": "big",
"frequency": 287,
"logDice": 11.24,
"relativeFrequency": 0.084
}
]
}
}
}curl "http://localhost:8080/api/semantic-field/explore?seed=theory&relation=adj_predicate&top=15&min_logdice=2"Relations:
adj_predicate: "X is ADJ" (e.g., "theory is correct")adj_modifier: "ADJ X" (e.g., "correct theory")subject_of: subject verbs for noun-head sketches, strict local pattern (e.g., "theory suggests")noun_verbs: nearby verbs for noun-head sketches, looser window thansubject_ofverb_subjects: subject nouns for verb-head sketches (e.g., "theory predicts")object_of: object nouns for verb-head sketches, strict local pattern (e.g., "develop theory")verb_nouns: nearby nouns for verb-head sketches, looser window thanobject_of
Response:
{
"status": "ok",
"seed": "theory",
"seed_collocates": [
{"word": "correct", "log_dice": 4.21, "frequency": 142},
{"word": "practical", "log_dice": 3.73, "frequency": 98}
],
"core_collocates": [...],
"discovered_nouns": [
{
"word": "development",
"shared_count": 5,
"shared_collocates": ["correct", "practical", "quantum"]
}
]
}curl "http://localhost:8080/api/semantic-field/explore-multi?seeds=theory,model,hypothesis&relation=adj_predicate&top=10"Response:
{
"status": "ok",
"seeds": ["theory", "model", "hypothesis"],
"seed_collocates": [
{"word": "correct", "log_dice": 4.21, "frequency": 142}
],
"seed_collocates_count": 23,
"core_collocates": [],
"common_collocates": [],
"common_collocates_count": 0,
"discovered_nouns": ["theory", "model", "hypothesis"],
"edges": [
{"source": "theory", "target": "correct", "log_dice": 4.21, "type": "SURFACE"}
]
}Note: All
seed_collocatesitems have the same shape{word, log_dice, frequency}across both endpoints.
curl "http://localhost:8080/api/concordance/examples?seed=house&collocate=big&top=10"Get actual example sentences from the corpus containing both words (lemmas). This feature validates collocations by showing real usage contexts.
How It Works:
- Uses SpanNearQuery to efficiently find sentences where both lemmas appear within 10 words
- Decodes token data (word, lemma, tag, position) from BinaryDocValues (
tokensfield) - Generates HTML with
<mark>tags highlighting both target words - Returns sentence text, highlighted HTML, and position arrays
Technical Details:
- The HYBRID index stores tokens as BinaryDocValues, decoded via
TokenSequenceCodec - Lemma field is indexed with positions, enabling fast SpanQueries
- No need to store lemma/word/tag as separate StoredFields - DocValues provide O(1) lookup
- Query complexity: O(log N) for SpanQuery + O(k) for decoding k matching documents
Parameters:
seed(required) - Headword (lemma)collocate(required) - Collocate word (lemma)top(optional) - Number of examples to return (default: 10)relation(optional) - Grammatical relation ID (default: noun_adj_predicates)
Response:
{
"status": "ok",
"seed": "house",
"collocate": "big",
"relation": "noun_adj_predicates",
"top": 10,
"total_results": 3,
"examples": [
{
"sentence": "The big house! - The big house.",
"raw": "The big house ! - The big house ."
},
{
"sentence": "Houses Big and beautiful house with 4 bedrooms Houses big...",
"raw": "Houses Big and beautiful house with 4 bedrooms Houses big ..."
}
]
}Response Fields:
sentence- Raw sentence text from the corpusraw- Tokenized sentence (space-separated)
Use Cases:
- Validate collocations before citing in research
- Understand usage contexts and frequency patterns
- Discover idiomatic expressions and multi-word units
- Quality check corpus tagging and lemmatization
Integration with Web UI:
- Word Sketch tab: Click any collocation word to see inline examples
- Semantic Field Explorer: Click graph edges to see example sentences
- Examples appear in expandable panels below the visualization
- Up to 10 examples shown with "Load More" option for additional contexts
The webapp/ directory contains an interactive web interface built with D3.js.
-
Word Sketch Search
- Browse collocations for any lemma
- Filter by POS tags
- Click any collocation to see example sentences from the corpus
- Examples appear in a panel below with highlighted target words
- Adjust logDice thresholds
-
Single-Seed Exploration
- Bootstrap from one seed word
- Select grammatical relation
- Discover semantically similar words
- Force-directed graph visualization
-
Multi-Seed Exploration (NEW)
- Explore from multiple seeds at once
- See all collocates per seed
- Identify common patterns
- Cluster-based semantic field analysis
# Terminal 1: API Server
java -jar target/concept-sketch-1.6.0-shaded.jar server --index <corpus_path> --port 8080
# Terminal 2: Web Server
python -m http.server 3000 --directory webapp
# Open browser to http://localhost:3000To configure a non-default CORS origin (e.g., for production), use the cors.allow.origin JVM property:
java -Dcors.allow.origin=https://myapp.example.com \
-jar target/concept-sketch-1.6.0-shaded.jar server --index <corpus_path> --port 8080| Pattern | Meaning |
|---|---|
"house" |
Match lemma "house" |
[tag="NN.*"] |
Match POS tag regex (nouns) |
[tag="JJ"] |
Match exact POS tag |
[word="the"] |
Match word form |
[tag="JJ.*"] # Adjectives (any type)
[tag="VB.*"] # Verbs (any type)
[tag="NN.*"] # Nouns
[tag!="NN.*"] # NOT nouns
[tag="JJ"|tag="RB"] # Adjectives OR adverbs[tag="JJ"] # Adjacent (distance = 1)
[tag="JJ"] ~ {0,3} # Within 0-3 words
[tag="JJ"] ~ {1,5} # 1-5 words apart# Adjectives modifying a noun
[tag="jj.*"]
# Verbs taking noun as object
[tag="vb.*"]
# Adjectives within 3 words
[tag="jj.*"] ~ {0,3}User Input
↓
CQL Pattern Parser (CQLParser.java)
↓
Lucene SpanQuery Compiler (CQLToLuceneCompiler.java)
↓
Index Lookup (Lucene)
↓
logDice Scorer (LogDiceCalculator.java)
↓
Response (JSON/HTML)
| Field | Type | Purpose |
|---|---|---|
doc_id |
Numeric, stored | Sentence ID |
position |
Numeric, stored | Word position |
word |
Stored | Raw word form |
lemma |
Indexed | Lemma for search |
tag |
Keyword, indexed | POS tag (NN, JJ, VB, etc.) |
pos_group |
Keyword | Broad category (noun/verb/adj/adv) |
sentence |
Stored | Full sentence |
logDice = log₂(2 * f(A,B) / (f(A) + f(B))) + 14
- Scale: 0-14 (14 = perfect association)
- Symmetric measure - same value regardless of direction
MI3 = log₂((f(A,B) * N) / (f(A) * f(B)))
- Higher values indicate stronger association
- Good for finding rare but informative collocations
T = (f(A,B) - expected) / sqrt(expected)
where expected = (f(A) * f(B)) / N
- Measures statistical significance
- Higher absolute values indicate more significant associations
G2 = 2 * f(A,B) * log(f(A,B) / expected)
- Measures deviance from expected co-occurrence
- Higher values indicate greater statistical significance
Parameters:
f(A,B)= co-occurrence frequency (collocate with headword)f(A)= headword frequencyf(B)= collocate total frequencyN= total tokens in corpus
Query API:
The server uses logDice scoring by default. Simply query the sketch endpoint:
curl "http://localhost:8080/api/sketch/house"concept-sketch/
├── src/main/java/pl/marcinmilkowski/word_sketch/
│ ├── Main.java # CLI entry point
│ ├── api/
│ │ ├── WordSketchApiServer.java # REST API server (14 endpoints)
│ │ ├── ConcordanceHandlers.java # Handlers for concordance/examples endpoints
│ │ ├── CorpusQueryHandlers.java # Handler for BCQL corpus query endpoint
│ │ ├── ExplorationHandlers.java # Handlers for semantic field exploration endpoints
│ │ ├── ExploreResponseAssembler.java # Builds JSON response maps for exploration results
│ │ ├── GrammarConfigSerializer.java # Serializes GrammarConfig/RelationConfig to JSON
│ │ ├── HttpApiUtils.java # HTTP utilities: sendJsonResponse, parseQueryParams
│ │ ├── RequestEntityTooLargeException.java # RuntimeException for HTTP 413 responses
│ │ ├── SketchHandlers.java # Handlers for word sketch endpoints
│ │ └── VisualizationHandlers.java # Handler for radial plot endpoint
│ ├── config/
│ │ ├── GrammarConfig.java # Immutable grammar configuration (relations, version)
│ │ ├── GrammarConfigLoader.java # Loads grammar config from JSON
│ │ ├── RelationConfig.java # Single relation: pattern, deprel derivation
│ │ ├── RelationPatternBuilder.java # Builds CQL patterns for relations
│ │ └── RelationUtils.java # Utility: relation type checks
│ ├── exploration/
│ │ ├── CollocateProfileComparator.java # Compares adjective profiles across seed nouns
│ │ ├── ExplorationService.java # Interface for exploration operations
│ │ ├── MultiSeedExplorer.java # Multi-seed semantic field exploration
│ │ └── SemanticFieldExplorer.java # Coordination facade for SEF (single + multi seed)
│ ├── indexer/
│ │ └── blacklab/
│ │ ├── BlackLabConllUIndexer.java # CoNLL-U corpus indexer for BlackLab
│ │ └── ConlluConverter.java # Converts CoNLL-U to WPL chunk format
│ ├── model/
│ │ ├── FetchExamplesOptions.java # Options for fetchExamples
│ │ ├── PosGroup.java # POS group enum: NOUN, VERB, ADJ, ADV, OTHER
│ │ ├── QueryResults.java # Result DTOs: WordSketchResult, ConcordanceResult
│ │ ├── RelationType.java # Enum: SURFACE | DEP
│ │ └── exploration/
│ │ ├── CollocateProfile.java # Adjective collocate profile for SEF comparison
│ │ ├── ComparisonResult.java # Result DTO for compareCollocateProfiles()
│ │ ├── CoreCollocate.java # High-coverage shared collocate
│ │ ├── DiscoveredNoun.java # Noun discovered via shared adjectives
│ │ ├── Edge.java # Graph edge for D3.js visualization
│ │ ├── ExplorationOptions.java # Base options for SEF exploration
│ │ ├── ExplorationResult.java # Top-level result DTO for SEF exploration
│ │ ├── RelationEdgeType.java # Enum for edge types in exploration graphs
│ │ ├── SharingCategory.java # Enum: FULLY_SHARED, PARTIALLY_SHARED, SPECIFIC
│ │ └── SingleSeedExplorationOptions.java # Options for single-seed exploration
│ ├── query/
│ │ ├── BlackLabQueryExecutor.java # BlackLab-backed query executor
│ │ ├── BlackLabSnippetParser.java # Parses BlackLab XML snippets
│ │ ├── CollocateQueryHelper.java # Low-level collocate frequency/example lookup
│ │ └── QueryExecutor.java # Query executor interface
│ ├── utils/
│ │ ├── CqlUtils.java # CQL parsing: splitCqlTokens, escapeForRegex
│ │ ├── LogDiceUtils.java # logDice scoring
│ │ └── MathUtils.java # Math utilities: round2dp
│ └── viz/
│ └── RadialPlot.java # Radial plot data builder
├── webapp/
│ ├── index.html # Web UI (D3.js visualization)
│ └── assets/ # CSS, D3.js
├── src/test/java/ # Unit tests
├── pom.xml # Maven config
└── README.md # This file
The concordance feature efficiently retrieves example sentences containing word pairs using a two-stage approach:
Stage 1: SpanQuery for Fast Document Retrieval
// Build SpanNearQuery: both lemmas within 10 words
SpanTermQuery span1 = new SpanTermQuery(new Term("lemma", "house"));
SpanTermQuery span2 = new SpanTermQuery(new Term("lemma", "big"));
SpanNearQuery nearQuery = SpanNearQuery.newUnorderedNearQuery("lemma")
.addClause(span1)
.addClause(span2)
.setSlop(10) // Max distance: 10 tokens
.build();
TopDocs results = searcher.search(nearQuery, limit);Stage 2: DocValues Decoding for Token Details
// For each matching document, decode tokens from BinaryDocValues
BinaryDocValues tokensDV = reader.getBinaryDocValues("tokens");
tokensDV.advanceExact(docId);
BytesRef tokensBytes = tokensDV.binaryValue();
// Decode using TokenSequenceCodec
List<Token> tokens = TokenSequenceCodec.decode(tokensBytes);
// Each token contains: position, word, lemma, tag, startOffset, endOffsetWhy This Design?
-
Compact Storage: Tokens stored as binary (varint encoding) instead of separate fields
- Typical sentence (~20 tokens): 400-600 bytes vs 1-2KB for separate fields
- 62M sentence corpus: ~30GB vs ~80GB storage
-
Fast Retrieval:
- SpanQuery uses inverted index with positions → O(log N) lookup
- DocValues provide O(1) document access (memory-mapped)
- No need to reconstruct from stored text
-
Position Accuracy:
- Positions preserved from tagging pipeline
- Support for multi-word tokens and contractions
- Exact alignment with original text offsets
Binary Encoding Format (TokenSequenceCodec):
[token_count: varint]
For each token:
[position: varint]
[word_length: varint][word: UTF-8]
[lemma_length: varint][lemma: UTF-8]
[tag_length: varint][tag: UTF-8]
[start_offset: varint]
[end_offset: varint]
Varint encoding saves space for common cases (positions < 128 = 1 byte).
correct (logDice: 4.21)
Dependency sketches are visual or data-driven representations of how words relate to each other based on syntactic dependencies in the corpus. They help users understand grammatical and semantic relationships beyond simple collocations, leveraging dependency parsing to reveal patterns such as subject, object, modifier, and predicate relations.
Dependency sketches are generated from parsed corpora (e.g., CoNLL-U format) and can be explored via the API and web UI. They provide insights into grammatical structures and are useful for linguistic analysis, semantic field exploration, and advanced querying.
For the noun "theory", a dependency sketch might show its typical subjects, objects, and modifiers, visualized as a graph or listed in ranked tables.
See also: MULTI_SEED_EXPLORATION.md for advanced semantic field features. practical (logDice: 3.73) wrong (logDice: 3.58) mathematical (logDice: 3.47) quantum (logDice: 2.89)
### Example 2: Find Words "House" Can Be Object Of
```bash
curl "http://localhost:8080/api/semantic-field/explore?seed=house&relation=object_of&top=10"
Result: Find verbs that take "house" as object
locate (logDice: 5.12)
build (logDice: 4.89)
buy (logDice: 4.21)
Discovered nouns (words that share these verbs):
hotel (shared: build, locate)
apartment (shared: build, buy, locate)
property (shared: buy, locate)
curl "http://localhost:8080/api/semantic-field/explore-multi?seeds=dog,cat,horse&relation=subject_of&top=8"Result: What do dogs, cats, and horses do?
All seeds can: eat, run, live
Dog-specific: bark, beg, fetch
Cat-specific: meow, purr, scratch
mvn testSee plans/ directory for:
concept-sketch-spec.md- Overall technical specificationprecomputed-collocations-spec.md- Precomputed algorithm detailshybrid-index-spec.md- Hybrid index architecture
Tests cover:
- CQL parsing (50+ patterns)
- Lucene query compilation
- logDice calculation
- API endpoints
- Multi-seed exploration


