Skip to content

Fredilly/article6-methodologies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

629 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Article6 Methodologies (data-first, audit-ready)

Engine Health Canonical store of methodologies: META + sections + rules (+ tools, overrides, tests, core). For a working example of the file layout and content, see docs/examples/TEMPLATE_METHOD. See RULESET.md for conventions and CI guardrails.

Repository layout

Sector naming note: The internal slug Forestry in paths such as methodologies/UNFCCC/Forestry refers to UNFCCC sector 14, “Afforestation and reforestation”. We keep the slug stable for hashes and reproducibility; human-facing docs and models should use the full sector label.

Five Things mapping

  1. Data-first methodologies
  2. Audit-ready hashes
  3. Open references
  4. Reproducible scripts
  5. CI guardrails

Hashing policy

  • sections.json -> META.audit_hashes.sections_json_sha256
  • rules.json -> META.audit_hashes.rules_json_sha256
  • tools//**/* -> META.references.tools[]
  • scripts/** and core/** -> scripts_manifest.json

Workflow

  1. Edit methodology content or scripts.
  2. Run ./scripts/hash-all.sh to refresh digests.
  3. Commit the changes.
  4. CI validates JSON, schemas, and registry consistency.

Batch files (entrypoints-first)

  • In batches/*.links.txt, the first N links must be UNFCCC “DB/.../view.html” entrypoints (N = number of codes).
  • Any remaining links are treated as optional extra assets (tools PDFs, glossary, FileStorage attachments).
  • Run npm run validate:batches (and scripts/ingest-full.sh runs it automatically).

Development Setup

  • The recommended environment is the Linux devcontainer defined in .devcontainer/.
  • Use VS Code’s “Reopen in Container” (or GitHub Codespaces) to get Node 20, poppler-utils, qpdf, jq, yq, git-lfs, and build-essential preinstalled.
  • See docs/devcontainer.md for full instructions. Native macOS/Windows setups remain optional but aren’t required for ingest work.

Definition of Done

  • ./scripts/hash-all.sh — updates META.audit_hashes, META.automation, and scripts_manifest.json.
  • npm run validate:rich — ensures every rich JSON conforms to the schemas before deriving lean files.
  • npm run validate:lean — validates all lean META.json, sections.json, and rules.json artifacts.
  • npm run validate:guardrails — enforces forestry rule coverage, sequential IDs, and tool provenance consistency.
  • ./scripts/check-registry.sh — confirms registry.json mirrors the methodologies tree.

All four commands must complete without diffs or errors before opening a pull request. Capture any new evidence files under outputs/mvp/ and include screenshots or logs referenced in the change summary.

Baselines & CLI (Offline, Deterministic)

  • Section retrieval (BM25):

    • Build dataset: npm run dataset:sections
    • Evaluate: npm run eval:sections:bm25
    • Example metrics (current corpus): acc@1≈0.6250, mrr@5≈0.7333
  • Parameter/units extraction (TF‑IDF/Linear):

    • Build dataset: npm run dataset:params
    • Evaluate: npm run eval:params:linear
    • Example metrics: variables micro‑F1≈0.6364, units micro‑F1≈0.9091
  • Rule-type labels (manual curation):

    • Primary annotations: datasets/rule_type/rules.csv (method_tag, anchor, text, label)

    • Rule metadata: datasets/rule_type/rules_meta.csv (rule_id, rule_type, notes)

    • Allowed categories listed in datasets/rule_type/labels.yaml

    • Update flow: edit CSVs → ./scripts/hash-all.sh → run validators.

    • Categories defined in datasets/rule_type/labels.yaml

main

  • Update flow: edit CSV → ./scripts/hash-all.sh → run validators.

  • CLI retrieval wrapper:

    • Installable bin: mrv-cli
    • Usage: npm run cli:query -- "<query text>" --k 5
    • Prints top‑K rules with sections, summary, and refs (tool kind/path/sha256 lifted from META).
  • HTTP engine adapter:

    • Installable bin: http-engine-adapter
    • Start locally: npm run server:http -- --port 3030
    • POST http://<host>:<port>/query with { "query": "forest leakage" } (optional top_k ≤ 50).
    • Replies deterministically with BM25-ranked rules across AR-AMS0003 and AR-AMS0007 plus audit hashes (rules/sections/tool refs).
    • Metrics logging: every request prints requests=<n> p95_ms=<latency>; set ENGINE_METRICS_LOG=/path/to/file.log to append to disk.
  • Serverless endpoint (Vercel-style /api/query):

    • GET /api/query?text=forest+leakage[&top_k=5] for ad-hoc checks.
    • POST /api/query with { "query": "forest leakage", "top_k": 5 } for structured calls.
    • Delegates to the same deterministic BM25 engine to keep outputs aligned with the CLI/HTTP adapter.
  • Health check:

    • GET /api/healthz{ "status": "ok", "documents": 26 } (document count driven by corpus size).

Determinism

  • Fixed BM25 params and TF‑IDF/Linear hyperparameters; stable ordering and splits.
  • Dataset files recorded in datasets_manifest.json with SHA‑256.

See also: docs/baselines-cli.md for quickstart commands, expected metrics, and how to verify determinism locally.

Meta-driven source hash check

Use node scripts/check-source-hash.js to verify that all META.references.tools[*] entries exist and match their recorded SHA-256. This avoids assumptions about folder layout and treats META as the source of truth for tool paths.

Conventions

  • JSON UTF-8, LF, 2 spaces.
  • Do not delete evidence; supersede only.
  • registry.json mirrors /methodologies.

Stable Tree v1

This structure is normative. Changes require a "Stable Tree vX" section and CI update.

About

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors