Part 5: Data & Lexicon Construction (CSI-HDC)

The foundation of the GDS system is the Semantic Base, a large-scale lexicon of “semantic particles” generated by the CSI-HDC (Conceptual State Injector using Hyperdimensional Computing) pipeline. This pipeline processes and synthesizes information from multiple data sources to generate concept representations with physics-inspired properties.

Primary Data Sources

Data Source Location in Project Role & Contribution
ConceptNet data/raw/assertions.csv Provides the primary structural graph of common-sense relationships (e.g., UsedFor, CapableOf, PartOf). It forms the backbone of explicit knowledge.
Numberbatch data/raw/numberbatch.txt Pre-trained 300-dimensional word embeddings. Primary source for generating 20,000-bit binary HDC vectors via the Julia HDC server, and fallback for affective score calculation.
NRC-VAD Lexicon data/raw/nrc_vad/ Provides affective scores for English words across three dimensions: Valence (pleasure/displeasure), Arousal (intensity), and Dominance (control). This is the source for the spin property of English particles.
German Norms data/raw/german_norms/ The German equivalent of the NRC-VAD lexicon, providing affective scores for German words.
OEWM Lexicons data/oewm_lexicons/ Open English, German, and Romanian WordNet data. This is a crucial source for normalization, synonymy (aliases), and word frequency priors. It significantly boosts the quality of mass calculation and the coverage of other lookups.
BabelNet Cache data/enrichment/babelnet_cache.db Local SQLite database caching BabelNet API results. Used in enrichment iterations to add high-quality multilingual relations, expanding the semantic graph over time.

The Generation Pipeline (LexiconBuilder)

The process is orchestrated by the LexiconBuilder in the Rust codebase and follows several key stages:

LexiconBuilder Pipeline Flow
  1. Aggregation: Raw assertions from ConceptNet are streamed and aggregated into a per-concept map, building a preliminary list of relations.
  2. Normalization & Enrichment: Lemmas are normalized using OEWM. This step also discovers aliases (synonyms) that will be used in later stages.
  3. Quality Scoring: Each potential concept is scored based on a set of heuristics: its connectivity in the graph, whether it has a Numberbatch embedding, and its coverage in affective lexicons.
  4. Filtering: Concepts that do not meet a minimum quality threshold (e.g., min_relations) are discarded.
  5. Property Calculation: For each high-quality concept:
    • Mass (m0) is calculated from graph connectivity, boosted by frequency priors from OEWM.
    • Spin (s) is derived from affective lexicons (NRC-VAD, German Norms).
    • Charge (q) is generated by the Julia HDC server, which expands 300D Numberbatch embeddings into 20,000-bit binary hypervectors—this is the CSI-HDC tokenization step that replaces traditional token embeddings.
  6. Export: The final collection of SemanticParticle objects is written to a compressed Parquet file (ZSTD compression), forming the Semantic Base for the GDS runtime.
"