Part 2: The Development Journey - From 6TB to 7.5GB
The ambitious vision of GDS required substantial engineering effort to become practically feasible. The development journey from theoretical concept to operational research prototype involved systematic optimization, documented in the project’s OPTIMIZATION_ROADMAP.md.
The Initial Challenge: The 5.95 TB Problem
The initial lexicon builder implementation faced a critical storage challenge. Using standard 64-bit float vectors (f64) for 20,000-dimensional HDC representations resulted in a projected storage requirement of 5.95 Terabytes for the full lexicon. This was computationally and financially infeasible on consumer hardware.
This constraint necessitated a fundamental re-evaluation of storage and data representation strategy, initiating a multi-stage optimization effort.
Stage 1: Binary HDC Implementation
The first major optimization aligned the implementation with HDC theoretical foundations. Canonical HDC research (Kanerva, Plate, et al.) uses binary or bipolar vectors rather than floating-point representations.
- The Change: The pipeline was refactored to convert 300D Numberbatch embeddings into 20,000-bit binary vectors, packed densely into
u64words. - The Impact: This change reduced storage projection from 5.95 TB to 121 GB. Additionally, binary HDC operations (e.g., Hamming distance) proved computationally cheaper than floating-point operations, accelerating the build process by over 50%.
Stage 2: Columnar Storage with Parquet
While 121 GB was feasible, further optimization was pursued. The second phase involved transitioning from binary/JSONL format to Apache Parquet, an industry-standard columnar storage format.
- The Change: The
ParticleWriterwas rewritten using thearrow-rslibrary, implementing a proper columnar schema. This provided several advantages:- Columnar Storage: Efficient for analytical queries and selective column access.
- ZSTD Compression: Modern high-performance compression applied to all columns.
- Native Dictionary Encoding: Automatic dictionary encoding for high-cardinality string columns (
lemma,concept_id), further reducing file size.
- The Impact: This optimization yielded an additional ~3.5x compression ratio. The final storage requirement for a 3-million-particle lexicon reached 7.5 GB.
Summary
Through this systematic, two-stage optimization process, an ~800x reduction in storage requirements was achieved, transforming an infeasible design into an operational research prototype. This development phase demonstrated that for exploratory research at this scale, aggressive optimization is essential rather than optional.