Data Methodology
How we collect, validate, and maintain the most comprehensive dataset of domestic manufacturers available to Federal procurement.
Data Sources
- SAM.gov Entity Management API: UEI, CAGE codes, registration status, NAICS classifications for all registered government contractors
- State manufacturing directories: Official economic development agency databases across all 50 states
- Public company filings and registrations: Secretary of State records, business licenses
- Industry association membership directories: NTMA, PMA, AMT, SME, and sector-specific organizations
- Federal procurement history: Historical contract awards from FPDS-NG linked to supplier profiles
Collection and Update Frequency
- SAM.gov entity sync: daily automated pull via SAM Entity Management API
- Web enrichment pipeline: weekly crawl and NLP extraction across manufacturer websites
- Full dataset re-index: monthly comprehensive refresh with embedding regeneration
- Real-time corrections: user-reported data issues triaged within 24 hours
- Capability taxonomy updates: quarterly review aligned with NAICS revision cycles
Validation Methodology
- Multi-source cross-referencing: capabilities verified across two or more independent sources
- NAICS code validation: automated classification verified against stated capabilities and SIC crosswalk
- Geographic verification: address data validated against USPS databases and geocoded for spatial queries
- Certification validation: compliance claims cross-referenced with accreditation body databases where available
- Confidence scoring: each data point assigned a confidence level based on source count and recency
AI and Search Methodology
- Hybrid search: 70% semantic vector similarity (1536-dimensional embeddings) combined with 30% full-text relevance scoring
- Fallback chain: vector search, then full-text search, then name match, then capability-based search ensures results for every query
- NAICS auto-classification: NLP pipeline assigns NAICS codes from unstructured company descriptions with hierarchical matching
- Capability extraction: pattern-based NLP identifies 50+ standardized manufacturing capability categories from free text
- Natural language queries: user intent parsed and translated to structured filters across capabilities, certifications, materials, and geography
Data Governance
- Source attribution: every data point traceable to its original source with timestamp
- Audit trail: all search queries and results logged for Inspector General oversight capability
- Evidence bundles: exportable packages containing search parameters, results, source documentation, and methodology for waiver support
- Data provenance: full lineage from raw source through enrichment pipeline to final indexed record
- Version history: changes to manufacturer profiles tracked with before/after snapshots
Scalability and Sustainability
- Current dataset: 318,000+ manufacturers indexed with full capability profiles
- Infrastructure: PostgreSQL with pgvector extension on managed cloud infrastructure, auto-scaling to handle agency-wide query volumes
- API capacity: sub-second response times at sustained 1,000+ queries per minute
- Modular architecture: data ingestion, enrichment, search, and reporting as independent services that can be scaled independently
- Continuous improvement: ML feedback loop improves search relevance and data quality over time based on usage patterns