USA
Made in America Solutions

Data Methodology

How we collect, validate, and maintain the most comprehensive dataset of domestic manufacturers available to Federal procurement.

Data Sources

  • SAM.gov Entity Management API: UEI, CAGE codes, registration status, NAICS classifications for all registered government contractors
  • State manufacturing directories: Official economic development agency databases across all 50 states
  • Public company filings and registrations: Secretary of State records, business licenses
  • Industry association membership directories: NTMA, PMA, AMT, SME, and sector-specific organizations
  • Federal procurement history: Historical contract awards from FPDS-NG linked to supplier profiles

Collection and Update Frequency

  • SAM.gov entity sync: daily automated pull via SAM Entity Management API
  • Web enrichment pipeline: weekly crawl and NLP extraction across manufacturer websites
  • Full dataset re-index: monthly comprehensive refresh with embedding regeneration
  • Real-time corrections: user-reported data issues triaged within 24 hours
  • Capability taxonomy updates: quarterly review aligned with NAICS revision cycles

Validation Methodology

  • Multi-source cross-referencing: capabilities verified across two or more independent sources
  • NAICS code validation: automated classification verified against stated capabilities and SIC crosswalk
  • Geographic verification: address data validated against USPS databases and geocoded for spatial queries
  • Certification validation: compliance claims cross-referenced with accreditation body databases where available
  • Confidence scoring: each data point assigned a confidence level based on source count and recency

AI and Search Methodology

  • Hybrid search: 70% semantic vector similarity (1536-dimensional embeddings) combined with 30% full-text relevance scoring
  • Fallback chain: vector search, then full-text search, then name match, then capability-based search ensures results for every query
  • NAICS auto-classification: NLP pipeline assigns NAICS codes from unstructured company descriptions with hierarchical matching
  • Capability extraction: pattern-based NLP identifies 50+ standardized manufacturing capability categories from free text
  • Natural language queries: user intent parsed and translated to structured filters across capabilities, certifications, materials, and geography

Data Governance

  • Source attribution: every data point traceable to its original source with timestamp
  • Audit trail: all search queries and results logged for Inspector General oversight capability
  • Evidence bundles: exportable packages containing search parameters, results, source documentation, and methodology for waiver support
  • Data provenance: full lineage from raw source through enrichment pipeline to final indexed record
  • Version history: changes to manufacturer profiles tracked with before/after snapshots

Scalability and Sustainability

  • Current dataset: 318,000+ manufacturers indexed with full capability profiles
  • Infrastructure: PostgreSQL with pgvector extension on managed cloud infrastructure, auto-scaling to handle agency-wide query volumes
  • API capacity: sub-second response times at sustained 1,000+ queries per minute
  • Modular architecture: data ingestion, enrichment, search, and reporting as independent services that can be scaled independently
  • Continuous improvement: ML feedback loop improves search relevance and data quality over time based on usage patterns