Build LogApril 1, 2026 · 6 min read

Building a Brand Database: 36,000 Brands, One Solo Developer

How we built a structured database of 36,000+ brands with product lines, images, and category data — as a solo developer, from scratch.

The hardest part of building Diffr isn't the recommendation logic or the UI. It's the data. You cannot curate brands you don't know about. So before anything else, we needed a structured database of them — their names, categories, product lines, and images.

This is the story of building that foundation as a solo developer.

The Scale Problem

We started with a seed list of brand names from public directories, aggregators, and category research. The goal: build a complete, structured dataset with product line information for each brand. The number grew fast. We're now at 36,028 brands across tens of thousands of product categories.

Each brand needs more than just a name. For Diffr's no-repeat curation to work, we need to know what a brand actually makes — its product lines, its categories, its visual identity. Building that knowledge at scale is a data engineering problem before it's anything else.

The Data Architecture

The database is built on PostgreSQL 17, running locally on an external SSD. The schema is deliberately flat for the core tables: brands, product types, and product lines. Relationships live in a separate Neo4j graph database — the brand knowledge graph that powers Diffr's "no-repeat" constraint logic.

Current scale:

  • 36,028 brands — with names, categories, and metadata
  • 47,000+ product types — the vocabulary of what brands make
  • 1,079,000 product lines — the actual items that map to scene slots
  • Redis — for deduplication during data ingestion
  • Cloudinary CDN — for serving product and brand images

The Image Problem

Brand data without images is only half useful. A curation platform needs to be visual. Our current status: 402 product lines have confirmed images — that's 0.04% of 1,079,000 total. The image pipeline is the primary bottleneck right now.

Each product line requires sourcing, validating, and storing a high-quality image. We've built a confidence-scoring system: high, medium, or none per image, so the curation layer can prioritise well-represented brands in early scenes rather than showing blank slots.

The Logo System

Separate from product images, every brand needs a logo — the visual anchor for Diffr's brand-first display format. Logo status is tracked independently:

  • ok: Clean logo on appropriate background, ready to display
  • warn_black_logo: Logo exists but needs background treatment
  • warn_bad_bg: Logo on a problematic background
  • no_source: Logo not yet sourced

Of 36,028 brands, 773 have confirmed clean logos. Logo quality matters more than quantity — a bad logo display undermines the whole premise of visual brand curation.

The Graph Layer

PostgreSQL handles relational data well, but brand relationships aren't relational — they're a network. Which brands compete? Which share a category niche? Which appear together in scenes?

Neo4j stores the brand relationship graph: category co-occupancy, scene co-occurrence, and brand DNA similarity scores. This is what will eventually power Diffr's scene-building logic — selecting the right brand for each slot not just by category match but by relationship fit within the whole scene.

What Solo Development Looks Like at This Scale

Running a data pipeline this large solo means making peace with progress that's measured in percentages of percentages. A 1% improvement in image coverage is 10,000 product lines. A database this size takes months to populate, not days.

The things that help most:

  1. Decouple every phase. Data ingestion, image processing, logo handling, and graph updates are all separate jobs. Each can fail and restart without corrupting the others.
  2. Log obsessively. At this scale, a silent failure that runs for hours is worse than a fast crash. Every pipeline job writes structured logs. I check status before trusting any summary number.
  3. Design for incomplete data. The curation layer doesn't wait for 100% image coverage. It knows which brands are well-represented and prioritises those for early scenes.
  4. Use the right database for each job. PostgreSQL for structured queries. Neo4j for relationship traversal. Redis for real-time deduplication. Don't force one tool to do everything.

What Comes Next

The data foundation is strong enough to start building the curation layer. The first public Diffr experience will work with the brands we have high-confidence data on — a few thousand well-represented brands across core consumer categories.

As image and logo coverage grows, more brands enter the curation pool. The no-repeat principle only gets more powerful with more options to choose from.

If you want to be among the first to see what 36,000 brands look like when structured by scene, join the waitlist.

#build log#indie dev#data pipeline#python#postgres#solo founder

Diffr is building a brand curation platform based on the no-repeat principle. Early access is limited.

Join the Waitlist

© 2026 Truake OPC · Diffr