Under the Hood

How SolvedSeek actually works.

SolvedSeek is an independent search engine built from scratch. No Google API. No Bing reskin. We crawl the web ourselves, build our own index, rank results with our own algorithms, and run our own AI models for understanding what pages are actually about. Here's how it all fits together.

On this page

The Architecture

Three technologies, one stack. Everything runs on a single server with no cloud functions, no third-party APIs, and no external dependencies at runtime.

PHP 8.2+

Handles every search query, renders pages, and manages the web interface. Custom-built from scratch, no frameworks.

Node.js Workers

Eight background processes that crawl the web, render JavaScript pages, generate AI embeddings, and calculate rankings.

MySQL / MariaDB

Stores every page, link, embedding, and trust score. FULLTEXT indexes power keyword matching at speed.

No frameworks. The search engine, router, template engine, and database layer are all custom-written. No Laravel, no Express, no React. Every line of code is purpose-built for search.

The Crawl Pipeline

Every page in our index goes through a five-stage pipeline. Here's what happens from discovery to ranking.

1

Discover

New URLs are found through link following, XML sitemaps, and manual submissions. Each URL enters a priority queue where homepages and submitted sites get crawled first. Deeper pages wait their turn.

2

Crawl

Three parallel crawl workers fetch pages over HTTPS, always checking robots.txt first. We pull out the title, description, body text, and all outbound links. Pages behind CAPTCHAs or bot walls get detected and skipped gracefully.

3

Render

A lot of modern websites load their content with JavaScript. When our crawler spots a JS-heavy page, it gets sent to a render queue where a headless Chromium browser loads it fully, just like a real visitor would, and extracts the final content.

4

Understand

Each page goes through three layers of understanding. First, language detection uses trigram analysis to figure out what language the page is written in. Second, entity extraction uses NLP to identify the primary topic, whether that's a company, place, or person. Third, a local AI model converts the text into a mathematical "meaning fingerprint" for semantic search.

5

Rank

A dedicated ranking worker continuously analyses content quality and updates domain trust levels. When you search, these pre-computed signals combine with real-time relevance scoring to order your results.

How Ranking Works

When you type a query, every matching page gets a composite score built from multiple signals. We use LOG-damped scoring (similar to BM25) so that repeating a keyword 100 times on a page doesn't help. Relevance saturates naturally, just like it should.

Title Relevance

The strongest signal. We run a dedicated FULLTEXT search on just the title, LOG-damped so keyword stuffing doesn't work. Pages about your query outrank pages that just mention it in passing.

Body Relevance

How well your search terms match across the title, description, and body text combined. Also LOG-damped so spam pages can't game the score by repeating terms over and over.

Content Quality

We score page structure: headings, paragraphs, meta descriptions, body length, and spam patterns. This signal carries serious weight. Pages below a minimum quality floor don't appear in results at all.

Domain Trust

Every domain has a trust score from 0 to 100. Higher trust means more credibility. Domains with zero trust receive a significant ranking penalty.

Entity Match

We use NLP to figure out what each page is actually about (a company, place, or person). If that topic matches what you're searching for, the page gets a relevance boost.

Freshness

Recently crawled pages get a freshness bonus. The web changes fast and current content should rank higher than stale content.

Language Filter

Every page's language is detected using trigram analysis. English queries only return English pages. No more foreign-language results cluttering your search.

Hybrid semantic retrieval. Results come from two sources merged together: traditional FULLTEXT keyword matching and AI embedding similarity. This means a search for "how to cook pasta" can surface a page about "Italian noodle recipes" even if the exact words don't match.

Domain diversity. No single domain can dominate your results. We cap results at 2 per root domain, grouping subdomains together (so blog.example.com and www.example.com count as one).

AI, NLP & Language Detection

Traditional search engines match keywords. SolvedSeek also understands meaning, using three layers of intelligence.

Hybrid Semantic Retrieval

Every page we crawl gets processed by a local AI model (all-MiniLM-L6-v2) that converts the text into a 384-dimension "meaning fingerprint". When you search, your query gets the same treatment.

Results come from two sources merged together: traditional FULLTEXT keyword matching and AI embedding similarity. We scan thousands of page embeddings to find semantically relevant pages, then merge them with keyword results. The final score blends both signals equally (50/50 weighting). This means a search for "how to cook pasta" can match a page about "Italian noodle recipes" even if the exact words don't match.

Entity Detection (NLP)

Every page is analysed using NLP to identify what it's actually about. Is it a company? A place? A person? We pull candidates from the title, description, and body text, score them by how often they appear and where, and pick the best match. That becomes the page's topic label, shown as a pill on search results.

Language Detection

Every page's language is identified using trigram-based analysis. It examines character patterns to determine the language with high accuracy across 80+ languages. English queries return only English results.

No data leaves our server. All AI models and NLP processing run entirely on our own hardware. Your queries are never sent to OpenAI, Google, or any external service.

Trust & Quality Scoring

Not all websites are equal. We score every domain on trust and every page on quality, and both directly influence where they rank.

Domain Trust (0–100)

  • Who links to this domain, and how trusted are they?
  • How consistent is the content quality across pages?
  • How long has it been in our index without issues?
  • Manual review and human oversight

Page Quality (0–1.0)

  • Does it have a meaningful amount of body text?
  • Is there a proper heading structure?
  • Does it have a meta description?
  • What's the content-to-link ratio?
  • Gambling and pharma spam detection penalties

Higher trust unlocks more pages to be indexed from a domain. New domains start with a limited allowance that expands as trust is earned over time. Domains that consistently deliver poor content or spammy behaviour get blocked.

Quality floor. Pages must meet a minimum quality threshold to appear in results at all. This filters out thin content, stub pages, and auto-generated garbage before they ever reach you.

Spam detection. Our quality scorer actively detects gambling spam and pharma spam. Pages matching these patterns get hit with heavy quality penalties.

Trust-zero penalty. Domains with zero trust get a significant ranking penalty. If we don't know you, you don't get to rank high.

Ethical Crawling

Search engines should be good citizens of the web. Our crawler follows the rules.

robots.txt Fully respected. If you say "don't crawl", we don't crawl. Crawl-delay Honoured. We never hit a site faster than it allows. Meta robots noindex, nofollow, and none are all supported. Canonicals We follow canonical tags to avoid duplicate content. Identification Our bot identifies itself as SolvedSeekBot/1.0 in every request. Sitemaps We read and follow XML sitemaps declared in robots.txt.

By The Numbers

Live data from our index. These numbers update every time you load this page.

554,691

Pages Indexed

562,377

Domains Crawled

582

Searches Served

377,953

AI Embeddings

0

Links in Graph

68.1%

AI Coverage

The Journey

Building a search engine from scratch is one of the most ambitious projects in software engineering. It touches everything: networking, distributed systems, natural language processing, machine learning, information retrieval, web standards, and database engineering.

SolvedSeek started with a question: is it actually possible to build a real, independent search engine without being Google? Not a meta-search engine that queries someone else's API. Not a Bing reskin with a privacy label. A genuine, crawl-the-web-yourself, build-your-own-index, rank-with-your-own-algorithms search engine.

The answer is yes, but it takes a lot of work. Every part of this system was built, tested, broken, rebuilt, and refined. The crawl pipeline alone went through dozens of iterations before it could reliably handle thousands of pages per day across thousands of domains.

This project has been one of the best learning experiences of my career. I've learned more about how the web actually works (from robots.txt edge cases to DNS resolution quirks to the surprising complexity of HTML parsing) than years of building websites ever taught me.

What's Next

SolvedSeek is a living project. Here's what we're working towards.

Larger Index

Continuously expanding our crawl to cover more of the web, with smarter prioritisation of high-quality domains.

Smarter AI

Exploring larger embedding models and deeper semantic understanding to make search results even more relevant.

Webmaster Tools

Building tools for site owners to see how their pages appear in our index and engage with the search engine directly.

Performance

Faster search, faster crawling, and more efficient infrastructure to handle growth as the index scales.

Want to be in our index?

Submit your website and we'll crawl it. Simple as that.

Submit Your Site

Page generated Jun 27, 2026 at 12:55 AM UTC