Under the Hood
How SolvedSeek actually works.
SolvedSeek is an independent search engine built from scratch. No Google API. No Bing reskin. We crawl the web ourselves, build our own index, rank results with our own algorithms, and run our own AI models for understanding what pages are actually about. Here's how it all fits together.
On this page
The Architecture
Three technologies, one stack. Everything runs on a single server with no cloud functions, no third-party APIs, and no external dependencies at runtime.
PHP 8.2+
Handles every search query, renders pages, and manages the web interface. Custom-built from scratch, no frameworks.
Node.js Workers
Eight background processes that crawl the web, render JavaScript pages, generate AI embeddings, and calculate rankings.
MySQL / MariaDB
Stores every page, link, embedding, and trust score. FULLTEXT indexes power keyword matching at speed.
No frameworks. The search engine, router, template engine, and database layer are all custom-written. No Laravel, no Express, no React. Every line of code is purpose-built for search.
The Crawl Pipeline
Every page in our index goes through a five-stage pipeline. Here's what happens from discovery to ranking.
Discover
New URLs are found through link following, XML sitemaps, and manual submissions. Each URL enters a priority queue where homepages and submitted sites get crawled first. Deeper pages wait their turn.
Crawl
Three parallel crawl workers fetch pages over HTTPS, always checking robots.txt first. We pull out the title, description, body text, and all outbound links. Pages behind CAPTCHAs or bot walls get detected and skipped gracefully.
Render
A lot of modern websites load their content with JavaScript. When our crawler spots a JS-heavy page, it gets sent to a render queue where a headless Chromium browser loads it fully, just like a real visitor would, and extracts the final content.
Understand
Each page goes through three layers of understanding. First, language detection uses trigram analysis to figure out what language the page is written in. Second, entity extraction uses NLP to identify the primary topic, whether that's a company, place, or person. Third, a local AI model converts the text into a mathematical "meaning fingerprint" for semantic search.
Rank
A dedicated ranking worker continuously analyses content quality and updates domain trust levels. When you search, these pre-computed signals combine with real-time relevance scoring to order your results.
How Ranking Works
When you type a query, every matching page gets a composite score built from multiple signals. We use LOG-damped scoring (similar to BM25) so that repeating a keyword 100 times on a page doesn't help. Relevance saturates naturally, just like it should.
The strongest signal. We run a dedicated FULLTEXT search on just the title, LOG-damped so keyword stuffing doesn't work. Pages about your query outrank pages that just mention it in passing.
How well your search terms match across the title, description, and body text combined. Also LOG-damped so spam pages can't game the score by repeating terms over and over.
We score page structure: headings, paragraphs, meta descriptions, body length, and spam patterns. This signal carries serious weight. Pages below a minimum quality floor don't appear in results at all.
Every domain has a trust score from 0 to 100. Higher trust means more credibility. Domains with zero trust receive a significant ranking penalty.
We use NLP to figure out what each page is actually about (a company, place, or person). If that topic matches what you're searching for, the page gets a relevance boost.
Recently crawled pages get a freshness bonus. The web changes fast and current content should rank higher than stale content.
Every page's language is detected using trigram analysis. English queries only return English pages. No more foreign-language results cluttering your search.
Hybrid semantic retrieval. Results come from two sources merged together: traditional FULLTEXT keyword matching and AI embedding similarity. This means a search for "how to cook pasta" can surface a page about "Italian noodle recipes" even if the exact words don't match.
Domain diversity. No single domain can dominate your results. We cap results at 2 per root domain, grouping subdomains together (so blog.example.com and www.example.com count as one).
AI, NLP & Language Detection
Traditional search engines match keywords. SolvedSeek also understands meaning, using three layers of intelligence.
Hybrid Semantic Retrieval
Every page we crawl gets processed by a local AI model (all-MiniLM-L6-v2) that converts the text into a 384-dimension "meaning fingerprint". When you search, your query gets the same treatment.
Results come from two sources merged together: traditional FULLTEXT keyword matching and AI embedding similarity. We scan thousands of page embeddings to find semantically relevant pages, then merge them with keyword results. The final score blends both signals equally (50/50 weighting). This means a search for "how to cook pasta" can match a page about "Italian noodle recipes" even if the exact words don't match.
Entity Detection (NLP)
Every page is analysed using NLP to identify what it's actually about. Is it a company? A place? A person? We pull candidates from the title, description, and body text, score them by how often they appear and where, and pick the best match. That becomes the page's topic label, shown as a pill on search results.
Language Detection
Every page's language is identified using trigram-based analysis. It examines character patterns to determine the language with high accuracy across 80+ languages. English queries return only English results.
No data leaves our server. All AI models and NLP processing run entirely on our own hardware. Your queries are never sent to OpenAI, Google, or any external service.
Trust & Quality Scoring
Not all websites are equal. We score every domain on trust and every page on quality, and both directly influence where they rank.
Domain Trust (0–100)
- ● Who links to this domain, and how trusted are they?
- ● How consistent is the content quality across pages?
- ● How long has it been in our index without issues?
- ● Manual review and human oversight
Page Quality (0–1.0)
- ● Does it have a meaningful amount of body text?
- ● Is there a proper heading structure?
- ● Does it have a meta description?
- ● What's the content-to-link ratio?
- ● Gambling and pharma spam detection penalties
Higher trust unlocks more pages to be indexed from a domain. New domains start with a limited allowance that expands as trust is earned over time. Domains that consistently deliver poor content or spammy behaviour get blocked.
Quality floor. Pages must meet a minimum quality threshold to appear in results at all. This filters out thin content, stub pages, and auto-generated garbage before they ever reach you.
Spam detection. Our quality scorer actively detects gambling spam and pharma spam. Pages matching these patterns get hit with heavy quality penalties.
Trust-zero penalty. Domains with zero trust get a significant ranking penalty. If we don't know you, you don't get to rank high.
Ethical Crawling
Search engines should be good citizens of the web. Our crawler follows the rules.
SolvedSeekBot/1.0 in every request.
Sitemaps
We read and follow XML sitemaps declared in robots.txt.
By The Numbers
Live data from our index. These numbers update every time you load this page.
554,691
Pages Indexed
562,377
Domains Crawled
582
Searches Served
377,953
AI Embeddings
0
Links in Graph
68.1%
AI Coverage
The Journey
Building a search engine from scratch is one of the most ambitious projects in software engineering. It touches everything: networking, distributed systems, natural language processing, machine learning, information retrieval, web standards, and database engineering.
SolvedSeek started with a question: is it actually possible to build a real, independent search engine without being Google? Not a meta-search engine that queries someone else's API. Not a Bing reskin with a privacy label. A genuine, crawl-the-web-yourself, build-your-own-index, rank-with-your-own-algorithms search engine.
The answer is yes, but it takes a lot of work. Every part of this system was built, tested, broken, rebuilt, and refined. The crawl pipeline alone went through dozens of iterations before it could reliably handle thousands of pages per day across thousands of domains.
This project has been one of the best learning experiences of my career. I've learned more about how the web actually works (from robots.txt edge cases to DNS resolution quirks to the surprising complexity of HTML parsing) than years of building websites ever taught me.
What's Next
SolvedSeek is a living project. Here's what we're working towards.
Continuously expanding our crawl to cover more of the web, with smarter prioritisation of high-quality domains.
Exploring larger embedding models and deeper semantic understanding to make search results even more relevant.
Building tools for site owners to see how their pages appear in our index and engage with the search engine directly.
Faster search, faster crawling, and more efficient infrastructure to handle growth as the index scales.
Page generated Jun 27, 2026 at 12:55 AM UTC