SolvedSeek Documentation

Everything you need to know about SolvedSeek, our crawler, and how your site fits into our index.

What is SolvedSeek?

SolvedSeek is an independent web search engine. We operate our own crawler, build our own index, and rank results using our own algorithms. We do not license results from Google, Bing, or any other provider. Every page in our index was discovered and fetched by our crawler directly.

The search engine is designed to be transparent about how it works. This page explains our crawling, indexing, and ranking in full detail so that users and webmasters can understand exactly what happens when a query is submitted or a page is crawled.

Why we built this

The modern web search landscape is dominated by a small number of companies. Most "alternative" search engines are skins on top of Bing or Google APIs. They cannot control what gets indexed, how results are ranked, or what gets filtered. They are, at best, a different interface to someone else's index.

We wanted something different: a search engine that owns its entire stack. One where the ranking algorithm is not a black box optimised for ad revenue. One where privacy is not a marketing claim but a structural guarantee, because there is no tracking infrastructure to begin with.

SolvedSeek is not trying to index the entire web. We focus on building a curated, high-quality index where every domain has earned its place through content quality and trust signals, not through SEO manipulation or advertising spend.

How search works

When you type a query, the following happens:

Cache check: We check if this exact query has been answered recently. Cached results are served instantly.
FULLTEXT search: Your query is matched against page titles, descriptions, and body text using MySQL FULLTEXT indexing in natural language mode. Scores are LOG-damped (BM25-style) to prevent keyword-stuffed pages from dominating.
Embedding candidates: In parallel, your query is converted to a vector embedding and compared against thousands of page embeddings to find semantically similar pages that keyword matching might miss.
Merge & score: FULLTEXT and embedding candidates are merged and deduplicated. Each page receives a composite score from multiple ranking signals (detailed below).
Twiddler adjustments: Admin-defined ranking rules can pin, promote, demote, block, or otherwise adjust specific results.
Semantic re-ranking: Results are re-ranked by blending keyword scores with semantic similarity (50/50 weighting by default).
Diversity & filtering: Results are capped at 2 per root domain (subdomains grouped together), language-filtered to match the query, and paginated.
Results returned: The final ranked list is returned with titles, URLs, description snippets, and entity topic labels.

Search operators

SolvedSeek supports search operators that give you fine-grained control over your results. You can combine multiple operators in a single query.

Operator	Example	What it does
`site:`	`site:github.com react`	Only show results from a specific domain (includes subdomains).
`"quotes"`	`"search engine"`	Match the exact phrase. Only pages containing those words in that order will appear.
`-word`	`python -snake`	Exclude results containing a specific word.
`intitle:`	`intitle:tutorial javascript`	Only show results where the word appears in the page title.
`inurl:`	`inurl:blog marketing`	Only show results where the word appears in the URL path.
`filetype:`	`filetype:pdf machine learning`	Only show results with a specific file extension in the URL.
`after:`	`after:2026-01-01 news`	Only show pages crawled on or after a specific date (YYYY-MM-DD).
`before:`	`before:2026-01-01 archive`	Only show pages crawled on or before a specific date (YYYY-MM-DD).
`trust:`	`trust:80 banking`	Only show results from domains with a trust score at or above the specified value (0-100).
`lang:`	`lang:de berlin`	Override language detection. Show results in a specific language (ISO 639-1 code).

Combining operators

You can use multiple operators together. For example:

site:github.com intitle:readme "getting started" finds README pages on GitHub containing the phrase "getting started"
python tutorial -django after:2026-01-01 finds recent Python tutorials that are not about Django
trust:80 inurl:blog machine learning finds machine learning content on high-trust blogs

If you use an operator by itself (e.g., just site:wikipedia.org), results are ranked by page quality and domain trust instead of keyword relevance.

Ranking signals

The composite score for each page is built from these signals. All FULLTEXT scores are LOG-damped (similar to BM25 saturation) — this means repeating a keyword 100 times on a page gives diminishing returns, preventing keyword-stuffed spam from dominating results:

Signal	How it works
Title relevance	The strongest signal. A dedicated FULLTEXT search on the title only, LOG-damped (BM25-style) so keyword stuffing has diminishing returns. Pages about your query outrank pages that merely mention it.
Body relevance	FULLTEXT score across title, description, and body text combined. Also LOG-damped to prevent spam pages from gaming the score through term repetition.
Title keyword match	Pages where the query appears directly in the title (LIKE match) receive a significant bonus on top of the FULLTEXT score.
URL keyword match	Pages where the query appears in the URL receive an additional bonus (e.g., searching "react" boosts pages at /react-tutorial).
Entity topic match	Each page has an NLP-extracted entity (company, place, or person). If the page's entity matches your query, it receives a relevance boost.
Content quality	A heavily-weighted quality score (0.0 to 1.0) based on body text length, heading structure, paragraph count, meta description, link ratios, and spam detection. Pages below 0.15 are excluded entirely.
Domain trust	Higher-trust domains receive a boost proportional to their score (0 to 100). Domains with zero trust receive a heavy penalty (-30 points).
Freshness	Pages crawled within the last 7 days receive a small freshness bonus.
Description penalty	Pages without a meta description (or with one shorter than 20 characters) receive a score penalty.
Language filter	Pages are tagged with their detected language using trigram analysis. English queries only return English pages — non-matching pages are excluded from results entirely.

On top of these base signals, admin-defined twiddlers can apply additional adjustments. Twiddlers support 10 ranking functions: pin, promote, demote, block, freshness boost, authority cap, domain diversity, description requirement, title match bonus, and URL depth penalty. These can be applied globally or targeted at specific domains, URLs, or query patterns.

Semantic search

SolvedSeek uses hybrid semantic retrieval that combines traditional keyword matching with AI-powered semantic understanding. Every crawled page has a vector embedding generated using a local machine learning model (all-MiniLM-L6-v2, producing 384-dimension vectors).

At search time, two retrieval paths run in parallel:

FULLTEXT candidates: Traditional keyword matching via MySQL FULLTEXT indexes finds pages containing your search terms.
Embedding candidates: Your query is converted to a vector embedding and compared against thousands of page embeddings using cosine similarity. Pages above a 0.3 similarity threshold are included as candidates.

Both candidate sets are merged and deduplicated. The final score blends the keyword-based composite score with the semantic similarity score using a 50/50 weighting. This means SolvedSeek can understand that a search for "how to cook pasta" is related to a page about "Italian noodle recipes" even if the exact words do not match.

All embedding generation runs locally on our own hardware. No data is sent to external AI services.

Trust scoring

Every domain in our index has a trust score from 0 to 100. Trust is not self-reported or purchased. It is calculated based on:

Inbound link quality (who links to this domain, and how trusted are they?)
Content consistency across crawled pages
Crawl history (how long the domain has been in the index, error rates)
Human review and manual adjustments by administrators

Trust determines how deeply we index a domain. New domains start with an initial page allowance of 20 pages. As a domain proves its quality over time, this limit can be expanded up to 500 pages. Domains with very low trust may be blocked from the index entirely.

Trust-zero penalty: Domains with a trust score of 0 receive a heavy ranking penalty (-30 points), keeping unknown or untrusted domains from appearing prominently in results. Combined with the quality floor (pages must score at least 0.15 to appear at all), this creates a strong baseline filter against spam and low-quality content.

Our crawler

User-Agent SolvedSeekBot/1.0 Full string SolvedSeekBot/1.0 (+https://solvedseek.com/docs) Respects robots.txt, meta robots, X-Robots-Tag, canonical URLs, crawl-delay Concurrency 10 URLs per tick (parallel), one page per domain per tick, 3 crawl workers Timeout 15 seconds per page request

SolvedSeekBot is a Node.js-based crawler that discovers pages through link following, XML sitemaps, and manually seeded URLs. It uses a priority-based queue: seed URLs get the highest priority, followed by homepage children, sitemap URLs, and then deeper links.

The crawler processes pages in a breadth-first pattern, prioritising homepage and top-level pages before diving deeper into a site. New external domains discovered through links are automatically added to the index with an initial 20-page allowance.

Each crawled page goes through the following pipeline:

Check domain page limits and robots.txt permissions
Fetch the page via HTTPS (with HTTP fallback)
Detect CDN challenge/bot protection pages (Cloudflare, Sucuri, Akamai, etc.) and skip them
Check X-Robots-Tag headers and meta robots tags for noindex/nofollow directives
Skip pages with empty titles (low-quality/non-content pages)
Resolve canonical URLs to avoid duplicate content
Compute content hash for deduplication
Calculate quality score (including gambling/pharma spam detection) and store the page in the index
Detect page language using trigram analysis (franc-min) and store the ISO 639-1 code
Extract the primary named entity (company, place, person) using NLP (compromise)
Generate a vector embedding for semantic search (non-blocking)
Discover and queue internal links, external links, and sitemap URLs

Robots.txt and meta tags

SolvedSeekBot fully respects robots.txt. We support Disallow, Allow, and Crawl-Delay directives. When both Allow and Disallow match a path, the most specific (longest) rule wins. If they are the same length, Allow takes precedence.

We check for our specific user agent first (SolvedSeekBot), then fall back to the wildcard (*) rules.

We also respect:

<meta name="robots" content="noindex">:Page will not be stored in the index
<meta name="robots" content="nofollow">:Links on the page will not be followed
<meta name="solvedseekbot" content="noindex">:Bot-specific directive
X-Robots-Tag: noindex:HTTP header directive
none:Equivalent to noindex, nofollow

To block SolvedSeekBot from your site entirely, add this to your robots.txt:

User-agent: SolvedSeekBot
Disallow: /

XML Sitemaps

SolvedSeekBot reads Sitemap: directives from your robots.txt file. When crawling a domain for the first time (via a seed URL), the crawler checks for declared sitemaps and queues URLs found in them.

We support both standard sitemap XML files and sitemap index files. For sitemap indexes, we process the first child sitemap. Each sitemap is limited to 200 URLs to keep the queue manageable.

Sitemap URLs are queued at a medium priority, below homepage children but above deep links discovered through crawling.

Canonical URLs

SolvedSeekBot respects the <link rel="canonical"> tag. If a page declares a canonical URL that differs from the fetched URL but points to the same domain, we store the page under the canonical URL instead. This prevents duplicate entries for pages with query parameters, session IDs, or tracking parameters.

If the canonical URL points to a different domain, we skip indexing the page (it is a cross-domain canonical, indicating the content belongs to another site), but we still follow links on the page to discover new content.

What we index

For each page, we extract and store:

Title from the <title> tag (up to 512 characters). Pages without a title are not indexed.
Description from the meta description tag (up to 1024 characters).
Body text extracted after removing scripts, styles, navigation, footers, headers, forms, and other non-content elements.
Internal and external links for link graph analysis and further crawling.
Content hash for detecting duplicate pages across different URLs.
Language detected via trigram-based analysis (franc), stored as ISO 639-1 code (e.g., "en", "de", "fr"). Used for language-filtered search results.
Entity extracted via NLP (compromise) — the primary named entity (company, place, or person) the page is about. Used as a ranking signal and displayed on results.
Vector embedding (384-dimension) for semantic search capability.

We only index HTML pages that return HTTP 200. Non-HTML content types (PDFs, images, JSON, etc.) are skipped. Pages behind CDN bot protection (Cloudflare challenges, CAPTCHA walls) are also skipped, as we cannot access their real content.

Privacy

SolvedSeek does not track users. There are no cookies, no analytics scripts, no fingerprinting, and no IP address logging. The search engine has no advertising, so there is no incentive to build user profiles.

Search queries may be cached temporarily (configurable, typically 5 minutes) to improve performance. Cached queries are stored as SHA-256 hashes and are automatically purged when they expire. No query is ever linked to a user, session, or IP address.

Query embeddings (used for semantic search) are cached in the database to avoid regenerating them. These contain only the mathematical vector representation of the query text, with no user-identifying information.

For full details, see our Privacy Policy.

For webmasters

Getting your site indexed

The easiest way to get your site into SolvedSeek is to submit it directly. Enter your homepage URL on the submit page and we will crawl it within the next crawl cycle. You can also wait for natural discovery: SolvedSeekBot finds new sites through links on already-indexed pages. New domains are created with an initial 20-page allowance and can expand to 500 pages as trust is established.

Helping SolvedSeekBot crawl your site effectively

Provide a complete robots.txt with a Sitemap: directive pointing to your XML sitemap
Use descriptive <title> tags on every page (pages without titles are not indexed)
Include meta description tags (pages without them receive a ranking penalty)
Use <link rel="canonical"> tags to indicate preferred URLs
Use clean, semantic HTML with proper heading structure
Ensure your server responds within 15 seconds (our request timeout)

Removing your site from the index

Add a Disallow rule for SolvedSeekBot in your robots.txt (see example above). Alternatively, add <meta name="robots" content="noindex"> to individual pages you want excluded. Changes will take effect the next time the crawler visits your site.

Technical details

Search engine	PHP 8.2+ with MySQL FULLTEXT indexing
Crawler	Node.js with native fetch API
HTML parsing	cheerio (server-side DOM)
Embedding model	Xenova/all-MiniLM-L6-v2 (384 dimensions, runs locally via ONNX)
Language detection	franc-min (trigram-based, 82 languages, ~150KB)
Entity extraction	compromise (NLP, extracts organizations, places, people)
Database	MySQL 8.x / MariaDB with InnoDB
Concurrency	10 simultaneous page fetches, 1 page per domain per crawl tick, 3 crawl workers
Request timeout	15 seconds per page
robots.txt caching	24 hours (1 hour for server errors)
CDN detection	Cloudflare, Sucuri, DDoS-Guard, Akamai, StackPath, Imperva/Incapsula
Frameworks	None. Custom router, template engine, and database abstraction. No Laravel, Symfony, Express, or similar.

Crawl priority system

URLs in the crawl queue are processed by priority. Higher priority URLs are crawled first:

Priority	Source
10	Manually seeded URLs and new domain homepages
8	Links found on seed/homepage pages (top-level children)
5	URLs discovered from XML sitemaps
3	Links found on deeper pages
2	Links to already-known approved domains

Questions?

Reach us at hello@solvedseek.com