SolvedSeek Documentation
Everything you need to know about SolvedSeek, our crawler, and how your site fits into our index.
On this page
What is SolvedSeek?
SolvedSeek is an independent web search engine. We operate our own crawler, build our own index, and rank results using our own algorithms. We do not license results from Google, Bing, or any other provider. Every page in our index was discovered and fetched by our crawler directly.
The search engine is designed to be transparent about how it works. This page explains our crawling, indexing, and ranking in full detail so that users and webmasters can understand exactly what happens when a query is submitted or a page is crawled.
Why we built this
The modern web search landscape is dominated by a small number of companies. Most "alternative" search engines are skins on top of Bing or Google APIs. They cannot control what gets indexed, how results are ranked, or what gets filtered. They are, at best, a different interface to someone else's index.
We wanted something different: a search engine that owns its entire stack. One where the ranking algorithm is not a black box optimised for ad revenue. One where privacy is not a marketing claim but a structural guarantee, because there is no tracking infrastructure to begin with.
SolvedSeek is not trying to index the entire web. We focus on building a curated, high-quality index where every domain has earned its place through content quality and trust signals, not through SEO manipulation or advertising spend.
How search works
When you type a query, the following happens:
- Cache check: We check if this exact query has been answered recently. Cached results are served instantly.
- FULLTEXT search: Your query is matched against page titles, descriptions, and body text using MySQL FULLTEXT indexing in natural language mode. Scores are LOG-damped (BM25-style) to prevent keyword-stuffed pages from dominating.
- Embedding candidates: In parallel, your query is converted to a vector embedding and compared against thousands of page embeddings to find semantically similar pages that keyword matching might miss.
- Merge & score: FULLTEXT and embedding candidates are merged and deduplicated. Each page receives a composite score from multiple ranking signals (detailed below).
- Twiddler adjustments: Admin-defined ranking rules can pin, promote, demote, block, or otherwise adjust specific results.
- Semantic re-ranking: Results are re-ranked by blending keyword scores with semantic similarity (50/50 weighting by default).
- Diversity & filtering: Results are capped at 2 per root domain (subdomains grouped together), language-filtered to match the query, and paginated.
- Results returned: The final ranked list is returned with titles, URLs, description snippets, and entity topic labels.
Search operators
SolvedSeek supports search operators that give you fine-grained control over your results. You can combine multiple operators in a single query.
| Operator | Example | What it does |
|---|---|---|
site: |
site:github.com react |
Only show results from a specific domain (includes subdomains). |
"quotes" |
"search engine" |
Match the exact phrase. Only pages containing those words in that order will appear. |
-word |
python -snake |
Exclude results containing a specific word. |
intitle: |
intitle:tutorial javascript |
Only show results where the word appears in the page title. |
inurl: |
inurl:blog marketing |
Only show results where the word appears in the URL path. |
filetype: |
filetype:pdf machine learning |
Only show results with a specific file extension in the URL. |
after: |
after:2026-01-01 news |
Only show pages crawled on or after a specific date (YYYY-MM-DD). |
before: |
before:2026-01-01 archive |
Only show pages crawled on or before a specific date (YYYY-MM-DD). |
trust: |
trust:80 banking |
Only show results from domains with a trust score at or above the specified value (0-100). |
lang: |
lang:de berlin |
Override language detection. Show results in a specific language (ISO 639-1 code). |
Combining operators
You can use multiple operators together. For example:
site:github.com intitle:readme "getting started"finds README pages on GitHub containing the phrase "getting started"python tutorial -django after:2026-01-01finds recent Python tutorials that are not about Djangotrust:80 inurl:blog machine learningfinds machine learning content on high-trust blogs
If you use an operator by itself (e.g., just site:wikipedia.org), results are ranked by page quality and domain trust instead of keyword relevance.
Ranking signals
The composite score for each page is built from these signals. All FULLTEXT scores are LOG-damped (similar to BM25 saturation) — this means repeating a keyword 100 times on a page gives diminishing returns, preventing keyword-stuffed spam from dominating results:
| Signal | How it works |
|---|---|
| Title relevance | The strongest signal. A dedicated FULLTEXT search on the title only, LOG-damped (BM25-style) so keyword stuffing has diminishing returns. Pages about your query outrank pages that merely mention it. |
| Body relevance | FULLTEXT score across title, description, and body text combined. Also LOG-damped to prevent spam pages from gaming the score through term repetition. |
| Title keyword match | Pages where the query appears directly in the title (LIKE match) receive a significant bonus on top of the FULLTEXT score. |
| URL keyword match | Pages where the query appears in the URL receive an additional bonus (e.g., searching "react" boosts pages at /react-tutorial). |
| Entity topic match | Each page has an NLP-extracted entity (company, place, or person). If the page's entity matches your query, it receives a relevance boost. |
| Content quality | A heavily-weighted quality score (0.0 to 1.0) based on body text length, heading structure, paragraph count, meta description, link ratios, and spam detection. Pages below 0.15 are excluded entirely. |
| Domain trust | Higher-trust domains receive a boost proportional to their score (0 to 100). Domains with zero trust receive a heavy penalty (-30 points). |
| Freshness | Pages crawled within the last 7 days receive a small freshness bonus. |
| Description penalty | Pages without a meta description (or with one shorter than 20 characters) receive a score penalty. |
| Language filter | Pages are tagged with their detected language using trigram analysis. English queries only return English pages — non-matching pages are excluded from results entirely. |
On top of these base signals, admin-defined twiddlers can apply additional adjustments. Twiddlers support 10 ranking functions: pin, promote, demote, block, freshness boost, authority cap, domain diversity, description requirement, title match bonus, and URL depth penalty. These can be applied globally or targeted at specific domains, URLs, or query patterns.
Semantic search
SolvedSeek uses hybrid semantic retrieval that combines traditional keyword matching with AI-powered semantic understanding. Every crawled page has a vector embedding generated using a local machine learning model (all-MiniLM-L6-v2, producing 384-dimension vectors).
At search time, two retrieval paths run in parallel:
- FULLTEXT candidates: Traditional keyword matching via MySQL FULLTEXT indexes finds pages containing your search terms.
- Embedding candidates: Your query is converted to a vector embedding and compared against thousands of page embeddings using cosine similarity. Pages above a 0.3 similarity threshold are included as candidates.
Both candidate sets are merged and deduplicated. The final score blends the keyword-based composite score with the semantic similarity score using a 50/50 weighting. This means SolvedSeek can understand that a search for "how to cook pasta" is related to a page about "Italian noodle recipes" even if the exact words do not match.
All embedding generation runs locally on our own hardware. No data is sent to external AI services.
Trust scoring
Every domain in our index has a trust score from 0 to 100. Trust is not self-reported or purchased. It is calculated based on:
- Inbound link quality (who links to this domain, and how trusted are they?)
- Content consistency across crawled pages
- Crawl history (how long the domain has been in the index, error rates)
- Human review and manual adjustments by administrators
Trust determines how deeply we index a domain. New domains start with an initial page allowance of 20 pages. As a domain proves its quality over time, this limit can be expanded up to 500 pages. Domains with very low trust may be blocked from the index entirely.
Trust-zero penalty: Domains with a trust score of 0 receive a heavy ranking penalty (-30 points), keeping unknown or untrusted domains from appearing prominently in results. Combined with the quality floor (pages must score at least 0.15 to appear at all), this creates a strong baseline filter against spam and low-quality content.
Our crawler
SolvedSeekBot/1.0
Full string
SolvedSeekBot/1.0 (+https://solvedseek.com/docs)
Respects
robots.txt, meta robots, X-Robots-Tag, canonical URLs, crawl-delay
Concurrency
10 URLs per tick (parallel), one page per domain per tick, 3 crawl workers
Timeout
15 seconds per page request
SolvedSeekBot is a Node.js-based crawler that discovers pages through link following, XML sitemaps, and manually seeded URLs. It uses a priority-based queue: seed URLs get the highest priority, followed by homepage children, sitemap URLs, and then deeper links.
The crawler processes pages in a breadth-first pattern, prioritising homepage and top-level pages before diving deeper into a site. New external domains discovered through links are automatically added to the index with an initial 20-page allowance.
Each crawled page goes through the following pipeline:
- Check domain page limits and robots.txt permissions
- Fetch the page via HTTPS (with HTTP fallback)
- Detect CDN challenge/bot protection pages (Cloudflare, Sucuri, Akamai, etc.) and skip them
- Check X-Robots-Tag headers and meta robots tags for noindex/nofollow directives
- Skip pages with empty titles (low-quality/non-content pages)
- Resolve canonical URLs to avoid duplicate content
- Compute content hash for deduplication
- Calculate quality score (including gambling/pharma spam detection) and store the page in the index
- Detect page language using trigram analysis (franc-min) and store the ISO 639-1 code
- Extract the primary named entity (company, place, person) using NLP (compromise)
- Generate a vector embedding for semantic search (non-blocking)
- Discover and queue internal links, external links, and sitemap URLs
Robots.txt and meta tags
SolvedSeekBot fully respects robots.txt. We support Disallow, Allow, and Crawl-Delay directives. When both Allow and Disallow match a path, the most specific (longest) rule wins. If they are the same length, Allow takes precedence.
We check for our specific user agent first (SolvedSeekBot), then fall back to the wildcard (*) rules.
We also respect:
<meta name="robots" content="noindex">:Page will not be stored in the index<meta name="robots" content="nofollow">:Links on the page will not be followed<meta name="solvedseekbot" content="noindex">:Bot-specific directiveX-Robots-Tag: noindex:HTTP header directivenone:Equivalent tonoindex, nofollow
To block SolvedSeekBot from your site entirely, add this to your robots.txt:
User-agent: SolvedSeekBot Disallow: /
XML Sitemaps
SolvedSeekBot reads Sitemap: directives from your robots.txt file. When crawling a domain for the first time (via a seed URL), the crawler checks for declared sitemaps and queues URLs found in them.
We support both standard sitemap XML files and sitemap index files. For sitemap indexes, we process the first child sitemap. Each sitemap is limited to 200 URLs to keep the queue manageable.
Sitemap URLs are queued at a medium priority, below homepage children but above deep links discovered through crawling.
Canonical URLs
SolvedSeekBot respects the <link rel="canonical"> tag. If a page declares a canonical URL that differs from the fetched URL but points to the same domain, we store the page under the canonical URL instead. This prevents duplicate entries for pages with query parameters, session IDs, or tracking parameters.
If the canonical URL points to a different domain, we skip indexing the page (it is a cross-domain canonical, indicating the content belongs to another site), but we still follow links on the page to discover new content.
What we index
For each page, we extract and store:
- Title from the
<title>tag (up to 512 characters). Pages without a title are not indexed. - Description from the
meta descriptiontag (up to 1024 characters). - Body text extracted after removing scripts, styles, navigation, footers, headers, forms, and other non-content elements.
- Internal and external links for link graph analysis and further crawling.
- Content hash for detecting duplicate pages across different URLs.
- Language detected via trigram-based analysis (franc), stored as ISO 639-1 code (e.g., "en", "de", "fr"). Used for language-filtered search results.
- Entity extracted via NLP (compromise) — the primary named entity (company, place, or person) the page is about. Used as a ranking signal and displayed on results.
- Vector embedding (384-dimension) for semantic search capability.
We only index HTML pages that return HTTP 200. Non-HTML content types (PDFs, images, JSON, etc.) are skipped. Pages behind CDN bot protection (Cloudflare challenges, CAPTCHA walls) are also skipped, as we cannot access their real content.
Privacy
SolvedSeek does not track users. There are no cookies, no analytics scripts, no fingerprinting, and no IP address logging. The search engine has no advertising, so there is no incentive to build user profiles.
Search queries may be cached temporarily (configurable, typically 5 minutes) to improve performance. Cached queries are stored as SHA-256 hashes and are automatically purged when they expire. No query is ever linked to a user, session, or IP address.
Query embeddings (used for semantic search) are cached in the database to avoid regenerating them. These contain only the mathematical vector representation of the query text, with no user-identifying information.
For full details, see our Privacy Policy.
For webmasters
Getting your site indexed
The easiest way to get your site into SolvedSeek is to submit it directly. Enter your homepage URL on the submit page and we will crawl it within the next crawl cycle. You can also wait for natural discovery: SolvedSeekBot finds new sites through links on already-indexed pages. New domains are created with an initial 20-page allowance and can expand to 500 pages as trust is established.
Helping SolvedSeekBot crawl your site effectively
- Provide a complete
robots.txtwith aSitemap:directive pointing to your XML sitemap - Use descriptive
<title>tags on every page (pages without titles are not indexed) - Include
meta descriptiontags (pages without them receive a ranking penalty) - Use
<link rel="canonical">tags to indicate preferred URLs - Use clean, semantic HTML with proper heading structure
- Ensure your server responds within 15 seconds (our request timeout)
Removing your site from the index
Add a Disallow rule for SolvedSeekBot in your robots.txt (see example above). Alternatively, add <meta name="robots" content="noindex"> to individual pages you want excluded. Changes will take effect the next time the crawler visits your site.
Technical details
| Search engine | PHP 8.2+ with MySQL FULLTEXT indexing |
| Crawler | Node.js with native fetch API |
| HTML parsing | cheerio (server-side DOM) |
| Embedding model | Xenova/all-MiniLM-L6-v2 (384 dimensions, runs locally via ONNX) |
| Language detection | franc-min (trigram-based, 82 languages, ~150KB) |
| Entity extraction | compromise (NLP, extracts organizations, places, people) |
| Database | MySQL 8.x / MariaDB with InnoDB |
| Concurrency | 10 simultaneous page fetches, 1 page per domain per crawl tick, 3 crawl workers |
| Request timeout | 15 seconds per page |
| robots.txt caching | 24 hours (1 hour for server errors) |
| CDN detection | Cloudflare, Sucuri, DDoS-Guard, Akamai, StackPath, Imperva/Incapsula |
| Frameworks | None. Custom router, template engine, and database abstraction. No Laravel, Symfony, Express, or similar. |
Crawl priority system
URLs in the crawl queue are processed by priority. Higher priority URLs are crawled first:
| Priority | Source |
|---|---|
| 10 | Manually seeded URLs and new domain homepages |
| 8 | Links found on seed/homepage pages (top-level children) |
| 5 | URLs discovered from XML sitemaps |
| 3 | Links found on deeper pages |
| 2 | Links to already-known approved domains |
Questions?
Reach us at hello@solvedseek.com