Crawl Crawler is a JSON HTTP API for private and corporate data hungry text projects as well.
Crawl Crawler is also plain ol' non-tracking, keyword-based web search, results marked up with ad-free non-dynamic HTML, serving no cookies.
Crawl Crawler gives you the ability to search four grand sources of data, plus your own: the Common Crawl meta-data, text, and HTML repositories as well as WWW itself.
There are three (types of) collections here that you may interact with.
The first collection is public read-only and is called "cc_wat".
It is maintained by Crawl Crawler and
it is a product of analyzing Common Crawl's WAT meta-data repository.
Documents in this collection contain the queryable fields "title", "description",
The second collection is also public read-only and is called "cc_wet".
Complete text extracts from Common Crawl's WET repo or WWW will be added to it when you use your favorite browser to click on "Enrich"
from a search result page and become refreshed, i.e. updated with more current data, when you click on "Refresh".
Documents in the cc_wet collection contain the queryable fields "description"
The third collection type are collections you create when you click
"Save as" from a search result page.
Anyone who knows the name of such a collection can both query it, append to it and refresh it.
Hacking the URL
Specify one or more keywords in the "q" query string parameter.
Use one or more "collection" query string parameters to direct your queries towards one or more collections.
Use one or more "field" parameters to direct your queries towards one or more fields.
Use one or more "select" parameters to define which document fields to include in your search result.
Replace "OR=OR" query string entry with "AND=AND" for stricter interpretation of your query.
Page by using "skip" and "take" parameters.
HTTP GET Accept:application/json
HTTPS GET /query/?field=title&field=description&q=embellished+sheath&OR=OR&skip=0&take=100&collection=cc_wat&select=title
JSON Query with HTTP POST
Define "skip", "take" and "select" parameters in the query string.
Include your JSON query in the body of the request.
HTTP POST /query/?skip=0&take=100&select=title
"description":"prom dress wedding"
There can be no more than one "and", "or" and "not" top-level field per JSON object.
If child terms target the same collection as their parent, then you need to specify collection only once.
There is no limit to the nesting depth other than one you set for yourself:
Insert, append, update
You may create new collections and query, append to and update any public read-write enabled collection you know by name.
HTTP POST Content-Type:application/json
The insert, append and update HTTP APIs are unstable at this time.
As soon as they become reliable they will be documented here.
Crawl Crawler will not show you ads nor place cookies on your device.
Crawl Crawler will not track you by any means.
For security purposes and for a limited time Crawl Crawler will keep a record of your IP address.
No data privacy
No document collection hosted by Crawl Crawler is private and all document collections are public.
Free as in freedom
Crawl Crawler is free of any affiliation with any dominant, or submissive (for that matter),
Not that there would be anything wrong with being so. We're just not.
Crawl Crawler is built exclusively on OSS that you are free to run on your premises.
You are free to use the Crawl Crawler GUI to query for data.
You are equally free to use the Crawl Crawler HTTP API to query for data.
Not free as in beer
This service is mostly free as in beer but after the BETA period and for certain tasks,
such as recurringly refreshing a document collection,
or a slice of it, with textual content from WWW,
there will be a small, quite reasonable fee involved.
We build Resin BETA,
an open-source and extensible search engine.
This web based service was created by e-commerce solutions architect
and search handy-man, Marcus Lager from Helsingborg, Sweden.