Crawl Crawler BETA

About

Crawl Crawler is a JSON HTTP API for private and corporate data hungry text projects as well. Crawl Crawler is also plain ol' non-tracking, keyword-based web search, results marked up with ad-free non-dynamic HTML, serving no cookies.

Crawl Crawler gives you the ability to search four grand sources of data, plus your own: the Common Crawl meta-data, text, and HTML repositories as well as WWW itself.

Instructions

There are three (types of) collections here that you may interact with.

First collection

The first collection is public read-only and is called "cc_wat". It is maintained by Crawl Crawler and it is a product of analyzing Common Crawl's WAT meta-data repository.

Documents in this collection contain the queryable fields "title", "description", and "url".

Second collection

The second collection is also public read-only and is called "cc_wet". Complete text extracts from Common Crawl's WET repo or WWW will be added to it when you use your favorite browser to click on "Enrich" from a search result page and become refreshed, i.e. updated with more current data, when you click on "Refresh".

Documents in the cc_wet collection contain the queryable fields "description" and "url".

Third collection

The third collection type are collections you create when you click "Save as" from a search result page. Anyone who knows the name of such a collection can both query it, append to it and refresh it.

Querying

Hacking the URL

Specify one or more keywords in the "q" query string parameter.

Use one or more "collection" query string parameters to direct your queries towards one or more collections.

Use one or more "field" parameters to direct your queries towards one or more fields.

Use one or more "select" parameters to define which document fields to include in your search result.

Replace "OR=OR" query string entry with "AND=AND" for stricter interpretation of your query.

Page by using "skip" and "take" parameters.

HTTP GET Accept:application/json

HTTPS GET /query/?field=title&field=description&q=embellished+sheath&OR=OR&skip=0&take=100&collection=cc_wat&select=title
Accept: application/json

JSON Query with HTTP POST

Define "skip", "take" and "select" parameters in the query string. Include your JSON query in the body of the request.

HTTP POST /query/?skip=0&take=100&select=title
Content-Type:application/json
Accept: application/json

    {
        "and":{
            "collection":"cc_wat",
            "host":"myfashion.com"
        },
        "or":{
            "collection":"cc_wet",
            "description":"prom dress wedding"
        },
        "not":{
            "collection":"cc_wat",
            "path":"kids teens"
        },
    }
            

There can be no more than one "and", "or" and "not" top-level field per JSON object.

If child terms target the same collection as their parent, then you need to specify collection only once. There is no limit to the nesting depth other than one you set for yourself:

    {
        "or":{
            "collection":"cc_wet",
            "description":"prom",
            "or":{
                "description":"dress",
                "or":{
                    "description":"wedding"
                }
            }
        }
    }
            

Insert, append, update

You may create new collections and query, append to and update any public read-write enabled collection you know by name.

HTTP POST Content-Type:application/json

The insert, append and update HTTP APIs are unstable at this time. As soon as they become reliable they will be documented here.

User privacy

Crawl Crawler will not show you ads nor place cookies on your device. Crawl Crawler will not track you by any means. For security purposes and for a limited time Crawl Crawler will keep a record of your IP address.

No data privacy

No document collection hosted by Crawl Crawler is private and all document collections are public.

Free as in freedom

Crawl Crawler is free of any affiliation with any dominant, or submissive (for that matter), search player. Not that there would be anything wrong with being so. We're just not.

Crawl Crawler is built exclusively on OSS that you are free to run on your premises.

You are free to use the Crawl Crawler GUI to query for data. You are equally free to use the Crawl Crawler HTTP API to query for data.

Not free as in beer

This service is mostly free as in beer but after the BETA period and for certain tasks, such as recurringly refreshing a document collection, or a slice of it, with textual content from WWW, there will be a small, quite reasonable fee involved.

Technology

We build Resin BETA, an open-source and extensible search engine.

Created by

This web based service was created by e-commerce solutions architect and search handy-man, Marcus Lager from Helsingborg, Sweden.