Search¶

The classes in this module

Visit the interactive demo to learn how to use these classes.

Usage¶

First, pip install searchbetter.

Then, in your Python code:

from searchbetter import search

Documentation¶

class search.EdXSearchEngine(dataset_path, index_path, create=False)[source]¶

edX

__init__(dataset_path, index_path, create=False)[source]¶

Creates a new search engine that searches over edX courses.

Parameters:	{string} (index_path) – the path to the edX course listings file. {string} – the path to a folder where you’d like to store the search engine index. The given folder doesn’t have to exist, but its parent folder does. {bool} (create) – If True, recreates an index from scratch. If False, loads the existing index

count_words()[source]¶: Returns the number of words in the underlying Udacity dataset.

create_index()[source]¶

Creates a new index to search the dataset. You only need to call this once; once the index is created, you can just load it again instead of creating it afresh all the time.

Returns the index object.

class search.HarvardXSearchEngine(dataset_path, index_path, create=False)[source]¶

HX

__init__(dataset_path, index_path, create=False)[source]¶

Creates a new HarvardX search engine. Searches over the HarvardX/DART database of all courses and course materials used in HarvardX. This includes videos, quizzes, etc.

TODO: consider renaming to DART, probz

Parameters:	{string} (index_path) – the path to the HarvardX course catalog CSV file. {string} – the path to a folder where you’d like to store the search engine index. The given folder doesn’t have to exist, but its parent folder does. {bool} (create) – If True, recreates an index from scratch. If False, loads the existing index

create_index()[source]¶

Creates a new index to search the dataset. You only need to call this once; once the index is created, you can just load it again instead of creating it afresh all the time.

Returns the index object.

class search.PrebuiltSearchEngine(search_fields, index_path)[source]¶: A search engine designed for when you’re just given a model file and can use that directly without having to build anything.

class search.Result(dict_data, score)[source]¶

Encodes a search result. Basically a wrapper around a result dict and its relevance score (higher is better).

get_dict()[source]¶: Get the underlying dict data

class search.SearchEngine(create, search_fields, index_path)[source]¶

An abstract class for search engines. A batteries-included search engine that can operate on any given dataset. Uses the Whoosh library to index and run searches on the dataset. Has built-in support for query rewriting.

__init__(create, search_fields, index_path)[source]¶

Creates a new search engine.

Parameters:	{bool} (create) – If True, recreates an index from scratch. If False, loads the existing index {str[]} (search_fields) – An array names of fields in the index that our search engine will search against. {str} (index_path) – A relative path to a folder where the whoosh index should be stored.

create_index()[source]¶: Creates and returns a brand-new index. This will call get_empty_index() behind the scenes. Subclasses must implement!

get_empty_index(path, schema)[source]¶

Makes an empty index file, making the directory where it needs to be stored if necessary. Returns the index.

This is called within create_index(). TODO this breakdown is still confusing

get_num_documents()[source]¶: Returns the number of documents in this search engine’s corpus. That is, this is the size of the search engine.

load_index()[source]¶: Used when the index is already created. This just loads it and returns it for you.

search(term)[source]¶: Runs a plain-English search and returns results. :param term {String}: a query like you’d type into Google. :return: a list of dicts, each of which encodes a search result.

set_rewriter(rewriter)[source]¶: Sets a new query rewriter (from this_package.rewriter) as the default rewriter for this search engine.

class search.UdacitySearchEngine(dataset_path, index_path, create=False)[source]¶

Udacity

__init__(dataset_path, index_path, create=False)[source]¶

Creates a new Udacity search engine.

Parameters:	{string} (index_path) – the path to the Udacity API JSON file. {string} – the path to a folder where you’d like to store the search engine index. The given folder doesn’t have to exist, but its parent folder does. {bool} (create) – If True, recreates an index from scratch. If False, loads the existing index

count_words()[source]¶: Returns the number of words in the underlying Udacity dataset.

create_index()[source]¶: Creates a new index to search the Udacity dataset. You only need to call this once; once the index is created, you can just load it again instead of creating it afresh all the time.

search.pack_byte()¶

S.pack(v1, v2, ...) -> string

Return a string containing values v1, v2, ... packed according to this Struct’s format. See struct.__doc__ for more on format strings.

search.unpack_byte()¶

S.unpack(str) -> (v1, v2, ...)

Return tuple containing values unpacked according to this Struct’s format. Requires len(str) == self.size. See struct.__doc__ for more on format strings.