Rewriter

The Rewriter module contains classes that help you rewrite single words and phrases to a set of semantically related words and phrases. If you search for all of these words and phrases on a search engine, you can get more, and more useful, results than if you just searched for the original word or phrase.

For instance, acceleration could be rewritten to { acceleration, velocity, mass, force, physics }.

Visit the interactive demo to learn how to use these classes.

Usage

First, pip install searchbetter.

Then, in your Python code:

from searchbetter import rewriter

Documentation

class rewriter.ControlRewriter[source]

A rewriter that’s basically a no-op. Just returns the term you give it. This is mostly useful for testing purposes.

rewrite(term)[source]

Rewrites a term to a list containing just itself. This is the degenerate case of query rewriting - the original term isn’t actually rewritten at all.

rewrite(x) == [x] for all x.

Parameters:term (str) – a string to rewrite
Returns:a list containing just term
Return type:list(str)
class rewriter.Rewriter[source]

Abstract class around a query rewriter, which takes a given term and rewrites it to a set of semantically related terms. This, hopefully, helps search engines return more, and more useful, results.

rewrite(term)[source]

Rewrites a term to a list of new terms to search with. Abstract base!

Parameters:term (str) – a string to rewrite
Returns:a list of semantically related strings, including term
Return type:list(str)
class rewriter.WikipediaRewriter[source]

A class to rewrite queries using Wikipedia’s Category API.

rewrite(term)[source]

Given a base term, returns a list of related terms based on the Wikipedia category API.

For example, visit your favorite Wikipedia page and look for the list of Categories at the very bottom of the page.

Parameters:term (str) – a string to rewrite
Returns:a list of semantically related strings, including term
Return type:list(str)
class rewriter.Word2VecRewriter(model_path, create=False, corpus=None, bigrams=True)[source]

A class to rewrite queries using Word2Vec, an NLP package that finds semantically related words and phrases to inputted words and phrases. Word2Vec must be trained on a user-provided dataset before it is used.

__init__(model_path, create=False, corpus=None, bigrams=True)[source]

Initializes the rewriter, given a particular Word2Vec corpus. A good example corpus is the Wikipedia Text8Corpus. You only need the corpus if you are recreating the model from scratch.

If create == True, this generates a new Word2Vec model (which takes a really long time to build.) If False, this loads an existing model we already saved.

Parameters:
  • model_path (str) – where to store the model files. This file needn’t exist, but its parent folder should.
  • create (bool) – True to create a new Word2Vec model, False to use the one stored at model_path.
  • corpus (Iterable) – only needed if create=True. Defines a corpus for Word2Vec to learn from.
  • bigrams (bool) – only needed if create=True. If True, takes some more time to build a model that supports bigrams (e.g. new_york). Otherwise, it’ll only support one-word searches. bigram=True makes this slower but more complete.
decode_term(encoded)[source]

Converts an encoded search term into something more human readable, like hadrians_wall to hadrians wall.

Parameters:term (str) – a cleaned term from Word2VecRewriter:encode_term.
Returns:a more human-readable version of the inputted term.
Return type:str
encode_term(term)[source]

Converts a search term like Hadrian’s Wall to hadrians_wall, which plays better with Word2Vec. Primarily for internal use.

Parameters:term (str) – a search term you’d normally feed into Word2VecRewriter.rewrite.
Returns:a cleaned up version of the term, which works better in rewrite.
Return type:str
rewrite(term)[source]

Rewrites a term to a list of new terms to search with. These words are the k most similar words and phrases to the inputted term, as judged by Word2Vec. Here, k==10.

Parameters:term (str) – a string to rewrite
Returns:a list of semantically related strings, including term
Return type:list(str)