Topic Modelling and Search with Top2Vec

2 min readNov 8, 2022

An entry in a series of blogs written during the Vector Search Hackathon organized by the MLOps Community, Redis, and Saturn Cloud.

The Top2Vec paper explains the concepts behind the Top2Vec library in a more accessible way than I ever could. So instead of diving into the concepts, we’ll look at the practicalities of using Top2Vec and the data generated from it.

We have three major types of objects in Top2Vec: words, documents, and topics. All the unique, cleaned-up words in all the documents fed into the model make up the vocabulary. Topics refer to centroid vectors representing dense clusters of documents, the clusters having been determined using UMAP and HDBSCAN.

The following is my simplified understanding of the code in the top2vec library. This is what happens once Top2Vec(documents) gets called and the training begins.

Looking at this code, we find that these objects have the most relevance:

vocab: list of words
word_vectors: array of word vectors
word_indexes: dictionary of words to word indexes. Useful to get the word index given the word without iterating over the list of words.
documents: numpy array of all documents
document_vectors: array of document vectors
topic_vectors: array of topic vectors
topic_words: 2-D array of 50 words to describe every topic
topic_word_scores: 2-D array of 50 scores to show the dist of the 50 words for every topic in topic_words
doc_top: topic for each document
doc_dist: scores per topic for each document
topic_sizes: frequency of documents per topic

We then come up with a data model to load this data into Redis:

class Paper

Dealt with by RedisOM.

class Paper(HashModel):
    paper_id: str = Field(index=True)
    title: str = Field(index=True, full_text_search=True)
    year: int = Field(index=True)
    authors: str
    categories: str
    abstract: str = Field(index=True, full_text_search=True)
    input: str

In addition, each paper will have a topic index and a topic score assigned to it to indicate the topic the document is closest to.

The input field is a cleaned version of title + abstract.

paper_vector:

paper_pk
paper_id
doc_idx
categories
year
vector

word_vector:

word_id
word
vector

topic_vector:

topic_id
size
vector
word_0 … word_15
word_score_0 … word_15

A total of 33 fields per topic_vector.

Indexes

We create three different HNSW indexes for the vector key with Redisearch. The separation between the indexes is assured by the prefix we provide it; it will only look for hashes that begin with the prefix.

papers

prefix: paper_vector:

words

prefix: word_vector:

topics

prefix: topic_vector:

Conclusion

We’ve barely scratched the surface here. There is a ton of information contained within the Redis DB now, both in terms of metadata and vectors. Correlating them and exploring relationships should yield very interesting results. And the promise of this not being in a single computer is tantalizing. We’re no longer dealing with a Top2Vec model we have to port everywhere.