Topic Modelling and Search with Top2Vec
An entry in a series of blogs written during the Vector Search Hackathon organized by the MLOps Community, Redis, and Saturn Cloud.
The Top2Vec paper explains the concepts behind the Top2Vec library in a more accessible way than I ever could. So instead of diving into the concepts, we’ll look at the practicalities of using Top2Vec and the data generated from it.
We have three major types of objects in Top2Vec: words, documents, and topics. All the unique, cleaned-up words in all the documents fed into the model make up the vocabulary. Topics refer to centroid vectors representing dense clusters of documents, the clusters having been determined using UMAP and HDBSCAN.
The following is my simplified understanding of the code in the top2vec library. This is what happens once Top2Vec(documents)
gets called and the training begins.
Looking at this code, we find that these objects have the most relevance:
- vocab: list of words
- word_vectors: array of word vectors
- word_indexes: dictionary of words to word indexes. Useful to get the word index given the word without iterating over the list of words.
- documents: numpy array of all documents
- document_vectors: array of document vectors
- topic_vectors: array of topic vectors
- topic_words: 2-D array of 50 words to describe every topic
- topic_word_scores: 2-D array of 50 scores to show the dist of the 50 words for every topic in topic_words
- doc_top: topic for each document
- doc_dist: scores per topic for each document
- topic_sizes: frequency of documents per topic
We then come up with a data model to load this data into Redis:
class Paper
Dealt with by RedisOM.
class Paper(HashModel):
paper_id: str = Field(index=True)
title: str = Field(index=True, full_text_search=True)
year: int = Field(index=True)
authors: str
categories: str
abstract: str = Field(index=True, full_text_search=True)
input: str
In addition, each paper will have a topic index and a topic score assigned to it to indicate the topic the document is closest to.
The input field is a cleaned version of title + abstract.
paper_vector:
- paper_pk
- paper_id
- doc_idx
- categories
- year
- vector
word_vector:
- word_id
- word
- vector
topic_vector:
- topic_id
- size
- vector
- word_0 … word_15
- word_score_0 … word_15
A total of 33 fields per topic_vector.
Indexes
We create three different HNSW indexes for the vector key with Redisearch. The separation between the indexes is assured by the prefix we provide it; it will only look for hashes that begin with the prefix.
papers
prefix: paper_vector:
words
prefix: word_vector:
topics
prefix: topic_vector:
Conclusion
We’ve barely scratched the surface here. There is a ton of information contained within the Redis DB now, both in terms of metadata and vectors. Correlating them and exploring relationships should yield very interesting results. And the promise of this not being in a single computer is tantalizing. We’re no longer dealing with a Top2Vec model we have to port everywhere.