This tutorial will demonstrate how to process text data and generate word embeddings and visualizations as part of a Flyte workflow. It’s an adaptation of the official Gensim Word2Vec tutorial.
Gensim is a popular open-source natural language processing (NLP) library used to process large corpora (can be larger than RAM). It has efficient multicore implementations of a number of algorithms such as Latent Semantic Analysis, Latent Dirichlet Allocation (LDA), Word2Vec deep learning to perform complex tasks including understanding document relationships, topic modeling, learning word embeddings, and more.
You can read more about Gensim here.
The dataset used for this tutorial is the open-source Lee Background Corpus that comes with the Gensim library.
The following points outline the modelling process:
Returns a preprocessed (tokenized, stop words excluded, lemmatized) corpus from the custom iterator.
Trains the Word2vec model on the preprocessed corpus.
Generates a bag of words from the corpus and trains the LDA model.
Saves the LDA and Word2Vec models to disk.
Deserializes the Word2Vec model, runs word similarity and computes word movers distance.
Reduces the dimensionality (using tsne) and plots the word embeddings.
Let’s dive into the code!