spaCy
- Tokenization: sentences -> words
- Detecting Entities
- Detecting Nouns
- Lemmatization: been ===> be
- Phrase matching
- Stop Words
- Parts of Speech (POS) Tagging
NLTK
- Stemming: Snowball Stemmer
Sentiment Analysis (Scikit-Learn)
- Dataset import
- Data Analysis
- Data Cleaning (remove special chars)
- Text -> Number: Bag of Words, TF-IDF, Word2Vec
- TfidfVectorizer, fit_transform
- Data dividing -> train/test
- train_test_split
- Model training
- RandomForestClassifier, fit
- Prediction: predict
- Evaluation: confusion_matrix / F1 measure / accuracy
Topic Modeling (Scikit-Learn)
- unsupervised/clustering
- similarity
- Latent Dirichlet Allocation / Non-Negative Matrix factorization
LDA
- similar words = same topic
- frequently occurring together = same topic
- import data (pandas)
- dropna
- text to number: 20k docs -> 14546 words = dimensional vector
- feature_extraction: CountVectorizer, fit_transform
- 20000×14546 sparse matrix
- decomposition: LatentDirichletAllocation, fit
- retrieve words: get_feature_names
- add column (category) to each text
- LDA.transform(doc_term_matrix)
- shape: (20000, 5)
NMF
- supervised learning
- combination with TF-IDF
TextBlob Library
- Tokenization
- Lemmatization
- Parts of Speech (POS) Tagging
- Convert Text to Singular and Plural
- Noun Phrase Extraction
- Getting Words and Phrase Counts
- Converting to Upper and Lowercase
- Finding N-Grams
- Spelling Corrections
- Language Translation (Google Translate API)
- Text Classification (simple)
- Sentiment Analysis
Pattern Library
- NLP
- Tokenizing, POS Tagging, and Chunking
- Pluralizing and Singularizing the Tokens
- Converting Adjective to Comparative and Superlative Degrees
- Finding N-Grams
- Finding Sentiments
- Checking if a Statement is a Fact
- Spelling Corrections
- Working with Numbers
- Data Mining
- Accessing Web Pages
- Finding URLs within Text
- Making Asynchronous Requests for Webpages
- Getting Search Engine Results with APIs
- Converting HTML Data to Plain Text
- Parsing PDF Documents
Gensim library
Creating Dictionaries
- dictionaries that map words to IDs
Creating Bag of Words Corpus
- contain the ID of each word along with the frequency of occurrence
Creating TF-IDF Corpus
- term frequency – Inverse Document Frequency
Built-In Gensim Models and Datasets
Topic Modeling with LDA
- Scraping Wikipedia Articles
- Data Preprocessing
- Modeling Topics
- dictionary + bag of words corpus (doc2bow)
- LdaModel
- Visualizing the LDA
- pyLDAvis
Topic Modeling via Latent Semantic Indexing (LSI)
Rule-Based Chatbot
Idea
- query -> vectorized form
- sentences -> vectorized forms
- highest cosine similarity = response
Implementation
- Creating the Corpus
- Text Preprocessing and Helper Function (NLTK)
- nltk.word_tokenize
- nltk.stem.WordNetLemmatizer()
- Responding to Greetings
- Responding to User Queries (SKLearn)
- TfidfVectorizer, fit_transform
- cosine_similarity, flatten
Bag of Words
- Tokenize the Sentences
- Create a Dictionary of Word Frequency
- Creating the Bag of Words Model
problems with BoW:
- assigns equal value to the words, irrespective of their importance
- use TF-IDF
problems with TF-IDF and BoW:
- treated individually
- context information of the word is not retained
- use N-Grams
Problems with One-Hot Encoded Feature Vector
- BoW, TF-IDF, N-Grams = Text data -> numeric feature vectors
- half million unique words. represent a sentence that contains 10 words?
- half million dimensional one-hot encoded vector where only 10 indexes will have 1
- wastage of space and increases algorithm complexity
Word Embeddings
- n-dimensional dense vector
- techniques: GloVe and Word2Vec
- Implementing: Keras Sequential Models
- embedding layer = first layer in the sequential
embedding_layer = Embedding(200, 32, input_length=50)
- size of the vocabulary / total number of unique words
- number of the dimensions for each word vector
- length of the input sentence
convert text to numbers / vectors (keras)
- use
one_hot
orTokenizer
:fit_on_text
padded_sentences = pad_sequences(embedded_sentences, length_long_sentence, padding='post')
- list of encoded sentences of unequal sizes
- size of the longest sentence / padding index
- post = add padding at the end of sentences
model = Sequential() model.add(Embedding(vocab_length, 20, input_length=length_long_sentence)) model.add(Flatten()) model.add(Dense(1, activation='sigmoid'))
Layer (type) Output Shape Param # ================================================================
embedding_1 (Embedding) (None, 7, 20) 1000
_________________________________________________________________
flatten_1 (Flatten) (None, 140) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 141 ===============================================================
Total params: 1,141
- vocabulary size = 50 & dimensional vector = 20 -> 1000 trainable parameters
- embedding layer = a sentence with 7 words (each 20 dimensional vector)
Word Embeddings with Keras Functional API
- sequential API for beginners
- advanced: multiple inputs and outputs
- first layer = input layer
16 https://stackabuse.com/python-for-nlp-word-embeddings-for-deep-learning-in-keras/
Movie Sentiment Analysis
- Importing and Analyzing the Dataset
- Data Preprocessing
- Removing html tags
- Remove punctuations and numbers
- Remove single characters
- Remove multiple spaces
- Preparing the Embedding Layer
- Tokenizer (keras), fit_on_texts
- texts_to_sequences
- pad_sequences
- GloVe embeddings -> feature matrix
- dictionary: words as keys
- create embedding_matrix
- 92547 rows, 100 columns
Text Classification with Simple Neural Network
Layer (type) Output Shape Param # ===============================================================
embedding_1 (Embedding) (None, 100, 100) 9254700
_________________________________________________________________
flatten_1 (Flatten) (None, 10000) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 10001 ===============================================================
Total params: 9,264,701
Trainable params: 10,001
Non-trainable params: 9,254,700
Convolutional Neural Network
Layer (type) Output Shape Param # ===============================================================
embedding_2 (Embedding) (None, 100, 100) 9254700 _________________________________________________________________
conv1d_1 (Conv1D) (None, 96, 128) 64128
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 129 ===============================================================
Total params: 9,318,957
Trainable params: 64,257
Non-trainable params: 9,254,700
Recurrent Neural Network (LSTM)
Layer (type) Output Shape Param # ==============================================================
embedding_3 (Embedding) (None, 100, 100) 9254700
_________________________________________________________________
lstm_1 (LSTM) (None, 128) 117248
_________________________________________________________________
dense_3 (Dense) (None, 1) 129 ===============================================================
Total params: 9,372,077
Trainable params: 117,377
Non-trainable params: 9,254,700
17 https://stackabuse.com/python-for-nlp-movie-sentiment-analysis-using-deep-learning-in-keras/
Multi-Data-Type Classification
review_names = ['bad', 'average', 'good']
- text into integer labels (y)
-
LabelEncoder
, fit_transform (sklearn.preprocessing)
-
- text to a one-hot encoded vector (y)
- to_categorical
Creating a Model with Text Inputs Only
- textual data -> numeric form (x)
- word embeddings: previous mentioned
Layer (type) Output Shape Param # ===============================================================
input_1 (InputLayer) (None, 200) 0
_________________________________________________________________
embedding_1 (Embedding) (None, 200, 100) 5572900
_________________________________________________________________
lstm_1 (LSTM) (None, 128) 117248
_________________________________________________________________
dense_1 (Dense) (None, 3) 387 ==============================================================
Total params: 5,690,535
Trainable params: 117,635
Non-trainable params: 5,572,900
Creating a Model with Meta Information Only
Layer (type) Output Shape Param # ==============================================================
input_1 (InputLayer) (None, 3) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 40
_________________________________________________________________
dense_2 (Dense) (None, 10) 110
_________________________________________________________________
dense_3 (Dense) (None, 3) 33 ==============================================================
Total params: 183
Trainable params: 183
Non-trainable params: 0
Creating a Model with Multiple Inputs
Layer (type) Output Shape Param # Connected to ================================================================
input_1 (InputLayer) (None, 200) 0
input_2 (InputLayer) (None, 3) 0
______________________________________________________________________________
embedding_1 (Embedding) (None, 200, 100) 5572900 input_1[0][0]
dense_1 (Dense) (None, 10) 40 input_2[0][0]
______________________________________________________________________________
lstm_1 (LSTM) (None, 128) 117248 embedding_1[0][0]
dense_2 (Dense) (None, 10) 110 dense_1[0][0]
______________________________________________________________________________
concatenate_1 (Concatenate) (None, 138) 0 lstm_1[0][0] dense_2[0][0]
______________________________________________________________________________
dense_3 (Dense) (None, 10) 1390 concatenate_1[0][0]
______________________________________________________________________________
dense_4 (Dense) (None, 3) 33 dense_3[0][0]
================================================================
Total params: 5,691,721
Trainable params: 118,821
Non-trainable params: 5,572,900

18 https://stackabuse.com/python-for-nlp-creating-multi-data-type-classification-models-with-keras/
Multi-label Text Classification
- comments.shape: (159571,8)
with Single Output Layer
- either one output
Layer (type) Output Shape Param # ================================================================
input_1 (InputLayer) (None, 200) 0
_________________________________________________________________
embedding_1 (Embedding) (None, 200, 100) 14824300
_________________________________________________________________
lstm_1 (LSTM) (None, 128) 117248
_________________________________________________________________
dense_1 (Dense) (None, 6) 774
================================================================
Total params: 14,942,322
Trainable params: 118,022
Non-trainable params: 14,824,300
with Multiple Output Layers
- each output label will have a dedicated output dense layer
Layer (type) Output Shape Param # Connected to
===============================================================
input_1 (InputLayer) (None, 200) 0
__________________________________________________________________________________________________
embedding_1 (Embedding) (None, 200, 100) 14824300 input_1[0][0]
__________________________________________________________________________________________________
lstm_1 (LSTM) (None, 128) 117248 embedding_1[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 1) 129 lstm_1[0][0]
dense_2 (Dense) (None, 1) 129 lstm_1[0][0]
dense_3 (Dense) (None, 1) 129 lstm_1[0][0]
dense_4 (Dense) (None, 1) 129 lstm_1[0][0]
dense_5 (Dense) (None, 1) 129 lstm_1[0][0]
dense_6 (Dense) (None, 1) 129 lstm_1[0][0]
===============================================================
Total params: 14,942,322
Trainable params: 118,022
Non-trainable params: 14,824,300
19 https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/
Facebook FastText Library
- Semantic Similarity
- text classification
20 https://stackabuse.com/python-for-nlp-working-with-facebook-fasttext-library/
Text Generation
- predict the next word given a sequence of input words
- many-to-one sequence problems
- LSTM accepts data in a 3-dimensional format
- three LSTM layers with 800 neurons each + dense layer with 1 neuron
21 https://stackabuse.com/python-for-nlp-deep-learning-text-generation-with-keras/
Translation with Seq2Seq
- many to many sequence problems
- both inputs and outputs are divided over multiple time-steps
- encoder-decoder architecture
- input to the encoder LSTM = sentence in the original language
- input to the decoder LSTM = sentence in the translated language with a start-of-sentence token
- output = target sentence with an end-of-sentence token
22 https://stackabuse.com/python-for-nlp-neural-machine-translation-with-seq2seq-in-keras/