NLP Study (Usman Malik)

fmchanprogrammingLeave a Comment

spaCy

  • Tokenization: sentences -> words
  • Detecting Entities
  • Detecting Nouns
  • Lemmatization: been ===> be
  • Phrase matching
  • Stop Words
  • Parts of Speech (POS) Tagging

NLTK

  • Stemming: Snowball Stemmer

 Sentiment Analysis (Scikit-Learn)

  • Dataset import
  • Data Analysis
  • Data Cleaning (remove special chars)
  • Text -> Number: Bag of Words, TF-IDF, Word2Vec
    • TfidfVectorizer, fit_transform
  • Data dividing -> train/test
    • train_test_split
  • Model training
    • RandomForestClassifier, fit
  • Prediction: predict
  • Evaluation: confusion_matrix / F1 measure / accuracy

Topic Modeling (Scikit-Learn)

  • unsupervised/clustering
  • similarity
  • Latent Dirichlet Allocation / Non-Negative Matrix factorization

LDA

  • similar words = same topic
  • frequently occurring together = same topic
  • import data (pandas)
  • dropna
  • text to number: 20k docs -> 14546 words = dimensional vector
    • feature_extraction: CountVectorizer, fit_transform
    • 20000×14546 sparse matrix
  • decomposition: LatentDirichletAllocation, fit
  • retrieve words: get_feature_names
  • add column (category) to each text
    • LDA.transform(doc_term_matrix)
    • shape: (20000, 5)

NMF

  • supervised learning
  • combination with TF-IDF

TextBlob Library

  • Tokenization
  • Lemmatization
  • Parts of Speech (POS) Tagging
  • Convert Text to Singular and Plural
  • Noun Phrase Extraction
  • Getting Words and Phrase Counts
  • Converting to Upper and Lowercase
  • Finding N-Grams
  • Spelling Corrections
  • Language Translation (Google Translate API)
  • Text Classification (simple)
  • Sentiment Analysis

Pattern Library

  • NLP
    • Tokenizing, POS Tagging, and Chunking
    • Pluralizing and Singularizing the Tokens
    • Converting Adjective to Comparative and Superlative Degrees
    • Finding N-Grams
    • Finding Sentiments
    • Checking if a Statement is a Fact
    • Spelling Corrections
    • Working with Numbers
  • Data Mining
    • Accessing Web Pages
    • Finding URLs within Text
    • Making Asynchronous Requests for Webpages
    • Getting Search Engine Results with APIs
    • Converting HTML Data to Plain Text
    • Parsing PDF Documents

Gensim library

Creating Dictionaries

  • dictionaries that map words to IDs 

Creating Bag of Words Corpus

  • contain the ID of each word along with the frequency of occurrence

Creating TF-IDF Corpus

  • term frequency – Inverse Document Frequency

Built-In Gensim Models and Datasets

Topic Modeling with LDA

  • Scraping Wikipedia Articles
  • Data Preprocessing
  • Modeling Topics
    • dictionary + bag of words corpus (doc2bow)
    • LdaModel
  • Visualizing the LDA
    • pyLDAvis

Topic Modeling via Latent Semantic Indexing (LSI)

Rule-Based Chatbot

Idea

  • query -> vectorized form
  • sentences -> vectorized forms
  • highest cosine similarity = response

Implementation

  • Creating the Corpus
  • Text Preprocessing and Helper Function (NLTK)
    • nltk.word_tokenize
    • nltk.stem.WordNetLemmatizer()
  • Responding to Greetings
  • Responding to User Queries (SKLearn)
    • TfidfVectorizer, fit_transform
    • cosine_similarity, flatten

Bag of Words

  1. Tokenize the Sentences
  2. Create a Dictionary of Word Frequency
  3. Creating the Bag of Words Model

problems with BoW:

  • assigns equal value to the words, irrespective of their importance
  • use TF-IDF

problems with TF-IDF and BoW:

  • treated individually
  • context information of the word is not retained
  • use N-Grams

Problems with One-Hot Encoded Feature Vector

  • BoW, TF-IDF, N-Grams = Text data -> numeric feature vectors
  • half million unique words. represent a sentence that contains 10 words?
  • half million dimensional one-hot encoded vector where only 10 indexes will have 1
  • wastage of space and increases algorithm complexity

Word Embeddings

  • n-dimensional dense vector
  • techniques: GloVe and Word2Vec
  • Implementing: Keras Sequential Models
  • embedding layer = first layer in the sequential
embedding_layer = Embedding(200, 32, input_length=50)
  1. size of the vocabulary / total number of unique words
  2. number of the dimensions for each word vector
  3. length of the input sentence

convert text to numbers / vectors (keras)

  • use one_hot or Tokenizer: fit_on_text
padded_sentences = pad_sequences(embedded_sentences, length_long_sentence, padding='post')
  1. list of encoded sentences of unequal sizes
  2. size of the longest sentence / padding index
  3. post = add padding at the end of sentences
model = Sequential()
model.add(Embedding(vocab_length, 20, input_length=length_long_sentence))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

Layer (type) Output Shape Param # ================================================================
embedding_1 (Embedding) (None, 7, 20) 1000
_________________________________________________________________
flatten_1 (Flatten) (None, 140) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 141 ===============================================================
Total params: 1,141

  1. vocabulary size = 50 & dimensional vector = 20 -> 1000 trainable parameters
  2. embedding layer = a sentence with 7 words (each 20 dimensional vector)

Word Embeddings with Keras Functional API

  • sequential API for beginners
  • advanced: multiple inputs and outputs
  • first layer = input layer

16 https://stackabuse.com/python-for-nlp-word-embeddings-for-deep-learning-in-keras/

Movie Sentiment Analysis

  • Importing and Analyzing the Dataset
  • Data Preprocessing
    • Removing html tags
    • Remove punctuations and numbers
    • Remove single characters
    • Remove multiple spaces
  • Preparing the Embedding Layer
    • Tokenizer (keras), fit_on_texts
    • texts_to_sequences
    • pad_sequences
    • GloVe embeddings -> feature matrix
      • dictionary: words as keys
    • create embedding_matrix
      • 92547 rows, 100 columns

Text Classification with Simple Neural Network

Layer (type) Output Shape Param # ===============================================================
embedding_1 (Embedding) (None, 100, 100) 9254700
_________________________________________________________________
flatten_1 (Flatten) (None, 10000) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 10001 ===============================================================
Total params: 9,264,701
Trainable params: 10,001
Non-trainable params: 9,254,700

Convolutional Neural Network

Layer (type) Output Shape Param # ===============================================================
embedding_2 (Embedding) (None, 100, 100) 9254700 _________________________________________________________________
conv1d_1 (Conv1D) (None, 96, 128) 64128
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 129 ===============================================================
Total params: 9,318,957
Trainable params: 64,257
Non-trainable params: 9,254,700

Recurrent Neural Network (LSTM)

Layer (type) Output Shape Param # ==============================================================
embedding_3 (Embedding) (None, 100, 100) 9254700
_________________________________________________________________
lstm_1 (LSTM) (None, 128) 117248
_________________________________________________________________
dense_3 (Dense) (None, 1) 129 ===============================================================
Total params: 9,372,077
Trainable params: 117,377
Non-trainable params: 9,254,700

17 https://stackabuse.com/python-for-nlp-movie-sentiment-analysis-using-deep-learning-in-keras/

Multi-Data-Type Classification

review_names = ['bad', 'average', 'good']
  • text into integer labels (y)
    • LabelEncoder, fit_transform  (sklearn.preprocessing)
  • text to a one-hot encoded vector (y)
    • to_categorical

Creating a Model with Text Inputs Only

  • textual data -> numeric form (x)
    • word embeddings: previous mentioned 

Layer (type) Output Shape Param # ===============================================================
input_1 (InputLayer) (None, 200) 0
_________________________________________________________________
embedding_1 (Embedding) (None, 200, 100) 5572900
_________________________________________________________________
lstm_1 (LSTM) (None, 128) 117248
_________________________________________________________________
dense_1 (Dense) (None, 3) 387 ==============================================================
Total params: 5,690,535
Trainable params: 117,635
Non-trainable params: 5,572,900

Creating a Model with Meta Information Only

Layer (type) Output Shape Param # ==============================================================
input_1 (InputLayer) (None, 3) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 40
_________________________________________________________________
dense_2 (Dense) (None, 10) 110
_________________________________________________________________
dense_3 (Dense) (None, 3) 33 ==============================================================
Total params: 183
Trainable params: 183
Non-trainable params: 0

Creating a Model with Multiple Inputs

Layer (type) Output Shape Param # Connected to ================================================================
input_1 (InputLayer) (None, 200) 0
input_2 (InputLayer) (None, 3) 0
______________________________________________________________________________
embedding_1 (Embedding) (None, 200, 100) 5572900 input_1[0][0]
dense_1 (Dense) (None, 10) 40 input_2[0][0]
______________________________________________________________________________
lstm_1 (LSTM) (None, 128) 117248 embedding_1[0][0]
dense_2 (Dense) (None, 10) 110 dense_1[0][0]
______________________________________________________________________________
concatenate_1 (Concatenate) (None, 138) 0 lstm_1[0][0] dense_2[0][0]
______________________________________________________________________________
dense_3 (Dense) (None, 10) 1390 concatenate_1[0][0]
______________________________________________________________________________
dense_4 (Dense) (None, 3) 33 dense_3[0][0]
================================================================
Total params: 5,691,721
Trainable params: 118,821
Non-trainable params: 5,572,900

18 https://stackabuse.com/python-for-nlp-creating-multi-data-type-classification-models-with-keras/

Multi-label Text Classification

  • comments.shape: (159571,8)

with Single Output Layer

  • either one output

Layer (type) Output Shape Param # ================================================================
input_1 (InputLayer) (None, 200) 0
_________________________________________________________________
embedding_1 (Embedding) (None, 200, 100) 14824300
_________________________________________________________________
lstm_1 (LSTM) (None, 128) 117248
_________________________________________________________________
dense_1 (Dense) (None, 6) 774
================================================================
Total params: 14,942,322
Trainable params: 118,022
Non-trainable params: 14,824,300

with Multiple Output Layers

  • each output label will have a dedicated output dense layer

Layer (type) Output Shape Param # Connected to
===============================================================
input_1 (InputLayer) (None, 200) 0
__________________________________________________________________________________________________
embedding_1 (Embedding) (None, 200, 100) 14824300 input_1[0][0]
__________________________________________________________________________________________________
lstm_1 (LSTM) (None, 128) 117248 embedding_1[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 1) 129 lstm_1[0][0]
dense_2 (Dense) (None, 1) 129 lstm_1[0][0]
dense_3 (Dense) (None, 1) 129 lstm_1[0][0]
dense_4 (Dense) (None, 1) 129 lstm_1[0][0]
dense_5 (Dense) (None, 1) 129 lstm_1[0][0]
dense_6 (Dense) (None, 1) 129 lstm_1[0][0]
===============================================================
Total params: 14,942,322
Trainable params: 118,022
Non-trainable params: 14,824,300

19 https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/

Facebook FastText Library

  • Semantic Similarity
  • text classification

20 https://stackabuse.com/python-for-nlp-working-with-facebook-fasttext-library/

Text Generation

  • predict the next word given a sequence of input words
  • many-to-one sequence problems
  • LSTM accepts data in a 3-dimensional format
  • three LSTM layers with 800 neurons each + dense layer with 1 neuron

21 https://stackabuse.com/python-for-nlp-deep-learning-text-generation-with-keras/

Translation with Seq2Seq

  • many to many sequence problems
  • both inputs and outputs are divided over multiple time-steps
  • encoder-decoder architecture
    • input to the encoder LSTM = sentence in the original language
    • input to the decoder LSTM = sentence in the translated language with a start-of-sentence token
    • output = target sentence with an end-of-sentence token

22 https://stackabuse.com/python-for-nlp-neural-machine-translation-with-seq2seq-in-keras/

search https://stackabuse.com/search/?q=Python+for+NLP

Leave a Reply

Your email address will not be published. Required fields are marked *