Natural Language Processing (NLP) study

fmchanprogrammingLeave a Comment

Pre-processing

1) Tokenization (word/sent)

tokens = nltk.word_tokenize(string)

Bigrams / Trigrams / Ngrams

tokens_ngrams = list(nltk.ngrams(tokens,4))

2) Stemming

word stem: studi
form new words like “studies,” “studied,” “studying”

PorterStemmer / SnowballStemmer / etc

3) Lemmatization

different is that it finds the dictionary word instead of truncating the original word. like “study

WordNetLemmatizer

4) Removing Stopwords

stopwords.words('english')

Part of Speech Tagging (PoS tagging)

respective tags

tag = nltk.pos_tag(["Studying","Study"])

Chunking

Figure 92: A chunking example in NLP.
grammar = "NP : {<DT>?<JJ>*<NN>} "

extraction of the noun phrase

Figure 94: Successful extraction of the noun phrase from the input text.

Chinking

exclude a part of the text from the whole text or chunk

grammar = r""" NP: {<.*>+} 
               }<JJ>+{"""

adjectives are excluded

Figure 96: In this example, adjectives are excluded by using NLP chinking.

Named Entity Recognition (NER)

1. binary = True

N_E_R = nltk.ne_chunk(tagged_words,binary=True)
Figure 99: Graph example of when a binary value is True.

2. binary = False

N_E_R = nltk.ne_chunk(tagged_words,binary=False)
Figure 101: Graph showing the type of named entities when a binary value equals false.

WordNet

lexical database for the English language

different definitions / different meanings / Hypernyms (abstract/specific) / Synonyms / Antonyms / compare similarity

Figure: 104: Finding all the details for a specific word with Wordnet and Python.

Bag of Words

  • discard the order of occurrences of words
  • counts the frequency for the words
Figure 119: The final model of our bag of words.

sklearn Countvectorizer

TF-IDF (Term Frequency — Inverse Document Frequency)

score shows how important or relevant

  • Term Frequency: is a scoring of the frequency of the word in the current document.
  • Inverse Document Frequency: is a scoring of how rare the word is across documents

TF – higher importance: word appears many times in a document
IDF – less importance: word presents many times in some other documents

higher the TF*IDF score, the rarer or unique or valuable

Figure 138: Final TF-IDF values.

sklearn Tfidfvectorizer

Word Embeddings / Word Vectors

Words that appear in similar contexts will have similar vectors

vectors for “leopard”, “lion”, and “tiger” will be close together, while they’ll be far away from “planet” and “castle”

Text classification

predicting the sentiment of tweets and movie reviews, as well as classifying email as spam or not

  • Word Embedding Model: A distributed representation of words where different words that have a similar meaning (based on their usage) also have a similar representation.
  • Convolutional Model: A feature extraction model that learns to extract salient features from documents represented using a word embedding.
  • Fully-Connected Model: The interpretation of extracted features in terms of a predictive output.

references

https://pub.towardsai.net/natural-language-processing-nlp-with-python-tutorial-for-beginners-1f54e610a1a0

https://towardsdatascience.com/the-beginning-of-natural-language-processing-74cce2545676

https://www.kaggle.com/matleonard/word-vectors

https://machinelearningmastery.com/crash-course-deep-learning-natural-language-processing

Leave a Reply

Your email address will not be published. Required fields are marked *