Pre-processing
1) Tokenization (word/sent)
tokens = nltk.word_tokenize(string)
Bigrams / Trigrams / Ngrams
tokens_ngrams = list(nltk.ngrams(tokens,4))
2) Stemming
word stem: studi
form new words like “studies,” “studied,” “studying”
PorterStemmer / SnowballStemmer / etc
3) Lemmatization
different is that it finds the dictionary word instead of truncating the original word. like “study“
WordNetLemmatizer
4) Removing Stopwords
stopwords.words('english')
Part of Speech Tagging (PoS tagging)
respective tags
tag = nltk.pos_tag(["Studying","Study"])
Chunking

grammar = "NP : {<DT>?<JJ>*<NN>} "
extraction of the noun phrase

Chinking
exclude a part of the text from the whole text or chunk
grammar = r""" NP: {<.*>+} }<JJ>+{"""
adjectives are excluded

Named Entity Recognition (NER)
1. binary = True
N_E_R = nltk.ne_chunk(tagged_words,binary=True)

2. binary = False
N_E_R = nltk.ne_chunk(tagged_words,binary=False)

WordNet
lexical database for the English language
different definitions / different meanings / Hypernyms (abstract/specific) / Synonyms / Antonyms / compare similarity

Bag of Words
- discard the order of occurrences of words
- counts the frequency for the words

sklearn Countvectorizer
TF-IDF (Term Frequency — Inverse Document Frequency)
score shows how important or relevant
- Term Frequency: is a scoring of the frequency of the word in the current document.
- Inverse Document Frequency: is a scoring of how rare the word is across documents
TF – higher importance: word appears many times in a document
IDF – less importance: word presents many times in some other documents
higher the TF*IDF score, the rarer or unique or valuable

sklearn Tfidfvectorizer
Word Embeddings / Word Vectors
Words that appear in similar contexts will have similar vectors
vectors for “leopard”, “lion”, and “tiger” will be close together, while they’ll be far away from “planet” and “castle”
Text classification
predicting the sentiment of tweets and movie reviews, as well as classifying email as spam or not
- Word Embedding Model: A distributed representation of words where different words that have a similar meaning (based on their usage) also have a similar representation.
- Convolutional Model: A feature extraction model that learns to extract salient features from documents represented using a word embedding.
- Fully-Connected Model: The interpretation of extracted features in terms of a predictive output.
references
https://towardsdatascience.com/the-beginning-of-natural-language-processing-74cce2545676
https://www.kaggle.com/matleonard/word-vectors
https://machinelearningmastery.com/crash-course-deep-learning-natural-language-processing