How do I Start with NLP using Python?
Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP) which was written in Python and has a big community behind it.
NLTK also is very easy to learn, actually, it’s the easiest natural language processing (NLP) library that you’ll use.
In this NLP Tutorial, we will use Python NLTK library.
Before I start installing NLTK, I assume that you know some Python basics to get started.
If you are using Windows or Linux or Mac, you can install NLTK using pip:
$ pip install nltk
You can use NLTK on Python 2.7, 3.4, and 3.5 at the time of writing this post.
To check if NLTK has installed correctly, you can open python terminal and type the following:
If everything goes fine, that means you’ve successfully installed NLTK library.
Once you’ve installed NLTK, you should install the NLTK packages by running the following code:
import nltk nltk.download()
This will show the NLTK downloader to choose what packages need to be installed.
You can install all packages since they have small sizes, so no problem. Now let’s start the show.
Here we will learn how to identify what the web page is about using NLTK in Python
First, we will grab a webpage and analyze the text to see what the page is about.
urllib module will help us to crawl the webpage
import urllib.request response = urllib.request.urlopen('https://en.wikipedia.org/wiki/SpaceX') html = response.read() print(html)
It’s pretty clear from the link that page is about SpaceX now let us see whether our code is able to correctly identify the page’s context.
We will use Beautiful Soup which is a Python library for pulling data out of HTML and XML files. We will use beautiful soup to clean our webpage text of HTML tags.
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'html5lib') text = soup.get_text(strip = True) print(text)
Now we have clean text from the crawled web page, let’s convert the text into tokens.
tokens = [t for t in text.split()] print(tokens)
your output text is now converted into tokens
Count word Frequency
nltk offers a function FreqDist() which will do the job for us. Also, we will remove stop words (a, at, the, for etc) from our web page as we don’t need them to hamper our word frequency count. We will plot the graph for most frequently occurring words in the webpage in order to get the clear picture of the context of the web page
from nltk.corpus import stopwords sr= stopwords.words('english') clean_tokens = tokens[:] for token in tokens: if token in stopwords.words('english'): clean_tokens.remove(token)freq = nltk.FreqDist(clean_tokens) for key,val in freq.items(): print(str(key) + ':' + str(val))freq.plot(20, cumulative=False)
Great!!! the code has correctly identified that the web page speaks about SpaceX.
It was so simple and interesting right !!! you can similarly identify the news articles, blogs etc.
I have done my best to make the article simple and interesting for you, hope you found it useful and interesting too.
You have successfully taken your first step towards NLP, there is an ocean to explore for you…
source &courtesy :https://towardsdatascience.com/gentle-start-to-natural-language-processing-using-python-6e46c07addf3