Enter your keyword

Gentle start to Natural Language Processing

How do I Start with NLP using Python?

Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP) which was written in Python and has a big community behind it.

NLTK also is very easy to learn, actually, it’s the easiest natural language processing (NLP) library that you’ll use.

In this NLP Tutorial, we will use Python NLTK library.

Before I start installing NLTK, I assume that you know some Python basics to get started.

Install nltk

If you are using Windows or Linux or Mac, you can install NLTK using pip:

$ pip install nltk

You can use NLTK on Python 2.7, 3.4, and 3.5 at the time of writing this post.

Alternatively, you can install it from source from this tar.

To check if NLTK has installed correctly, you can open python terminal and type the following:

Import nltk

If everything goes fine, that means you’ve successfully installed NLTK library.

Once you’ve installed NLTK, you should install the NLTK packages by running the following code:

import nltk
nltk.download()

This will show the NLTK downloader to choose what packages need to be installed.

You can install all packages since they have small sizes, so no problem. Now let’s start the show.

Here we will learn how to identify what the web page is about using NLTK in Python

First, we will grab a webpage and analyze the text to see what the page is about.

urllib module will help us to crawl the webpage

import urllib.request
response =  urllib.request.urlopen('https://en.wikipedia.org/wiki/SpaceX')
html = response.read()
print(html)

It’s pretty clear from the link that page is about SpaceX now let us see whether our code is able to correctly identify the page’s context.

We will use Beautiful Soup which is a Python library for pulling data out of HTML and XML files. We will use beautiful soup to clean our webpage text of HTML tags.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'html5lib')
text = soup.get_text(strip = True)
print(text)

You will get an output somewhat like this

Now we have clean text from the crawled web page, let’s convert the text into tokens.

tokens = [t for t in text.split()]
print(tokens)

your output text is now converted into tokens

Count word Frequency

nltk offers a function FreqDist() which will do the job for us. Also, we will remove stop words (a, at, the, for etc) from our web page as we don’t need them to hamper our word frequency count. We will plot the graph for most frequently occurring words in the webpage in order to get the clear picture of the context of the web page

from nltk.corpus import stopwords
sr= stopwords.words('english')
clean_tokens = tokens[:]
for token in tokens:
    if token in stopwords.words('english'):
        
        clean_tokens.remove(token)

frequency word count output

graph of 20 most frequent words.

Great!!! the code has correctly identified that the web page speaks about SpaceX.

It was so simple and interesting right !!! you can similarly identify the news articles, blogs etc.

I have done my best to make the article simple and interesting for you, hope you found it useful and interesting too.

You have successfully taken your first step towards NLP, there is an ocean to explore for you…

source &courtesy :https://towardsdatascience.com/gentle-start-to-natural-language-processing-using-python-6e46c07addf3

  • Sign up
Lost your password? Please enter your username or email address. You will receive a link to create a new password via email.