Under the umbrella of data science fields, natural language processing (NLP) is one of the most famous and important subfields. Natural language processing is a computer science field that gives computers the ability to understand human — natural — languages.
Although the field has gained a lot of traction recently, it is — in fact — a field as old as computers themselves. However, the advancement of technology and computing power has led to incredible advancements in NLP.
Now, speech technologies are becoming as famous as written text technologies. The development of virtual assistants such as Siri, Alexa, and Cortana is evidence of how far the technology has become.
What do you need to know to get into NLP? Do you need a degree in computer science?
The good thing is, you don’t need any degrees to become an NLP specialist. All you need to is learn and practice some skills and build some projects to prove your knowledge.
Starting in a popular field like NLP can be very overwhelming. The amount of information you can find on the internet can be confusing and sometimes distracting. I decided to write this article because I wanted to write a short, direct guide about how you can get started with NLP after going through the experience myself.
NLP is, at its core, the study of languages. The developer is trying to explain to the computer how to understand the human, spoken languages’ complexity.
But, just because you’re a human, speaking some language, doesn’t mean that you fully understand it’s logic. I started with NLP because I was always fascinated with languages, how they were formed, and how they evolved and developed through time.
If you want to have a strong foundation to start with NLP, you will need to be fully aware of the basic logic of the language you’re trying to “teach” the computer. That language doesn’t need to be your native language; in fact, you can learn a new language in designing an NLP project to analyze said language.
Now, I am not saying to get a degree in literature or something. Rather, I am trying to say that understanding how languages can solve different problems can be quite useful in designing and analyzing NLP applications. Furthermore, knowledge of the variation of cross-linguistic can be used to build multilingual NLP applications.
If you are looking for a place to learn linguistics basics for NLP, I recommend starting with this article and this book.
The first step you need to master before you get knee-deep into actual NLP tricks and techniques is string manipulation using your programming languages of choice.
For newcomers — with absolutely no programming background — I recommend starting with Python. It has a lot of support for the various fields of data science, including NLP. However, if you’re batting into NLP with previous knowledge of other programming languages, master string manipulation will probably be an easy task.
Often, the “language” you’ll try to analyze and build applications for will most often be in the form of strings. Even when you build speech recognition applications, the speech is transferred to text before it is analyzed.
So, understanding how to manipulate strings is an essential first step in your NLP journey.
Regular Expression is one of the most powerful and efficient text processing techniques. It has its own terminologies, conditions, and syntax. Some developers consider regular expression as a mini programming language.
Once you have mastered string manipulations using built-in functions in your programming language of choice, regular expression is one step above that. It can help you generalize rules and make more efficient test processing applications.
This skill doesn’t only apply to NLP projects, rather, to all subfields of data science. In general, your results will be only as good as your input data, so, to have accurate, better results, the data you’re working on needs to be in its best form.
So, although data cleaning is essential for all data science applications, the approach by which you clean the data differ based on the application and the target results.
Often when preparing text to be processed and analyzed, we remove all punctuations. Removing them will lead to a better word variation in the text. There are also different types of words that can be removed from the text for a better analysis, such as stop words.
Cleaning text for NLP consists of mainly three steps:
- Make sure everything is lowercase.
- Remove stopwords (English stopwords).
- Returning every word to its original root.
We are finally at the NLP section of the skill. Once you have a clean dataset, you’re ready to build models and start analyzing your text. But to do that, you need to be familiar with some NLP vocab.
There are many terms that you will pass by during your journey, but let’s start with those basic 5 skills:
- n-grams: This is a type of probabilistic language model used to predict the next item in such a sequence of words.
- Tokenization: Is the act of chipping down a sentence into tokens (words), such as verbs, nouns, pronouns, etc.
- Stemming: This refers to removing the end of the word to reach its origins, for example, cleaning => clean. This doesn’t work all the time.
- POS tagging: The process of converting a sentence to a list of tuples (containing (word, tag)). The tag here represents the Part Of Speech of that word (is it a verb (v), noun (n)).
- Lemmatization: This refers to getting the origin of the word using proper grammatical laws.
You can learn more about the basics of NLP from Stanfords NLP group.
Once you have used the basics of NLP to get the corpus dataset (the dataset after cleaning and performing some basic NLP tasks), you will now need to analyze this dataset and extract some useful information.
To analyze them, you will need to use a machine-learning algorithm. While there are many algorithms that you can use, let’s just talk about the two most commonly used ones in NLP applications:
- Clustering algorithms: This type of algorithm is used to find patterns in data, such as the sentiment of the words used, their them, and frequency. You can use this type of algorithms to detect false news or inaccurate information.
- Classification algorithms: This type of algorithm is used to place texts in predefined tags. The most known application of these algorithms is classifying incoming mail into legit (goes to the inbox) or spam.
This a very essential and usually overlooked knowledge. Whenever you apply a machine learning model to your data, you’ll need to evaluate this model’s results. No one model fits them all thing. Some of the metrics used to evaluate NLP models:
- Confusion Matrix.
- F Score.
- ROC Curves (receiver operating characteristic curve).
You can learn more about NLP model evaluation from this lecture materials taught at the University of Massachusetts Amherst.
So, you’ve cleaned your data, applied some models to them, and evaluated these model’s results.
You need better models, higher accuracy, and better results. You can achieve all of that through the use of deep learning. Deep learning is useful for particular tasks that require some non-linearity of its feature space.
Perhaps the most commonly used deep learning technique in NLP is Recurrent Neural Networks. Luckily, you don’t need to know how to implement this algorithm — or most others because Python libraries like Keras and Scikit-learn provide a pre-defined version of that algorithm.
So, what you really need to do is learn how to efficiently use the algorithm by learning about its background, applying it, and understanding its results.
I know I have this step as number 9, but this step should go in parallel with all previous steps. You should always put your knowledge into action. That’s the only way actually to test your knowledge level.
That being said, the more you know, the cooler applications you can build. Let me give you some ideas you can try out when you get this far!
- Topic modeling application.
- Language identifier.
- Haiku generator
- Social media monitor.
And so much more. At this point, you have all the knowledge you need to create something amazing, so the sky is the limit.
All subfields of data science are active research fields. As a data scientist in general and an NLP specialist, you will need to keep up with the field’s latest development. The only way to do that is to keep an eye on the recently published NLP-related research papers.
Personally, I like to create Google Scholar alerts for new publications on specific topics I am interested in. That way, I get an email whenever something new and exciting pops up.
“We can do anything we want to if we stick to it long enough.” — Helen Keller
NLP is one of the most popular and important subfields of data science. The human desire to teach computers natural languages has been there since the invention of computers. This desire had become a reality with the advancement of the transistor technology in the bast decade.
Mastering NLP can be difficult, just because of the overwhelming amount of information available on the web. I hope this article helps you navigate your learning journey and makes it a little bit easier.
Learning a new skill or obtaining new knowledge can be quite challenging, but if you stick to it, keep practicing, and always expand your knowledge base, you will reach your goal.