Python Text Processing With Nltk 2 Cookbook

Python Text Processing With Nltk 2 Cookbook: Over 80 practical recipes for using Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities.

Download

Introduction

Natural Language Processing is used everywhere—in search engines, spell checkers, mobile phones, computer games, and even in your washing machine. Python's Natural Language Toolkit (NLTK) suite of libraries has rapidly emerged as one of the most efficient tools for Natural Language Processing. You want to employ nothing less than the best techniques in Natural Language Processing—and this book is your answer.

Python Text Processing with NLTK 2.0 Cookbook is your handy and illustrative guide, which will walk you through all the Natural Language Processing techniques in a step-by-step manner. It will demystify the advanced features of text analysis and text mining using the comprehensive NLTK suite.

This book cuts short the preamble and lets you dive right into the science of text processing with a practical hands-on approach.

Get started off with learning tokenization of text. Receive an overview of WordNet and how to use it. Learn the basics as well as advanced features of stemming and lemmatization. Discover various ways to replace words with simpler and more common (read: more searched) variants. Create your own corpora and learn to create custom corpus readers for data stored in MongoDB. Use and manipulate POS taggers. Transform and normalize parsed chunks to produce a canonical form without changing their meaning. Dig into feature extraction and text classification. Learn how to easily handle huge amounts of data without any loss in efficiency or speed.

This book will teach you all that and beyond, in a hands-on learn-by-doing manner. Make yourself an expert in using the NLTK for Natural Language Processing with this handy companion.

What this book covers
Chapter 1, Tokenizing Text and WordNet Basics, covers the basics of tokenizing text and using WordNet.

Chapter 2, Replacing and Correcting Words, discusses various word replacement and correction techniques. The recipes cover the gamut of linguistic compression, spelling correction, and text normalization.

Chapter 3, Creating Custom Corpora, covers how to use corpus readers and create custom corpora. At the same time, it explains how to use the existing corpus data that comes with NLTK.

Chapter 4, Part-of-Speech Tagging, explains the process of converting a sentence, in the form of a list of words, into a list of tuples. It also explains taggers, which are trainable.

Chapter 5, Extracting Chunks, explains the process of extracting short phrases from a part-of-speech tagged sentence. It uses Penn Treebank corpus for basic training and testing chunk extraction, and the CoNLL 2000 corpus as it has a simpler and more flexible format that supports multiple chunk types.

Chapter 6, Transforming Chunks and Trees, shows you how to do various transforms on both chunks and trees. The functions detailed in these recipes modify data, as opposed to learning from it.

Chapter 7, Text Classification, describes a way to categorize documents or pieces of text and, by examining the word usage in a piece of text, classifiers decide what class label should be assigned to it.

Chapter 8, Distributed Processing and Handling Large Datasets, discusses how to use execnet to do parallel and distributed processing with NLTK. It also explains how to use the Redis data structure server/database to store frequency distributions.

Chapter 9, Parsing Specific Data, covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Appendix, Penn Treebank Part-of-Speech Tags, lists a table of all the part-of-speech tags that occur in the treebank corpus distributed with NLTK.

What you need for this book 
In the course of this book, you will need the following software utilities to try out various code examples listed:
• NLTK
• MongoDB
• PyMongo
• Redis
• redis-py
• execnet
• Enchant
• PyEnchant
• PyYAML
• dateutil
• chardet
• BeautifulSoup
• lxml
• SimpleParse
• mxBase
• lockfile

Who this book is for
This book is for Python programmers who want to quickly get to grips with using the NLTK for Natural Language Processing. Familiarity with basic text processing concepts is required. Programmers experienced in the NLTK will find it useful. Students of linguistics will find it invaluable.
Share This