We recently worked on a
project with Zeit Online which is analyzing the frequency of words in the Bundestag's (the german parliament) speeches. For research purposes we built a tool for counting words using NLP techniques. The tool removes stop words and transforms the words to a common base (
lemmatize) before actually counting the words. It is open source and you can try it here:
How does it work ?
Python was the language of choice because it is one of the most prolific languages for NLP, mostly because of the large ecosystem of stable and complete libraries, like:
- NLTK: a widely adopted toolkit for natural language processing
- spacy: a complete and deep-learning powered library
- TextBlob: a very simple API for NLP operations
We decided to start from the ground up, that's why we choose to try out what's possible using the NLTK library.
Our word-counting tool performs the following operations:
The tool can be used over HTTP thanks to a small
Flask API
server.
View the full source for the frontend
here and for the backend
here.