Text analytics describe a set of linguitsic , statistical and machine learning techniques that model and structure the informational content of textual sources for various practical purposes.Detailed analysis of text data requires understanding of natural language text, which is known to be a difficult task for computers. However, a number of statistical approaches have been shown to work well for the "shallow" but robust analysis of text data for pattern finding and knowledge discovery.
So What have I Done ?
In the summer of 2015 I finally finished reading the last part of Harry Potter Series Again . It ocured to me that I could do some kind of text analysis on it . I would try to do the same thing again with NLTK package from python to analyse how easier it becomes . If you haven't read the books already then please do . I have used the book numbers in all the graphs and results to make it look cleaner . For Understanding which book is which number (duh) click on the link. For graphs I have used plotly which I find is a good way of making interactive graphs for web browsers and is easily integrable with python .
Feel free to give me your comments and suggestion via email on the bottom of page.
Getting and Cleaning Data
The data (i.e) the text obtained is fairly frunished with fonts and other details like punctuations and Upper case and lower case diffrences.In order to do a complete Statistical Analysis we will have to convert to lovely lowercase,easy to analyse texts.
My Data was in word format . To extract them convert them into txt fromat the easy code in on GNU/Linux is
system("for f in *.doc; do antiword $f; done")
If your file is in pdf format I would suggest doing
system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")
After Doing this You Can simply access by using
open command in python
As I Said Earlier it is important to make the text free from white spaces , punctuation marks , and stopwords . Some of these stopwords are like
a about above after again against all am anFor more click here Words like these don't give much information about only mess with the analytics algorithms as they are very frequent. To strip and split text and conversion to lower case and cleaning and writing cleaned text ::
fname="text.txt" fh=open(fname) fh2=open("text_mod_alp.txt",'w') fh2.seek(0,0) ## File got ## Now conversion to string for line in fh: line=line.rstrip() word=line.split() ##module 1 done for item in word: item=item.lower() words = (stopwords) if not item in words: ## not cliqued word if item and item.isalpha() and item[len(item)-1].isalpha() : ## word without any problem print item item=item+' ' fh2.write(item) else : if not item.isalpha(): ## begins with punctuation print item[1::] item=item+' ' fh2.write(item[1::]) else : ## ends with punctuation print item[:len(item)-1:] item=item[:len(item)-1:] item=item+' ' fh2.write(item) fh.close() fh2.close()
Some Types of Analysis
Kind of like an index of number of "new"words used by the author.
for item in word: if item not in count: count[item]=1 else : count[item]=count[item]+1 for key in count: total+=count[key]An example could be ::
Book 1:: Total number of words are :: 61874 Total number of unique words are :: 7411 Index of Uniqueness is 0.12
Index of Uniqueness is simply the ratio of Number of unique words to Number of overall words Similarly for Book 2::
Total number of words are :: 68920 Total number of unique words are :: 9055 Index of similarity is 0.1314
And so on...
Infact its better if I show you the graph depicting of all these indices along x axis beign the book number....(Go ahead and click on it for more interactive graph )
One thing to observe here is the decrease in uniqueness of words from book 2 till 5 which basically may be a result of not introducing much new characters
Character Talked About
We can also figure out the number of times a character is talked about . So as to get a basic idea of their relative importance with one another as well as as a whole over all the time.The snippet for the same will look something like ::
for item in word: if item not in count: count[item]=1 else : count[item]=count[item]+1
An example would be
The above figure is the histogram for book 1. So it gives you a basic understanding of the characters that are of more importance in the book 1.
But why plot graphs for indivisual books only when a better and clearer picture would be available by all the books that are then aranged in order of Books. Something like this ::
Here,the index is basically like the realitive number of times a character is talked about.Well I have read the books and so I am not going to give you much spoiler alert if you haven't read it already (which you should), but if you magnify on certain parts of the graph (click on it or here,it will open a plotly page then maginfy using icon on right top of page) you would see very clear trends of some major plot shifters.
Another thing which I can do is to figure out the top 30 most frequent words in one book. This can sort of like give an idea of all the important things talked about in the book. An example of the following maybe
The following is a wordcloud that I made from python for book 1 using this link.Similar wordcloud can be made for all the other books as well.
Correlation Between Character
Another thing that I later on tried doing was to convert the orignal text into various statements.Through this I tried to find probability of finding two of these characters in the same sentence.Although I get the fact that the serious use of pronouns will disturb my algorithm of the given but hey if all the characters have about realtively equal pronoun usage relative because Harry will obviously from earlier obseravtions have the most. Then It can be used as an effective way to find how close knit the relationship between these two characters are . For example for book 1 ::
cr of Harry and Hermione is 0.89 cr of Harry and Ron is 2.615 cr of Ron and Hermione is 3.06
Here cr is the corelation index that I made . So we get the fact that Ron And Hermione are more closely knitted in the conversation part of the book. Although I am still working on trying to find the best way to plot this to give a better sense of understanding on as to what is happening.I will keep you updated.