Saturday, 24 August 2013

Can't use NLTK on text pulled from the Silmarillion

Can't use NLTK on text pulled from the Silmarillion

I'm trying to use Tolkein's Silmarillion as a practice text for learning
some NLP with nltk.
I am having trouble getting started because I'm running into text encoding
issues.
The text is available at
en.htm">http://www.ae-lib.org.ua/texts-c/tolkien_the_silmarillion__en.htm
I'm using the TextBlob wrapper (https://github.com/sloria/TextBlob) around
NLTK because it's a lot easier. TextBlog is available at:
The sentence that I can't parse is "But Húrin did not answer, and they sat
beside the stone, and did not speak again". I believe it's the special
character in Hurin causing the issue.
My code:
from text.blob import TextBlob
b = TextBlob( 'But Húrin did not answer, and they sat beside the stone,
and did not speak again' )
b.noun_phrases
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
As this is just a for-fun project, I just want to be able to use this text
and extracting some attributes and do some basic processing.
How can I convert this text to ascii when I don't know what the initial
encoding is? I tried to decode from UTF8, then re-encode into ascii:
>>> asc = unicode_text.decode('utf-8')
>>> asc = unicode_text.encode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10:
ordinal not in range(128)
But even that doesn't worry. Any suggestions are appreciated -- I'm fine
with losing the special characters, as long as it's done consistently
across the document.
I'm using python 2.6.8 with the required modules also correctly installed.

No comments:

Post a Comment