Busse: Can't use NLTK on text pulled from the Silmarillion

Can't use NLTK on text pulled from the Silmarillion

I'm trying to use Tolkein's Silmarillion as a practice text for learning
some NLP with nltk.
I am having trouble getting started because I'm running into text encoding
issues.
The text is available at
en.htm">http://www.ae-lib.org.ua/texts-c/tolkien_the_silmarillion__en.htm
I'm using the TextBlob wrapper (https://github.com/sloria/TextBlob) around
NLTK because it's a lot easier. TextBlog is available at:
The sentence that I can't parse is "But Húrin did not answer, and they sat
beside the stone, and did not speak again". I believe it's the special
character in Hurin causing the issue.
My code:
from text.blob import TextBlob
b = TextBlob( 'But Húrin did not answer, and they sat beside the stone,
and did not speak again' )
b.noun_phrases
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
As this is just a for-fun project, I just want to be able to use this text
and extracting some attributes and do some basic processing.
How can I convert this text to ascii when I don't know what the initial
encoding is? I tried to decode from UTF8, then re-encode into ascii:
>>> asc = unicode_text.decode('utf-8')
>>> asc = unicode_text.encode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10:
ordinal not in range(128)
But even that doesn't worry. Any suggestions are appreciated -- I'm fine
with losing the special characters, as long as it's done consistently
across the document.
I'm using python 2.6.8 with the required modules also correctly installed.

Busse

Saturday, 24 August 2013

Can't use NLTK on text pulled from the Silmarillion

No comments:

Post a Comment