So, I was futzing around with a little Python blog reader the other day and I realized that it was perfect, but…. with a but in this case being that it was littered with “ and ”s. Not a problem right? Just toss thing thing through htmlentitiesdecode() or something and be on my way? I should have known better I think, going the other way was a simple enough matter, one line of code that looks like this:
"my unicode string".encode('ascii', 'xmlcharrefreplace')
However finding that line was a pain in the neck with a Google search filled with people implementing their own encoding. I was not completely surprised then to find people doing the same thing the other way. Checking the docs at python.org I found there was a String.decode() function, but it didn’t reverse the above. In 2003 someone wrote one that did work in reverse, but the bug was killed with a won’t merge. I think that may have slipped through the cracks a bit, but still people are doing this right? Nothing in the cookbook and a look at Pears was no help, they just use a wxHtml control and let it deal with it. In despair I started to code up my own codec.
The answer, it turns out is Beautiful Soup, Stephen Laniel should get “I told you so” rights on this except his server was down when I clicked it up. The final code then:
decodedString=unicode(BeautifulStoneSoup(encodedString,convertEntities=BeautifulStoneSoup.HTML_ENTITIES ))
as simple as Python, but why so hard to find? I guess for a language that usually is so easy to work with I should show some tolerance when it’s as hard as other languages–not that I’m planning to.
- CF
November 13, 2007 at 3:11 pm
Thanks, just what I needed.
December 7, 2007 at 9:02 am
Exactly what I needed! Thanks.
March 2, 2008 at 7:43 pm
Excellent, and I was just looking through the BeautifulSoup docs in vain before I found your page!
March 6, 2008 at 4:11 pm
Same comment as the previous three guys:
Excellent! This was really really helpful.
May 2, 2008 at 3:51 pm
Exactly my thoughts. Rolling your own is prone to bugginess…
Thanks for the Beautiful Soup tip. The patch shouldn’t have been killed.
May 20, 2008 at 2:05 am
This should be in the python universal FAQ.
May 27, 2008 at 5:49 pm
Wicked! Thanks so much.
July 15, 2008 at 9:45 pm
Crikey! this is should be a lot easier to do…but thanks for pointing out the BeautifulSoup – I was alreay using elsewhere but didn’t think to pull it in for this problem. awesome!
July 30, 2008 at 9:56 pm
Another great solution that not even require any external modules can
be found on: http://effbot.org/zone/re-sub.htm#unescape-html
October 7, 2008 at 2:57 am
Looks like Beautiful Soup is a solid library, but it seems like overkill for my usage, unfortunately. In my sub-300 line script, I’m trying to decode 160 character strings, 20 at a time, so using this whole library seems wrong… maybe I’ll do something like rabio suggests?
October 14, 2008 at 1:15 pm
I just wanna join in on the praise of the article, it really helped me out too.
BeautifulSoup is a hell of library, and it has saved my more times than I want to remember.
On the other hand, why is coding and deconding HTML / XML entities so “hard” in python ?
November 4, 2008 at 10:35 pm
I did what rabio suggested and it works perfectly! Thanks a lot!
November 22, 2008 at 8:26 am
“Beautiful Soup uses a class called UnicodeDammit to detect the encodings of documents you give it and convert them to Unicode, no matter what. If you need to do this for other documents (without using Beautiful Soup to parse them), you can use UnicodeDammit by itself.”
http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode,%20Dammit
I haven’t tried using the class by itself, but it sounds like that could be the solution to the not-needing-the-whole-library woes.
March 31, 2009 at 2:46 pm
Thanks a million! Yeah, amazing they slipped on this one.
September 15, 2009 at 2:46 am
Thanks ciemaar for this blog post!
October 2, 2009 at 3:27 pm
Tough it’s a helpul piece of advice, I’m in a trickier situation.
I have to convert HTML entities IN HEX to uni. So BeautifulSoup complains that I give to it an ‘invalid literal for int() with base 10′. The entities I face are the type: {&#x~hex number here~;} (i.e.: ‘ò′ for ‘&242′/’ò’). Does anyone knows soemthing about it?
October 2, 2009 at 3:51 pm
Solved:
hex entity to decimal entity:
## IF you have encodedStringHex
## DO(in Python):
encodedStringDecimal=str(‘&#%i;’%int(‘0%s’%encodedStringHex[2:-1],16))
##THEN the solution posted above will work
October 2, 2009 at 4:03 pm
from BeautifulSoup import BeautifulStoneSoup
def HTMLtoUni(entity):
uni=str()
if type(entity)!= type(str()):
## entity is not a string
return
if not re.match(u’&[#a-z0-9]+;’,entity):
##entity is not an HTML entity
return
if re.match(u’&#x[a-f0-9]+;’,entity):
## convert hex HTML entity to HTML decimal entity
entity=str(‘&#%i;’%int(‘0%s’%entity[2:-1],16))
return unicode(BeautifulStoneSoup(entity,convertEntities=BeautifulStoneSoup.HTML_ENTITIES ))