How to convert html entities to “real” unicode in Python

July 4, 2007

So, I was futzing around with a little Python blog reader the other day and I realized that it was perfect, but…. with a but in this case being that it was littered with &#8220 and &#8221s. Not a problem right? Just toss thing thing through htmlentitiesdecode() or something and be on my way? I should have known better I think, going the other way was a simple enough matter, one line of code that looks like this:

"my unicode string".encode('ascii', 'xmlcharrefreplace')

However finding that line was a pain in the neck with a Google search filled with people implementing their own encoding.  I was not completely surprised then to find people doing the same thing the other way.  Checking the docs at python.org I found there was a String.decode() function, but it didn’t reverse the above.  In 2003 someone wrote one that did work in reverse, but the bug was killed with a won’t merge.  I think that may have slipped through the cracks a bit, but still people are doing this right?  Nothing in the cookbook and a look at Pears was no help, they just use a wxHtml control and let it deal with it.  In despair I started to code up my own codec.

The answer, it turns out is Beautiful Soup, Stephen Laniel should get “I told you so” rights on this except his server was down when I clicked it up.  The final code then:

 decodedString=unicode(BeautifulStoneSoup(encodedString,convertEntities=BeautifulStoneSoup.HTML_ENTITIES ))

as simple as Python, but why so hard to find?  I guess for a language that usually is so easy to work with I should show some tolerance when it’s as hard as other languages–not that I’m planning to.

– CF

Advertisements

23 Responses to “How to convert html entities to “real” unicode in Python”

  1. Mike Says:

    Thanks, just what I needed.

  2. Graham King Says:

    Exactly what I needed! Thanks.


  3. Excellent, and I was just looking through the BeautifulSoup docs in vain before I found your page!

  4. Roman Says:

    Same comment as the previous three guys:
    Excellent! This was really really helpful.

  5. In from Google Says:

    Exactly my thoughts. Rolling your own is prone to bugginess…

    Thanks for the Beautiful Soup tip. The patch shouldn’t have been killed.

  6. Bill Says:

    This should be in the python universal FAQ.


  7. Wicked! Thanks so much.

  8. kpw Says:

    Crikey! this is should be a lot easier to do…but thanks for pointing out the BeautifulSoup – I was alreay using elsewhere but didn’t think to pull it in for this problem. awesome!

  9. rabio Says:

    Another great solution that not even require any external modules can
    be found on: http://effbot.org/zone/re-sub.htm#unescape-html

  10. akahn Says:

    Looks like Beautiful Soup is a solid library, but it seems like overkill for my usage, unfortunately. In my sub-300 line script, I’m trying to decode 160 character strings, 20 at a time, so using this whole library seems wrong… maybe I’ll do something like rabio suggests?

  11. Tørbjorn Says:

    I just wanna join in on the praise of the article, it really helped me out too.
    BeautifulSoup is a hell of library, and it has saved my more times than I want to remember.

    On the other hand, why is coding and deconding HTML / XML entities so “hard” in python ?

  12. M Says:

    I did what rabio suggested and it works perfectly! Thanks a lot!

  13. Ian Young Says:

    “Beautiful Soup uses a class called UnicodeDammit to detect the encodings of documents you give it and convert them to Unicode, no matter what. If you need to do this for other documents (without using Beautiful Soup to parse them), you can use UnicodeDammit by itself.”

    http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode,%20Dammit

    I haven’t tried using the class by itself, but it sounds like that could be the solution to the not-needing-the-whole-library woes.

  14. Jonas Byström Says:

    Thanks a million! Yeah, amazing they slipped on this one.

  15. javaJake Says:

    Thanks ciemaar for this blog post!

  16. LlucPot Says:

    Tough it’s a helpul piece of advice, I’m in a trickier situation.

    I have to convert HTML entities IN HEX to uni. So BeautifulSoup complains that I give to it an ‘invalid literal for int() with base 10’. The entities I face are the type: {&#x~hex number here~;} (i.e.: ‘&#xf2’ for ‘&242’/’ò’). Does anyone knows soemthing about it?

  17. LlucPot Says:

    Solved:
    hex entity to decimal entity:

    ## IF you have encodedStringHex
    ## DO(in Python):
    encodedStringDecimal=str(‘&#%i;’%int(‘0%s’%encodedStringHex[2:-1],16))
    ##THEN the solution posted above will work

  18. LlucPot Says:

    from BeautifulSoup import BeautifulStoneSoup

    def HTMLtoUni(entity):
    uni=str()
    if type(entity)!= type(str()):
    ## entity is not a string
    return
    if not re.match(u’&[#a-z0-9]+;’,entity):
    ##entity is not an HTML entity
    return
    if re.match(u’&#x[a-f0-9]+;’,entity):
    ## convert hex HTML entity to HTML decimal entity
    entity=str(‘&#%i;’%int(‘0%s’%entity[2:-1],16))
    return unicode(BeautifulStoneSoup(entity,convertEntities=BeautifulStoneSoup.HTML_ENTITIES ))

  19. Greg Brown Says:

    Worked a treat, thanks!

  20. Danilo Says:

    BeautifulSoup is very slow. HTMLParser is included in the Python library and works faster.

    from HTMLParser import HTMLParser
    row_clean = HTMLParser().unescape(row)

    See: http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/

  21. Johng97 Says:

    Thank you for some other informative website. Where else may I get that type of info written in such an ideal way? I’ve a mission that I’m just now operating on, and I’ve been on the look out for such information. edfkaekefkae


  22. […] This blog entry seems to have had some success with it. […]


  23. […] This blog entry seems to have had some success with it. […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: