So, I was futzing around with a little Python blog reader the other day and I realized that it was perfect, but…. with a but in this case being that it was littered with &#8220 and &#8221s. Not a problem right? Just toss thing thing through htmlentitiesdecode() or something and be on my way? I should have known better I think, going the other way was a simple enough matter, one line of code that looks like this:

"my unicode string".encode('ascii', 'xmlcharrefreplace')

However finding that line was a pain in the neck with a Google search filled with people implementing their own encoding.  I was not completely surprised then to find people doing the same thing the other way.  Checking the docs at python.org I found there was a String.decode() function, but it didn’t reverse the above.  In 2003 someone wrote one that did work in reverse, but the bug was killed with a won’t merge.  I think that may have slipped through the cracks a bit, but still people are doing this right?  Nothing in the cookbook and a look at Pears was no help, they just use a wxHtml control and let it deal with it.  In despair I started to code up my own codec.

The answer, it turns out is Beautiful Soup, Stephen Laniel should get “I told you so” rights on this except his server was down when I clicked it up.  The final code then:

 decodedString=unicode(BeautifulStoneSoup(encodedString,convertEntities=BeautifulStoneSoup.HTML_ENTITIES ))

as simple as Python, but why so hard to find?  I guess for a language that usually is so easy to work with I should show some tolerance when it’s as hard as other languages–not that I’m planning to.

– CF

Advertisements