ticket id
000043
status
closed
priority
???
assigned to
Waylan
Reported by: eichin@gmail.com
Component:

HTML (with stripTopLevelTags turned off) and RSS (extensions/rss.py) both output headers. http://www.iana.org/assignments/character-sets mentions "UTF-8", as does http://www.rfc-editor.org/rfc/rfc3629.txt; likewise http://www.w3.org/International/O-charset only mentions utf-8.

More locally, rss.py has an explicit attempt to set utf-8:

md.docType = '<?xml version="1.0" encoding="utf-8"?>\n'

(I bring it up because emacs XML-mode complained and I dug into it a little...)

Changing convert to say

    # Serialize _properly_.  Strip top-level tags.
    output, length = codecs.utf_8_decode(self.serializer(root, encoding="utf-8"))

(ie. replace encoding="utf8" with encoding="utf-8") seems sensible, especially because looking at xml.etree.ElementTree.ElementTree.write, it actually recognizes utf-8 and treats utf8 as an unknown encoding that must be explicit.

Comments

By Waylan on August 22, 2009:

What is the problem you are having? Specifically, what do I need to do to replicate it and/or what are getting and what do you expect to be getting?

According to the python docs, utf8 is a valid alias for utf-8, so I'm a little confused as to the problem here.

By Mark Eichin:

The python docs aren't relevant here - those are talking python-specific aliases, and the code as written ends up actually emitting (and utf8 isn't meaningful in that outside context.)

#!/usr/bin/python
import markdown

if __name__ == "__main__":
   mdd = markdown.Markdown()
   mdd.stripTopLevelTags = False
   print mdd.convert("# test")

emits

<?xml version='1.0' encoding='utf8'?>
<div>
<h1>test</h1>
</div>

(with the suggested change, it leaves out the encoding header altogether because ElemenTree.write special cases that ones that are assumed:

   elif encoding != "utf-8" and encoding != "us-ascii":
       file.write("<?xml version='1.0' encoding='%s'?>\n" % encoding)

and "utf8" isn't one of those.) I realize that stripTopLevelTags is undocumented, but the rss extension uses it, and it looks like it's necessary when generating entire HTML (or RSS) documents out of markdown text...

By Waylan:

Ah, okay. I was assuming that whenever the encoding was passed to an encode method that it would just 'do the right thing'. However, if I understand you correctly, ElementTree apparently doesn't take those aliases into account. I wonder if this should be considered a bug in ElementTree which should be filed with them. Either way, I'll patch Python-Markdown to only use the string "utf-8" internally. Thanks for the report.

Resolution

fixed

Powered by Sputnik | XHTML 1.1