|
ticket id 000043 |
status closed |
priority ??? |
assigned to Waylan |
Reported by: eichin@gmail.com Component: |
HTML (with stripTopLevelTags turned off) and RSS (extensions/rss.py) both output headers. http://www.iana.org/assignments/character-sets mentions "UTF-8", as does http://www.rfc-editor.org/rfc/rfc3629.txt; likewise http://www.w3.org/International/O-charset only mentions utf-8.
More locally, rss.py has an explicit attempt to set utf-8:
md.docType = '<?xml version="1.0" encoding="utf-8"?>\n'
(I bring it up because emacs XML-mode complained and I dug into it a little...)
Changing convert to say
# Serialize _properly_. Strip top-level tags.
output, length = codecs.utf_8_decode(self.serializer(root, encoding="utf-8"))
(ie. replace encoding="utf8" with encoding="utf-8") seems sensible, especially because looking at xml.etree.ElementTree.ElementTree.write, it actually recognizes utf-8 and treats utf8 as an unknown encoding that must be explicit.
Comments
By Waylan on August 22, 2009:
What is the problem you are having? Specifically, what do I need to do to replicate it and/or what are getting and what do you expect to be getting?
According to the python docs, utf8 is a valid alias for utf-8, so I'm a little confused as to the problem here.
By Mark Eichin:
The python docs aren't relevant here - those are talking python-specific aliases, and the code as written ends up actually emitting (and utf8 isn't meaningful in that outside context.)
#!/usr/bin/python
import markdown
if __name__ == "__main__":
mdd = markdown.Markdown()
mdd.stripTopLevelTags = False
print mdd.convert("# test")
emits
<?xml version='1.0' encoding='utf8'?>
<div>
<h1>test</h1>
</div>
(with the suggested change, it leaves out the encoding header altogether because ElemenTree.write special cases that ones that are assumed:
elif encoding != "utf-8" and encoding != "us-ascii":
file.write("<?xml version='1.0' encoding='%s'?>\n" % encoding)
and "utf8" isn't one of those.) I realize that stripTopLevelTags is undocumented, but the rss extension uses it, and it looks like it's necessary when generating entire HTML (or RSS) documents out of markdown text...
By Waylan:
Ah, okay. I was assuming that whenever the encoding was passed to an encode method that it would just 'do the right thing'. However, if I understand you correctly, ElementTree apparently doesn't take those aliases into account. I wonder if this should be considered a bug in ElementTree which should be filed with them. Either way, I'll patch Python-Markdown to only use the string "utf-8" internally. Thanks for the report.
Resolution
fixed