Login
or
register
Overview
Introduction
Features
Credits
License
This Wiki
Status
News
Releases
Change Log
Recent Wiki Edits
User Guide
Installation
Command Line
Using as a Module
Integration
Reporting Bugs
Advanced
Mailing List
Source Code
Test Suite
Extensions
Extensions
Writing Extensions
Available Extensions
Related Projects
Tickets
▹ 000043
utf-8, not utf8
Anonymous users must enter
captcha
below.
Ticket Information
Ticket Title
HTML (with stripTopLevelTags turned off) and RSS (extensions/rss.py) both output <?xml version='1.0' encoding='utf8'?> headers. http://www.iana.org/assignments/character-sets mentions "UTF-8", as does http://www.rfc-editor.org/rfc/rfc3629.txt; likewise http://www.w3.org/International/O-charset only mentions utf-8. More locally, rss.py has an explicit attempt to set utf-8: md.docType = '<?xml version="1.0" encoding="utf-8"?>\n' (I bring it up because emacs XML-mode complained and I dug into it a little...) Changing convert to say # Serialize _properly_. Strip top-level tags. output, length = codecs.utf_8_decode(self.serializer(root, encoding="utf-8")) (ie. replace encoding="utf8" with encoding="utf-8") seems sensible, especially because looking at xml.etree.ElementTree.ElementTree.write, it actually recognizes utf-8 and treats utf8 as an unknown encoding that must be explicit. ###Comments **By Waylan on August 22, 2009:** What is the problem you are having? Specifically, what do I need to do to replicate it and/or what are getting and what do you expect to be getting? According to the python [docs](http://docs.python.org/library/codecs.html#standard-encodings), `utf8` is a valid alias for `utf-8`, so I'm a little confused as to the problem here. **By Mark Eichin:** The python docs aren't relevant here - those are talking python-specific aliases, and the code as written ends up actually *emitting* <?xml version="1.0" encoding="utf8"?> (and utf8 isn't meaningful in that outside context.) #!/usr/bin/python import markdown if __name__ == "__main__": mdd = markdown.Markdown() mdd.stripTopLevelTags = False print mdd.convert("# test") emits <?xml version='1.0' encoding='utf8'?> <div> <h1>test</h1> </div> (with the suggested change, it leaves out the encoding header altogether because ElemenTree.write special cases that ones that are assumed: elif encoding != "utf-8" and encoding != "us-ascii": file.write("<?xml version='1.0' encoding='%s'?>\n" % encoding) and "utf8" isn't one of those.) I realize that stripTopLevelTags is undocumented, but the rss extension uses it, and it looks like it's necessary when generating entire HTML (or RSS) documents out of markdown text... **By Waylan:** Ah, okay. I was assuming that whenever the encoding was passed to an encode method that it would just 'do the right thing'. However, if I understand you correctly, ElementTree apparently doesn't take those aliases into account. I wonder if this should be considered a bug in ElementTree which should be filed with them. Either way, I'll patch Python-Markdown to only use the string "utf-8" internally. Thanks for the report.
Reported by
Assigned to
Status
open
someday
resolved
closed
Resolution
n.a.
fixed
wontfix
Don't put anything here
Don't put anything here
Advanced Fields
Priority
unassigned
high
medium
low
Resolution Explanation
Component
Don't put anything here
Don't put anything here
Don't put anything here
About This Edit
Minor Edit
Edit Summary
Don't put anything here
save
preview
cancel
Powered by
Sputnik
|
XHTML 1.1