Update from Sam Ruby.

This commit is contained in:
Jacques Distler 2007-05-01 10:07:44 -05:00
commit cfce4dde58
58 changed files with 2108 additions and 234 deletions

View File

@ -1,3 +1,4 @@
*.tmplc *.tmplc
.DS_Store .DS_Store
cache cache
*.pluginc

4
THANKS
View File

@ -13,6 +13,10 @@ Morten Frederiksen - Support WordPress LinkManager OPML
Harry Fuecks - default item date to feed date Harry Fuecks - default item date to feed date
Antonio Cavedoni - Django templates Antonio Cavedoni - Django templates
Morten Frederiksen - expungeCache Morten Frederiksen - expungeCache
Lenny Domnitser - Coral CDN support for URLs with non-standard ports
Amit Chakradeo - Allow read-only files to be overwritten
Matt Brubeck - fix new_channel
Aristotle Pagaltzis - ensure byline_author filter doesn't drop foreign markup
This codebase represents a radical refactoring of Planet 2.0, which lists This codebase represents a radical refactoring of Planet 2.0, which lists
the following contributors: the following contributors:

View File

@ -68,6 +68,9 @@ can be found</dd>
<dt><ins>filters</ins></dt> <dt><ins>filters</ins></dt>
<dd>Space-separated list of <a href="filters.html">filters</a> to apply to <dd>Space-separated list of <a href="filters.html">filters</a> to apply to
each entry</dd> each entry</dd>
<dt><ins>filter_directories</ins></dt>
<dd>Space-separated list of directories in which <code>filters</code>
can be found</dd>
</dl> </dl>
<dl class="compact code"> <dl class="compact code">
@ -148,6 +151,7 @@ processed as <a href="templates.html">templates</a>. With Planet 2.0,
it is possible to override parameters like <code>items_per_page</code> it is possible to override parameters like <code>items_per_page</code>
on a per template basis, but at the current time Planet Venus doesn't on a per template basis, but at the current time Planet Venus doesn't
implement this.</p> implement this.</p>
<p><ins><a href="filters.html">Filters</a> can be defined on a per-template basis, and will be used to post-process the output of the template.</ins></p>
<h3 id="filter"><code>[</code><em>filter</em><code>]</code></h3> <h3 id="filter"><code>[</code><em>filter</em><code>]</code></h3>
<p>Sections which are listed in <code>[planet] filters</code> are <p>Sections which are listed in <code>[planet] filters</code> are

48
docs/etiquette.html Normal file
View File

@ -0,0 +1,48 @@
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
"http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<script type="text/javascript" src="docs.js"></script>
<link rel="stylesheet" type="text/css" href="docs.css"/>
<title>Etiquette</title>
</head>
<body>
<h2>Etiquette</h2>
<p>You would think that people who publish syndication feeds do it with the
intent to be syndicated. But the truth is that we live in a world where
<a href="http://en.wikipedia.org/wiki/Deep_linking">deep linking</a> can
cause people to complain. Nothing is safe. But that doesn&#8217;t
stop us from doing links.</p>
<p>These concerns tend to increase when you profit, either directly via ads or
indirectly via search engine rankings, from the content of others.</p>
<p>While there are no hard and fast rules that apply here, here&#8217;s are a
few things you can do to mitigate the concern:</p>
<ul>
<li>Aggressively use robots.txt, meta tags, and the google/livejournal
atom namespace to mark your pages as not to be indexed by search
engines.</li>
<blockquote><p><dl>
<dt><a href="http://www.robotstxt.org/">robots.txt</a>:</dt>
<dd><p><code>User-agent: *<br/>
Disallow: /</code></p></dd>
<dt>index.html:</dt>
<dd><p><code>&lt;<a href="http://www.robotstxt.org/wc/meta-user.html">meta name="robots"</a> content="noindex,nofollow"/&gt;</code></p></dd>
<dt>atom.xml:</dt>
<dd><p><code>&lt;feed xmlns:indexing="<a href="http://community.livejournal.com/lj_dev/696793.html">urn:atom-extension:indexing</a>" indexing:index="no"&gt;</code></p>
<p><code>&lt;access:restriction xmlns:access="<a href="http://www.bloglines.com/about/specs/fac-1.0">http://www.bloglines.com/about/specs/fac-1.0</a>" relationship="deny"/&gt;</code></p></dd>
</dl></p></blockquote>
<li><p>Ensure that all <a href="http://nightly.feedparser.org/docs/reference-entry-source.html#reference.entry.source.rights">copyright</a> and <a href="http://nightly.feedparser.org/docs/reference-entry-license.html">licensing</a> information is propagated to the
combined feed(s) that you produce.</p></li>
<li><p>Add no advertising. Consider filtering out ads, lest you
be accused of using someone&#8217;s content to help your friends profit.</p></li>
<li><p>Most importantly, if anyone does object to their content being included,
quickly and without any complaint, remove them.</p></li>
</ul>
</body>
</html>

View File

@ -8,18 +8,21 @@
<title>Venus Filters</title> <title>Venus Filters</title>
</head> </head>
<body> <body>
<h2>Filters</h2> <h2>Filters and Plugins</h2>
<p>Filters are simple Unix pipes. Input comes in <code>stdin</code>, <p>Filters and plugins are simple Unix pipes. Input comes in
parameters come from the config file, and output goes to <code>stdout</code>. <code>stdin</code>, parameters come from the config file, and output goes to
Anything written to <code>stderr</code> is logged as an ERROR message. If no <code>stdout</code>. Anything written to <code>stderr</code> is logged as an
<code>stdout</code> is produced, the entry is not written to the cache or ERROR message. If no <code>stdout</code> is produced, the entry is not written
processed further; in fact, if the entry had previously been written to the cache, it will be removed.</p> to the cache or processed further; in fact, if the entry had previously been
written to the cache, it will be removed.</p>
<p>Input to a filter is a aggressively <p>There are two types of filters supported by Venus, input and template.</p>
<p>Input to an input filter is a aggressively
<a href="normalization.html">normalized</a> entry. For <a href="normalization.html">normalized</a> entry. For
example, if a feed is RSS 1.0 with 10 items, the filter will be called ten example, if a feed is RSS 1.0 with 10 items, the filter will be called ten
times, each with a single Atom 1.0 entry, with all textConstructs times, each with a single Atom 1.0 entry, with all textConstructs
expressed as XHTML, and everything encoded as UTF-8.</p> expressed as XHTML, and everything encoded as UTF-8.</p>
<p>Input to a template filter will be the output produced by the template.</p>
<p>You will find a small set of example filters in the <a <p>You will find a small set of example filters in the <a
href="../filters">filters</a> directory. The <a href="../filters">filters</a> directory. The <a
@ -54,8 +57,14 @@ instead of XPath expressions.</p>
<h3>Notes</h3> <h3>Notes</h3>
<ul> <ul>
<li>Filters are executed when a feed is fetched, and the results are placed <li>Any filters listed in the <code>[planet]</code> section of your config.ini
into the cache. Changing a configuration file alone is not sufficient to will be invoked on all feeds. Filters listed in individual
<code>[feed]</code> sections will only be invoked on those feeds.
Filters listed in <code>[template]</code> sections will be invoked on the
output of that template.</li>
<li>Input filters are executed when a feed is fetched, and the results are
placed into the cache. Changing a configuration file alone is not sufficient to
change the contents of the cache &mdash; typically that only occurs after change the contents of the cache &mdash; typically that only occurs after
a feed is modified.</li> a feed is modified.</li>
@ -63,18 +72,34 @@ a feed is modified.</li>
configuration file (think unix pipes). Planet wide filters are executed before configuration file (think unix pipes). Planet wide filters are executed before
feed specific filters.</li> feed specific filters.</li>
<li>Any filters listed in the <code>[planet]</code> section of your config.ini
will be invoked on all feeds. Filters listed in individual
<code>[feed]</code> sections will only be invoked on those feeds.</li>
<li>The file extension of the filter is significant. <code>.py</code> invokes <li>The file extension of the filter is significant. <code>.py</code> invokes
python. <code>.xslt</code> involkes XSLT. <code>.sed</code> and python. <code>.xslt</code> involkes XSLT. <code>.sed</code> and
<code>.tmpl</code> (a.k.a. htmltmp) are also options. Other languages, like <code>.tmpl</code> (a.k.a. htmltmp) are also options. Other languages, like
perl or ruby or class/jar (java), aren't supported at the moment, but these perl or ruby or class/jar (java), aren't supported at the moment, but these
would be easy to add.</li> would be easy to add.</li>
<li>Templates written using htmltmpl currently only have access to a fixed set <li>If the filter name contains a redirection character (<code>&gt;</code>),
of fields, whereas XSLT templates have access to everything.</li> then the output stream is
<a href="http://en.wikipedia.org/wiki/Tee_(Unix)">tee</a>d; one branch flows
through the specified filter and the output is planced into the named file; the
other unmodified branch continues onto the next filter, if any.
One use case for this function is to use
<a href="../filters/xhtml2html.py">xhtml2html</a> to produce both an XHTML and
an HTML output stream from one source.</li>
<li>Templates written using htmltmpl or django currently only have access to a
fixed set of fields, whereas XSLT and genshi templates have access to
everything.</li>
<li>Plugins differ from filters in that while filters are forked, plugins are
<a href="http://docs.python.org/lib/module-imp.html">imported</a>. This
means that plugins are limited to Python and are run in-process. Plugins
therefore have direct access to planet internals like configuration and
logging facitilies, as well as access to the bundled libraries like the
<a href="http://feedparser.org/docs/">Universal Feed Parser</a> and
<a href="http://code.google.com/p/html5lib/">html5lib</a>; but it also
means that functions like <code>os.abort()</code> can't be recovered
from.</li>
</ul> </ul>
</body> </body>
</html> </html>

View File

@ -21,13 +21,14 @@
<ul> <ul>
<li><a href="venus.svg">Architecture</a></li> <li><a href="venus.svg">Architecture</a></li>
<li><a href="normalization.html">Normalization</a></li> <li><a href="normalization.html">Normalization</a></li>
<li><a href="filters.html">Filters</a></li> <li><a href="filters.html">Filters and Plugins</a></li>
</ul> </ul>
</li> </li>
<li>Other <li>Other
<ul> <ul>
<li><a href="migration.html">Migration from Planet 2.0</a></li> <li><a href="migration.html">Migration from Planet 2.0</a></li>
<li><a href="contributing.html">Contributing</a></li> <li><a href="contributing.html">Contributing</a></li>
<li><a href="etiquette.html">Etiquette</a></li>
</ul> </ul>
</li> </li>
<li>Reference <li>Reference

View File

@ -167,5 +167,18 @@ a <code>planet:format</code> attribute containing the referenced date
formatted according to the <code>[planet] date_format</code> specified formatted according to the <code>[planet] date_format</code> specified
in the configuration</li> in the configuration</li>
</ul> </ul>
<h3>genshi</h3>
<p>Genshi approaches the power of XSLT, but with a syntax that many Python
programmers find more natural, succinct and expressive. Genshi templates
have access to the full range of <a href="http://feedparser.org/docs/reference.html">feedparser</a> values, with the following additions:</p>
<ul>
<li>In addition to a <code>feed</code> element which describes the feed
for your planet, there is also a <code>feeds</code> element which contains
the description for each subscription.</li>
<li>All <code>feed</code>, <code>feeds</code>, and <code>source</code> elements have a child <code>config</code> element which contains the config.ini entries associated with that feed.</li>
<li>All text construct detail elements (<code>subtitle</code>, <code>rights</code>, <code>title</code>, <code>summary</code>, <code>content</code>) also contain a <code>stream</code> element which contains the value as a Genshi stream.</li>
<li>Each of the <code>entries</code> has a <code>new_date</code> and <code>new_feed</code> value which indicates if this entry's date or feed differs from the preceeding entry.</li>
</ul>
</body> </body>
</html> </html>

View File

@ -36,6 +36,13 @@ filters = excerpt.py
omit = img p br omit = img p br
width = 500 width = 500
# add memes to output
[index.html.tmpl]
filters = mememe.plugin
[mememe.plugin]
sidebar = //*[@id="footer"]
# subscription list # subscription list
[http://share.opml.org/opml/top100.opml] [http://share.opml.org/opml/top100.opml]
content_type = opml content_type = opml

30
filters/addsearch.genshi Normal file
View File

@ -0,0 +1,30 @@
<html xmlns:py="http://genshi.edgewall.org/" py:strip="">
<!--! insert search form -->
<div py:match="div[@id='sidebar']" py:attrs="select('@*')">
${select('*')}
<h2>Search</h2>
<form><input name="q"/></form>
</div>
<?python from urlparse import urljoin ?>
<!--! insert opensearch autodiscovery link -->
<head py:match="head" py:attrs="select('@*')">
${select('*')}
<link rel="search" type="application/opensearchdescription+xml"
href="${urljoin(str(select('link[@rel=\'alternate\']/@href')),
'opensearchdescription.xml')}"
title="${select('link[@rel=\'alternate\']/@title')} search"/>
</head>
<!--! ensure that scripts don't use empty tag syntax -->
<script py:match="script" py:attrs="select('@*')">
${select('*')}
</script>
<!--! Include the original stream, which will be processed by the rules
defined above -->
${input}
</html>

70
filters/addsearch.xslt Normal file
View File

@ -0,0 +1,70 @@
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns="http://www.w3.org/1999/xhtml">
<!-- insert search form -->
<xsl:template match="xhtml:div[@id='sidebar']">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
<h2>Search</h2>
<form><input name="q"/></form>
</xsl:copy>
</xsl:template>
<!-- function to return baseuri of a given string -->
<xsl:template name="baseuri">
<xsl:param name="string" />
<xsl:if test="contains($string, '/')">
<xsl:value-of select="substring-before($string, '/')"/>
<xsl:text>/</xsl:text>
<xsl:call-template name="baseuri">
<xsl:with-param name="string">
<xsl:value-of select="substring-after($string, '/')"/>
</xsl:with-param>
</xsl:call-template>
</xsl:if>
</xsl:template>
<!-- insert opensearch autodiscovery link -->
<xsl:template match="xhtml:head">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
<link rel="search" type="application/opensearchdescription+xml" title="{xhtml:link[@rel='alternate']/@title} search">
<xsl:attribute name="href">
<xsl:call-template name="baseuri">
<xsl:with-param name="string">
<xsl:value-of select="xhtml:link[@rel='alternate']/@href"/>
</xsl:with-param>
</xsl:call-template>
<xsl:text>opensearchdescription.xml</xsl:text>
</xsl:attribute>
</link>
</xsl:copy>
</xsl:template>
<!-- ensure that scripts don't use empty tag syntax -->
<xsl:template match="xhtml:script">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
<xsl:if test="not(node())">
<xsl:comment><!--HTML Compatibility--></xsl:comment>
</xsl:if>
</xsl:copy>
</xsl:template>
<!-- add HTML5 doctype -->
<xsl:template match="/xhtml:html">
<xsl:text disable-output-escaping="yes">&lt;!DOCTYPE html&gt;</xsl:text>
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- pass through everything else -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

View File

@ -3,14 +3,15 @@ Remap all images to take advantage of the Coral Content Distribution
Network <http://www.coralcdn.org/>. Network <http://www.coralcdn.org/>.
""" """
import sys, urlparse, xml.dom.minidom import re, sys, urlparse, xml.dom.minidom
entry = xml.dom.minidom.parse(sys.stdin).documentElement entry = xml.dom.minidom.parse(sys.stdin).documentElement
for node in entry.getElementsByTagName('img'): for node in entry.getElementsByTagName('img'):
if node.hasAttribute('src'): if node.hasAttribute('src'):
component = list(urlparse.urlparse(node.getAttribute('src'))) component = list(urlparse.urlparse(node.getAttribute('src')))
if component[0]=='http' and component[1].find(':')<0: if component[0] == 'http':
component[1] = re.sub(r':(\d+)$', r'.\1', component[1])
component[1] += '.nyud.net:8080' component[1] += '.nyud.net:8080'
node.setAttribute('src', urlparse.urlunparse(component)) node.setAttribute('src', urlparse.urlunparse(component))

View File

@ -0,0 +1,29 @@
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<!-- Replace atom:author/atom:name with the byline author -->
<xsl:template match="atom:entry/atom:author[../atom:content/xhtml:div/xhtml:span[@class='byline-author' and substring(.,1,10)='Posted by ']]">
<xsl:copy>
<atom:name>
<xsl:value-of select="substring(../atom:content/xhtml:div/xhtml:span[@class='byline-author'],11)"/>
</atom:name>
<xsl:apply-templates select="*[not(self::atom:name)]"/>
</xsl:copy>
</xsl:template>
<!-- Remove byline author -->
<xsl:template match="xhtml:div/xhtml:span[@class='byline-author' and substring(.,1,10)='Posted by ']"/>
<!-- Remove two line breaks following byline author -->
<xsl:template match="xhtml:br[preceding-sibling::*[1][@class='byline-author' and substring(.,1,10)='Posted by ']]"/>
<xsl:template match="xhtml:br[preceding-sibling::*[2][@class='byline-author' and substring(.,1,10)='Posted by ']]"/>
<!-- pass through everything else -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

View File

@ -0,0 +1,17 @@
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<!-- If the first paragraph consists exclusively of "By author-name",
delete it -->
<xsl:template match="atom:content/xhtml:div/xhtml:p[1][. =
concat('By ', ../../../atom:author/atom:name)]"/>
<!-- pass through everything else -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

View File

@ -0,0 +1,15 @@
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<!-- If the first paragraph consists contains @class="from", delete it -->
<xsl:template match="atom:content/xhtml:div/xhtml:div[@class='comment']/xhtml:p[1][@class='from']"/>
<!-- pass through everything else -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

View File

@ -0,0 +1,6 @@
import sys
from planet import html5lib
tree=html5lib.treebuilders.dom.TreeBuilder
parser = html5lib.html5parser.HTMLParser(tree=tree)
document = parser.parse(sys.stdin)
sys.stdout.write(document.toxml("utf-8"))

480
filters/mememe.plugin Normal file
View File

@ -0,0 +1,480 @@
#
# This Venus output filter will annotate an XHTML page with a list of
# "memes" (or most popular linked destinations, based on the last week
# of entries from the cache) and will update the subscription list with
# links to recent entries from each subscription.
#
# Templates that don't produce XHTML natively will need their output passed
# through html2xhtml.plugin first.
#
# Typical configuration (based on classic_fancy):
#
# [index.html.tmpl]
# filters:
# html2xhtml.plugin
# mememe.plugin
#
# [mememe.plugin]
# sidebar = @class='sidebar'
#
import glob, libxml2, os, time, sys, sgmllib, urllib2, urlparse, re, md5
from xml.sax.saxutils import escape
from htmlentitydefs import entitydefs
import planet
from planet import config, feedparser
from planet.spider import filename
log = planet.getLogger(config.log_level(),config.log_format())
options = config.filter_options(sys.argv[0])
MEMES_ATOM = os.path.join(config.output_dir(),'memes.atom')
now = time.time()
week = 7 * 86400
week_ago = now - week
cache = config.cache_directory()
meme_cache = os.path.join(cache, 'memes')
if not os.path.exists(meme_cache): os.makedirs(meme_cache)
bom = config.bill_of_materials()
if not 'images/tcosm11.gif' in bom:
bom.append('images/tcosm11.gif')
config.parser.set('Planet', 'bill_of_materials', ' '.join(bom))
all_links = {}
feed_links = {}
def check_cache(url):
try:
file = open(filename(meme_cache, url))
headers = eval(file.read())
file.close()
return headers or {}
except:
return {}
def cache_meme(url, headers):
json = []
for key,value in headers.items():
json.append(' %s: %s' % (toj(key), toj(value)))
file = open(filename(meme_cache, url),'w')
file.write('{\n' + ',\n'.join(json) + '\n}\n')
file.close()
urlmap = {}
def canonicalize(url):
url = urlmap.get(url,url)
parts = list(urlparse.urlparse(url))
parts[0] = parts[0].lower()
parts[1] = parts[1].lower()
if parts[1].startswith('www.'): parts[1]=parts[1][4:]
if not parts[2]: parts[2] = '/'
parts[-1] = ''
return urlparse.urlunparse(parts)
log.debug("Loading cached data")
for name in glob.glob(os.path.join(cache, '*')):
# ensure that this is within the past week
if os.path.isdir(name): continue
mtime = os.stat(name).st_mtime
if mtime < week_ago: continue
# parse the file
try:
doc = libxml2.parseFile(name)
except:
continue
xp = doc.xpathNewContext()
xp.xpathRegisterNs("atom", "http://www.w3.org/2005/Atom")
xp.xpathRegisterNs("planet", "http://planet.intertwingly.net/")
# determine the entry
entry = xp.xpathEval("/atom:entry/atom:link[@rel='alternate']")
if not entry: continue
entry = canonicalize(entry[0].prop("href"))
# determine the title
title = xp.xpathEval("/atom:entry/atom:title")
if title:
if title[0].prop('type') == 'html':
title = re.sub('<.*?>','',title[0].content)
else:
title = title[0].content
title = str(title or '')
# determine the feed id
feed = xp.xpathEval("/atom:entry/atom:source/planet:memegroup")
if not feed: feed = xp.xpathEval("/atom:entry/atom:source/atom:id")
if not feed: continue
feed = feed[0].content
# determine the author
author = xp.xpathEval("/atom:entry/atom:source/planet:name")
if author:
author = author[0].content
else:
author = ''
# track the feed_links
if author:
if not feed_links.has_key(author): feed_links[author] = list()
feed_links[author].append([mtime, entry, title])
# identify the unique links
entry_links = []
for node in doc.xpathEval("//*[@href and not(@rel='source')]"):
parent = node.parent
while parent:
if parent.name == 'source': break
parent = parent.parent
else:
link = canonicalize(node.prop('href'))
if not link in entry_links:
entry_links.append(link)
if node.hasProp('title') and node.prop('title').startswith('http'):
link = canonicalize(node.prop('title'))
if not link in entry_links:
entry_links.append(link)
# add the votes
weight = 1.0 - (now - mtime)**2 / week**2
vote = [(weight, str(entry), str(feed), title, author, mtime)]
for link in entry_links:
all_links[link] = all_links.get(link,list()) + vote
# free the entry
doc.freeDoc()
# tally the votes
weighted_links = []
for link, votes in all_links.items():
site = {}
updated = 0
for weight, entry, feed, title, author, mtime in votes:
site[feed] = max(site.get(feed,0), weight)
if mtime > updated: updated=mtime
weighted_links.append((sum(site.values()), link, updated))
weighted_links.sort()
weighted_links.reverse()
cp1252 = {
128: 8364, # euro sign
130: 8218, # single low-9 quotation mark
131: 402, # latin small letter f with hook
132: 8222, # double low-9 quotation mark
133: 8230, # horizontal ellipsis
134: 8224, # dagger
135: 8225, # double dagger
136: 710, # modifier letter circumflex accent
137: 8240, # per mille sign
138: 352, # latin capital letter s with caron
139: 8249, # single left-pointing angle quotation mark
140: 338, # latin capital ligature oe
142: 381, # latin capital letter z with caron
145: 8216, # left single quotation mark
146: 8217, # right single quotation mark
147: 8220, # left double quotation mark
148: 8221, # right double quotation mark
149: 8226, # bullet
150: 8211, # en dash
151: 8212, # em dash
152: 732, # small tilde
153: 8482, # trade mark sign
154: 353, # latin small letter s with caron
155: 8250, # single right-pointing angle quotation mark
156: 339, # latin small ligature oe
158: 382, # latin small letter z with caron
159: 376} # latin capital letter y with diaeresis
# determine the title for a given url
class html(sgmllib.SGMLParser):
def __init__(self, url):
sgmllib.SGMLParser.__init__(self)
self.title = ""
self.feedurl = ""
self.intitle = False
headers = check_cache(url)
try:
# fetch the page
request = urllib2.Request(url)
request.add_header('User-Agent', 'Venus/MeMeme')
if headers.has_key('etag'):
request.add_header('If-None-Match', headers['etag'])
if headers.has_key('last_modified'):
request.add_header('If-Modified-Since', headers['last-modified'])
response = urllib2.urlopen(request)
self.feed(response.read())
# ensure the data is in utf-8
try:
self.title = self.title.decode('utf-8')
except:
self.title = ''.join([unichr(cp1252.get(ord(c),ord(c)))
for c in self.title.decode('iso-8859-1')])
# cache the results
headers = {}
if self.feedurl: headers['feedurl'] = self.feedurl
if self.title: headers['title'] = self.title
headers.update(response.headers)
cache_meme(url, headers)
except:
self.feedurl = headers.get('feedurl')
if headers.has_key('title'):
if isinstance(headers['title'],str):
self.title=eval('u'+repr(headers['title']).replace('\\\\','\\'))
else:
self.title=headers['title']
# if there is a feed, look for an entry that matches, and take that title
if self.feedurl and not self.title:
headers = check_cache(self.feedurl)
data = feedparser.parse(self.feedurl, etag=headers.get('etag'),
modified=headers.get('last-modified'))
if data.has_key('headers') and data.has_key('status') and \
data.status in [200, 301, 302]:
titles = {}
for entry in data.entries:
if entry.has_key('title_detail') and entry.has_key('link'):
titles[entry.link] = entry.title_detail.value
if entry.title_detail.type == 'text/plain':
titles[entry.link] = escape(titles[entry.link])
if titles.has_key(url): self.title = titles[url]
data.headers.update(titles)
cache_meme(self.feedurl, data.headers)
else:
if headers.has_key(url):
if isinstance(headers[url],str):
self.title=eval('u'+repr(headers[url]).replace('\\\\','\\'))
else:
self.title=headers[url]
# fallback is the basename of the URI
if not self.title:
self.title = escape(url.rstrip('/').split('/')[-1].split('?')[0])
# parse out the first autodiscovery link
def start_link(self, attrs):
if self.feedurl: return
attrs = dict(map(lambda (k,v): (k.lower(),v), attrs))
if not 'rel' in attrs: return
rels = attrs['rel'].split(' ')
if 'alternate' not in rels: return
if not 'type' in attrs or not attrs['type'].endswith('xml'): return
if 'href' in attrs:
self.feedurl = attrs['href']
# parse the page title
def start_title(self, attributes):
if not self.title: self.intitle = True
def end_title(self):
self.intitle = False
def handle_data(self, text):
if self.intitle: self.title += escape(text)
# convert unicode string to a json string
def toj(value):
result = repr(value).replace(r'\x',r'\u00')
if result[:1] == 'u': result=result[1:]
if result.startswith("'"):
result = '"%s"' % result.replace('"',r'\"').replace(r"\'","'")[1:-1]
return result
seenit = []
count = 0
# construct an empty feed
feed_doc = libxml2.newDoc("1.0")
meme_feed = feed_doc.newChild(None, "feed", None)
meme_feed.newNs('http://www.w3.org/2005/Atom', None)
meme_feed.newTextChild(None, 'title', config.name() + ': Memes')
author = meme_feed.newChild(None, 'author', None)
author.newTextChild(None, 'name', config.owner_name())
if config.owner_email: author.newTextChild(None, 'email', config.owner_email())
meme_feed.newTextChild(None, 'id', os.path.join(config.link(), 'memes.atom'))
link = meme_feed.newChild(None, 'link', None)
link.setProp('href', os.path.join(config.link(), 'memes.atom'))
link.setProp('rel', 'self')
meme_feed.newTextChild(None, 'updated',
time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()))
# parse the input
log.debug("Parse input")
doc=libxml2.parseDoc(sys.stdin.read())
# find the sidebar/footer
sidebar = options.get('sidebar','//*[@class="sidebar"]')
footer = doc.xpathEval(sidebar)
if not hasattr(footer,'__len__') or len(footer) == 0:
raise Exception(sidebar + ' not found')
if len(footer) > 1:
log.info("%d occurrences of %s found, taking first" % (len(footer),sidebar))
footer = footer[0]
# add up to 10 entry links to each subscription
subs_ul = footer.children
while subs_ul.isText() or subs_ul.name != 'ul': subs_ul = subs_ul.next
child = subs_ul.children
while child:
if child.name == 'li':
if child.lastChild().name == 'ul': child.lastChild().unlinkNode()
link = child.lastChild()
while link.isText(): link=link.prev
author = link.getContent()
state = 'inactive'
if feed_links.has_key(author):
ul2 = child.newChild(None, 'ul', None)
feed_links[author].sort()
feed_links[author].reverse()
link_count = 0
for mtime, entry, title in feed_links[author]:
if not title: continue
li2 = ul2.newChild(None, 'li', None)
a = li2.newTextChild(None, 'a', title)
a.setProp('href', entry)
link_count = link_count + 1
if link_count >= 10: break
if link_count > 0: state = None
if state:
link.setProp('class',((link.prop('class') or '') + ' ' + state).strip())
child=child.next
# create a h2 and ul for the memes list
footer_top = footer.children
memes = footer_top.addPrevSibling(footer.newTextChild(None, 'h2', 'Memes '))
memes_ul = footer_top.addPrevSibling(footer.newChild(None, 'ul', None))
# create a header for the memes list
a = memes.newChild(None, 'a', None)
a.setProp('href', 'memes.atom')
img = a.newChild(None, 'img', None)
img.setProp('src', 'images/feed-icon-10x10.png')
# collect the results
log.debug("Fetch titles and collect the results")
from urllib import quote_plus
for i in range(0,len(weighted_links)):
weight, link, updated = weighted_links[i]
# ensure that somebody new points to this entry. This guards against
# groups of related links which several posts point to all.
novel = False
for weight, entry, feed, title, author, mtime in all_links[link]:
if entry not in seenit:
seenit.append(entry)
novel = True
if not novel: continue
all_links[link].sort()
all_links[link].reverse()
cache_file = filename(cache, link)
title = None
# when possible, take the title from the cache
if os.path.exists(cache_file):
entry = feedparser.parse(cache_file).entries[0]
if entry.has_key('title_detail'):
title = entry.title_detail.value
if entry.title_detail.type == 'text/plain': title = escape(title)
# otherwise, parse the html
if not title:
title = html(link).title
# dehtmlize
title = re.sub('&(\w+);',
lambda n: entitydefs.get(n.group(1), '&'+n.group(1)+';'), title)
title = re.sub('&#(\d+);',lambda n: unichr(int(n.group(1))), title)
title = re.sub('&#x(\w+);',lambda n: unichr(int(n.group(1),16)), title)
# title too long? Insert zero width spaces where appropriate
if max(map(len,title.split())) > 30:
title=re.sub('(\W+)',u'\\1\u200b',title)
# save the entry title (it is used later)
entry_title = title.strip()
# add to the memes list
memes_ul.addContent('\n')
li = memes_ul.newChild(None, 'li', None)
memes_ul.addContent('\n')
# technorati link
a = li.newChild(None, 'a', None)
tlink = 'http://technorati.com/cosmos/search.html?url='
if link.startswith('http://'):
a.setProp('href',tlink + quote_plus(link[7:]))
else:
a.setProp('href',tlink + quote_plus(link))
a.setProp('title','cosmos')
img = a.newChild(None, 'img', None)
img.setProp('src','images/tcosm11.gif')
# main link
a = li.newTextChild(None, 'a', title.strip().encode('utf-8'))
a.setProp('href',link)
if (((i==0) or (updated>=weighted_links[i-1][2])) and
(i+1==len(weighted_links) or (updated>=weighted_links[i+1][2]))):
rank = 0
for j in range(0,len(weighted_links)):
if updated < weighted_links[j][2]: rank = rank + 1
if rank < len(weighted_links)/2:
a.setProp('class','rising')
# voters
ul2 = li.newChild(None, 'ul', None)
voters = []
for weight, entry, feed, title, author, mtime in all_links[link]:
if entry in voters: continue
li2 = ul2.newChild(None, 'li', None)
a = li2.newTextChild(None, 'a' , author)
a.setProp('href',entry)
if title: a.setProp('title',title)
voters.append(entry)
# add to the meme feed
if len(all_links[link]) > 2:
meme_feed.addContent('\n')
entry = meme_feed.newChild(None, 'entry', None)
meme_feed.addContent('\n')
# entry
tagbase = config.link().split('/')
if not tagbase[-1]: tagbase = tagbase[:-1]
tagbase = 'tag:%s,2007:%smeme/%%s' % (tagbase[2],'/'.join(tagbase[3:]))
entry.newTextChild(None, 'id', tagbase % md5.new(link).hexdigest())
entry.newTextChild(None, 'title', entry_title.encode('utf-8'))
meme_link = entry.newTextChild(None, 'link', None)
meme_link.setProp('href', link)
entry.newTextChild(None, 'updated',
time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime(updated)))
# voters
content = entry.newChild(None, 'content', None)
content.setProp('type', 'xhtml')
div = content.newTextChild(None, 'div', 'Spotted by:')
div.newNs('http://www.w3.org/1999/xhtml', None)
content_ul = div.newChild(None, 'ul', None)
for weight, entry, feed, title, author, mtime in all_links[link]:
li2 = content_ul.newTextChild(None, 'li', author + ": ")
a = li2.newTextChild(None, 'a' , title or 'untitled')
a.setProp('href',entry)
count = count + 1
if count >= 10: break
log.info("Writing " + MEMES_ATOM)
output=open(MEMES_ATOM,'w')
output.write(feed_doc.serialize('utf-8'))
output.close()
sys.stdout.write(doc.serialize('utf-8'))

5
filters/xhtml2html.py Normal file
View File

@ -0,0 +1,5 @@
import sys
from genshi.input import XMLParser
from genshi.output import HTMLSerializer
print ''.join(HTMLSerializer()(XMLParser(sys.stdin))).encode('utf-8')

View File

@ -352,14 +352,15 @@ def filters(section=None):
filters = [] filters = []
if parser.has_option('Planet', 'filters'): if parser.has_option('Planet', 'filters'):
filters += parser.get('Planet', 'filters').split() filters += parser.get('Planet', 'filters').split()
if section and parser.has_option(section, 'filters'):
filters += parser.get(section, 'filters').split()
if filter(section): if filter(section):
filters.append('regexp_sifter.py?require=' + filters.append('regexp_sifter.py?require=' +
urllib.quote(filter(section))) urllib.quote(filter(section)))
if exclude(section): if exclude(section):
filters.append('regexp_sifter.py?exclude=' + filters.append('regexp_sifter.py?exclude=' +
urllib.quote(filter(section))) urllib.quote(exclude(section)))
for section in section and [section] or template_files():
if parser.has_option(section, 'filters'):
filters += parser.get(section, 'filters').split()
return filters return filters
def planet_options(): def planet_options():
@ -382,6 +383,10 @@ def template_options(section):
""" dictionary of template specific options""" """ dictionary of template specific options"""
return feed_options(section) return feed_options(section)
def filter_options(section):
""" dictionary of filter specific options"""
return feed_options(section)
def write(file=sys.stdout): def write(file=sys.stdout):
""" write out an updated template """ """ write out an updated template """
print parser.write(file) print parser.write(file)

View File

@ -11,8 +11,8 @@ Recommended: Python 2.3 or later
Recommended: CJKCodecs and iconv_codec <http://cjkpython.i18n.org/> Recommended: CJKCodecs and iconv_codec <http://cjkpython.i18n.org/>
""" """
__version__ = "4.2-pre-" + "$Revision: 1.149 $"[11:16] + "-cvs" __version__ = "4.2-pre-" + "$Revision: 262 $"[11:14] + "-svn"
__license__ = """Copyright (c) 2002-2006, Mark Pilgrim, All rights reserved. __license__ = """Copyright (c) 2002-2007, Mark Pilgrim, All rights reserved.
Redistribution and use in source and binary forms, with or without modification, Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met: are permitted provided that the following conditions are met:
@ -39,7 +39,8 @@ __contributors__ = ["Jason Diamond <http://injektilo.org/>",
"John Beimler <http://john.beimler.org/>", "John Beimler <http://john.beimler.org/>",
"Fazal Majid <http://www.majid.info/mylos/weblog/>", "Fazal Majid <http://www.majid.info/mylos/weblog/>",
"Aaron Swartz <http://aaronsw.com/>", "Aaron Swartz <http://aaronsw.com/>",
"Kevin Marks <http://epeus.blogspot.com/>"] "Kevin Marks <http://epeus.blogspot.com/>",
"Sam Ruby <http://intertwingly.net/>"]
_debug = 0 _debug = 0
# HTTP "User-Agent" header to send to servers when downloading feeds. # HTTP "User-Agent" header to send to servers when downloading feeds.
@ -229,6 +230,10 @@ class FeedParserDict(UserDict):
if key == 'enclosures': if key == 'enclosures':
norel = lambda link: FeedParserDict([(name,value) for (name,value) in link.items() if name!='rel']) norel = lambda link: FeedParserDict([(name,value) for (name,value) in link.items() if name!='rel'])
return [norel(link) for link in UserDict.__getitem__(self, 'links') if link['rel']=='enclosure'] return [norel(link) for link in UserDict.__getitem__(self, 'links') if link['rel']=='enclosure']
if key == 'license':
for link in UserDict.__getitem__(self, 'links'):
if link['rel']=='license' and link.has_key('href'):
return link['href']
if key == 'categories': if key == 'categories':
return [(tag['scheme'], tag['term']) for tag in UserDict.__getitem__(self, 'tags')] return [(tag['scheme'], tag['term']) for tag in UserDict.__getitem__(self, 'tags')]
realkey = self.keymap.get(key, key) realkey = self.keymap.get(key, key)
@ -424,7 +429,7 @@ class _FeedParserMixin:
} }
_matchnamespaces = {} _matchnamespaces = {}
can_be_relative_uri = ['link', 'id', 'wfw_comment', 'wfw_commentrss', 'docs', 'url', 'href', 'comments', 'license', 'icon', 'logo'] can_be_relative_uri = ['link', 'id', 'wfw_comment', 'wfw_commentrss', 'docs', 'url', 'href', 'comments', 'icon', 'logo']
can_contain_relative_uris = ['content', 'title', 'summary', 'info', 'tagline', 'subtitle', 'copyright', 'rights', 'description'] can_contain_relative_uris = ['content', 'title', 'summary', 'info', 'tagline', 'subtitle', 'copyright', 'rights', 'description']
can_contain_dangerous_markup = ['content', 'title', 'summary', 'info', 'tagline', 'subtitle', 'copyright', 'rights', 'description'] can_contain_dangerous_markup = ['content', 'title', 'summary', 'info', 'tagline', 'subtitle', 'copyright', 'rights', 'description']
html_types = ['text/html', 'application/xhtml+xml'] html_types = ['text/html', 'application/xhtml+xml']
@ -460,6 +465,7 @@ class _FeedParserMixin:
self.langstack = [] self.langstack = []
self.baseuri = baseuri or '' self.baseuri = baseuri or ''
self.lang = baselang or None self.lang = baselang or None
self.svgOK = 0
if baselang: if baselang:
self.feeddata['language'] = baselang.replace('_','-') self.feeddata['language'] = baselang.replace('_','-')
@ -514,6 +520,7 @@ class _FeedParserMixin:
attrs.append(('xmlns',namespace)) attrs.append(('xmlns',namespace))
if tag=='svg' and namespace=='http://www.w3.org/2000/svg': if tag=='svg' and namespace=='http://www.w3.org/2000/svg':
attrs.append(('xmlns',namespace)) attrs.append(('xmlns',namespace))
if tag == 'svg': self.svgOK = 1
return self.handle_data('<%s%s>' % (tag, self.strattrs(attrs)), escape=0) return self.handle_data('<%s%s>' % (tag, self.strattrs(attrs)), escape=0)
# match namespaces # match namespaces
@ -549,6 +556,7 @@ class _FeedParserMixin:
prefix = self.namespacemap.get(prefix, prefix) prefix = self.namespacemap.get(prefix, prefix)
if prefix: if prefix:
prefix = prefix + '_' prefix = prefix + '_'
if suffix == 'svg': self.svgOK = 0
# call special handler (if defined) or default handler # call special handler (if defined) or default handler
methodname = '_end_' + prefix + suffix methodname = '_end_' + prefix + suffix
@ -1247,17 +1255,26 @@ class _FeedParserMixin:
self._save('expired_parsed', _parse_date(self.pop('expired'))) self._save('expired_parsed', _parse_date(self.pop('expired')))
def _start_cc_license(self, attrsD): def _start_cc_license(self, attrsD):
self.push('license', 1) context = self._getContext()
value = self._getAttribute(attrsD, 'rdf:resource') value = self._getAttribute(attrsD, 'rdf:resource')
if value: attrsD = FeedParserDict()
self.elementstack[-1][2].append(value) attrsD['rel']='license'
self.pop('license') if value: attrsD['href']=value
context.setdefault('links', []).append(attrsD)
def _start_creativecommons_license(self, attrsD): def _start_creativecommons_license(self, attrsD):
self.push('license', 1) self.push('license', 1)
_start_creativeCommons_license = _start_creativecommons_license
def _end_creativecommons_license(self): def _end_creativecommons_license(self):
self.pop('license') value = self.pop('license')
context = self._getContext()
attrsD = FeedParserDict()
attrsD['rel']='license'
if value: attrsD['href']=value
context.setdefault('links', []).append(attrsD)
del context['license']
_end_creativeCommons_license = _end_creativecommons_license
def _addXFN(self, relationships, href, name): def _addXFN(self, relationships, href, name):
context = self._getContext() context = self._getContext()
@ -1349,12 +1366,13 @@ class _FeedParserMixin:
self._save('link', value) self._save('link', value)
def _start_title(self, attrsD): def _start_title(self, attrsD):
if self.incontent: return self.unknown_starttag('title', attrsD) if self.svgOK: return self.unknown_starttag('title', attrsD.items())
self.pushContent('title', attrsD, 'text/plain', self.infeed or self.inentry or self.insource) self.pushContent('title', attrsD, 'text/plain', self.infeed or self.inentry or self.insource)
_start_dc_title = _start_title _start_dc_title = _start_title
_start_media_title = _start_title _start_media_title = _start_title
def _end_title(self): def _end_title(self):
if self.svgOK: return
value = self.popContent('title') value = self.popContent('title')
if not value: return if not value: return
context = self._getContext() context = self._getContext()
@ -2233,27 +2251,41 @@ def _resolveRelativeURIs(htmlSource, baseURI, encoding, type):
return p.output() return p.output()
class _HTMLSanitizer(_BaseHTMLProcessor): class _HTMLSanitizer(_BaseHTMLProcessor):
acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'article',
'big', 'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'aside', 'audio', 'b', 'big', 'blockquote', 'br', 'button', 'canvas',
'code', 'col', 'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'caption', 'center', 'cite', 'code', 'col', 'colgroup', 'command',
'em', 'fieldset', 'font', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'datagrid', 'datalist', 'dd', 'del', 'details', 'dfn', 'dialog', 'dir',
'hr', 'i', 'img', 'input', 'ins', 'kbd', 'label', 'legend', 'li', 'map', 'div', 'dl', 'dt', 'em', 'event-source', 'fieldset', 'figure', 'footer',
'menu', 'ol', 'optgroup', 'option', 'p', 'pre', 'q', 's', 'samp', 'font', 'form', 'header', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i',
'select', 'small', 'span', 'strike', 'strong', 'sub', 'sup', 'table', 'img', 'input', 'ins', 'keygen', 'kbd', 'label', 'legend', 'li', 'm', 'map',
'tbody', 'td', 'textarea', 'tfoot', 'th', 'thead', 'tr', 'tt', 'u', 'menu', 'meter', 'multicol', 'nav', 'nextid', 'ol', 'output', 'optgroup',
'ul', 'var'] 'option', 'p', 'pre', 'progress', 'q', 's', 'samp', 'section', 'select',
'small', 'sound', 'source', 'spacer', 'span', 'strike', 'strong', 'sub',
'sup', 'table', 'tbody', 'td', 'textarea', 'time', 'tfoot', 'th', 'thead',
'tr', 'tt', 'u', 'ul', 'var', 'video', 'noscript']
acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey', acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'action', 'align', 'alt', 'autoplay', 'autocomplete', 'autofocus', 'axis',
'cellspacing', 'char', 'charoff', 'charset', 'checked', 'cite', 'class', 'background', 'balance', 'bgcolor', 'bgproperties', 'border',
'clear', 'cols', 'colspan', 'color', 'compact', 'coords', 'datetime', 'bordercolor', 'bordercolordark', 'bordercolorlight', 'bottompadding',
'dir', 'disabled', 'enctype', 'for', 'frame', 'headers', 'height', 'cellpadding', 'cellspacing', 'ch', 'challenge', 'char', 'charoff',
'href', 'hreflang', 'hspace', 'id', 'ismap', 'label', 'lang', 'choff', 'charset', 'checked', 'cite', 'class', 'clear', 'color', 'cols',
'longdesc', 'maxlength', 'media', 'method', 'multiple', 'name', 'colspan', 'compact', 'contenteditable', 'coords', 'data', 'datafld',
'nohref', 'noshade', 'nowrap', 'prompt', 'readonly', 'rel', 'rev', 'datapagesize', 'datasrc', 'datetime', 'default', 'delay', 'dir',
'rows', 'rowspan', 'rules', 'scope', 'selected', 'shape', 'size', 'disabled', 'draggable', 'dynsrc', 'enctype', 'end', 'face', 'for',
'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'form', 'frame', 'galleryimg', 'gutter', 'headers', 'height', 'hidefocus',
'type', 'usemap', 'valign', 'value', 'vspace', 'width', 'xml:lang'] 'hidden', 'high', 'href', 'hreflang', 'hspace', 'icon', 'id', 'inputmode',
'ismap', 'keytype', 'label', 'leftspacing', 'lang', 'list', 'longdesc',
'loop', 'loopcount', 'loopend', 'loopstart', 'low', 'lowsrc', 'max',
'maxlength', 'media', 'method', 'min', 'multiple', 'name', 'nohref',
'noshade', 'nowrap', 'open', 'optimum', 'pattern', 'ping', 'point-size',
'prompt', 'pqg', 'radiogroup', 'readonly', 'rel', 'repeat-max',
'repeat-min', 'replace', 'required', 'rev', 'rightspacing', 'rows',
'rowspan', 'rules', 'scope', 'selected', 'shape', 'size', 'span', 'src',
'start', 'step', 'summary', 'suppress', 'tabindex', 'target', 'template',
'title', 'toppadding', 'type', 'unselectable', 'usemap', 'urn', 'valign',
'value', 'variable', 'volume', 'vspace', 'vrml', 'width', 'wrap',
'xml:lang']
unacceptable_elements_with_end_tag = ['script', 'applet'] unacceptable_elements_with_end_tag = ['script', 'applet']
@ -2300,36 +2332,38 @@ class _HTMLSanitizer(_BaseHTMLProcessor):
svg_elements = ['a', 'animate', 'animateColor', 'animateMotion', svg_elements = ['a', 'animate', 'animateColor', 'animateMotion',
'animateTransform', 'circle', 'defs', 'desc', 'ellipse', 'font-face', 'animateTransform', 'circle', 'defs', 'desc', 'ellipse', 'font-face',
'font-face-name', 'font-face-src', 'g', 'glyph', 'hkern', 'image', 'font-face-name', 'font-face-src', 'g', 'glyph', 'hkern', 'image',
'linearGradient', 'line', 'metadata', 'missing-glyph', 'mpath', 'path', 'linearGradient', 'line', 'marker', 'metadata', 'missing-glyph', 'mpath',
'polygon', 'polyline', 'radialGradient', 'rect', 'set', 'stop', 'svg', 'path', 'polygon', 'polyline', 'radialGradient', 'rect', 'set', 'stop',
'switch', 'text', 'title', 'use'] 'svg', 'switch', 'text', 'title', 'tspan', 'use']
# svgtiny + class + opacity + offset + xmlns + xmlns:xlink # svgtiny + class + opacity + offset + xmlns + xmlns:xlink
svg_attributes = ['accent-height', 'accumulate', 'additive', 'alphabetic', svg_attributes = ['accent-height', 'accumulate', 'additive', 'alphabetic',
'arabic-form', 'ascent', 'attributeName', 'attributeType', 'arabic-form', 'ascent', 'attributeName', 'attributeType',
'baseProfile', 'bbox', 'begin', 'by', 'calcMode', 'cap-height', 'baseProfile', 'bbox', 'begin', 'by', 'calcMode', 'cap-height',
'class', 'color', 'color-rendering', 'content', 'cx', 'cy', 'd', 'class', 'color', 'color-rendering', 'content', 'cx', 'cy', 'd', 'dx',
'descent', 'display', 'dur', 'end', 'fill', 'fill-rule', 'font-family', 'dy', 'descent', 'display', 'dur', 'end', 'fill', 'fill-rule',
'font-size', 'font-stretch', 'font-style', 'font-variant', 'font-family', 'font-size', 'font-stretch', 'font-style', 'font-variant',
'font-weight', 'from', 'fx', 'fy', 'g1', 'g2', 'glyph-name', 'font-weight', 'from', 'fx', 'fy', 'g1', 'g2', 'glyph-name',
'gradientUnits', 'hanging', 'height', 'horiz-adv-x', 'horiz-origin-x', 'gradientUnits', 'hanging', 'height', 'horiz-adv-x', 'horiz-origin-x',
'id', 'ideographic', 'k', 'keyPoints', 'keySplines', 'keyTimes', 'id', 'ideographic', 'k', 'keyPoints', 'keySplines', 'keyTimes',
'lang', 'mathematical', 'max', 'min', 'name', 'offset', 'opacity', 'lang', 'mathematical', 'marker-end', 'marker-mid', 'marker-start',
'origin', 'overline-position', 'overline-thickness', 'panose-1', 'markerHeight', 'markerUnits', 'markerWidth', 'max', 'min', 'name',
'path', 'pathLength', 'points', 'preserveAspectRatio', 'r', 'offset', 'opacity', 'orient', 'origin', 'overline-position',
'repeatCount', 'repeatDur', 'requiredExtensions', 'requiredFeatures', 'overline-thickness', 'panose-1', 'path', 'pathLength', 'points',
'restart', 'rotate', 'rx', 'ry', 'slope', 'stemh', 'stemv', 'preserveAspectRatio', 'r', 'refX', 'refY', 'repeatCount', 'repeatDur',
'stop-color', 'stop-opacity', 'strikethrough-position', 'requiredExtensions', 'requiredFeatures', 'restart', 'rotate', 'rx',
'strikethrough-thickness', 'stroke', 'stroke-dasharray', 'ry', 'slope', 'stemh', 'stemv', 'stop-color', 'stop-opacity',
'stroke-dashoffset', 'stroke-linecap', 'stroke-linejoin', 'strikethrough-position', 'strikethrough-thickness', 'stroke',
'stroke-miterlimit', 'stroke-width', 'systemLanguage', 'target', 'stroke-dasharray', 'stroke-dashoffset', 'stroke-linecap',
'text-anchor', 'to', 'transform', 'type', 'u1', 'u2', 'stroke-linejoin', 'stroke-miterlimit', 'stroke-opacity',
'underline-position', 'underline-thickness', 'unicode', 'stroke-width', 'systemLanguage', 'target', 'text-anchor', 'to',
'unicode-range', 'units-per-em', 'values', 'version', 'viewBox', 'transform', 'type', 'u1', 'u2', 'underline-position',
'visibility', 'width', 'widths', 'x', 'x-height', 'x1', 'x2', 'underline-thickness', 'unicode', 'unicode-range', 'units-per-em',
'xlink:actuate', 'xlink:arcrole', 'xlink:href', 'xlink:role', 'values', 'version', 'viewBox', 'visibility', 'width', 'widths', 'x',
'xlink:show', 'xlink:title', 'xlink:type', 'xml:base', 'xml:lang', 'x-height', 'x1', 'x2', 'xlink:actuate', 'xlink:arcrole', 'xlink:href',
'xml:space', 'xmlns', 'xmlns:xlink', 'y', 'y1', 'y2', 'zoomAndPan'] 'xlink:role', 'xlink:show', 'xlink:title', 'xlink:type', 'xml:base',
'xml:lang', 'xml:space', 'xmlns', 'xmlns:xlink', 'y', 'y1', 'y2',
'zoomAndPan']
svg_attr_map = None svg_attr_map = None
svg_elem_map = None svg_elem_map = None
@ -3506,7 +3540,8 @@ class TextSerializer(Serializer):
class PprintSerializer(Serializer): class PprintSerializer(Serializer):
def write(self, stream=sys.stdout): def write(self, stream=sys.stdout):
stream.write(self.results['href'] + '\n\n') if self.results.has_key('href'):
stream.write(self.results['href'] + '\n\n')
from pprint import pprint from pprint import pprint
pprint(self.results, stream) pprint(self.results, stream)
stream.write('\n') stream.write('\n')
@ -3767,4 +3802,3 @@ if __name__ == '__main__':
# currently supports rel-tag (maps to 'tags'), rel-enclosure (maps to # currently supports rel-tag (maps to 'tags'), rel-enclosure (maps to
# 'enclosures'), XFN links within content elements (maps to 'xfn'), # 'enclosures'), XFN links within content elements (maps to 'xfn'),
# and hCard (parses as vCard); bug [ 1481975 ] Misencoded utf-8/win-1252 # and hCard (parses as vCard); bug [ 1481975 ] Misencoded utf-8/win-1252

View File

@ -71,35 +71,40 @@ class HTMLParser(object):
"trailingEnd": TrailingEndPhase(self, self.tree) "trailingEnd": TrailingEndPhase(self, self.tree)
} }
def parse(self, stream, encoding=None, innerHTML=False): def _parse(self, stream, innerHTML=False, container="div",
"""Parse a HTML document into a well-formed tree encoding=None):
stream - a filelike object or string containing the HTML to be parsed
innerHTML - Are we parsing in innerHTML mode (note innerHTML=True
is not yet supported)
The optional encoding parameter must be a string that indicates
the encoding. If specified, that encoding will be used,
regardless of any BOM or later declaration (such as in a meta
element)
"""
self.tree.reset() self.tree.reset()
self.firstStartTag = False self.firstStartTag = False
self.errors = [] self.errors = []
self.phase = self.phases["initial"] self.tokenizer = tokenizer.HTMLTokenizer(stream, encoding,
parseMeta=innerHTML)
if innerHTML:
self.innerHTML = container.lower()
if self.innerHTML in ('title', 'textarea'):
self.tokenizer.contentModelFlag = tokenizer.contentModelFlags["RCDATA"]
elif self.innerHTML in ('style', 'script', 'xmp', 'iframe', 'noembed', 'noframes', 'noscript'):
self.tokenizer.contentModelFlag = tokenizer.contentModelFlags["CDATA"]
elif self.innerHTML == 'plaintext':
self.tokenizer.contentModelFlag = tokenizer.contentModelFlags["PLAINTEXT"]
else:
# contentModelFlag already is PCDATA
#self.tokenizer.contentModelFlag = tokenizer.contentModelFlags["PCDATA"]
pass
self.phase = self.phases["rootElement"]
self.phase.insertHtmlElement()
self.resetInsertionMode()
else:
self.innerHTML = False
self.phase = self.phases["initial"]
# We only seem to have InBodyPhase testcases where the following is # We only seem to have InBodyPhase testcases where the following is
# relevant ... need others too # relevant ... need others too
self.lastPhase = None self.lastPhase = None
# We don't actually support innerHTML yet but this should allow
# assertations
self.innerHTML = innerHTML
self.tokenizer = tokenizer.HTMLTokenizer(stream, encoding)
# XXX This is temporary for the moment so there isn't any other # XXX This is temporary for the moment so there isn't any other
# changes needed for the parser to work with the iterable tokenizer # changes needed for the parser to work with the iterable tokenizer
for token in self.tokenizer: for token in self.tokenizer:
@ -118,7 +123,34 @@ class HTMLParser(object):
# When the loop finishes it's EOF # When the loop finishes it's EOF
self.phase.processEOF() self.phase.processEOF()
def parse(self, stream, encoding=None):
"""Parse a HTML document into a well-formed tree
stream - a filelike object or string containing the HTML to be parsed
The optional encoding parameter must be a string that indicates
the encoding. If specified, that encoding will be used,
regardless of any BOM or later declaration (such as in a meta
element)
"""
self._parse(stream, innerHTML=False, encoding=encoding)
return self.tree.getDocument() return self.tree.getDocument()
def parseFragment(self, stream, container="div", encoding=None):
"""Parse a HTML fragment into a well-formed tree fragment
container - name of the element we're setting the innerHTML property
if set to None, default to 'div'
stream - a filelike object or string containing the HTML to be parsed
The optional encoding parameter must be a string that indicates
the encoding. If specified, that encoding will be used,
regardless of any BOM or later declaration (such as in a meta
element)
"""
self._parse(stream, True, container=container, encoding=encoding)
return self.tree.getFragment()
def parseError(self, data="XXX ERROR MESSAGE NEEDED"): def parseError(self, data="XXX ERROR MESSAGE NEEDED"):
# XXX The idea is to make data mandatory. # XXX The idea is to make data mandatory.
@ -187,28 +219,29 @@ class HTMLParser(object):
"frameset":"inFrameset" "frameset":"inFrameset"
} }
for node in self.tree.openElements[::-1]: for node in self.tree.openElements[::-1]:
nodeName = node.name
if node == self.tree.openElements[0]: if node == self.tree.openElements[0]:
last = True last = True
if node.name not in ['td', 'th']: if nodeName not in ['td', 'th']:
# XXX # XXX
assert self.innerHTML assert self.innerHTML
raise NotImplementedError nodeName = self.innerHTML
# Check for conditions that should only happen in the innerHTML # Check for conditions that should only happen in the innerHTML
# case # case
if node.name in ("select", "colgroup", "head", "frameset"): if nodeName in ("select", "colgroup", "head", "frameset"):
# XXX # XXX
assert self.innerHTML assert self.innerHTML
if node.name in newModes: if nodeName in newModes:
self.phase = self.phases[newModes[node.name]] self.phase = self.phases[newModes[nodeName]]
break break
elif node.name == "html": elif nodeName == "html":
if self.tree.headPointer is None: if self.tree.headPointer is None:
self.phase = self.phases["beforeHead"] self.phase = self.phases["beforeHead"]
else: else:
self.phase = self.phases["afterHead"] self.phase = self.phases["afterHead"]
break break
elif last: elif last:
self.phase = self.phases["body"] self.phase = self.phases["inBody"]
break break
class Phase(object): class Phase(object):
@ -434,9 +467,7 @@ class InHeadPhase(Phase):
self.parser.phase.processCharacters(data) self.parser.phase.processCharacters(data)
def startTagHead(self, name, attributes): def startTagHead(self, name, attributes):
self.tree.insertElement(name, attributes) self.parser.parseError(_(u"Unexpected start tag head in existing head. Ignored"))
self.tree.headPointer = self.tree.openElements[-1]
self.parser.phase = self.parser.phases["inHead"]
def startTagTitle(self, name, attributes): def startTagTitle(self, name, attributes):
element = self.tree.createElement(name, attributes) element = self.tree.createElement(name, attributes)
@ -455,10 +486,11 @@ class InHeadPhase(Phase):
self.parser.tokenizer.contentModelFlag = contentModelFlags["CDATA"] self.parser.tokenizer.contentModelFlag = contentModelFlags["CDATA"]
def startTagScript(self, name, attributes): def startTagScript(self, name, attributes):
#XXX Inner HTML case may be wrong
element = self.tree.createElement(name, attributes) element = self.tree.createElement(name, attributes)
element._flags.append("parser-inserted") element._flags.append("parser-inserted")
if self.tree.headPointer is not None and\ if (self.tree.headPointer is not None and
self.parser.phase == self.parser.phases["inHead"]: self.parser.phase == self.parser.phases["inHead"]):
self.appendToHead(element) self.appendToHead(element)
else: else:
self.tree.openElements[-1].appendChild(element) self.tree.openElements[-1].appendChild(element)
@ -653,8 +685,8 @@ class InBodyPhase(Phase):
def startTagBody(self, name, attributes): def startTagBody(self, name, attributes):
self.parser.parseError(_(u"Unexpected start tag (body).")) self.parser.parseError(_(u"Unexpected start tag (body)."))
if len(self.tree.openElements) == 1 \ if (len(self.tree.openElements) == 1
or self.tree.openElements[1].name != "body": or self.tree.openElements[1].name != "body"):
assert self.parser.innerHTML assert self.parser.innerHTML
else: else:
for attr, value in attributes.iteritems(): for attr, value in attributes.iteritems():
@ -1179,6 +1211,7 @@ class InTablePhase(Phase):
self.parser.resetInsertionMode() self.parser.resetInsertionMode()
else: else:
# innerHTML case # innerHTML case
assert self.parser.innerHTML
self.parser.parseError() self.parser.parseError()
def endTagIgnore(self, name): def endTagIgnore(self, name):
@ -1215,23 +1248,25 @@ class InCaptionPhase(Phase):
]) ])
self.endTagHandler.default = self.endTagOther self.endTagHandler.default = self.endTagOther
def ignoreEndTagCaption(self):
return not self.tree.elementInScope("caption", True)
def processCharacters(self, data): def processCharacters(self, data):
self.parser.phases["inBody"].processCharacters(data) self.parser.phases["inBody"].processCharacters(data)
def startTagTableElement(self, name, attributes): def startTagTableElement(self, name, attributes):
self.parser.parseError() self.parser.parseError()
#XXX Have to duplicate logic here to find out if the tag is ignored
ignoreEndTag = self.ignoreEndTagCaption()
self.parser.phase.processEndTag("caption") self.parser.phase.processEndTag("caption")
# XXX how do we know the tag is _always_ ignored in the innerHTML if not ignoreEndTag:
# case and therefore shouldn't be processed again? I'm not sure this
# strategy makes sense...
if not self.parser.innerHTML:
self.parser.phase.processStartTag(name, attributes) self.parser.phase.processStartTag(name, attributes)
def startTagOther(self, name, attributes): def startTagOther(self, name, attributes):
self.parser.phases["inBody"].processStartTag(name, attributes) self.parser.phases["inBody"].processStartTag(name, attributes)
def endTagCaption(self, name): def endTagCaption(self, name):
if self.tree.elementInScope(name, True): if not self.ignoreEndTagCaption():
# AT this code is quite similar to endTagTable in "InTable" # AT this code is quite similar to endTagTable in "InTable"
self.tree.generateImpliedEndTags() self.tree.generateImpliedEndTags()
if self.tree.openElements[-1].name != "caption": if self.tree.openElements[-1].name != "caption":
@ -1244,14 +1279,15 @@ class InCaptionPhase(Phase):
self.parser.phase = self.parser.phases["inTable"] self.parser.phase = self.parser.phases["inTable"]
else: else:
# innerHTML case # innerHTML case
assert self.parser.innerHTML
self.parser.parseError() self.parser.parseError()
def endTagTable(self, name): def endTagTable(self, name):
self.parser.parseError() self.parser.parseError()
ignoreEndTag = self.ignoreEndTagCaption()
self.parser.phase.processEndTag("caption") self.parser.phase.processEndTag("caption")
# XXX ... if not ignoreEndTag:
if not self.parser.innerHTML: self.parser.phase.processEndTag(name)
self.parser.phase.processStartTag(name, attributes)
def endTagIgnore(self, name): def endTagIgnore(self, name):
self.parser.parseError(_("Unexpected end tag (" + name +\ self.parser.parseError(_("Unexpected end tag (" + name +\
@ -1279,10 +1315,13 @@ class InColumnGroupPhase(Phase):
]) ])
self.endTagHandler.default = self.endTagOther self.endTagHandler.default = self.endTagOther
def ignoreEndTagColgroup(self):
return self.tree.openElements[-1].name == "html"
def processCharacters(self, data): def processCharacters(self, data):
ignoreEndTag = self.ignoreEndTagColgroup()
self.endTagColgroup("colgroup") self.endTagColgroup("colgroup")
# XXX if not ignoreEndTag:
if not self.parser.innerHTML:
self.parser.phase.processCharacters(data) self.parser.phase.processCharacters(data)
def startTagCol(self, name ,attributes): def startTagCol(self, name ,attributes):
@ -1290,14 +1329,15 @@ class InColumnGroupPhase(Phase):
self.tree.openElements.pop() self.tree.openElements.pop()
def startTagOther(self, name, attributes): def startTagOther(self, name, attributes):
ignoreEndTag = self.ignoreEndTagColgroup()
self.endTagColgroup("colgroup") self.endTagColgroup("colgroup")
# XXX how can be sure it's always ignored? if not ignoreEndTag:
if not self.parser.innerHTML:
self.parser.phase.processStartTag(name, attributes) self.parser.phase.processStartTag(name, attributes)
def endTagColgroup(self, name): def endTagColgroup(self, name):
if self.tree.openElements[-1].name == "html": if self.ignoreEndTagColgroup():
# innerHTML case # innerHTML case
assert self.parser.innerHTML
self.parser.parseError() self.parser.parseError()
else: else:
self.tree.openElements.pop() self.tree.openElements.pop()
@ -1308,9 +1348,9 @@ class InColumnGroupPhase(Phase):
u"col has no end tag.")) u"col has no end tag."))
def endTagOther(self, name): def endTagOther(self, name):
ignoreEndTag = self.ignoreEndTagColgroup()
self.endTagColgroup("colgroup") self.endTagColgroup("colgroup")
# XXX how can be sure it's always ignored? if not ignoreEndTag:
if not self.parser.innerHTML:
self.parser.phase.processEndTag(name) self.parser.phase.processEndTag(name)
@ -1359,9 +1399,9 @@ class InTableBodyPhase(Phase):
def startTagTableOther(self, name, attributes): def startTagTableOther(self, name, attributes):
# XXX AT Any ideas on how to share this with endTagTable? # XXX AT Any ideas on how to share this with endTagTable?
if self.tree.elementInScope("tbody", True) or \ if (self.tree.elementInScope("tbody", True) or
self.tree.elementInScope("thead", True) or \ self.tree.elementInScope("thead", True) or
self.tree.elementInScope("tfoot", True): self.tree.elementInScope("tfoot", True)):
self.clearStackToTableBodyContext() self.clearStackToTableBodyContext()
self.endTagTableRowGroup(self.tree.openElements[-1].name) self.endTagTableRowGroup(self.tree.openElements[-1].name)
self.parser.phase.processStartTag(name, attributes) self.parser.phase.processStartTag(name, attributes)
@ -1382,9 +1422,9 @@ class InTableBodyPhase(Phase):
") in the table body phase. Ignored.")) ") in the table body phase. Ignored."))
def endTagTable(self, name): def endTagTable(self, name):
if self.tree.elementInScope("tbody", True) or \ if (self.tree.elementInScope("tbody", True) or
self.tree.elementInScope("thead", True) or \ self.tree.elementInScope("thead", True) or
self.tree.elementInScope("tfoot", True): self.tree.elementInScope("tfoot", True)):
self.clearStackToTableBodyContext() self.clearStackToTableBodyContext()
self.endTagTableRowGroup(self.tree.openElements[-1].name) self.endTagTableRowGroup(self.tree.openElements[-1].name)
self.parser.phase.processEndTag(name) self.parser.phase.processEndTag(name)
@ -1428,6 +1468,9 @@ class InRowPhase(Phase):
self.tree.openElements[-1].name + u") in the row phase.")) self.tree.openElements[-1].name + u") in the row phase."))
self.tree.openElements.pop() self.tree.openElements.pop()
def ignoreEndTagTr(self):
return not self.tree.elementInScope("tr", tableVariant=True)
# the rest # the rest
def processCharacters(self, data): def processCharacters(self, data):
self.parser.phases["inTable"].processCharacters(data) self.parser.phases["inTable"].processCharacters(data)
@ -1439,28 +1482,31 @@ class InRowPhase(Phase):
self.tree.activeFormattingElements.append(Marker) self.tree.activeFormattingElements.append(Marker)
def startTagTableOther(self, name, attributes): def startTagTableOther(self, name, attributes):
ignoreEndTag = self.ignoreEndTagTr()
self.endTagTr("tr") self.endTagTr("tr")
# XXX how are we sure it's always ignored in the innerHTML case? # XXX how are we sure it's always ignored in the innerHTML case?
if not self.parser.innerHTML: if not ignoreEndTag:
self.parser.phase.processStartTag(name, attributes) self.parser.phase.processStartTag(name, attributes)
def startTagOther(self, name, attributes): def startTagOther(self, name, attributes):
self.parser.phases["inTable"].processStartTag(name, attributes) self.parser.phases["inTable"].processStartTag(name, attributes)
def endTagTr(self, name): def endTagTr(self, name):
if self.tree.elementInScope("tr", True): if not self.ignoreEndTagTr():
self.clearStackToTableRowContext() self.clearStackToTableRowContext()
self.tree.openElements.pop() self.tree.openElements.pop()
self.parser.phase = self.parser.phases["inTableBody"] self.parser.phase = self.parser.phases["inTableBody"]
else: else:
# innerHTML case # innerHTML case
assert self.parser.innerHTML
self.parser.parseError() self.parser.parseError()
def endTagTable(self, name): def endTagTable(self, name):
ignoreEndTag = self.ignoreEndTagTr()
self.endTagTr("tr") self.endTagTr("tr")
# Reprocess the current tag if the tr end tag was not ignored # Reprocess the current tag if the tr end tag was not ignored
# XXX how are we sure it's always ignored in the innerHTML case? # XXX how are we sure it's always ignored in the innerHTML case?
if not self.parser.innerHTML: if not ignoreEndTag:
self.parser.phase.processEndTag(name) self.parser.phase.processEndTag(name)
def endTagTableRowGroup(self, name): def endTagTableRowGroup(self, name):
@ -1628,7 +1674,7 @@ class InSelectPhase(Phase):
u"select phase. Ignored.")) u"select phase. Ignored."))
def endTagSelect(self, name): def endTagSelect(self, name):
if self.tree.elementInScope(name, True): if self.tree.elementInScope("select", True):
node = self.tree.openElements.pop() node = self.tree.openElements.pop()
while node.name != "select": while node.name != "select":
node = self.tree.openElements.pop() node = self.tree.openElements.pop()
@ -1641,7 +1687,7 @@ class InSelectPhase(Phase):
self.parser.parseError(_(u"Unexpected table end tag (" + name +\ self.parser.parseError(_(u"Unexpected table end tag (" + name +\
") in the select phase.")) ") in the select phase."))
if self.tree.elementInScope(name, True): if self.tree.elementInScope(name, True):
self.endTagSelect() self.endTagSelect("select")
self.parser.phase.processEndTag(name) self.parser.phase.processEndTag(name)
def endTagOther(self, name): def endTagOther(self, name):
@ -1736,8 +1782,8 @@ class InFramesetPhase(Phase):
u"in the frameset phase (innerHTML).")) u"in the frameset phase (innerHTML)."))
else: else:
self.tree.openElements.pop() self.tree.openElements.pop()
if not self.parser.innerHTML and\ if (not self.parser.innerHTML and
self.tree.openElements[-1].name != "frameset": self.tree.openElements[-1].name != "frameset"):
# If we're not in innerHTML mode and the the current node is not a # If we're not in innerHTML mode and the the current node is not a
# "frameset" element (anymore) then switch. # "frameset" element (anymore) then switch.
self.parser.phase = self.parser.phases["afterFrameset"] self.parser.phase = self.parser.phases["afterFrameset"]

View File

@ -14,7 +14,7 @@ class HTMLInputStream(object):
""" """
def __init__(self, source, encoding=None, chardet=True): def __init__(self, source, encoding=None, parseMeta=True, chardet=True):
"""Initialises the HTMLInputStream. """Initialises the HTMLInputStream.
HTMLInputStream(source, [encoding]) -> Normalized stream from source HTMLInputStream(source, [encoding]) -> Normalized stream from source
@ -26,6 +26,8 @@ class HTMLInputStream(object):
the encoding. If specified, that encoding will be used, the encoding. If specified, that encoding will be used,
regardless of any BOM or later declaration (such as in a meta regardless of any BOM or later declaration (such as in a meta
element) element)
parseMeta - Look for a <meta> element containing encoding information
""" """
# List of where new lines occur # List of where new lines occur
@ -41,12 +43,9 @@ class HTMLInputStream(object):
#Encoding to use if no other information can be found #Encoding to use if no other information can be found
self.defaultEncoding = "windows-1252" self.defaultEncoding = "windows-1252"
#Autodetect encoding if no other information can be found?
self.chardet = chardet
#Detect encoding iff no explicit "transport level" encoding is supplied #Detect encoding iff no explicit "transport level" encoding is supplied
if encoding is None or not isValidEncoding(encoding): if encoding is None or not isValidEncoding(encoding):
encoding = self.detectEncoding() encoding = self.detectEncoding(parseMeta, chardet)
self.charEncoding = encoding self.charEncoding = encoding
# Read bytes from stream decoding them into Unicode # Read bytes from stream decoding them into Unicode
@ -79,17 +78,17 @@ class HTMLInputStream(object):
stream = cStringIO.StringIO(str(source)) stream = cStringIO.StringIO(str(source))
return stream return stream
def detectEncoding(self): def detectEncoding(self, parseMeta=True, chardet=True):
#First look for a BOM #First look for a BOM
#This will also read past the BOM if present #This will also read past the BOM if present
encoding = self.detectBOM() encoding = self.detectBOM()
#If there is no BOM need to look for meta elements with encoding #If there is no BOM need to look for meta elements with encoding
#information #information
if encoding is None: if encoding is None and parseMeta:
encoding = self.detectEncodingMeta() encoding = self.detectEncodingMeta()
#Guess with chardet, if avaliable #Guess with chardet, if avaliable
if encoding is None and self.chardet: if encoding is None and chardet:
try: try:
import chardet import chardet
buffer = self.rawStream.read() buffer = self.rawStream.read()

View File

@ -32,8 +32,8 @@ class HTMLTokenizer(object):
# XXX need to fix documentation # XXX need to fix documentation
def __init__(self, stream, encoding=None): def __init__(self, stream, encoding=None, parseMeta=True):
self.stream = HTMLInputStream(stream, encoding) self.stream = HTMLInputStream(stream, encoding, parseMeta)
self.states = { self.states = {
"data":self.dataState, "data":self.dataState,
@ -338,31 +338,33 @@ class HTMLTokenizer(object):
self.state = self.states["closeTagOpen"] self.state = self.states["closeTagOpen"]
else: else:
self.tokenQueue.append({"type": "Characters", "data": u"<"}) self.tokenQueue.append({"type": "Characters", "data": u"<"})
self.stream.queue.append(data) self.stream.queue.insert(0, data)
self.state = self.states["data"] self.state = self.states["data"]
return True return True
def closeTagOpenState(self): def closeTagOpenState(self):
if self.contentModelFlag in (contentModelFlags["RCDATA"],\ if (self.contentModelFlag in (contentModelFlags["RCDATA"],
contentModelFlags["CDATA"]): contentModelFlags["CDATA"])):
charStack = [] if self.currentToken:
charStack = []
# So far we know that "</" has been consumed. We now need to know # So far we know that "</" has been consumed. We now need to know
# whether the next few characters match the name of last emitted # whether the next few characters match the name of last emitted
# start tag which also happens to be the currentToken. We also need # start tag which also happens to be the currentToken. We also need
# to have the character directly after the characters that could # to have the character directly after the characters that could
# match the start tag name. # match the start tag name.
for x in xrange(len(self.currentToken["name"]) + 1): for x in xrange(len(self.currentToken["name"]) + 1):
charStack.append(self.stream.char()) charStack.append(self.stream.char())
# Make sure we don't get hit by EOF # Make sure we don't get hit by EOF
if charStack[-1] == EOF: if charStack[-1] == EOF:
break break
# Since this is just for checking. We put the characters back on # Since this is just for checking. We put the characters back on
# the stack. # the stack.
self.stream.queue.extend(charStack) self.stream.queue.extend(charStack)
if self.currentToken["name"].lower() == "".join(charStack[:-1]).lower() \ if self.currentToken \
and self.currentToken["name"].lower() == "".join(charStack[:-1]).lower() \
and charStack[-1] in (spaceCharacters | and charStack[-1] in (spaceCharacters |
frozenset((u">", u"/", u"<", EOF))): frozenset((u">", u"/", u"<", EOF))):
# Because the characters are correct we can safely switch to # Because the characters are correct we can safely switch to

View File

@ -108,6 +108,9 @@ class TreeBuilder(object):
#The class to use for creating doctypes #The class to use for creating doctypes
doctypeClass = None doctypeClass = None
#Fragment class
fragmentClass = None
def __init__(self): def __init__(self):
self.reset() self.reset()
@ -294,7 +297,6 @@ class TreeBuilder(object):
fosterParent = self.openElements[ fosterParent = self.openElements[
self.openElements.index(lastTable) - 1] self.openElements.index(lastTable) - 1]
else: else:
assert self.innerHTML
fosterParent = self.openElements[0] fosterParent = self.openElements[0]
return fosterParent, insertBefore return fosterParent, insertBefore
@ -310,6 +312,13 @@ class TreeBuilder(object):
def getDocument(self): def getDocument(self):
"Return the final tree" "Return the final tree"
return self.document return self.document
def getFragment(self):
"Return the final fragment"
#assert self.innerHTML
fragment = self.fragmentClass()
self.openElements[0].reparentChildren(fragment)
return fragment
def testSerializer(self, node): def testSerializer(self, node):
"""Serialize the subtree of node in the format required by unit tests """Serialize the subtree of node in the format required by unit tests

View File

@ -1,6 +1,8 @@
import _base import _base
from xml.dom import minidom, Node, XML_NAMESPACE, XMLNS_NAMESPACE from xml.dom import minidom, Node, XML_NAMESPACE, XMLNS_NAMESPACE
import new import new
from xml.sax.saxutils import escape
from constants import voidElements
import re import re
illegal_xml_chars = re.compile("[\x01-\x08\x0B\x0C\x0E-\x1F]") illegal_xml_chars = re.compile("[\x01-\x08\x0B\x0C\x0E-\x1F]")
@ -87,6 +89,9 @@ class TreeBuilder(_base.TreeBuilder):
def commentClass(self, data): def commentClass(self, data):
return NodeBuilder(self.dom.createComment(data)) return NodeBuilder(self.dom.createComment(data))
def fragmentClass(self):
return NodeBuilder(self.dom.createDocumentFragment())
def appendChild(self, node): def appendChild(self, node):
self.dom.appendChild(node.element) self.dom.appendChild(node.element)
@ -96,6 +101,9 @@ class TreeBuilder(_base.TreeBuilder):
def getDocument(self): def getDocument(self):
return self.dom return self.dom
def getFragment(self):
return _base.TreeBuilder.getFragment(self).element
def insertText(self, data, parent=None): def insertText(self, data, parent=None):
data=illegal_xml_chars.sub(u'\uFFFD',data) data=illegal_xml_chars.sub(u'\uFFFD',data)
@ -118,7 +126,9 @@ def testSerializer(element):
if element.nodeType == Node.DOCUMENT_TYPE_NODE: if element.nodeType == Node.DOCUMENT_TYPE_NODE:
rv.append("|%s<!DOCTYPE %s>"%(' '*indent, element.name)) rv.append("|%s<!DOCTYPE %s>"%(' '*indent, element.name))
elif element.nodeType == Node.DOCUMENT_NODE: elif element.nodeType == Node.DOCUMENT_NODE:
rv.append("#document") rv.append("#document")
elif element.nodeType == Node.DOCUMENT_FRAGMENT_NODE:
rv.append("#document-fragment")
elif element.nodeType == Node.COMMENT_NODE: elif element.nodeType == Node.COMMENT_NODE:
rv.append("|%s<!-- %s -->"%(' '*indent, element.nodeValue)) rv.append("|%s<!-- %s -->"%(' '*indent, element.nodeValue))
elif element.nodeType == Node.TEXT_NODE: elif element.nodeType == Node.TEXT_NODE:
@ -135,6 +145,32 @@ def testSerializer(element):
return "\n".join(rv) return "\n".join(rv)
class HTMLSerializer(object):
def serialize(self, node):
rv = self.serializeNode(node)
for child in node.childNodes:
rv += self.serialize(child)
if node.nodeType == Node.ELEMENT_NODE and node.nodeName not in voidElements:
rv += "</%s>\n"%node.nodeName
return rv
def serializeNode(self, node):
if node.nodeType == Node.TEXT_NODE:
rv = node.nodeValue
elif node.nodeType == Node.ELEMENT_NODE:
rv = "<%s"%node.nodeName
if node.hasAttributes():
rv = rv+"".join([" %s='%s'"%(key, escape(value)) for key,value in
node.attributes.items()])
rv += ">"
elif node.nodeType == Node.COMMENT_NODE:
rv = "<!-- %s -->" % escape(node.nodeValue)
elif node.nodeType == Node.DOCUMENT_TYPE_NODE:
rv = "<!DOCTYPE %s>" % node.name
else:
rv = ""
return rv
def dom2sax(node, handler, nsmap={'xml':XML_NAMESPACE}): def dom2sax(node, handler, nsmap={'xml':XML_NAMESPACE}):
if node.nodeType == Node.ELEMENT_NODE: if node.nodeType == Node.ELEMENT_NODE:
if not nsmap: if not nsmap:
@ -179,7 +215,10 @@ def dom2sax(node, handler, nsmap={'xml':XML_NAMESPACE}):
elif node.nodeType == Node.DOCUMENT_NODE: elif node.nodeType == Node.DOCUMENT_NODE:
handler.startDocument() handler.startDocument()
for child in node.childNodes: dom2sax(child, handler, nsmap) for child in node.childNodes: dom2sax(child, handler, nsmap)
handler.endDocument() handler.endDocument()
elif node.nodeType == Node.DOCUMENT_FRAGMENT_NODE:
for child in node.childNodes: dom2sax(child, handler, nsmap)
else: else:
# ATTRIBUTE_NODE # ATTRIBUTE_NODE

View File

@ -129,6 +129,10 @@ class Document(Element):
def __init__(self): def __init__(self):
Element.__init__(self, Document) Element.__init__(self, Document)
class DocumentFragment(Element):
def __init__(self):
Element.__init__(self, DocumentFragment)
def testSerializer(element): def testSerializer(element):
rv = [] rv = []
finalText = None finalText = None
@ -211,9 +215,13 @@ class TreeBuilder(_base.TreeBuilder):
doctypeClass = DocumentType doctypeClass = DocumentType
elementClass = Element elementClass = Element
commentClass = Comment commentClass = Comment
fragmentClass = DocumentFragment
def testSerializer(self, element): def testSerializer(self, element):
return testSerializer(element) return testSerializer(element)
def getDocument(self): def getDocument(self):
return self.document._element return self.document._element
def getFragment(self):
return _base.TreeBuilder.getFragment(self)._element

View File

@ -4,6 +4,7 @@ from xml.sax.saxutils import escape
# Really crappy basic implementation of a DOM-core like thing # Really crappy basic implementation of a DOM-core like thing
class Node(_base.Node): class Node(_base.Node):
type = -1
def __init__(self, name): def __init__(self, name):
self.name = name self.name = name
self.parent = None self.parent = None
@ -11,15 +12,18 @@ class Node(_base.Node):
self.childNodes = [] self.childNodes = []
self._flags = [] self._flags = []
def __iter__(self):
for node in self.childNodes:
yield node
for item in node:
yield item
def __unicode__(self): def __unicode__(self):
return self.name return self.name
def toxml(self): def toxml(self):
raise NotImplementedError raise NotImplementedError
def __repr__(self):
return "<%s %s>" % (self.__class__, self.name)
def printTree(self, indent=0): def printTree(self, indent=0):
tree = '\n|%s%s' % (' '* indent, unicode(self)) tree = '\n|%s%s' % (' '* indent, unicode(self))
for child in self.childNodes: for child in self.childNodes:
@ -69,6 +73,7 @@ class Node(_base.Node):
return bool(self.childNodes) return bool(self.childNodes)
class Document(Node): class Document(Node):
type = 1
def __init__(self): def __init__(self):
Node.__init__(self, None) Node.__init__(self, None)
@ -93,7 +98,13 @@ class Document(Node):
tree += child.printTree(2) tree += child.printTree(2)
return tree return tree
class DocumentFragment(Document):
type = 2
def __unicode__(self):
return "#document-fragment"
class DocumentType(Node): class DocumentType(Node):
type = 3
def __init__(self, name): def __init__(self, name):
Node.__init__(self, name) Node.__init__(self, name)
@ -106,6 +117,7 @@ class DocumentType(Node):
return '<code class="markup doctype">&lt;!DOCTYPE %s></code>' % self.name return '<code class="markup doctype">&lt;!DOCTYPE %s></code>' % self.name
class TextNode(Node): class TextNode(Node):
type = 4
def __init__(self, value): def __init__(self, value):
Node.__init__(self, None) Node.__init__(self, None)
self.value = value self.value = value
@ -119,6 +131,7 @@ class TextNode(Node):
hilite = toxml hilite = toxml
class Element(Node): class Element(Node):
type = 5
def __init__(self, name): def __init__(self, name):
Node.__init__(self, name) Node.__init__(self, name)
self.attributes = {} self.attributes = {}
@ -164,6 +177,7 @@ class Element(Node):
return tree return tree
class CommentNode(Node): class CommentNode(Node):
type = 6
def __init__(self, data): def __init__(self, data):
Node.__init__(self, None) Node.__init__(self, None)
self.data = data self.data = data
@ -177,11 +191,38 @@ class CommentNode(Node):
def hilite(self): def hilite(self):
return '<code class="markup comment">&lt;!--%s--></code>' % escape(self.data) return '<code class="markup comment">&lt;!--%s--></code>' % escape(self.data)
class HTMLSerializer(object):
def serialize(self, node):
rv = self.serializeNode(node)
for child in node.childNodes:
rv += self.serialize(child)
if node.type == Element.type and node.name not in voidElements:
rv += "</%s>\n"%node.name
return rv
def serializeNode(self, node):
if node.type == TextNode.type:
rv = node.value
elif node.type == Element.type:
rv = "<%s"%node.name
if node.attributes:
rv = rv+"".join([" %s='%s'"%(key, escape(value)) for key,value in
node.attributes.iteritems()])
rv += ">"
elif node.type == CommentNode.type:
rv = "<!-- %s -->" % escape(node.data)
elif node.type == DocumentType.type:
rv = "<!DOCTYPE %s>" % node.name
else:
rv = ""
return rv
class TreeBuilder(_base.TreeBuilder): class TreeBuilder(_base.TreeBuilder):
documentClass = Document documentClass = Document
doctypeClass = DocumentType doctypeClass = DocumentType
elementClass = Element elementClass = Element
commentClass = CommentNode commentClass = CommentNode
fragmentClass = DocumentFragment
def testSerializer(self, node): def testSerializer(self, node):
return node.printTree() return node.printTree()

View File

@ -44,13 +44,17 @@ def run(template_file, doc, mode='template'):
base,ext = os.path.splitext(os.path.basename(template_resolved)) base,ext = os.path.splitext(os.path.basename(template_resolved))
module_name = ext[1:] module_name = ext[1:]
try: try:
module = __import__(module_name) try:
module = __import__("_" + module_name)
except:
module = __import__(module_name)
except Exception, inst: except Exception, inst:
return log.error("Skipping %s '%s' after failing to load '%s': %s", return log.error("Skipping %s '%s' after failing to load '%s': %s",
mode, template_resolved, module_name, inst) mode, template_resolved, module_name, inst)
# Execute the shell module # Execute the shell module
options = planet.config.template_options(template_file) options = planet.config.template_options(template_file)
if module_name == 'plugin': options['__file__'] = template_file
options.update(extra_options) options.update(extra_options)
log.debug("Processing %s %s using %s", mode, log.debug("Processing %s %s using %s", mode,
os.path.realpath(template_resolved), module_name) os.path.realpath(template_resolved), module_name)
@ -60,3 +64,4 @@ def run(template_file, doc, mode='template'):
output_dir = planet.config.output_dir() output_dir = planet.config.output_dir()
output_file = os.path.join(output_dir, base) output_file = os.path.join(output_dir, base)
module.run(template_resolved, doc, output_file, options) module.run(template_resolved, doc, output_file, options)
return output_file

143
planet/shell/_genshi.py Normal file
View File

@ -0,0 +1,143 @@
from StringIO import StringIO
from xml.sax.saxutils import escape
from genshi.input import HTMLParser, XMLParser
from genshi.template import Context, MarkupTemplate
subscriptions = []
feed_types = [
'application/atom+xml',
'application/rss+xml',
'application/rdf+xml'
]
def norm(value):
""" Convert to Unicode """
if hasattr(value,'items'):
return dict([(norm(n),norm(v)) for n,v in value.items()])
try:
return value.decode('utf-8')
except:
return value.decode('iso-8859-1')
def find_config(config, feed):
# match based on self link
for link in feed.links:
if link.has_key('rel') and link.rel=='self':
if link.has_key('type') and link.type in feed_types:
if link.has_key('href') and link.href in subscriptions:
return norm(dict(config.parser.items(link.href)))
# match based on name
for sub in subscriptions:
if config.parser.has_option(sub, 'name') and \
norm(config.parser.get(sub, 'name')) == feed.planet_name:
return norm(dict(config.parser.items(sub)))
return {}
class XHTMLParser(object):
""" parse an XHTML fragment """
def __init__(self, text):
self.parser = XMLParser(StringIO("<div>%s</div>" % text))
self.depth = 0
def __iter__(self):
self.iter = self.parser.__iter__()
return self
def next(self):
object = self.iter.next()
if object[0] == 'END': self.depth = self.depth - 1
predepth = self.depth
if object[0] == 'START': self.depth = self.depth + 1
if predepth: return object
return self.next()
def streamify(text,bozo):
""" add a .stream to a _detail textConstruct """
if text.type == 'text/plain':
text.stream = HTMLParser(StringIO(escape(text.value)))
elif text.type == 'text/html' or bozo != 'false':
text.stream = HTMLParser(StringIO(text.value))
else:
text.stream = XHTMLParser(text.value)
def run(script, doc, output_file=None, options={}):
""" process an Genshi template """
context = Context(**options)
tmpl_fileobj = open(script)
tmpl = MarkupTemplate(tmpl_fileobj, script)
tmpl_fileobj.close()
if not output_file:
# filter
context.push({'input':XMLParser(StringIO(doc))})
else:
# template
import time
from planet import config,feedparser
from planet.spider import filename
# gather a list of subscriptions, feeds
global subscriptions
feeds = []
sources = config.cache_sources_directory()
for sub in config.subscriptions():
data=feedparser.parse(filename(sources,sub))
data.feed.config = norm(dict(config.parser.items(sub)))
if data.feed.has_key('link'):
feeds.append((data.feed.config.get('name',''),data.feed))
subscriptions.append(norm(sub))
feeds.sort()
# annotate each entry
new_date_format = config.new_date_format()
vars = feedparser.parse(StringIO(doc))
vars.feeds = [value for name,value in feeds]
last_feed = None
last_date = None
for entry in vars.entries:
entry.source.config = find_config(config, entry.source)
# add new_feed and new_date fields
entry.new_feed = entry.source.id
entry.new_date = date = None
if entry.has_key('published_parsed'): date=entry.published_parsed
if entry.has_key('updated_parsed'): date=entry.updated_parsed
if date: entry.new_date = time.strftime(new_date_format, date)
# remove new_feed and new_date fields if not "new"
if entry.new_date == last_date:
entry.new_date = None
if entry.new_feed == last_feed:
entry.new_feed = None
else:
last_feed = entry.new_feed
elif entry.new_date:
last_date = entry.new_date
last_feed = None
# add streams for all text constructs
for key in entry.keys():
if key.endswith("_detail") and entry[key].has_key('type') and \
entry[key].has_key('value'):
streamify(entry[key],entry.source.planet_bozo)
if entry.has_key('content'):
for content in entry.content:
streamify(content,entry.source.planet_bozo)
# add cumulative feed information to the Genshi context
vars.feed.config = dict(config.parser.items('Planet',True))
context.push(vars)
# apply template
output=tmpl.generate(context).render('xml')
if output_file:
out_file = open(output_file,'w')
out_file.write(output)
out_file.close()
else:
return output

64
planet/shell/plugin.py Normal file
View File

@ -0,0 +1,64 @@
import os, sys, imp
from StringIO import StringIO
def run(script, doc, output_file=None, options={}):
""" process an Python script using imp """
save_sys = (sys.stdin, sys.stdout, sys.stderr, sys.argv)
plugin_stdout = StringIO()
plugin_stderr = StringIO()
try:
# redirect stdin
sys.stdin = StringIO(doc)
# redirect stdout
if output_file:
sys.stdout = open(output_file, 'w')
else:
sys.stdout = plugin_stdout
# redirect stderr
sys.stderr = plugin_stderr
# determine __file__ value
if options.has_key("__file__"):
plugin_file = options["__file__"]
del options["__file__"]
else:
plugin_file = script
# set sys.argv
options = sum([['--'+key, value] for key,value in options.items()], [])
sys.argv = [plugin_file] + options
# import script
handle = open(script, 'r')
cwd = os.getcwd()
try:
try:
try:
description=('.plugin', 'rb', imp.PY_SOURCE)
imp.load_module('__main__',handle,plugin_file,description)
except SystemExit,e:
if e.code: log.error('%s exit rc=%d',(plugin_file,e.code))
except Exception, e:
import traceback
type, value, tb = sys.exc_info()
plugin_stderr.write(''.join(
traceback.format_exception_only(type,value) +
traceback.format_tb(tb)))
finally:
handle.close()
if cwd != os.getcwd(): os.chdir(cwd)
finally:
# restore system state
sys.stdin, sys.stdout, sys.stderr, sys.argv = save_sys
# log anything sent to stderr
if plugin_stderr.getvalue():
import planet
planet.logger.error(plugin_stderr.getvalue())
# return stdout
return plugin_stdout.getvalue()

View File

@ -102,7 +102,7 @@ Items = [
['enclosure_type', String, 'links', {'rel': 'enclosure'}, 'type'], ['enclosure_type', String, 'links', {'rel': 'enclosure'}, 'type'],
['id', String, 'id'], ['id', String, 'id'],
['link', String, 'links', {'rel': 'alternate'}, 'href'], ['link', String, 'links', {'rel': 'alternate'}, 'href'],
['new_channel', String, 'id'], ['new_channel', String, 'source', 'id'],
['new_date', NewDate, 'published_parsed'], ['new_date', NewDate, 'published_parsed'],
['new_date', NewDate, 'updated_parsed'], ['new_date', NewDate, 'updated_parsed'],
['rights', String, 'rights_detail', 'value'], ['rights', String, 'rights_detail', 'value'],
@ -226,7 +226,7 @@ def template_info(source):
date = item['new_date'] date = item['new_date']
if item.has_key('new_channel'): if item.has_key('new_channel'):
if item['new_channel'] == channel: if item['new_channel'] == channel and not item.has_key('new_date'):
del item['new_channel'] del item['new_channel']
else: else:
channel = item['new_channel'] channel = item['new_channel']
@ -241,12 +241,15 @@ def run(script, doc, output_file=None, options={}):
for key,value in template_info(doc).items(): for key,value in template_info(doc).items():
tp.set(key, value) tp.set(key, value)
reluri = os.path.splitext(os.path.basename(output_file))[0] if output_file:
tp.set('url', urlparse.urljoin(config.link(),reluri)) reluri = os.path.splitext(os.path.basename(output_file))[0]
tp.set('url', urlparse.urljoin(config.link(),reluri))
output = open(output_file, "w") output = open(output_file, "w")
output.write(tp.process(template)) output.write(tp.process(template))
output.close() output.close()
else:
return tp.process(template)
if __name__ == '__main__': if __name__ == '__main__':
sys.path.insert(0, os.path.split(sys.path[0])[0]) sys.path.insert(0, os.path.split(sys.path[0])[0])

View File

@ -323,14 +323,12 @@ def httpThread(thread_index, input_queue, output_queue, log):
for line in (traceback.format_exception_only(type, value) + for line in (traceback.format_exception_only(type, value) +
traceback.format_tb(tb)): traceback.format_tb(tb)):
log.error(line.rstrip()) log.error(line.rstrip())
continue
output_queue.put(block=True, item=(uri, feed_info, feed)) output_queue.put(block=True, item=(uri, feed_info, feed))
uri, feed_info = input_queue.get(block=True) uri, feed_info = input_queue.get(block=True)
def spiderPlanet(only_if_new = False): def spiderPlanet(only_if_new = False):
""" Spider (fetch) an entire planet """ """ Spider (fetch) an entire planet """
# log = planet.getLogger(config.log_level(),config.log_format())
log = planet.getLogger(config.log_level(),config.log_format()) log = planet.getLogger(config.log_level(),config.log_format())
global index global index

View File

@ -111,9 +111,37 @@ def apply(doc):
if not os.path.exists(output_dir): os.makedirs(output_dir) if not os.path.exists(output_dir): os.makedirs(output_dir)
log = planet.getLogger(config.log_level(),config.log_format()) log = planet.getLogger(config.log_level(),config.log_format())
planet_filters = config.filters('Planet')
# Go-go-gadget-template # Go-go-gadget-template
for template_file in config.template_files(): for template_file in config.template_files():
shell.run(template_file, doc) output_file = shell.run(template_file, doc)
# run any template specific filters
if config.filters(template_file) != planet_filters:
output = open(output_file).read()
for filter in config.filters(template_file):
if filter in planet_filters: continue
if filter.find('>')>0:
# tee'd output
filter,dest = filter.split('>',1)
tee = shell.run(filter.strip(), output, mode="filter")
if tee:
output_dir = planet.config.output_dir()
dest_file = os.path.join(output_dir, dest.strip())
dest_file = open(dest_file,'w')
dest_file.write(tee)
dest_file.close()
else:
# pipe'd output
output = shell.run(filter, output, mode="filter")
if not output:
os.unlink(output_file)
break
else:
handle = open(output_file,'w')
handle.write(output)
handle.close()
# Process bill of materials # Process bill of materials
for copy_file in config.bill_of_materials(): for copy_file in config.bill_of_materials():
@ -123,6 +151,9 @@ def apply(doc):
if os.path.exists(source): break if os.path.exists(source): break
else: else:
log.error('Unable to locate %s', copy_file) log.error('Unable to locate %s', copy_file)
log.info("Template search path:")
for template_dir in config.template_directories():
log.info(" %s", os.path.realpath(template_dir))
continue continue
mtime = os.stat(source).st_mtime mtime = os.stat(source).st_mtime
@ -131,5 +162,6 @@ def apply(doc):
if not os.path.exists(dest_dir): os.makedirs(dest_dir) if not os.path.exists(dest_dir): os.makedirs(dest_dir)
log.info("Copying %s to %s", source, dest) log.info("Copying %s to %s", source, dest)
if os.path.exists(dest): os.chmod(dest, 0644)
shutil.copyfile(source, dest) shutil.copyfile(source, dest)
shutil.copystat(source, dest) shutil.copystat(source, dest)

View File

@ -18,12 +18,23 @@ if not hasattr(unittest.TestCase, 'assertFalse'):
if sys.path[0]: os.chdir(sys.path[0]) if sys.path[0]: os.chdir(sys.path[0])
sys.path[0] = os.getcwd() sys.path[0] = os.getcwd()
# find all of the planet test modules # determine verbosity
modules = map(fullmodname, glob.glob(os.path.join('tests', 'test_*.py'))) verbosity = 1
for arg,value in (('-q',0),('--quiet',0),('-v',2),('--verbose',2)):
if arg in sys.argv:
verbosity = value
sys.argv.remove(arg)
# enable warnings # find all of the planet test modules
modules = []
for pattern in sys.argv[1:] or ['test_*.py']:
modules += map(fullmodname, glob.glob(os.path.join('tests', pattern)))
# enable logging
import planet import planet
planet.getLogger("WARNING",None) if verbosity == 0: planet.getLogger("FATAL",None)
if verbosity == 1: planet.getLogger("WARNING",None)
if verbosity == 2: planet.getLogger("DEBUG",None)
# load all of the tests into a suite # load all of the tests into a suite
try: try:
@ -33,11 +44,5 @@ except Exception, exception:
for module in modules: __import__(module) for module in modules: __import__(module)
raise raise
verbosity = 1
if "-q" in sys.argv or '--quiet' in sys.argv:
verbosity = 0
if "-v" in sys.argv or '--verbose' in sys.argv:
verbosity = 2
# run test suite # run test suite
unittest.TextTestRunner(verbosity=verbosity).run(suite) unittest.TextTestRunner(verbosity=verbosity).run(suite)

View File

@ -0,0 +1,21 @@
[Planet]
output_theme = asf
output_dir = tests/work/apply
name = test planet
cache_directory = tests/work/spider/cache
filter_directories = tests/data/apply
[index.html.xslt]
filters = rebase.py?base=http://example.com/
[tests/data/spider/testfeed0.atom]
name = not found
[tests/data/spider/testfeed1b.atom]
name = one
[tests/data/spider/testfeed2.atom]
name = two
[tests/data/spider/testfeed3.rss]
name = three

View File

@ -0,0 +1,21 @@
[Planet]
output_theme = genshi_fancy
output_dir = tests/work/apply
name = test planet
cache_directory = tests/work/spider/cache
bill_of_materials:
images/#{face}
[tests/data/spider/testfeed0.atom]
name = not found
[tests/data/spider/testfeed1b.atom]
name = one
face = jdub.png
[tests/data/spider/testfeed2.atom]
name = two
[tests/data/spider/testfeed3.rss]
name = three

View File

@ -0,0 +1,25 @@
[Planet]
output_theme = genshi_fancy
output_dir = tests/work/apply
name = test planet
cache_directory = tests/work/spider/cache
bill_of_materials:
images/#{face}
[index.html.genshi]
filters:
xhtml2html.py>index.html4
[tests/data/spider/testfeed0.atom]
name = not found
[tests/data/spider/testfeed1b.atom]
name = one
face = jdub.png
[tests/data/spider/testfeed2.atom]
name = two
[tests/data/spider/testfeed3.rss]
name = three

View File

@ -0,0 +1,29 @@
[Planet]
output_theme = classic_fancy
output_dir = tests/work/apply
name = test planet
cache_directory = tests/work/spider/cache
bill_of_materials:
images/#{face}
[index.html.tmpl]
filters:
html2xhtml.plugin
mememe.plugin
[mememe.plugin]
sidebar = //*[@class='sidebar']
[tests/data/spider/testfeed0.atom]
name = not found
[tests/data/spider/testfeed1b.atom]
name = one
face = jdub.png
[tests/data/spider/testfeed2.atom]
name = two
[tests/data/spider/testfeed3.rss]
name = three

View File

@ -0,0 +1,24 @@
# make href attributes absolute, using base argument passed in
import sys
try:
base = sys.argv[sys.argv.index('--base')+1]
except:
sys.stderr.write('Missing required argument: base\n')
sys.exit()
from xml.dom import minidom, Node
from urlparse import urljoin
def rebase(node, newbase):
if node.hasAttribute('href'):
href=node.getAttribute('href')
if href != urljoin(base,href):
node.setAttribute('href', urljoin(base,href))
for child in node.childNodes:
if child.nodeType == Node.ELEMENT_NODE:
rebase(child, newbase)
doc = minidom.parse(sys.stdin)
rebase(doc.documentElement, base)
print doc.toxml('utf-8')

View File

@ -1,7 +1,10 @@
<entry xmlns="http://www.w3.org/2005/Atom"> <entry xmlns="http://www.w3.org/2005/Atom">
<content> <content>
<div xmlns="http://www.w3.org/1999/xhtml"> <div xmlns="http://www.w3.org/1999/xhtml">
<img src="http://example.com/foo.png"/> Plain old image: <img src="http://example.com/foo.png"/>
Host has a non-standard port: <img src="http://example.com:1234/foo.png"/>
A non-port colon: <img src="http://u:p@example.com/foo.png"/>
Several colons: <img src="http://u:p@example.com:1234/foo.png"/>
</div> </div>
</content> </content>
</entry> </entry>

View File

@ -0,0 +1,18 @@
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head><link rel="stylesheet" href="default.css" type="text/css"/><title>Planet Intertwingly</title><meta name="robots" content="noindex,nofollow"/><meta name="generator" content="Venus"/><link rel="alternate" href="http://planet.intertwingly.net/atom.xml" title="Planet Intertwingly" type="application/atom+xml"/><link rel="shortcut icon" href="/favicon.ico"/><script type="text/javascript" src="personalize.js"/></head>
<body>
<h1>Planet Intertwingly</h1>
<div id="body">
<h2 class="date">April 14, 2007</h2>
</div><h1>Footnotes</h1>
<div id="sidebar"><h2>Info</h2><dl><dt>Last updated:</dt><dd><span class="date" title="GMT">April 14, 2007 02:01 PM</span></dd><dt>Powered by:</dt><dd><a href="http://intertwingly.net/code/venus/"><img src="images/venus.png" width="80" height="15" alt="Venus" border="0"/></a></dd><dt>Export:</dt><dd><ul><li><a href="opml.xml"><img src="images/opml.png" alt="OPML"/></a></li><li><a href="foafroll.xml"><img src="images/foaf.png" alt="FOAF"/></a></li></ul></dd></dl></div>
</body></html>

View File

@ -0,0 +1,2 @@
[Planet]
exclude=two

View File

@ -0,0 +1,34 @@
<!--
Description: source id
Expect: Items[0]['new_channel'] == 'http://example.com/' and not Items[1].has_key('new_channel') and Items[2]['new_channel'] == 'http://example.org/' and Items[3]['new_channel'] == 'http://example.com/'
-->
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<source>
<id>http://example.com/</id>
</source>
</entry>
<entry>
<source>
<id>http://example.com/</id>
</source>
</entry>
<entry>
<source>
<id>http://example.org/</id>
</source>
</entry>
<entry>
<source>
<id>http://example.com/</id>
</source>
</entry>
<planet:source xmlns:planet='http://planet.intertwingly.net/'>
<id>http://example.com/</id>
</planet:source>
<planet:source xmlns:planet='http://planet.intertwingly.net/'>
<id>http://example.org/</id>
</planet:source>
</feed>

View File

@ -0,0 +1,35 @@
<!--
Description: source id
Expect: Items[0]['new_channel'] == 'http://example.com/' and not Items[1].has_key('new_channel') and Items[2]['new_channel'] == 'http://example.org/'
-->
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<updated>2004-02-28T18:14:55Z</updated>
<source>
<id>http://example.com/</id>
</source>
</entry>
<entry>
<updated>2004-02-28T14:14:55Z</updated>
<source>
<id>http://example.com/</id>
</source>
</entry>
<entry>
<updated>2004-02-27T14:14:55Z</updated>
<source>
<id>http://example.org/</id>
</source>
</entry>
<entry>
<updated>2004-02-26T14:14:55Z</updated>
<source>
<id>http://example.org/</id>
</source>
</entry>
<planet:source xmlns:planet='http://planet.intertwingly.net/'>
<id>http://example.com/</id>
</planet:source>
</feed>

View File

@ -0,0 +1,23 @@
<!--
Description: source id
Expect: Items[0]['new_date'] == 'February 28, 2004' and not Items[1].has_key('new_date') and Items[2]['new_date'] == 'February 27, 2004' and Items[3]['new_date'] == 'February 26, 2004'
-->
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<updated>2004-02-28T18:14:55Z</updated>
</entry>
<entry>
<updated>2004-02-28T14:14:55Z</updated>
</entry>
<entry>
<updated>2004-02-27T14:14:55Z</updated>
</entry>
<entry>
<updated>2004-02-26T14:14:55Z</updated>
</entry>
<planet:source xmlns:planet='http://planet.intertwingly.net/'>
<id>http://example.com/</id>
</planet:source>
</feed>

View File

@ -0,0 +1,13 @@
<!--
Description: creative commons license
Expect: links[0].rel == 'license' and links[0].href == 'http://www.creativecommons.org/licenses/by-nc/1.0'
-->
<rss version="2.0" xmlns:cc="http://web.resource.org/cc/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<channel>
<item>
<cc:license rdf:resource="http://www.creativecommons.org/licenses/by-nc/1.0"/>
</item>
</channel>
</rss>

View File

@ -0,0 +1,13 @@
<!--
Description: creative commons license
Expect: links[0].rel == 'license' and links[0].href == 'http://www.creativecommons.org/licenses/by-nc/1.0'
-->
<rss version="2.0" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule">
<channel>
<item>
<creativeCommons:license>http://www.creativecommons.org/licenses/by-nc/1.0</creativeCommons:license>
</item>
</channel>
</rss>

View File

@ -21,8 +21,7 @@ class ApplyTest(unittest.TestCase):
os.makedirs(workdir) os.makedirs(workdir)
def tearDown(self): def tearDown(self):
shutil.rmtree(workdir) shutil.rmtree(os.path.split(workdir)[0])
os.removedirs(os.path.split(workdir)[0])
def test_apply_asf(self): def test_apply_asf(self):
config.load(configfile % 'asf') config.load(configfile % 'asf')
@ -47,8 +46,38 @@ class ApplyTest(unittest.TestCase):
self.assertEqual(12, content) self.assertEqual(12, content)
self.assertEqual(3, lang) self.assertEqual(3, lang)
def test_apply_fancy(self): def test_apply_classic_fancy(self):
config.load(configfile % 'fancy') config.load(configfile % 'fancy')
self.apply_fancy()
def test_apply_genshi_fancy(self):
config.load(configfile % 'genshi')
self.apply_fancy()
def test_apply_filter_html(self):
config.load(configfile % 'html')
self.apply_fancy()
output = open(os.path.join(workdir, 'index.html')).read()
self.assertTrue(output.find('/>')>=0)
output = open(os.path.join(workdir, 'index.html4')).read()
self.assertTrue(output.find('/>')<0)
def test_apply_filter_mememe(self):
config.load(configfile % 'mememe')
self.apply_fancy()
output = open(os.path.join(workdir, 'index.html')).read()
self.assertTrue(output.find('<div class="sidebar"><h2>Memes <a href="memes.atom">')>=0)
def apply_fancy(self):
# drop slow templates unrelated to test at hand
templates = config.parser.get('Planet','template_files').split()
templates.remove('rss10.xml.tmpl')
templates.remove('rss20.xml.tmpl')
config.parser.set('Planet','template_files',' '.join(templates))
splice.apply(self.feeddata) splice.apply(self.feeddata)
# verify that selected files are there # verify that selected files are there
@ -63,6 +92,14 @@ class ApplyTest(unittest.TestCase):
self.assertTrue(html.find( self.assertTrue(html.find(
'<h4><a href="http://example.com/2">Venus</a></h4>')>=0) '<h4><a href="http://example.com/2">Venus</a></h4>')>=0)
def test_apply_filter(self):
config.load(configfile % 'filter')
splice.apply(self.feeddata)
# verify that index.html is well formed, has content, and xml:lang
html = open(os.path.join(workdir, 'index.html')).read()
self.assertTrue(html.find(' href="http://example.com/default.css"')>=0)
try: try:
import libxml2 import libxml2
except ImportError: except ImportError:
@ -85,3 +122,10 @@ except ImportError:
logger.warn("xsltproc is not available => can't test XSLT templates") logger.warn("xsltproc is not available => can't test XSLT templates")
for method in dir(ApplyTest): for method in dir(ApplyTest):
if method.startswith('test_'): delattr(ApplyTest,method) if method.startswith('test_'): delattr(ApplyTest,method)
import test_filter_genshi
for method in dir(test_filter_genshi.GenshiFilterTests):
if method.startswith('test_'): break
else:
delattr(ApplyTest,'test_apply_genshi_fancy')
delattr(ApplyTest,'test_apply_filter_html')

View File

@ -0,0 +1,29 @@
#!/usr/bin/env python
import unittest, xml.dom.minidom
from planet import shell, config, logger
class GenshiFilterTests(unittest.TestCase):
def test_addsearch_filter(self):
testfile = 'tests/data/filter/index.html'
filter = 'addsearch.genshi'
output = shell.run(filter, open(testfile).read(), mode="filter")
self.assertTrue(output.find('<h2>Search</h2>')>=0)
self.assertTrue(output.find('<form><input name="q"/></form>')>=0)
self.assertTrue(output.find(' href="http://planet.intertwingly.net/opensearchdescription.xml"')>=0)
self.assertTrue(output.find('</script>')>=0)
def test_xhtml2html_filter(self):
testfile = 'tests/data/filter/index.html'
filter = 'xhtml2html.py'
output = shell.run(filter, open(testfile).read(), mode="filter")
self.assertTrue(output.find('/>')<0)
self.assertTrue(output.find('</script>')>=0)
try:
import genshi
except:
logger.warn("Genshi is not available => can't test genshi filters")
for method in dir(GenshiFilterTests):
if method.startswith('test_'): delattr(GenshiFilterTests,method)

View File

@ -15,14 +15,30 @@ class XsltFilterTests(unittest.TestCase):
catterm = dom.getElementsByTagName('category')[0].getAttribute('term') catterm = dom.getElementsByTagName('category')[0].getAttribute('term')
self.assertEqual('OnE', catterm) self.assertEqual('OnE', catterm)
def test_addsearch_filter(self):
testfile = 'tests/data/filter/index.html'
filter = 'addsearch.xslt'
output = shell.run(filter, open(testfile).read(), mode="filter")
self.assertTrue(output.find('<h2>Search</h2>')>=0)
self.assertTrue(output.find('<form><input name="q"/></form>')>=0)
self.assertTrue(output.find(' href="http://planet.intertwingly.net/opensearchdescription.xml"')>=0)
self.assertTrue(output.find('</script>')>=0)
try: try:
import libxslt import libxslt
except: except:
try: try:
from subprocess import Popen, PIPE try:
xsltproc=Popen(['xsltproc','--version'],stdout=PIPE,stderr=PIPE) # Python 2.5 bug 1704790 workaround (alas, Unix only)
xsltproc.communicate() import commands
if xsltproc.returncode != 0: raise ImportError if commands.getstatusoutput('xsltproc --version')[0] != 0:
raise ImportError
except:
from subprocess import Popen, PIPE
xsltproc=Popen(['xsltproc','--version'],stdout=PIPE,stderr=PIPE)
xsltproc.communicate()
if xsltproc.returncode != 0: raise ImportError
except: except:
logger.warn("libxslt is not available => can't test xslt filters") logger.warn("libxslt is not available => can't test xslt filters")
del XsltFilterTests.test_xslt_filter del XsltFilterTests.test_xslt_filter
del XsltFilterTests.test_addsearch_filter

View File

@ -11,8 +11,11 @@ class FilterTests(unittest.TestCase):
output = shell.run(filter, open(testfile).read(), mode="filter") output = shell.run(filter, open(testfile).read(), mode="filter")
dom = xml.dom.minidom.parseString(output) dom = xml.dom.minidom.parseString(output)
imgsrc = dom.getElementsByTagName('img')[0].getAttribute('src') imgsrcs = [img.getAttribute('src') for img in dom.getElementsByTagName('img')]
self.assertEqual('http://example.com.nyud.net:8080/foo.png', imgsrc) self.assertEqual('http://example.com.nyud.net:8080/foo.png', imgsrcs[0])
self.assertEqual('http://example.com.1234.nyud.net:8080/foo.png', imgsrcs[1])
self.assertEqual('http://u:p@example.com.nyud.net:8080/foo.png', imgsrcs[2])
self.assertEqual('http://u:p@example.com.1234.nyud.net:8080/foo.png', imgsrcs[3])
def test_excerpt_images1(self): def test_excerpt_images1(self):
config.load('tests/data/filter/excerpt-images.ini') config.load('tests/data/filter/excerpt-images.ini')
@ -108,17 +111,44 @@ class FilterTests(unittest.TestCase):
self.assertNotEqual('', output) self.assertNotEqual('', output)
def test_regexp_filter2(self):
config.load('tests/data/filter/regexp-sifter2.ini')
testfile = 'tests/data/filter/category-one.xml'
output = open(testfile).read()
for filter in config.filters():
output = shell.run(filter, output, mode="filter")
self.assertNotEqual('', output)
testfile = 'tests/data/filter/category-two.xml'
output = open(testfile).read()
for filter in config.filters():
output = shell.run(filter, output, mode="filter")
self.assertEqual('', output)
try: try:
from subprocess import Popen, PIPE from subprocess import Popen, PIPE
_no_sed = False _no_sed = True
try: if _no_sed:
sed = Popen(['sed','--version'],stdout=PIPE,stderr=PIPE) try:
sed.communicate() # Python 2.5 bug 1704790 workaround (alas, Unix only)
if sed.returncode != 0: import commands
_no_sed = True if commands.getstatusoutput('sed --version')[0]==0: _no_sed = False
except WindowsError: except:
_no_sed = True pass
if _no_sed:
try:
sed = Popen(['sed','--version'],stdout=PIPE,stderr=PIPE)
sed.communicate()
if sed.returncode == 0: _no_sed = False
except WindowsError:
pass
if _no_sed: if _no_sed:
logger.warn("sed is not available => can't test stripAd_yahoo") logger.warn("sed is not available => can't test stripAd_yahoo")

View File

@ -208,7 +208,7 @@ body > h1 {
text-align: right; text-align: right;
} }
#body h2.date { #body > h2 {
text-transform: none; text-transform: none;
font-size: medium; font-size: medium;
color: #333; color: #333;
@ -466,11 +466,28 @@ ul.tags a:link, ul.tags a:visited {
color:green color:green
} }
a[rel='tag'] img {
border: 0;
}
/* DiveIntoMark */ /* DiveIntoMark */
.framed { .framed {
float: none; float: none;
} }
/* BurningBird */
.update:before {
content: 'Update';
font-weight: bold;
}
.update {
margin: 2em;
padding: 0 1em 0 1em;
background: #eee;
border: 1px solid #aaa;
}
/* ----------------------------- Footer ---------------------------- */ /* ----------------------------- Footer ---------------------------- */
#footer { #footer {

View File

@ -49,9 +49,9 @@
<dl> <dl>
<dt>Last updated:</dt> <dt>Last updated:</dt>
<dd> <dd>
<span class="date" title="GMT"> <time datetime="{atom:updated}" title="GMT">
<xsl:value-of select="atom:updated/@planet:format"/> <xsl:value-of select="atom:updated/@planet:format"/>
</span> </time>
</dd> </dd>
<dt>Powered by:</dt> <dt>Powered by:</dt>
<dd> <dd>
@ -131,7 +131,7 @@
<xsl:value-of select="planet:name"/> <xsl:value-of select="planet:name"/>
</a> </a>
<xsl:if test="$posts"> <xsl:if test="$posts[string-length(atom:title) &gt; 0]">
<ul> <ul>
<xsl:for-each select="$posts"> <xsl:for-each select="$posts">
<xsl:if test="string-length(atom:title) &gt; 0"> <xsl:if test="string-length(atom:title) &gt; 0">
@ -165,10 +165,12 @@
<xsl:if test="not(preceding-sibling::atom:entry <xsl:if test="not(preceding-sibling::atom:entry
[substring(atom:updated,1,10) = $date])"> [substring(atom:updated,1,10) = $date])">
<xsl:text>&#10;&#10;</xsl:text> <xsl:text>&#10;&#10;</xsl:text>
<h2 class="date"> <h2>
<xsl:value-of select="substring-before(atom:updated/@planet:format,', ')"/> <time datetime="{$date}">
<xsl:text>, </xsl:text> <xsl:value-of select="substring-before(atom:updated/@planet:format,', ')"/>
<xsl:value-of select="substring-before(substring-after(atom:updated/@planet:format,', '), ' ')"/> <xsl:text>, </xsl:text>
<xsl:value-of select="substring-before(substring-after(atom:updated/@planet:format,', '), ' ')"/>
</time>
</h2> </h2>
</xsl:if> </xsl:if>
@ -231,9 +233,9 @@
<xsl:text> at </xsl:text> <xsl:text> at </xsl:text>
</xsl:when> </xsl:when>
</xsl:choose> </xsl:choose>
<span class="date" title="GMT"> <time datetime="{atom:updated}" title="GMT">
<xsl:value-of select="atom:updated/@planet:format"/> <xsl:value-of select="atom:updated/@planet:format"/>
</span> </time>
</a> </a>
</div> </div>
</div> </div>

View File

@ -71,6 +71,7 @@ function createCookie(name,value,days) {
// read a cookie // read a cookie
function readCookie(name) { function readCookie(name) {
var nameEQ = name + "="; var nameEQ = name + "=";
if (!document.cookie) return;
var ca = document.cookie.split(';'); var ca = document.cookie.split(';');
for(var i=0;i < ca.length;i++) { for(var i=0;i < ca.length;i++) {
var c = ca[i]; var c = ca[i];
@ -134,11 +135,27 @@ function addOption(event) {
} }
} }
// convert date to local time // Parse an HTML5-liberalized version of RFC 3339 datetime values
Date.parseRFC3339 = function (string) {
var date=new Date();
date.setTime(0);
var match = string.match(/(\d{4})-(\d\d)-(\d\d)\s*(?:[\sT]\s*(\d\d):(\d\d)(?::(\d\d))?(\.\d*)?\s*(Z|([-+])(\d\d):(\d\d))?)?/);
if (!match) return;
if (match[2]) match[2]--;
if (match[7]) match[7] = (match[7]+'000').substring(1,4);
var field = [null,'FullYear','Month','Date','Hours','Minutes','Seconds','Milliseconds'];
for (var i=1; i<=7; i++) if (match[i]) date['setUTC'+field[i]](match[i]);
if (match[9]) date.setTime(date.getTime()+
(match[9]=='-'?1:-1)*(match[10]*3600000+match[11]*60000) );
return date.getTime();
}
// convert datetime to local date
var localere = /^(\w+) (\d+) (\w+) \d+ 0?(\d\d?:\d\d):\d\d ([AP]M) (EST|EDT|CST|CDT|MST|MDT|PST|PDT)/; var localere = /^(\w+) (\d+) (\w+) \d+ 0?(\d\d?:\d\d):\d\d ([AP]M) (EST|EDT|CST|CDT|MST|MDT|PST|PDT)/;
function localizeDate(element) { function localizeDate(element) {
var date = new Date(); var date = new Date();
date.setTime(Date.parse(element.innerHTML + " GMT")); date.setTime(Date.parseRFC3339(element.getAttribute('datetime')));
if (!date.getTime()) return;
var local = date.toLocaleString(); var local = date.toLocaleString();
var match = local.match(localere); var match = local.match(localere);
@ -160,13 +177,13 @@ function localizeDate(element) {
// find entries (and localizeDates) // find entries (and localizeDates)
function findEntries() { function findEntries() {
var span = document.getElementsByTagName('span'); var times = document.getElementsByTagName('time');
for (var i=0; i<span.length; i++) { for (var i=0; i<times.length; i++) {
if (span[i].className == "date" && span[i].title == "GMT") { if (times[i].title == "GMT") {
var date = localizeDate(span[i]); var date = localizeDate(times[i]);
var parent = span[i]; var parent = times[i];
while (parent && while (parent &&
(!parent.className || parent.className.split(' ')[0] != 'news')) { (!parent.className || parent.className.split(' ')[0] != 'news')) {
parent = parent.parentNode; parent = parent.parentNode;
@ -174,8 +191,9 @@ function findEntries() {
if (parent) { if (parent) {
var info = entries[entries.length] = new Object(); var info = entries[entries.length] = new Object();
info.parent = parent; info.parent = parent;
info.date = date; info.date = date;
info.datetime = times[i].getAttribute('datetime').substring(0,10);
} }
} }
} }
@ -184,7 +202,7 @@ function findEntries() {
// insert/remove date headers to indicate change of date in local time zone // insert/remove date headers to indicate change of date in local time zone
function moveDateHeaders() { function moveDateHeaders() {
lastdate = '' var lastdate = ''
for (var i=0; i<entries.length; i++) { for (var i=0; i<entries.length; i++) {
var parent = entries[i].parent; var parent = entries[i].parent;
var date = entries[i].date; var date = entries[i].date;
@ -198,13 +216,16 @@ function moveDateHeaders() {
if (lastdate == date) { if (lastdate == date) {
sibling.parentNode.removeChild(sibling); sibling.parentNode.removeChild(sibling);
} else { } else {
sibling.innerHTML = date; sibling.childNodes[0].innerHTML = date;
sibling.childNodes[0].setAttribute('datetime',entries[i].datetime);
lastdate = date; lastdate = date;
} }
} else if (lastdate != date) { } else if (lastdate != date) {
var h2 = document.createElement('h2'); var h2 = document.createElement('h2');
h2.className = 'date' var time = document.createElement('time');
h2.appendChild(document.createTextNode(date)); time.setAttribute('datetime',entries[i].datetime);
time.appendChild(document.createTextNode(date));
h2.appendChild(time);
parent.parentNode.insertBefore(h2, parent); parent.parentNode.insertBefore(h2, parent);
lastdate = date; lastdate = date;
} }

View File

@ -26,7 +26,7 @@
<xsl:copy> <xsl:copy>
<xsl:attribute name="indexing:index">no</xsl:attribute> <xsl:attribute name="indexing:index">no</xsl:attribute>
<xsl:apply-templates select="@*"/> <xsl:apply-templates select="@*"/>
<access:restriction relationship="allow"/> <access:restriction relationship="deny"/>
<xsl:apply-templates select="node()"/> <xsl:apply-templates select="node()"/>
<xsl:text>&#10;</xsl:text> <xsl:text>&#10;</xsl:text>
</xsl:copy> </xsl:copy>

Binary file not shown.

After

Width:  |  Height:  |  Size: 203 B

View File

@ -0,0 +1,20 @@
# This theme reimplements the classic "fancy" htmltmpl using genshi
[Planet]
template_files:
atom.xml.xslt
foafroll.xml.xslt
index.html.genshi
opml.xml.xslt
rss10.xml.tmpl
rss20.xml.tmpl
template_directories:
../common
../classic_fancy
bill_of_materials:
planet.css
images/feed-icon-10x10.png
images/logo.png
images/venus.png

View File

@ -0,0 +1,95 @@
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:py="http://genshi.edgewall.org/">
<!--!
### Fancy Planet HTML template, converted to Genshi.
###
### When combined with the stylesheet and images in the output/ directory
### of the Planet source, this gives you a much prettier result than the
### default examples template and demonstrates how to use the config file
### to support things like faces
###
### For documentation on the more boring template elements, see
### http://www.intertwingly.net/code/venus/docs/templates.html
-->
<head>
<title>$feed.config.name</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<meta name="generator" content="$feed.generator"/>
<link rel="stylesheet" href="planet.css" type="text/css"/>
<link py:for="link in feed.links"
py:if="link.type in ['application/atom+xml','application/rss+xml']"
href="$link.href" rel="alternate" title="$link.title" type="$link.type"/>
</head>
<body>
<h1>$feed.config.name</h1>
<py:for each="entry in entries">
<div class="channelgroup" py:strip="not entry.new_date">
<h2 py:if="entry.new_date">$entry.new_date</h2>
<div class="entrygroup" py:strip="not entry.new_feed">
<h3 py:if="entry.new_feed"><a href="$entry.link" title="$entry.source.title">$entry.source.config.name</a></h3>
<img py:if="entry.new_feed and entry.source.config.face" class="face" src="images/$entry.source.config.face" width="$entry.source.config.facewidth" height="$entry.source.config.faceheight" alt=""/>
<h4 py:if="entry.title" lang="$entry.title_detail.language"><a href="$entry.link">$entry.title_detail.stream</a></h4>
<div class="entry">
<div class="content" py:choose="">
<py:when test="entry.content">${entry.content[0].stream}</py:when>
<py:when test="entry.summary_detail">${entry.summary_detail.stream}</py:when>
</div>
<p class="date"><py:if test="entry.author_detail and entry.author_detail.name">by $entry.author_detail.name at </py:if>$entry.updated</p>
</div>
</div>
</div>
</py:for>
<div class="sidebar">
<img src="images/logo.png" width="136" height="136" alt=""/>
<h2>Subscriptions</h2>
<ul>
<li py:for="feed in feeds">
<a py:for="link in feed.links" py:if="link.rel == 'self' and
link.type in ['application/atom+xml','application/rss+xml']"
href="$link.href" title="subscribe"><img src="images/feed-icon-10x10.png" alt="(feed)"/></a>
<py:choose>
<a py:when="feed.planet_message" href="$feed.link" class="message" title="$feed.planet_message">$feed.config.name</a>
<a py:otherwise="1" href="$feed.link" title="$feed.title">$feed.config.name</a>
</py:choose>
</li>
</ul>
<p>
<strong>Last updated:</strong><br/>
$feed.updated<br/>
<em>All times are UTC.</em><br/>
<br/>
Powered by:<br/>
<a href="http://intertwingly.net/code/venus/"><img src="images/venus.png" width="80" height="15" alt="Planet Venus" border="0"/></a>
</p>
<p>
<h2>Planetarium:</h2>
<ul>
<li><a href="http://www.planetapache.org/">Planet Apache</a></li>
<li><a href="http://planet.debian.net/">Planet Debian</a></li>
<li><a href="http://planet.freedesktop.org/">Planet freedesktop.org</a></li>
<li><a href="http://planet.gnome.org/">Planet GNOME</a></li>
<li><a href="http://planetsun.org/">Planet Sun</a></li>
<li><a href="http://fedora.linux.duke.edu/fedorapeople/">Fedora People</a></li>
<li><a href="http://www.planetplanet.org/">more...</a></li>
</ul>
</p>
</div>
</body>
</html>

View File

@ -0,0 +1,150 @@
body {
border-right: 1px solid black;
margin-right: 200px;
padding-left: 20px;
padding-right: 20px;
}
h1 {
margin-top: 0px;
padding-top: 20px;
font-family: "Bitstream Vera Sans", sans-serif;
font-weight: normal;
letter-spacing: -2px;
text-transform: lowercase;
text-align: right;
color: grey;
}
.admin {
text-align: right;
}
h2 {
font-family: "Bitstream Vera Sans", sans-serif;
font-weight: normal;
color: #200080;
margin-left: -20px;
}
h3 {
font-family: "Bitstream Vera Sans", sans-serif;
font-weight: normal;
background-color: #a0c0ff;
border: 1px solid #5080b0;
padding: 4px;
}
h3 a {
text-decoration: none;
color: inherit;
}
h4 {
font-family: "Bitstream Vera Sans", sans-serif;
font-weight: bold;
}
h4 a {
text-decoration: none;
color: inherit;
}
img.face {
float: right;
margin-top: -3em;
}
.entry {
margin-bottom: 2em;
}
.entry .date {
font-family: "Bitstream Vera Sans", sans-serif;
color: grey;
}
.entry .date a {
text-decoration: none;
color: inherit;
}
.sidebar {
position: absolute;
top: 0px;
right: 0px;
width: 200px;
margin-left: 0px;
margin-right: 0px;
padding-right: 0px;
padding-top: 20px;
padding-left: 0px;
font-family: "Bitstream Vera Sans", sans-serif;
font-size: 85%;
}
.sidebar h2 {
font-size: 110%;
font-weight: bold;
color: black;
padding-left: 5px;
margin-left: 0px;
}
.sidebar ul {
padding-left: 1em;
margin-left: 0px;
list-style-type: none;
}
.sidebar ul li:hover {
color: grey;
}
.sidebar ul li a {
text-decoration: none;
}
.sidebar ul li a:hover {
text-decoration: underline;
}
.sidebar ul li a img {
border: 0;
}
.sidebar p {
border-top: 1px solid grey;
margin-top: 30px;
padding-top: 10px;
padding-left: 5px;
}
.sidebar .message {
cursor: help;
border-bottom: 1px dashed red;
}
.sidebar a.message:hover {
cursor: help;
background-color: #ff0000;
color: #ffffff !important;
text-decoration: none !important;
}
a:hover {
text-decoration: underline !important;
color: blue !important;
}