Retroactive filtering,

and make it clearer in the docs that filters are performed at spider time
This commit is contained in:
Sam Ruby 2007-03-14 08:16:04 -04:00
parent 3aef94c214
commit bd1019e9fb
3 changed files with 24 additions and 10 deletions

View File

@ -13,7 +13,7 @@
parameters come from the config file, and output goes to <code>stdout</code>. parameters come from the config file, and output goes to <code>stdout</code>.
Anything written to <code>stderr</code> is logged as an ERROR message. If no Anything written to <code>stderr</code> is logged as an ERROR message. If no
<code>stdout</code> is produced, the entry is not written to the cache or <code>stdout</code> is produced, the entry is not written to the cache or
processed further.</p> processed further; in fact, if the entry had previously been written to the cache, it will be removed.</p>
<p>Input to a filter is a aggressively <p>Input to a filter is a aggressively
<a href="normalization.html">normalized</a> entry. For <a href="normalization.html">normalized</a> entry. For
@ -54,6 +54,18 @@ instead of XPath expressions.</p>
<h3>Notes</h3> <h3>Notes</h3>
<ul> <ul>
<li>Filters are executed when a feed is fetched, and the results are placed
into the cache. Changing a configuration file alone is not sufficient to
change the contents of the cache &mdash; typically that only occurs after
a feed is modified.</li>
<li>Filters are simply invoked in the order they are listed in the
configuration file (think unix pipes). Planet wide filters are executed before
feed specific filters.</li>
<li>Any filters listed in the <code>[planet]</code> section of your config.ini
will be invoked on all feeds. Filters listed in individual
<code>[feed]</code> sections will only be invoked on those feeds.</li>
<li>The file extension of the filter is significant. <code>.py</code> invokes <li>The file extension of the filter is significant. <code>.py</code> invokes
python. <code>.xslt</code> involkes XSLT. <code>.sed</code> and python. <code>.xslt</code> involkes XSLT. <code>.sed</code> and
@ -61,14 +73,6 @@ python. <code>.xslt</code> involkes XSLT. <code>.sed</code> and
perl or ruby or class/jar (java), aren't supported at the moment, but these perl or ruby or class/jar (java), aren't supported at the moment, but these
would be easy to add.</li> would be easy to add.</li>
<li>Any filters listed in the <code>[planet]</code> section of your config.ini
will be invoked on all feeds. Filters listed in individual
<code>[feed]</code> sections will only be invoked on those feeds.</li>
<li>Filters are simply invoked in the order they are listed in the
configuration file (think unix pipes). Planet wide filters are executed before
feed specific filters.</li>
<li>Templates written using htmltmpl currently only have access to a fixed set <li>Templates written using htmltmpl currently only have access to a fixed set
of fields, whereas XSLT templates have access to everything.</li> of fields, whereas XSLT templates have access to everything.</li>
</ul> </ul>

View File

@ -194,7 +194,9 @@ def writeCache(feed_uri, feed_info, data):
for filter in config.filters(feed_uri): for filter in config.filters(feed_uri):
output = shell.run(filter, output, mode="filter") output = shell.run(filter, output, mode="filter")
if not output: break if not output: break
if not output: continue if not output:
if os.path.exists(cache_file): os.remove(cache_file)
continue
# write out and timestamp the results # write out and timestamp the results
write(output, cache_file) write(output, cache_file)

View File

@ -73,6 +73,14 @@ class SpiderTest(unittest.TestCase):
self.spiderFeed(testfeed % '1b') self.spiderFeed(testfeed % '1b')
self.verify_spiderFeed() self.verify_spiderFeed()
def test_spiderFeed_retroactive_filter(self):
config.load(configfile)
self.spiderFeed(testfeed % '1b')
self.assertEqual(5, len(glob.glob(workdir+"/*")))
config.parser.set('Planet', 'filter', 'two')
self.spiderFeed(testfeed % '1b')
self.assertEqual(1, len(glob.glob(workdir+"/*")))
def test_spiderUpdate(self): def test_spiderUpdate(self):
config.load(configfile) config.load(configfile)
self.spiderFeed(testfeed % '1a') self.spiderFeed(testfeed % '1a')