[Yanel-dev] crawler

Josias Thöny josias.thoeny at wyona.com
Mon Feb 26 23:35:33 CET 2007


Hi,

I've had a look at the crawler of lenya 1.2, and it seems that a few 
features are missing:

basic missing features:
- download of images
- download of css
- download of scripts
- link rewriting
- limits for max level / max documents

advanced missing features:
- handling of frames / iframes
- tidy html -> xhtml
- extraction of body content
- resolving of links in css (background images etc.)

Or am I misunderstanding something...?

IMHO some of these features are quite essential, because we want to use 
the crawler in yanel to import the complete pages with images and 
everything, not only text content.

The question is now, does it make sense to implement the missing 
features into that crawler, or should we look for an alternative?

Josias



More information about the Yanel-development mailing list