[Yanel-dev] crawler

Josias Thöny josias.thoeny at wyona.com
Fri Mar 2 00:37:53 CET 2007


Michael Wechner wrote:
> Josias Thöny wrote:
> 
>> Hi,
>>
>> I've had a look at the crawler of lenya 1.2, and it seems that a few 
>> features are missing:
>>
>> basic missing features:
>> - download of images
>> - download of css
>> - download of scripts
>> - link rewriting
>> - limits for max level / max documents
>>
>> advanced missing features:
>> - handling of frames / iframes
>> - tidy html -> xhtml
>> - extraction of body content
>> - resolving of links in css (background images etc.)
>>
>> Or am I misunderstanding something...?
> 
> 
> no ;-)
> 
>>
>> IMHO some of these features are quite essential, because we want to 
>> use the crawler in yanel to import the complete pages with images and 
>> everything, not only text content.
>>
>> The question is now, does it make sense to implement the missing 
>> features into that crawler, or should we look for an alternative?
> 
> 
> sure, if there is an alternative :-) Is there?

The lenya crawler uses websphinx for the robot exclusion, which is 
actually a complete crawler framework, and I think we could use it 
instead of the lenya crawler. It supports the basic features that I 
mentioned above.
I wrote a class DumpingCrawler which is based on the websphinx crawler. 
Basically it should be able to create a complete dump of a website 
including images, css, etc. It also rewrites links in the html code.

The source code is at:
https://svn.wyona.com/repos/public/crawler

I also added the websphinx source code to our svn because I had to patch 
a few things. The license is apache-like, so it should be ok.

The usage is shown in the following example:

--------------------------------------------------
String crawlStartURL = "http://wyona.org";
String crawlScopeURL = "http://wyona.org";
String dumpDir = "/tmp/dump";

DumpingCrawler crawler = new DumpingCrawler(crawlStartURL, 
crawlScopeURL, dumpDir);

EventLog eventLog = new EventLog(System.out);
crawler.addCrawlListener(eventLog);
crawler.addLinkListener(eventLog);

crawler.run();
crawler.close();
--------------------------------------------------

Remarks:
- the EventLog is optional (it creates some log output)
- the crawlScopeURL limits the scope of the retrieved pages, i.e. only 
urls starting with the scope url are being downloaded.

For more information, see
http://www.cs.cmu.edu/~rcm/websphinx/doc/websphinx/Crawler.html

Josias


> 
> Thanks
> 
> Michi
> 
>>
>> Josias
>>
>> _______________________________________________
>> Yanel-development mailing list
>> Yanel-development at wyona.com
>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>
> 
> 




More information about the Yanel-development mailing list