[Yanel-dev] crawler
Josias Thöny
josias.thoeny at wyona.com
Fri Mar 2 00:37:53 CET 2007
Michael Wechner wrote:
> Josias Thöny wrote:
>
>> Hi,
>>
>> I've had a look at the crawler of lenya 1.2, and it seems that a few
>> features are missing:
>>
>> basic missing features:
>> - download of images
>> - download of css
>> - download of scripts
>> - link rewriting
>> - limits for max level / max documents
>>
>> advanced missing features:
>> - handling of frames / iframes
>> - tidy html -> xhtml
>> - extraction of body content
>> - resolving of links in css (background images etc.)
>>
>> Or am I misunderstanding something...?
>
>
> no ;-)
>
>>
>> IMHO some of these features are quite essential, because we want to
>> use the crawler in yanel to import the complete pages with images and
>> everything, not only text content.
>>
>> The question is now, does it make sense to implement the missing
>> features into that crawler, or should we look for an alternative?
>
>
> sure, if there is an alternative :-) Is there?
The lenya crawler uses websphinx for the robot exclusion, which is
actually a complete crawler framework, and I think we could use it
instead of the lenya crawler. It supports the basic features that I
mentioned above.
I wrote a class DumpingCrawler which is based on the websphinx crawler.
Basically it should be able to create a complete dump of a website
including images, css, etc. It also rewrites links in the html code.
The source code is at:
https://svn.wyona.com/repos/public/crawler
I also added the websphinx source code to our svn because I had to patch
a few things. The license is apache-like, so it should be ok.
The usage is shown in the following example:
--------------------------------------------------
String crawlStartURL = "http://wyona.org";
String crawlScopeURL = "http://wyona.org";
String dumpDir = "/tmp/dump";
DumpingCrawler crawler = new DumpingCrawler(crawlStartURL,
crawlScopeURL, dumpDir);
EventLog eventLog = new EventLog(System.out);
crawler.addCrawlListener(eventLog);
crawler.addLinkListener(eventLog);
crawler.run();
crawler.close();
--------------------------------------------------
Remarks:
- the EventLog is optional (it creates some log output)
- the crawlScopeURL limits the scope of the retrieved pages, i.e. only
urls starting with the scope url are being downloaded.
For more information, see
http://www.cs.cmu.edu/~rcm/websphinx/doc/websphinx/Crawler.html
Josias
>
> Thanks
>
> Michi
>
>>
>> Josias
>>
>> _______________________________________________
>> Yanel-development mailing list
>> Yanel-development at wyona.com
>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>
>
>
More information about the Yanel-development
mailing list