[Yanel-dev] crawler

Josias Thöny josias.thoeny at wyona.com
Fri Mar 2 15:29:22 CET 2007


Michael Wechner wrote:
> Josias Thöny wrote:
> 
>> Michael Wechner wrote:
>>
>>> Josias Thöny wrote:
>>>
>>>> Hi,
>>>>
>>>> I've had a look at the crawler of lenya 1.2, and it seems that a few 
>>>> features are missing:
>>>>
>>>> basic missing features:
>>>> - download of images
>>>> - download of css
>>>> - download of scripts
>>>> - link rewriting
>>>> - limits for max level / max documents
>>>>
>>>> advanced missing features:
>>>> - handling of frames / iframes
>>>> - tidy html -> xhtml
>>>> - extraction of body content
>>>> - resolving of links in css (background images etc.)
>>>>
>>>> Or am I misunderstanding something...?
>>>
>>>
>>>
>>> no ;-)
>>>
>>>>
>>>> IMHO some of these features are quite essential, because we want to 
>>>> use the crawler in yanel to import the complete pages with images 
>>>> and everything, not only text content.
>>>>
>>>> The question is now, does it make sense to implement the missing 
>>>> features into that crawler, or should we look for an alternative?
>>>
>>>
>>>
>>> sure, if there is an alternative :-) Is there?
>>
>>
>> The lenya crawler uses websphinx for the robot exclusion, which is 
>> actually a complete crawler framework, and I think we could use it 
>> instead of the lenya crawler. It supports the basic features that I 
>> mentioned above.
>> I wrote a class DumpingCrawler which is based on the websphinx 
>> crawler. Basically it should be able to create a complete dump of a 
>> website including images, css, etc. It also rewrites links in the html 
>> code.
>>
>> The source code is at:
>> https://svn.wyona.com/repos/public/crawler
>>
>> I also added the websphinx source code to our svn because I had to 
>> patch a few things.
> 
> 
> I think it's important that we also add the patches separately in order 
> to know what has been patched.

Well, I checked in an unpatched version first and then committed the 
patches, so it's all in the svn history. Websphinx is not an active 
project anymore, so I guess there will be no updates to the orignial source.

> 
>> The license is apache-like, so it should be ok.
>>
>> The usage is shown in the following example:
>>
>> --------------------------------------------------
>> String crawlStartURL = "http://wyona.org";
>> String crawlScopeURL = "http://wyona.org";
>> String dumpDir = "/tmp/dump";
>>
>> DumpingCrawler crawler = new DumpingCrawler(crawlStartURL, 
>> crawlScopeURL, dumpDir);
>>
>> EventLog eventLog = new EventLog(System.out);
>> crawler.addCrawlListener(eventLog);
>> crawler.addLinkListener(eventLog);
>>
>> crawler.run();
>> crawler.close();
>> --------------------------------------------------
>>
>> Remarks:
>> - the EventLog is optional (it creates some log output)
> 
> 
> what is the EventLog good for

The purpose of the EventLog is to log events ;) and to give the user 
some kind of feedback, like which URLs have been downloaded etc.
You pass it an OutputStream and it will write log entries into that stream.
The output looks like that:

[java] Fri Mar 02 09:25:48 EST 2007: *** started 
org.apache.lenya.search.crawler.DumpingCrawler
[java] Fri Mar 02 09:25:48 EST 2007: retrieving 
[http://localhost:8888/default/live/index.html]
[java] Fri Mar 02 09:25:53 EST 2007: downloaded 
[http://localhost:8888/default/live/index.html]
[java] Fri Mar 02 09:25:53 EST 2007: retrieving 
[http://localhost:8888/default/live/css/page.css]
....

> 
>> - the crawlScopeURL limits the scope of the retrieved pages, i.e. only 
>> urls starting with the scope url are being downloaded.
>>
>> For more information, see
>> http://www.cs.cmu.edu/~rcm/websphinx/doc/websphinx/Crawler.html
> 
> 
> sounds very good :-) Have you already uploaded the library to our Maven 
> repo?

No, not yet.

Josias


> 
> Thanks
> 
> Michi
> 
>>
>> Josias
>>
>>
>>>
>>> Thanks
>>>
>>> Michi
>>>
>>>>
>>>> Josias
>>>>
>>>> _______________________________________________
>>>> Yanel-development mailing list
>>>> Yanel-development at wyona.com
>>>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> Yanel-development mailing list
>> Yanel-development at wyona.com
>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>
> 
> 




More information about the Yanel-development mailing list