[Yanel-dev] crawler

Fri Mar 2 16:02:45 CET 2007

Michael Wechner wrote:
> Josias Thöny wrote:
> 
>> Michael Wechner wrote:
>>
>>> Josias Thöny wrote:
>>>
>>>> Michael Wechner wrote:
>>>>
>>>>> Josias Thöny wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've had a look at the crawler of lenya 1.2, and it seems that a 
>>>>>> few features are missing:
>>>>>>
>>>>>> basic missing features:
>>>>>> - download of images
>>>>>> - download of css
>>>>>> - download of scripts
>>>>>> - link rewriting
>>>>>> - limits for max level / max documents
>>>>>>
>>>>>> advanced missing features:
>>>>>> - handling of frames / iframes
>>>>>> - tidy html -> xhtml
>>>>>> - extraction of body content
>>>>>> - resolving of links in css (background images etc.)
>>>>>>
>>>>>> Or am I misunderstanding something...?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> no ;-)
>>>>>
>>>>>>
>>>>>> IMHO some of these features are quite essential, because we want 
>>>>>> to use the crawler in yanel to import the complete pages with 
>>>>>> images and everything, not only text content.
>>>>>>
>>>>>> The question is now, does it make sense to implement the missing 
>>>>>> features into that crawler, or should we look for an alternative?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> sure, if there is an alternative :-) Is there?
>>>>
>>>>
>>>>
>>>> The lenya crawler uses websphinx for the robot exclusion, which is 
>>>> actually a complete crawler framework, and I think we could use it 
>>>> instead of the lenya crawler. It supports the basic features that I 
>>>> mentioned above.
>>>> I wrote a class DumpingCrawler which is based on the websphinx 
>>>> crawler. Basically it should be able to create a complete dump of a 
>>>> website including images, css, etc. It also rewrites links in the 
>>>> html code.
>>>>
>>>> The source code is at:
>>>> https://svn.wyona.com/repos/public/crawler
>>>>
>>>> I also added the websphinx source code to our svn because I had to 
>>>> patch a few things.
>>>
>>>
>>>
>>> I think it's important that we also add the patches separately in 
>>> order to know what has been patched.
>>
>>
>> Well, I checked in an unpatched version first and then committed the 
>> patches, so it's all in the svn history. Websphinx is not an active 
>> project anymore, so I guess there will be no updates to the orignial 
>> source.
> 
> 
> well, is there a simple way to say give me all the patches after the 
> first checkin resp. the checkin of the original source?

First use svn log to find the original revision:
svn log https://svn.wyona.com/repos/public/crawler/src/java/websphinx

Then get the diff:
svn diff -r 23040:HEAD 
https://svn.wyona.com/repos/public/crawler/src/java/websphinx

> 
>>
>>>
>>>> The license is apache-like, so it should be ok.
>>>>
>>>> The usage is shown in the following example:
>>>>
>>>> --------------------------------------------------
>>>> String crawlStartURL = "http://wyona.org";
>>>> String crawlScopeURL = "http://wyona.org";
>>>> String dumpDir = "/tmp/dump";
>>>>
>>>> DumpingCrawler crawler = new DumpingCrawler(crawlStartURL, 
>>>> crawlScopeURL, dumpDir);
>>>>
>>>> EventLog eventLog = new EventLog(System.out);
>>>> crawler.addCrawlListener(eventLog);
>>>> crawler.addLinkListener(eventLog);
>>>>
>>>> crawler.run();
>>>> crawler.close();
>>>> --------------------------------------------------
>>>>
>>>> Remarks:
>>>> - the EventLog is optional (it creates some log output)
>>>
>>>
>>>
>>> what is the EventLog good for
>>
>>
>> The purpose of the EventLog is to log events ;) and to give the user 
>> some kind of feedback, like which URLs have been downloaded etc.
>> You pass it an OutputStream and it will write log entries into that 
>> stream.
>> The output looks like that:
>>
>> [java] Fri Mar 02 09:25:48 EST 2007: *** started 
>> org.apache.lenya.search.crawler.DumpingCrawler
>> [java] Fri Mar 02 09:25:48 EST 2007: retrieving 
>> [http://localhost:8888/default/live/index.html]
>> [java] Fri Mar 02 09:25:53 EST 2007: downloaded 
>> [http://localhost:8888/default/live/index.html]
>> [java] Fri Mar 02 09:25:53 EST 2007: retrieving 
>> [http://localhost:8888/default/live/css/page.css]
>> ....
> 
> 
> ok, can we log the number of dumped pages such that we can present a 
> status report during the crawl, e.g. 45 pages of 1000 pages dumped. This 
> status should be reloaded/refreshed automatically, e.g. every 15 secs

I guess it could be used to do that. We may have to extend the default 
EventLog, but it's quite a simple class.
However, it won't be possible to know the total number of pages in 
advance, because the crawler finds new links as it crawls new pages.

> 
>>
>>>
>>>> - the crawlScopeURL limits the scope of the retrieved pages, i.e. 
>>>> only urls starting with the scope url are being downloaded.
>>>>
>>>> For more information, see
>>>> http://www.cs.cmu.edu/~rcm/websphinx/doc/websphinx/Crawler.html
>>>
>>>
>>>
>>> sounds very good :-) Have you already uploaded the library to our 
>>> Maven repo?
>>
>>
>> No, not yet.
> 
> 
> please do ;-)

ok, will do that.

Josias

> 
> Michi
> 
>>
>> Josias
>>
>>
>>>
>>> Thanks
>>>
>>> Michi
>>>
>>>>
>>>> Josias
>>>>
>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Michi
>>>>>
>>>>>>
>>>>>> Josias
>>>>>>
>>>>>> _______________________________________________
>>>>>> Yanel-development mailing list
>>>>>> Yanel-development at wyona.com
>>>>>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Yanel-development mailing list
>>>> Yanel-development at wyona.com
>>>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> Yanel-development mailing list
>> Yanel-development at wyona.com
>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>
> 
>