[Yanel-dev] crawler

Fri Mar 2 15:42:35 CET 2007

Josias Thöny wrote:

> Michael Wechner wrote:
>
>> Josias Thöny wrote:
>>
>>> Michael Wechner wrote:
>>>
>>>> Josias Thöny wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've had a look at the crawler of lenya 1.2, and it seems that a 
>>>>> few features are missing:
>>>>>
>>>>> basic missing features:
>>>>> - download of images
>>>>> - download of css
>>>>> - download of scripts
>>>>> - link rewriting
>>>>> - limits for max level / max documents
>>>>>
>>>>> advanced missing features:
>>>>> - handling of frames / iframes
>>>>> - tidy html -> xhtml
>>>>> - extraction of body content
>>>>> - resolving of links in css (background images etc.)
>>>>>
>>>>> Or am I misunderstanding something...?
>>>>
>>>>
>>>>
>>>>
>>>> no ;-)
>>>>
>>>>>
>>>>> IMHO some of these features are quite essential, because we want 
>>>>> to use the crawler in yanel to import the complete pages with 
>>>>> images and everything, not only text content.
>>>>>
>>>>> The question is now, does it make sense to implement the missing 
>>>>> features into that crawler, or should we look for an alternative?
>>>>
>>>>
>>>>
>>>>
>>>> sure, if there is an alternative :-) Is there?
>>>
>>>
>>>
>>> The lenya crawler uses websphinx for the robot exclusion, which is 
>>> actually a complete crawler framework, and I think we could use it 
>>> instead of the lenya crawler. It supports the basic features that I 
>>> mentioned above.
>>> I wrote a class DumpingCrawler which is based on the websphinx 
>>> crawler. Basically it should be able to create a complete dump of a 
>>> website including images, css, etc. It also rewrites links in the 
>>> html code.
>>>
>>> The source code is at:
>>> https://svn.wyona.com/repos/public/crawler
>>>
>>> I also added the websphinx source code to our svn because I had to 
>>> patch a few things.
>>
>>
>>
>> I think it's important that we also add the patches separately in 
>> order to know what has been patched.
>
>
> Well, I checked in an unpatched version first and then committed the 
> patches, so it's all in the svn history. Websphinx is not an active 
> project anymore, so I guess there will be no updates to the orignial 
> source.

well, is there a simple way to say give me all the patches after the 
first checkin resp. the checkin of the original source?

>
>>
>>> The license is apache-like, so it should be ok.
>>>
>>> The usage is shown in the following example:
>>>
>>> --------------------------------------------------
>>> String crawlStartURL = "http://wyona.org";
>>> String crawlScopeURL = "http://wyona.org";
>>> String dumpDir = "/tmp/dump";
>>>
>>> DumpingCrawler crawler = new DumpingCrawler(crawlStartURL, 
>>> crawlScopeURL, dumpDir);
>>>
>>> EventLog eventLog = new EventLog(System.out);
>>> crawler.addCrawlListener(eventLog);
>>> crawler.addLinkListener(eventLog);
>>>
>>> crawler.run();
>>> crawler.close();
>>> --------------------------------------------------
>>>
>>> Remarks:
>>> - the EventLog is optional (it creates some log output)
>>
>>
>>
>> what is the EventLog good for
>
>
> The purpose of the EventLog is to log events ;) and to give the user 
> some kind of feedback, like which URLs have been downloaded etc.
> You pass it an OutputStream and it will write log entries into that 
> stream.
> The output looks like that:
>
> [java] Fri Mar 02 09:25:48 EST 2007: *** started 
> org.apache.lenya.search.crawler.DumpingCrawler
> [java] Fri Mar 02 09:25:48 EST 2007: retrieving 
> [http://localhost:8888/default/live/index.html]
> [java] Fri Mar 02 09:25:53 EST 2007: downloaded 
> [http://localhost:8888/default/live/index.html]
> [java] Fri Mar 02 09:25:53 EST 2007: retrieving 
> [http://localhost:8888/default/live/css/page.css]
> ....

ok, can we log the number of dumped pages such that we can present a 
status report during the crawl, e.g. 45 pages of 1000 pages dumped. This 
status should be reloaded/refreshed automatically, e.g. every 15 secs

>
>>
>>> - the crawlScopeURL limits the scope of the retrieved pages, i.e. 
>>> only urls starting with the scope url are being downloaded.
>>>
>>> For more information, see
>>> http://www.cs.cmu.edu/~rcm/websphinx/doc/websphinx/Crawler.html
>>
>>
>>
>> sounds very good :-) Have you already uploaded the library to our 
>> Maven repo?
>
>
> No, not yet.

please do ;-)

Michi

>
> Josias
>
>
>>
>> Thanks
>>
>> Michi
>>
>>>
>>> Josias
>>>
>>>
>>>>
>>>> Thanks
>>>>
>>>> Michi
>>>>
>>>>>
>>>>> Josias
>>>>>
>>>>> _______________________________________________
>>>>> Yanel-development mailing list
>>>>> Yanel-development at wyona.com
>>>>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>>>>
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Yanel-development mailing list
>>> Yanel-development at wyona.com
>>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>>
>>
>>
>
>
> _______________________________________________
> Yanel-development mailing list
> Yanel-development at wyona.com
> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>

-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner at wyona.com                        michi at apache.org
+41 44 272 91 61