[Yanel-dev] crawler
Josias Thöny
josias.thoeny at wyona.com
Fri Mar 2 16:02:45 CET 2007
Michael Wechner wrote:
> Josias Thöny wrote:
>
>> Michael Wechner wrote:
>>
>>> Josias Thöny wrote:
>>>
>>>> Michael Wechner wrote:
>>>>
>>>>> Josias Thöny wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've had a look at the crawler of lenya 1.2, and it seems that a
>>>>>> few features are missing:
>>>>>>
>>>>>> basic missing features:
>>>>>> - download of images
>>>>>> - download of css
>>>>>> - download of scripts
>>>>>> - link rewriting
>>>>>> - limits for max level / max documents
>>>>>>
>>>>>> advanced missing features:
>>>>>> - handling of frames / iframes
>>>>>> - tidy html -> xhtml
>>>>>> - extraction of body content
>>>>>> - resolving of links in css (background images etc.)
>>>>>>
>>>>>> Or am I misunderstanding something...?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> no ;-)
>>>>>
>>>>>>
>>>>>> IMHO some of these features are quite essential, because we want
>>>>>> to use the crawler in yanel to import the complete pages with
>>>>>> images and everything, not only text content.
>>>>>>
>>>>>> The question is now, does it make sense to implement the missing
>>>>>> features into that crawler, or should we look for an alternative?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> sure, if there is an alternative :-) Is there?
>>>>
>>>>
>>>>
>>>> The lenya crawler uses websphinx for the robot exclusion, which is
>>>> actually a complete crawler framework, and I think we could use it
>>>> instead of the lenya crawler. It supports the basic features that I
>>>> mentioned above.
>>>> I wrote a class DumpingCrawler which is based on the websphinx
>>>> crawler. Basically it should be able to create a complete dump of a
>>>> website including images, css, etc. It also rewrites links in the
>>>> html code.
>>>>
>>>> The source code is at:
>>>> https://svn.wyona.com/repos/public/crawler
>>>>
>>>> I also added the websphinx source code to our svn because I had to
>>>> patch a few things.
>>>
>>>
>>>
>>> I think it's important that we also add the patches separately in
>>> order to know what has been patched.
>>
>>
>> Well, I checked in an unpatched version first and then committed the
>> patches, so it's all in the svn history. Websphinx is not an active
>> project anymore, so I guess there will be no updates to the orignial
>> source.
>
>
> well, is there a simple way to say give me all the patches after the
> first checkin resp. the checkin of the original source?
First use svn log to find the original revision:
svn log https://svn.wyona.com/repos/public/crawler/src/java/websphinx
Then get the diff:
svn diff -r 23040:HEAD
https://svn.wyona.com/repos/public/crawler/src/java/websphinx
>
>>
>>>
>>>> The license is apache-like, so it should be ok.
>>>>
>>>> The usage is shown in the following example:
>>>>
>>>> --------------------------------------------------
>>>> String crawlStartURL = "http://wyona.org";
>>>> String crawlScopeURL = "http://wyona.org";
>>>> String dumpDir = "/tmp/dump";
>>>>
>>>> DumpingCrawler crawler = new DumpingCrawler(crawlStartURL,
>>>> crawlScopeURL, dumpDir);
>>>>
>>>> EventLog eventLog = new EventLog(System.out);
>>>> crawler.addCrawlListener(eventLog);
>>>> crawler.addLinkListener(eventLog);
>>>>
>>>> crawler.run();
>>>> crawler.close();
>>>> --------------------------------------------------
>>>>
>>>> Remarks:
>>>> - the EventLog is optional (it creates some log output)
>>>
>>>
>>>
>>> what is the EventLog good for
>>
>>
>> The purpose of the EventLog is to log events ;) and to give the user
>> some kind of feedback, like which URLs have been downloaded etc.
>> You pass it an OutputStream and it will write log entries into that
>> stream.
>> The output looks like that:
>>
>> [java] Fri Mar 02 09:25:48 EST 2007: *** started
>> org.apache.lenya.search.crawler.DumpingCrawler
>> [java] Fri Mar 02 09:25:48 EST 2007: retrieving
>> [http://localhost:8888/default/live/index.html]
>> [java] Fri Mar 02 09:25:53 EST 2007: downloaded
>> [http://localhost:8888/default/live/index.html]
>> [java] Fri Mar 02 09:25:53 EST 2007: retrieving
>> [http://localhost:8888/default/live/css/page.css]
>> ....
>
>
> ok, can we log the number of dumped pages such that we can present a
> status report during the crawl, e.g. 45 pages of 1000 pages dumped. This
> status should be reloaded/refreshed automatically, e.g. every 15 secs
I guess it could be used to do that. We may have to extend the default
EventLog, but it's quite a simple class.
However, it won't be possible to know the total number of pages in
advance, because the crawler finds new links as it crawls new pages.
>
>>
>>>
>>>> - the crawlScopeURL limits the scope of the retrieved pages, i.e.
>>>> only urls starting with the scope url are being downloaded.
>>>>
>>>> For more information, see
>>>> http://www.cs.cmu.edu/~rcm/websphinx/doc/websphinx/Crawler.html
>>>
>>>
>>>
>>> sounds very good :-) Have you already uploaded the library to our
>>> Maven repo?
>>
>>
>> No, not yet.
>
>
> please do ;-)
ok, will do that.
Josias
>
> Michi
>
>>
>> Josias
>>
>>
>>>
>>> Thanks
>>>
>>> Michi
>>>
>>>>
>>>> Josias
>>>>
>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Michi
>>>>>
>>>>>>
>>>>>> Josias
>>>>>>
>>>>>> _______________________________________________
>>>>>> Yanel-development mailing list
>>>>>> Yanel-development at wyona.com
>>>>>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Yanel-development mailing list
>>>> Yanel-development at wyona.com
>>>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> Yanel-development mailing list
>> Yanel-development at wyona.com
>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>
>
>
More information about the Yanel-development
mailing list