[Yanel-dev] crawler
Michael Wechner
michael.wechner at wyona.com
Fri Mar 2 15:42:35 CET 2007
Josias Thöny wrote:
> Michael Wechner wrote:
>
>> Josias Thöny wrote:
>>
>>> Michael Wechner wrote:
>>>
>>>> Josias Thöny wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've had a look at the crawler of lenya 1.2, and it seems that a
>>>>> few features are missing:
>>>>>
>>>>> basic missing features:
>>>>> - download of images
>>>>> - download of css
>>>>> - download of scripts
>>>>> - link rewriting
>>>>> - limits for max level / max documents
>>>>>
>>>>> advanced missing features:
>>>>> - handling of frames / iframes
>>>>> - tidy html -> xhtml
>>>>> - extraction of body content
>>>>> - resolving of links in css (background images etc.)
>>>>>
>>>>> Or am I misunderstanding something...?
>>>>
>>>>
>>>>
>>>>
>>>> no ;-)
>>>>
>>>>>
>>>>> IMHO some of these features are quite essential, because we want
>>>>> to use the crawler in yanel to import the complete pages with
>>>>> images and everything, not only text content.
>>>>>
>>>>> The question is now, does it make sense to implement the missing
>>>>> features into that crawler, or should we look for an alternative?
>>>>
>>>>
>>>>
>>>>
>>>> sure, if there is an alternative :-) Is there?
>>>
>>>
>>>
>>> The lenya crawler uses websphinx for the robot exclusion, which is
>>> actually a complete crawler framework, and I think we could use it
>>> instead of the lenya crawler. It supports the basic features that I
>>> mentioned above.
>>> I wrote a class DumpingCrawler which is based on the websphinx
>>> crawler. Basically it should be able to create a complete dump of a
>>> website including images, css, etc. It also rewrites links in the
>>> html code.
>>>
>>> The source code is at:
>>> https://svn.wyona.com/repos/public/crawler
>>>
>>> I also added the websphinx source code to our svn because I had to
>>> patch a few things.
>>
>>
>>
>> I think it's important that we also add the patches separately in
>> order to know what has been patched.
>
>
> Well, I checked in an unpatched version first and then committed the
> patches, so it's all in the svn history. Websphinx is not an active
> project anymore, so I guess there will be no updates to the orignial
> source.
well, is there a simple way to say give me all the patches after the
first checkin resp. the checkin of the original source?
>
>>
>>> The license is apache-like, so it should be ok.
>>>
>>> The usage is shown in the following example:
>>>
>>> --------------------------------------------------
>>> String crawlStartURL = "http://wyona.org";
>>> String crawlScopeURL = "http://wyona.org";
>>> String dumpDir = "/tmp/dump";
>>>
>>> DumpingCrawler crawler = new DumpingCrawler(crawlStartURL,
>>> crawlScopeURL, dumpDir);
>>>
>>> EventLog eventLog = new EventLog(System.out);
>>> crawler.addCrawlListener(eventLog);
>>> crawler.addLinkListener(eventLog);
>>>
>>> crawler.run();
>>> crawler.close();
>>> --------------------------------------------------
>>>
>>> Remarks:
>>> - the EventLog is optional (it creates some log output)
>>
>>
>>
>> what is the EventLog good for
>
>
> The purpose of the EventLog is to log events ;) and to give the user
> some kind of feedback, like which URLs have been downloaded etc.
> You pass it an OutputStream and it will write log entries into that
> stream.
> The output looks like that:
>
> [java] Fri Mar 02 09:25:48 EST 2007: *** started
> org.apache.lenya.search.crawler.DumpingCrawler
> [java] Fri Mar 02 09:25:48 EST 2007: retrieving
> [http://localhost:8888/default/live/index.html]
> [java] Fri Mar 02 09:25:53 EST 2007: downloaded
> [http://localhost:8888/default/live/index.html]
> [java] Fri Mar 02 09:25:53 EST 2007: retrieving
> [http://localhost:8888/default/live/css/page.css]
> ....
ok, can we log the number of dumped pages such that we can present a
status report during the crawl, e.g. 45 pages of 1000 pages dumped. This
status should be reloaded/refreshed automatically, e.g. every 15 secs
>
>>
>>> - the crawlScopeURL limits the scope of the retrieved pages, i.e.
>>> only urls starting with the scope url are being downloaded.
>>>
>>> For more information, see
>>> http://www.cs.cmu.edu/~rcm/websphinx/doc/websphinx/Crawler.html
>>
>>
>>
>> sounds very good :-) Have you already uploaded the library to our
>> Maven repo?
>
>
> No, not yet.
please do ;-)
Michi
>
> Josias
>
>
>>
>> Thanks
>>
>> Michi
>>
>>>
>>> Josias
>>>
>>>
>>>>
>>>> Thanks
>>>>
>>>> Michi
>>>>
>>>>>
>>>>> Josias
>>>>>
>>>>> _______________________________________________
>>>>> Yanel-development mailing list
>>>>> Yanel-development at wyona.com
>>>>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>>>>
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Yanel-development mailing list
>>> Yanel-development at wyona.com
>>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>>
>>
>>
>
>
> _______________________________________________
> Yanel-development mailing list
> Yanel-development at wyona.com
> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>
--
Michael Wechner
Wyona - Open Source Content Management - Apache Lenya
http://www.wyona.com http://lenya.apache.org
michael.wechner at wyona.com michi at apache.org
+41 44 272 91 61
More information about the Yanel-development
mailing list