[Yanel-dev] API for accessing large data sets

Tue Mar 29 22:35:28 CEST 2011

Hi Balz

Thanks very much for your feedback. Please find some comments inline below

On 3/29/11 11:06 AM, Balz Schreier wrote:
> Hi Michael,
>
> my observation with large data sets (zwischengas.com 
> <http://zwischengas.com>) is the following:
> - usually you only want to retrieve a subset of all matching documents
> - therefore it is ok to include the "max documents" parameter into the 
> API too
> - additionally you could also provide a method search(from, max), 
> which internally uses the lucene method search(from+max) and then just 
> skips the document before "from".
>
> I don't know where you want to provide this method

for example for retrieving revisions of a node. We have some real world 
situations with more than 30K revisions per node.

> but be careful with creating YarepNodes for the results, I would deal 
> with just the Yarep Paths as long as you can, otherwise performance 
> goes done dramatically.

I think it depends on the implementation. For example some 
implementations read the properties during node init, which I consider 
bad and I think we should change.

Thanks

Michael
>
> Cheers
> Balz
>
> On Tue, Mar 29, 2011 at 10:50 AM, Michael Wechner 
> <michael.wechner at wyona.com <mailto:michael.wechner at wyona.com>> wrote:
>
>     Hi
>
>     I am currently thinking about introducing a new VersionableV3
>     interface to access large sets of revisions
>     (e.g. 50K) and make it scale better. Also it would be nice to
>     search revisions for particular tags.
>     Hence I was looking at the search API of lucene, because it has
>     similar scalability issues:
>
>     http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/Searcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20int%29
>
>     publicTopDocs  <http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/TopDocs.html>  *search*(Query  <http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/Query.html>  query,
>                            Filter  <http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/Filter.html>  filter,
>                            int n)
>                     throwsIOException  <http://java.sun.com/j2se/1.5/docs/api/java/io/IOException.html>
>
>
>     http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/TopDocs.html
>
>
>     |ScoreDoc
>     <http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/ScoreDoc.html>[]|
>     |*scoreDocs
>     <http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/TopDocs.html#scoreDocs>*|
>
>               The top hits for the query. | int| |*totalHits
>     <http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/TopDocs.html#totalHits>*|
>
>               The total number of hits for the query.
>
>     but also see for example
>
>     http://docs.codehaus.org/display/GEOTOOLS/Random+Data+Access
>
>     I am currently playing with the various APIs, but any suggestions
>     are very welcome.
>
>     Cheers
>
>     Michael
>
>     --
>     Yanel-development mailing list Yanel-development at wyona.com
>     <mailto:Yanel-development at wyona.com>
>     http://lists.wyona.org/cgi-bin/mailman/listinfo/yanel-development
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wyona.org/pipermail/yanel-development/attachments/20110329/e9954401/attachment.html>