[Yanel-dev] New XMLDB repository

Mon Feb 12 17:42:13 CET 2007

Hi Josias

On 12.2.2007 16:55 Uhr, Josias Thöny wrote:

> Andreas Wuest wrote:
>> Hi Josias
>>
>> On 12.2.2007 15:55 Uhr, Josias Thöny wrote:
>>
>>> Andreas Wuest wrote:
>>>> Hi
>>>>
>>>> I've finished and checked in a basic implementation of the XMLDB 
>>>> repository, based on the XML:DB API.
>>>
>>> Cool :)
>>>
>>>>
>>>> Unfortunately, Yarep is documented really bad, so I couldn't find 
>>>> out what the exact contracts for the various methods are. For 
>>>> example, should getSize() or delete() throw a repository exception 
>>>> if the resource does not exist, or return 0 or false, etc.
>>>>
>>>> I've extensively documented the XMLDBStorage class, so you can see 
>>>> what it does on the first glance.
>>>>
>>>> The Reader/Writer and InputStream/OutputStream are implemented using 
>>>> aggregation. Don't know if it would be more desireable to e.g. 
>>>> subclass StringReader and override the close() method instead.
>>>>
>>>> Also, there are some other API related problems: Yanel always seems 
>>>> to call getInputStream to directly read from the repo. Now, this is 
>>>> all fine and dandy on a file based repo, but the XML database stores 
>>>> XML documents as character data, and returns them as strings. With 
>>>> other words, in order for the OutputStream to work, we have to 
>>>> convert the string to bytes, which, of course, involves character 
>>>> encoding. I just use UTF-8 to de- and encode, but of you really want 
>>>> to read an XML resource, the getReader method should be used.
>>>>
>>>> The same goes for writing, but with some additional complication. 
>>>> You should NEVER use getOutputStream to write an XML document. 
>>>> getOutputStream creates a binary resource in the database. Use 
>>>> getWriter instead to write character data, which creates an XML 
>>>> resource.
>>>
>>> Well, I didn't realize that some repository implementations might 
>>> handle binary data differently than text data. But I guess it makes 
>>> sense.
>>> So probably we should change yanel to use the reader/writer methods 
>>> for text data, and add reader/writer methods to the node-based api, too.
>>> Would that help?
>>
>> That would help for sure. Although I don't know how Yanel can find out 
>> which method to call for reading, because it does not know in the 
>> first place if a requested resource is character-based or binary.
> 
> Yeah, I had some doubts about that also.
> Maybe we could simply say that a FileResource is always treated as 
> binary, and a XMLResource is always text. Would that be too simple?

This would only cover the cases where the control flow actually goes 
through a resource. If Yanel receives a GET for which it does not have 
rti information, it never goes through a resource, and accesses the repo 
directly for reading.

>> One possible way would be for the repository implementation to guide 
>> Yanel, because the repository should generally know what type of 
>> resource is being requested (at least, XMLDB knows, we may see other 
>> back-ends in the future which do not even know this one though). If 
>> Yanel uses getInputStream(), and the repo decides that this is not a 
>> binary resource, it could throw an exception, and Yanel would then try 
>> getReader(), or vice versa. We could also introduce a flag on those 
>> two methods, e.g. forceRead, which would prevent the repo impl from 
>> throwing if the resource to be read is of the wrong type, but read 
>> anyway.
>>
> 
> If we say that the repo "knows" about the type of a resource, it could 
> provide a method isBinary() or something like that, so yanel could know 
> which method to call (getReader/getInputStream). I normally prefer to 
> "ask first" instead of handling an error.

Yes, that would be even nicer. Of course, a file based repository only 
knows about bytes.

This is where all those problems actually arise: XMLDBs differentiate 
between BinaryResource and XMLResource.

Reading an XMLResource as a stream results in a character->byte 
conversion, using an encoding. Reading a BinaryResource as a Reader on 
the other hand results in a byte->character conversion, which makes even 
less sense (after all, binary resources may and do contain byte values 
which can't be mapped to a character).

File based repositories just read a stream of bytes, and don't care 
about any of this.

Writing is even worse, because the XMLDB repo has to decide if it has to 
create an XMLResource or a BinaryResource. That's why I let Writers 
create an XMLResource, and Streams create a BinaryResource.

> When someone calls a reading method which does not match the type, a 
> best-effort conversion could be applied.

Yes, this would work for the reading case, but not for writing, because, 
as I've outlined above, I have to decide what kind of resource I have to 
create in the database upon writing.

> I'm not entirely sure though how the repo would know the type 
> (text/binary). Should it assume that it's binary when it was written by 
> getOutputStream, and text otherwise?

Now, this depends on the capabilities of the repo. For example, a file 
based repo can't know what kind of data a specific file holds, as it 
only knows about files.

The XMLDB on the other hand knows if it is dealing with an XMLResource 
or a BinaryResource, because the resources are typed.

After all, I've shortly discussed these issues with Michi as well, and 
he is heavily opposed to introduce anything with regard to 
differentiation between character and binary data. Unfortunately, this 
would cripple the XMLDB resource in such a way that we could only allow 
the creation of XMLResources, or, with other words, an XMLDB based Yanel 
*cannot* store binary resources like images etc.

-- 
Kind regards,
Andi