Discussion:
RFC: REST example for DAS 2.0
(too old to reply)
Andrew Dalke
2003-01-15 07:46:19 UTC
Permalink
REST example for DAS 2.0

In my previous RFC I suggested ignoring SOAP+UDDI+WSDL and build DAS
2.0 on top of straight HTTP+XML using a REST architecture.

To show you how that might work, here's one way to have implemented
the functionality from the DAS 1.5 spec. I ignore for now a
discussion of how to handle versioning when the sequence changes. (I
think it's best done by having an extra level with the version
identifier in them.)

If you want me to say "URI" instead "URI" you can make the replacement
in your head.

============================
<dsn>/
Returns a list of data sources

This replaces the 'dsns' method call. It returns an XML document of
doctype "http://www.biodas.org/dtd/dasdsn.dtd" Doing this also gets
rid of the annoying "cannot have a dsn named 'dsn'" problem.


<dsn>/stylesheet
Returns the stylesheet for the DSN


<dsn>/entry_point/
Returns a list of entry points

This returns an XML document (the doctype doesn't yet exist). It is
basically a list of URLs.

<dsn>/entry_point/<id>
This returns XML describing a segment, ie, id, start, stop, and
orientation. The doctype doesn't yet exist.


<dsn>/feature/
Returns a list of all features. (You might not want to do this,
and the server could simply say "not implemented.")

<dsn>/feature/<id>
Returns the GFF for the feature named 'id'

Each feature in 1.5 already has a unique identifier. This makes the
feature a full-fledged citizen of the web by making it directly
accessible. (Under DAS 1.5 it is accessible as a side effect of a
'features' command, but I don't want to confuse a feature's name with
a search command, especially since many searches can return the same
feature, and because the results of a search should be a list, not a
single result.)


<dsn>/features?segment=RANGE;type=TYPE;category=....
Returns a list of features matching the given search criteria.

The input is identical to the existing 'features' command. The result
is a list of feature URLs. This is a POST interface.


<dsn>/sequence?segment=RANGE[;segment=RANGE]*
Returns the sequence in the given segment(s), as XML of
doctype "http://www.biodas.org/dtd/dassequence.dtd".

This is identical to the existing 'sequence' command and is a POST
interface.


<dsn>/type/
Returns a list of all types. (You might not want to do this,
and the server could simply say "not implemented.")

<dsn>/type/<id>
Returns a XML document of doctype "DASTYPE", which is like
the existing "http://www.biodas.org/dtd/dastypes.dtd" except
there's only one type.

<dsn>/types?segment=RANGE;type=TYPE
Return a list of URIs for types matching the search criteria.

The input is identical to the existing 'types' command. The result is
a list of URLs. This is a POST interface.

============================

Unlike the existing spec, and unlike the proposed RFC 13, the feature
and types are objects in their own right. This has several effects.

Linkability

Since a feature has a URL, means that features are directly
addressible. This helps address RFC 3 "InterService links in DAS/2"
(see http://www.biodas.org/RFCs/rfc003.txt ) because each object is
accessible through a URL, and can be addressed by anything else which
understands URLs.

One such relevant technology is the Resource Description Framework
(RDF) (see http://www.w3.org/TR/REC-rdf-syntax/ ). This lets 3rd
parties add their own associations between URLs. For example, I could
publish my own RDF database which comments on the quality of features
in someone else's database.

I do not know enough about RDF. I conjecture that I can suggest an
alternative stylesheet (RFC 8, "DAS Visualization Server"
http://www.biodas.org/RFCs/rfc008.txt) by an appropriate link to the
<dsn>/stylesheet/ .

I further conjecture that RDF appropriately handles group
normalization from RFC 10 (http://www.biodas.org/RFCs/rfc010.txt).

Ontologies

Web ontologies, like DAML+OIL, are built on top of RDF. Because types
are also directly accessible, this lets us (or others!) build their
own ontologies on top of the features type. This addresses RFC 4
"Annotation ontologies for DAS/2" at
http://www.biodas.org/RFCs/rfc004.txt .


Independent requests

Perhaps the biggest disadvantage to this scheme is that any search
(like 'features') requires an additional 'GET' to get information
about every feature that matched. If there are 1,000 matches, then
there are 1,000 additional requests. Compare that to the current
scheme where all the data about the matches is returned in one shot.

I do not believe this should be a problem. The HTTP/1.1 spec supports
"keep-alive" so that the connection to the server does not need to be
re-established. A client can feed requests to the server while also
receiving responses from earlier queries, so there shouldn't be a
pause in bandwidth usage while making each request. In addition, the
overhead for making a request and the extra headers for each
independent response shouldn't require much extra data to be sent.

The performance slowdown should pay for itself quickly once someone
does multiple queries. Suppose the second query also has 1,000
matches, with 500 matches overlapping with the first query. Under the
existing DAS 1.5 spec, this means that all the data must be sent
again. Under this proposal, only the 500 new requests need be sent.

One other issue mentioned in the SOAP proposals and in my REST
advocacy was the ability to stream through a feature table. Suppose
the feature table is large. People would like to see partial results
and not wait until all the data is received. Eg, this would allow
them to cancel a download if they can see it contains the wrong
information.

If the results are sent in one block, this requires that the parsing
toolkit support a streaming interface. It is unlikely that most SOAP
toolkits will support this mode. It's also trickier to develop
software using a streaming API (like SAX) compared to a bulk API (like
DOM). This new spec gets around that problem by sending a list of
URLs instead of the full data. The individual records are small and
can be fetched one at a time and parsed with whatever means are
appropriate. This makes it easier to develop software which can
multitask between reading/parsing input and handling the user
interface.

Caching

RFC 5 "DAS Caching" (http://www.biodas.org/RFCs/rfc005.txt) wants a
way to cache data. I believe most of the data requests will be for
feature data. Because these are independentially named and accessed
through that name using an HTTP GET, this means that normal HTTP
caching systems like the Squid proxy can be used along with standard
and well-defined mechanisms to control cache behaviour.

The caching proposal also considers P2P systems like Gnutella as a way
to distribute data. One possible scheme for this is to define a
mapping from URLs to a Gnutella resource. In this case, replace 'URL'
above to 'URI'.



Andrew Dalke
***@dalkescientific.com
--
Need usable, robust software for bioinformatics or chemical
informatics? Want to integrate your different tools so you can
do more science in less time? Contact us!
http://www.dalkescientific.com/
Brian Gilman
2003-01-15 13:41:33 UTC
Permalink
On 1/15/03 2:46 AM, "Andrew Dalke" <***@mindspring.com> wrote:

Hey Andrew,

Long time no talk. SOAP, WSDL, and UDDI are NEVER going to help you send
50 MB of data across the wire! I've also thought about REST as a means to
make a distributed system. But, the industry is just not going that way.
There are MANY toolkits to program up a web service. Programming a REST
service means doing things that are non-standard and my engineering brain
says not to touch those things. SOAP has been able to solve a lot of
interoperability problems and will only get better over time. We use the
DIME protocol and compression to shove data over the wire. No need to parse
the document this way.

SOAP has two methods of asking for data:

1) RPC
2) Document centric

My question to you is: Why reinvent the wheel?? Why program up yet
another wire protocol when you have something to work with already?? And,
DAS, is a REST protocol!! Right now DAS just works. Why change it to use
anything else?? Is there a problem with the semantics of the protocol that
impede any of the research that we are doing?? Murphy's law should be called
the engineer's prayer.

Best,

-B
Post by Andrew Dalke
REST example for DAS 2.0
In my previous RFC I suggested ignoring SOAP+UDDI+WSDL and build DAS
2.0 on top of straight HTTP+XML using a REST architecture.
To show you how that might work, here's one way to have implemented
the functionality from the DAS 1.5 spec. I ignore for now a
discussion of how to handle versioning when the sequence changes. (I
think it's best done by having an extra level with the version
identifier in them.)
If you want me to say "URI" instead "URI" you can make the replacement
in your head.
============================
<dsn>/
Returns a list of data sources
This replaces the 'dsns' method call. It returns an XML document of
doctype "http://www.biodas.org/dtd/dasdsn.dtd" Doing this also gets
rid of the annoying "cannot have a dsn named 'dsn'" problem.
<dsn>/stylesheet
Returns the stylesheet for the DSN
<dsn>/entry_point/
Returns a list of entry points
This returns an XML document (the doctype doesn't yet exist). It is
basically a list of URLs.
<dsn>/entry_point/<id>
This returns XML describing a segment, ie, id, start, stop, and
orientation. The doctype doesn't yet exist.
<dsn>/feature/
Returns a list of all features. (You might not want to do this,
and the server could simply say "not implemented.")
<dsn>/feature/<id>
Returns the GFF for the feature named 'id'
Each feature in 1.5 already has a unique identifier. This makes the
feature a full-fledged citizen of the web by making it directly
accessible. (Under DAS 1.5 it is accessible as a side effect of a
'features' command, but I don't want to confuse a feature's name with
a search command, especially since many searches can return the same
feature, and because the results of a search should be a list, not a
single result.)
<dsn>/features?segment=RANGE;type=TYPE;category=....
Returns a list of features matching the given search criteria.
The input is identical to the existing 'features' command. The result
is a list of feature URLs. This is a POST interface.
<dsn>/sequence?segment=RANGE[;segment=RANGE]*
Returns the sequence in the given segment(s), as XML of
doctype "http://www.biodas.org/dtd/dassequence.dtd".
This is identical to the existing 'sequence' command and is a POST
interface.
<dsn>/type/
Returns a list of all types. (You might not want to do this,
and the server could simply say "not implemented.")
<dsn>/type/<id>
Returns a XML document of doctype "DASTYPE", which is like
the existing "http://www.biodas.org/dtd/dastypes.dtd" except
there's only one type.
<dsn>/types?segment=RANGE;type=TYPE
Return a list of URIs for types matching the search criteria.
The input is identical to the existing 'types' command. The result is
a list of URLs. This is a POST interface.
============================
Unlike the existing spec, and unlike the proposed RFC 13, the feature
and types are objects in their own right. This has several effects.
Linkability
Since a feature has a URL, means that features are directly
addressible. This helps address RFC 3 "InterService links in DAS/2"
(see http://www.biodas.org/RFCs/rfc003.txt ) because each object is
accessible through a URL, and can be addressed by anything else which
understands URLs.
One such relevant technology is the Resource Description Framework
(RDF) (see http://www.w3.org/TR/REC-rdf-syntax/ ). This lets 3rd
parties add their own associations between URLs. For example, I could
publish my own RDF database which comments on the quality of features
in someone else's database.
I do not know enough about RDF. I conjecture that I can suggest an
alternative stylesheet (RFC 8, "DAS Visualization Server"
http://www.biodas.org/RFCs/rfc008.txt) by an appropriate link to the
<dsn>/stylesheet/ .
I further conjecture that RDF appropriately handles group
normalization from RFC 10 (http://www.biodas.org/RFCs/rfc010.txt).
Ontologies
Web ontologies, like DAML+OIL, are built on top of RDF. Because types
are also directly accessible, this lets us (or others!) build their
own ontologies on top of the features type. This addresses RFC 4
"Annotation ontologies for DAS/2" at
http://www.biodas.org/RFCs/rfc004.txt .
Independent requests
Perhaps the biggest disadvantage to this scheme is that any search
(like 'features') requires an additional 'GET' to get information
about every feature that matched. If there are 1,000 matches, then
there are 1,000 additional requests. Compare that to the current
scheme where all the data about the matches is returned in one shot.
I do not believe this should be a problem. The HTTP/1.1 spec supports
"keep-alive" so that the connection to the server does not need to be
re-established. A client can feed requests to the server while also
receiving responses from earlier queries, so there shouldn't be a
pause in bandwidth usage while making each request. In addition, the
overhead for making a request and the extra headers for each
independent response shouldn't require much extra data to be sent.
The performance slowdown should pay for itself quickly once someone
does multiple queries. Suppose the second query also has 1,000
matches, with 500 matches overlapping with the first query. Under the
existing DAS 1.5 spec, this means that all the data must be sent
again. Under this proposal, only the 500 new requests need be sent.
One other issue mentioned in the SOAP proposals and in my REST
advocacy was the ability to stream through a feature table. Suppose
the feature table is large. People would like to see partial results
and not wait until all the data is received. Eg, this would allow
them to cancel a download if they can see it contains the wrong
information.
If the results are sent in one block, this requires that the parsing
toolkit support a streaming interface. It is unlikely that most SOAP
toolkits will support this mode. It's also trickier to develop
software using a streaming API (like SAX) compared to a bulk API (like
DOM). This new spec gets around that problem by sending a list of
URLs instead of the full data. The individual records are small and
can be fetched one at a time and parsed with whatever means are
appropriate. This makes it easier to develop software which can
multitask between reading/parsing input and handling the user
interface.
Caching
RFC 5 "DAS Caching" (http://www.biodas.org/RFCs/rfc005.txt) wants a
way to cache data. I believe most of the data requests will be for
feature data. Because these are independentially named and accessed
through that name using an HTTP GET, this means that normal HTTP
caching systems like the Squid proxy can be used along with standard
and well-defined mechanisms to control cache behaviour.
The caching proposal also considers P2P systems like Gnutella as a way
to distribute data. One possible scheme for this is to define a
mapping from URLs to a Gnutella resource. In this case, replace 'URL'
above to 'URI'.
Andrew Dalke
--
Brian Gilman <***@genome.wi.mit.edu>
Group Leader Medical & Population Genetics Dept.
MIT/Whitehead Inst. Center for Genome Research
One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
phone +1 617 252 1069 / fax +1 617 252 1902
Lincoln Stein
2003-01-15 17:26:25 UTC
Permalink
I just want to keep things simple. As long as there are good Perl/Java/Python
APIs to DAS and performance is usable, none of the target audience
(applications developers) are going to care in the least whether it's SOAP or
not. My concern with SOAP encapsulation is that it makes it harder to stream
DAS, at least with my favorite language, Perl. But I've got my fingers
crossed that eventually there will be a good streaming SOAP for Perl, and at
that point all my misgivings go away.

My understanding of REST is that it's defined by the negative -- it isn't
SOAP. That's not going to provide much in the way of reusability.

Lincoln
Post by Brian Gilman
Hey Andrew,
Long time no talk. SOAP, WSDL, and UDDI are NEVER going to help you
send 50 MB of data across the wire! I've also thought about REST as a means
to make a distributed system. But, the industry is just not going that way.
There are MANY toolkits to program up a web service. Programming a REST
service means doing things that are non-standard and my engineering brain
says not to touch those things. SOAP has been able to solve a lot of
interoperability problems and will only get better over time. We use the
DIME protocol and compression to shove data over the wire. No need to parse
the document this way.
1) RPC
2) Document centric
My question to you is: Why reinvent the wheel?? Why program up yet
another wire protocol when you have something to work with already?? And,
DAS, is a REST protocol!! Right now DAS just works. Why change it to use
anything else?? Is there a problem with the semantics of the protocol that
impede any of the research that we are doing?? Murphy's law should be
called the engineer's prayer.
Best,
-B
Post by Andrew Dalke
REST example for DAS 2.0
In my previous RFC I suggested ignoring SOAP+UDDI+WSDL and build DAS
2.0 on top of straight HTTP+XML using a REST architecture.
To show you how that might work, here's one way to have implemented
the functionality from the DAS 1.5 spec. I ignore for now a
discussion of how to handle versioning when the sequence changes. (I
think it's best done by having an extra level with the version
identifier in them.)
If you want me to say "URI" instead "URI" you can make the replacement
in your head.
============================
<dsn>/
Returns a list of data sources
This replaces the 'dsns' method call. It returns an XML document of
doctype "http://www.biodas.org/dtd/dasdsn.dtd" Doing this also gets
rid of the annoying "cannot have a dsn named 'dsn'" problem.
<dsn>/stylesheet
Returns the stylesheet for the DSN
<dsn>/entry_point/
Returns a list of entry points
This returns an XML document (the doctype doesn't yet exist). It is
basically a list of URLs.
<dsn>/entry_point/<id>
This returns XML describing a segment, ie, id, start, stop, and
orientation. The doctype doesn't yet exist.
<dsn>/feature/
Returns a list of all features. (You might not want to do this,
and the server could simply say "not implemented.")
<dsn>/feature/<id>
Returns the GFF for the feature named 'id'
Each feature in 1.5 already has a unique identifier. This makes the
feature a full-fledged citizen of the web by making it directly
accessible. (Under DAS 1.5 it is accessible as a side effect of a
'features' command, but I don't want to confuse a feature's name with
a search command, especially since many searches can return the same
feature, and because the results of a search should be a list, not a
single result.)
<dsn>/features?segment=RANGE;type=TYPE;category=....
Returns a list of features matching the given search criteria.
The input is identical to the existing 'features' command. The result
is a list of feature URLs. This is a POST interface.
<dsn>/sequence?segment=RANGE[;segment=RANGE]*
Returns the sequence in the given segment(s), as XML of
doctype "http://www.biodas.org/dtd/dassequence.dtd".
This is identical to the existing 'sequence' command and is a POST
interface.
<dsn>/type/
Returns a list of all types. (You might not want to do this,
and the server could simply say "not implemented.")
<dsn>/type/<id>
Returns a XML document of doctype "DASTYPE", which is like
the existing "http://www.biodas.org/dtd/dastypes.dtd" except
there's only one type.
<dsn>/types?segment=RANGE;type=TYPE
Return a list of URIs for types matching the search criteria.
The input is identical to the existing 'types' command. The result is
a list of URLs. This is a POST interface.
============================
Unlike the existing spec, and unlike the proposed RFC 13, the feature
and types are objects in their own right. This has several effects.
Linkability
Since a feature has a URL, means that features are directly
addressible. This helps address RFC 3 "InterService links in DAS/2"
(see http://www.biodas.org/RFCs/rfc003.txt ) because each object is
accessible through a URL, and can be addressed by anything else which
understands URLs.
One such relevant technology is the Resource Description Framework
(RDF) (see http://www.w3.org/TR/REC-rdf-syntax/ ). This lets 3rd
parties add their own associations between URLs. For example, I could
publish my own RDF database which comments on the quality of features
in someone else's database.
I do not know enough about RDF. I conjecture that I can suggest an
alternative stylesheet (RFC 8, "DAS Visualization Server"
http://www.biodas.org/RFCs/rfc008.txt) by an appropriate link to the
<dsn>/stylesheet/ .
I further conjecture that RDF appropriately handles group
normalization from RFC 10 (http://www.biodas.org/RFCs/rfc010.txt).
Ontologies
Web ontologies, like DAML+OIL, are built on top of RDF. Because types
are also directly accessible, this lets us (or others!) build their
own ontologies on top of the features type. This addresses RFC 4
"Annotation ontologies for DAS/2" at
http://www.biodas.org/RFCs/rfc004.txt .
Independent requests
Perhaps the biggest disadvantage to this scheme is that any search
(like 'features') requires an additional 'GET' to get information
about every feature that matched. If there are 1,000 matches, then
there are 1,000 additional requests. Compare that to the current
scheme where all the data about the matches is returned in one shot.
I do not believe this should be a problem. The HTTP/1.1 spec supports
"keep-alive" so that the connection to the server does not need to be
re-established. A client can feed requests to the server while also
receiving responses from earlier queries, so there shouldn't be a
pause in bandwidth usage while making each request. In addition, the
overhead for making a request and the extra headers for each
independent response shouldn't require much extra data to be sent.
The performance slowdown should pay for itself quickly once someone
does multiple queries. Suppose the second query also has 1,000
matches, with 500 matches overlapping with the first query. Under the
existing DAS 1.5 spec, this means that all the data must be sent
again. Under this proposal, only the 500 new requests need be sent.
One other issue mentioned in the SOAP proposals and in my REST
advocacy was the ability to stream through a feature table. Suppose
the feature table is large. People would like to see partial results
and not wait until all the data is received. Eg, this would allow
them to cancel a download if they can see it contains the wrong
information.
If the results are sent in one block, this requires that the parsing
toolkit support a streaming interface. It is unlikely that most SOAP
toolkits will support this mode. It's also trickier to develop
software using a streaming API (like SAX) compared to a bulk API (like
DOM). This new spec gets around that problem by sending a list of
URLs instead of the full data. The individual records are small and
can be fetched one at a time and parsed with whatever means are
appropriate. This makes it easier to develop software which can
multitask between reading/parsing input and handling the user
interface.
Caching
RFC 5 "DAS Caching" (http://www.biodas.org/RFCs/rfc005.txt) wants a
way to cache data. I believe most of the data requests will be for
feature data. Because these are independentially named and accessed
through that name using an HTTP GET, this means that normal HTTP
caching systems like the Squid proxy can be used along with standard
and well-defined mechanisms to control cache behaviour.
The caching proposal also considers P2P systems like Gnutella as a way
to distribute data. One possible scheme for this is to define a
mapping from URLs to a Gnutella resource. In this case, replace 'URL'
above to 'URI'.
Andrew Dalke
--
Lincoln Stein
***@cshl.org
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
Andrew Dalke
2003-01-15 18:48:40 UTC
Permalink
[Blech! I'm subscribed to this list as "***@dalkescientific.com" since
that is my primary email address. But my 'From' is "***@mindspring.com"
because my ISP won't allow me to do otherwise. So every message I send
gets held for moderation. Sorry about that moderators.]
Post by Lincoln Stein
As long as there are good Perl/Java/Python
APIs to DAS and performance is usable, none of the target audience
(applications developers) are going to care in the least whether it's
SOAP or not.
I agree.
Post by Lincoln Stein
My concern with SOAP encapsulation is that it makes it harder to
stream DAS, at least with my favorite language, Perl. But I've got my
fingers crossed that eventually there will be a good streaming SOAP
for Perl, and at that point all my misgivings go away.
Given my readings, I do not think this will happen
http://www.xml.com/pub/a/2002/07/17/salz.html?page=last
} Note that even though the individual processing is fairly simple,
} the overall process is fairly complex and requires multiple passes
} over the header elements. In a streaming environment -- think SAX,
} not DOM -- that won't work. In fact, it's my bet that headers will
} spell the end of SAX-style SOAP processors. For example, a digital
} signature of a SOAP message naturally belongs in the header. In
} order to generate the signature, you need to generate a hash of the
} message content. How can you do that without buffering?

Brian mentioned DIME, which may. I do not think that DIME solution
affects my comments on caching and on fetching only new features.
Post by Lincoln Stein
My understanding of REST is that it's defined by the negative -- it
isn't SOAP. That's not going to provide much in the way of
reusability.
I would rather say that most times SOAP isn't REST. The papers
I've read offer plenty of example of what a REST-style architecture
is (as compared to the negatice.)

Quoting from http://www.xfront.com/REST-Web-Services.html
* Client-Server: a pull-based interaction style: consuming
components pull representations.
* Stateless: each request from client to server must contain all the
information necessary to understand the request, and cannot take
advantage of any stored context on the server.
* Cache: to improve network efficiency responses must be capable of
being labeled as cacheable or non-cacheable.
* Uniform interface: all resources are accessed with a generic
interface (e.g., HTTP GET, POST, PUT, DELETE).
* Named resources - the system is comprised of resources which are
named using a URL.
* Interconnected resource representations - the representations of
the resources are interconnected using URLs, thereby enabling a client
to progress from one state to another.
* Layered components - intermediaries, such as proxy servers, cache
servers, gateways, etc, can be inserted between clients and resources to
support performance, security, etc.

Here's the PhD dissertation describing REST
http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
in full glory.


Andrew Dalke
***@dalkescientific.com
--
Need usable, robust software for bioinformatics or chemical
informatics? Want to integrate your different tools so you can
do more science in less time? Contact us!
http://www.dalkescientific.com/
Andrew Dalke
2003-01-15 18:33:58 UTC
Permalink
Post by Brian Gilman
I've also thought about REST as a means to
make a distributed system. But, the industry is just not going that way.
Says the Java guy talking to Perl and Python coders. :)

The direction of industry should be an influence, but not a deciding
factor. ("If everyone else were to jump off the Empire State
building....") In my posts I said that so far SOAP has been tricky
to use (interoperating between Python, Perl, and Java), and the benefits
of SOAP given in the RFCs are not unique to a SOAP approach. I also
said, as you verified, that SOAP isn't useful for streaming over
large responses, and pointed out an alternate way to approach it that
the current 1.5 spec nor RFC 13 support. I also pointed out that a
SOAP approach makes it hard to do caching.

[Actually, itemized list of reasons at the bottom of this post.]
Post by Brian Gilman
There are MANY toolkits to program up a web service.
Yes, and I tried 4 of them for Python and the SOAP::Lite one for
Perl. I also mentioned there are many toolkits to program up a
REST service, since they already exist for standard web programming.
Post by Brian Gilman
Programming a REST
service means doing things that are non-standard and my engineering brain
says not to touch those things.
How is it non-standard? A search can still be a SOAP request (or
XML-RPC which is just as useful as SOAP and much less complicated).
Returning XML w/ a DTD is standardized, and the DTD can be used to
generate a native data structure. True, the DTD doesn't handle type
schemas, but that can be verified with an external schema. (Based
on the work of people I know, I'm now leaning towards RELAX-NG, but
that's a different topic.)
Post by Brian Gilman
SOAP has been able to solve a lot of
interoperability problems and will only get better over time. We use the
DIME protocol and compression to shove data over the wire. No need to parse
the document this way.
And XML-RPC has been able to solve a lot of interoperability problems
and is mature and stable. What advantages does SOAP bring?

DIME? Here's what I know about it (dated 2002/09/18)
http://www.xml.com/pub/a/2002/09/18/ends.html

According to that, DIME is more of a de facto standard than a de jure
one, so how does that affect your engineering brain? ;)

More seriously, it's layers upon layers. As I read it, the DIME message
holds the SOAP message, and is decoded to get the SOAP portion out.
(This is because DIME is a binary format and may contain XML metacharacters.)

Therefore, why not return the DIME message without the SOAP part?
Just include the data sets directly.
Post by Brian Gilman
1) RPC
2) Document centric
My question to you is: Why reinvent the wheel?? Why program up yet
another wire protocol when you have something to work with already??
RFC 13, which suggests WSDL and UDDI for a DAS 2, is RPC not
document centric. I agree that everything I mentioned for my
REST example can be done over SOAP, in which case SOAP is
being done for pure serialization. However, even in that case
you limit caching (RFC 5) because SOAP requests are all done via
POST and the cache doesn't know if a POST request has side
effects or not.

How am I reinventing the wheel? In my example recasting of DAS I
created no new wire protocols. Everything was returned in XML with
a DTD. Just like in RFC 13 there's a new WSDL for every query.
So there are equal numbers of new definitions required.
Post by Brian Gilman
And, DAS, is a REST protocol!! Right now DAS just works.
I disagree. Two data types, features and types, are not directly
addressable. They are only retrievable as part of a search, ie,
the 'types' and 'features' commands. (Semantically
'features?feature_id=ABC' returns a list of matches, either of
length 1 or 0, as compared to a name which returns the object or
says "404 Not Found")

This means that DAS as it stands doesn't allow "InterService links"
as requested for RFC 3 nor allows RDF-style commentary and metadata.

And so I believe DAS 1.5 is not a REST protocol.
Post by Brian Gilman
Why change it to use
anything else?? Is there a problem with the semantics of the protocol that
impede any of the research that we are doing?? Murphy's law should be called
the engineer's prayer.
Yes, as listed:
- improved performance because previously fetched features do not
need to be re-retrieved for every search
- better integration with existing http caching proxies
- protocols are easier to understand
- toolkits for doing this are more widely available (than SOAP,
they are the same toolkits for the existing DAS spec)
- able to make links to a feature, eg, with RDF (which can also
address RFC 10 on "normalizing groups")
- easy support for streaming
- easy extension to DAV for making a *writable* system using
standard and widely available authoring tools

Do they "impede research"? The performance ones make it easier
to work with distant data sources and easier to develop more
interactive tools. The ability to make direct links is, I
believe, a big but untapped advantage. The support for writing
makes it easier for people to maintain a DAS system.

Andrew Dalke
***@dalkescientific.com
--
Need usable, robust software for bioinformatics or chemical
informatics? Want to integrate your different tools so you can
do more science in less time? Contact us!
http://www.dalkescientific.com/
David Block
2003-01-15 16:41:42 UTC
Permalink
Brian,

What libraries are you using for DIME? Is there good Java, Perl
support? I know you're a J2EE shop - what toolkit do you use?

Thanks,
Dave
Post by Brian Gilman
Hey Andrew,
Long time no talk. SOAP, WSDL, and UDDI are NEVER going to help
you send
50 MB of data across the wire! I've also thought about REST as a means
to
make a distributed system. But, the industry is just not going that
way.
There are MANY toolkits to program up a web service. Programming a REST
service means doing things that are non-standard and my engineering
brain
says not to touch those things. SOAP has been able to solve a lot of
interoperability problems and will only get better over time. We use
the
DIME protocol and compression to shove data over the wire. No need to
parse
the document this way.
1) RPC
2) Document centric
My question to you is: Why reinvent the wheel?? Why program up yet
another wire protocol when you have something to work with already??
And,
DAS, is a REST protocol!! Right now DAS just works. Why change it to
use
anything else?? Is there a problem with the semantics of the protocol
that
impede any of the research that we are doing?? Murphy's law should be
called
the engineer's prayer.
Best,
-B
Post by Andrew Dalke
REST example for DAS 2.0
In my previous RFC I suggested ignoring SOAP+UDDI+WSDL and build DAS
2.0 on top of straight HTTP+XML using a REST architecture.
To show you how that might work, here's one way to have implemented
the functionality from the DAS 1.5 spec. I ignore for now a
discussion of how to handle versioning when the sequence changes. (I
think it's best done by having an extra level with the version
identifier in them.)
If you want me to say "URI" instead "URI" you can make the replacement
in your head.
============================
<dsn>/
Returns a list of data sources
This replaces the 'dsns' method call. It returns an XML document of
doctype "http://www.biodas.org/dtd/dasdsn.dtd" Doing this also gets
rid of the annoying "cannot have a dsn named 'dsn'" problem.
<dsn>/stylesheet
Returns the stylesheet for the DSN
<dsn>/entry_point/
Returns a list of entry points
This returns an XML document (the doctype doesn't yet exist). It is
basically a list of URLs.
<dsn>/entry_point/<id>
This returns XML describing a segment, ie, id, start, stop, and
orientation. The doctype doesn't yet exist.
<dsn>/feature/
Returns a list of all features. (You might not want to do this,
and the server could simply say "not implemented.")
<dsn>/feature/<id>
Returns the GFF for the feature named 'id'
Each feature in 1.5 already has a unique identifier. This makes the
feature a full-fledged citizen of the web by making it directly
accessible. (Under DAS 1.5 it is accessible as a side effect of a
'features' command, but I don't want to confuse a feature's name with
a search command, especially since many searches can return the same
feature, and because the results of a search should be a list, not a
single result.)
<dsn>/features?segment=RANGE;type=TYPE;category=....
Returns a list of features matching the given search criteria.
The input is identical to the existing 'features' command. The result
is a list of feature URLs. This is a POST interface.
<dsn>/sequence?segment=RANGE[;segment=RANGE]*
Returns the sequence in the given segment(s), as XML of
doctype "http://www.biodas.org/dtd/dassequence.dtd".
This is identical to the existing 'sequence' command and is a POST
interface.
<dsn>/type/
Returns a list of all types. (You might not want to do this,
and the server could simply say "not implemented.")
<dsn>/type/<id>
Returns a XML document of doctype "DASTYPE", which is like
the existing "http://www.biodas.org/dtd/dastypes.dtd" except
there's only one type.
<dsn>/types?segment=RANGE;type=TYPE
Return a list of URIs for types matching the search criteria.
The input is identical to the existing 'types' command. The result is
a list of URLs. This is a POST interface.
============================
Unlike the existing spec, and unlike the proposed RFC 13, the feature
and types are objects in their own right. This has several effects.
Linkability
Since a feature has a URL, means that features are directly
addressible. This helps address RFC 3 "InterService links in DAS/2"
(see http://www.biodas.org/RFCs/rfc003.txt ) because each object is
accessible through a URL, and can be addressed by anything else which
understands URLs.
One such relevant technology is the Resource Description Framework
(RDF) (see http://www.w3.org/TR/REC-rdf-syntax/ ). This lets 3rd
parties add their own associations between URLs. For example, I could
publish my own RDF database which comments on the quality of features
in someone else's database.
I do not know enough about RDF. I conjecture that I can suggest an
alternative stylesheet (RFC 8, "DAS Visualization Server"
http://www.biodas.org/RFCs/rfc008.txt) by an appropriate link to the
<dsn>/stylesheet/ .
I further conjecture that RDF appropriately handles group
normalization from RFC 10 (http://www.biodas.org/RFCs/rfc010.txt).
Ontologies
Web ontologies, like DAML+OIL, are built on top of RDF. Because types
are also directly accessible, this lets us (or others!) build their
own ontologies on top of the features type. This addresses RFC 4
"Annotation ontologies for DAS/2" at
http://www.biodas.org/RFCs/rfc004.txt .
Independent requests
Perhaps the biggest disadvantage to this scheme is that any search
(like 'features') requires an additional 'GET' to get information
about every feature that matched. If there are 1,000 matches, then
there are 1,000 additional requests. Compare that to the current
scheme where all the data about the matches is returned in one shot.
I do not believe this should be a problem. The HTTP/1.1 spec supports
"keep-alive" so that the connection to the server does not need to be
re-established. A client can feed requests to the server while also
receiving responses from earlier queries, so there shouldn't be a
pause in bandwidth usage while making each request. In addition, the
overhead for making a request and the extra headers for each
independent response shouldn't require much extra data to be sent.
The performance slowdown should pay for itself quickly once someone
does multiple queries. Suppose the second query also has 1,000
matches, with 500 matches overlapping with the first query. Under the
existing DAS 1.5 spec, this means that all the data must be sent
again. Under this proposal, only the 500 new requests need be sent.
One other issue mentioned in the SOAP proposals and in my REST
advocacy was the ability to stream through a feature table. Suppose
the feature table is large. People would like to see partial results
and not wait until all the data is received. Eg, this would allow
them to cancel a download if they can see it contains the wrong
information.
If the results are sent in one block, this requires that the parsing
toolkit support a streaming interface. It is unlikely that most SOAP
toolkits will support this mode. It's also trickier to develop
software using a streaming API (like SAX) compared to a bulk API (like
DOM). This new spec gets around that problem by sending a list of
URLs instead of the full data. The individual records are small and
can be fetched one at a time and parsed with whatever means are
appropriate. This makes it easier to develop software which can
multitask between reading/parsing input and handling the user
interface.
Caching
RFC 5 "DAS Caching" (http://www.biodas.org/RFCs/rfc005.txt) wants a
way to cache data. I believe most of the data requests will be for
feature data. Because these are independentially named and accessed
through that name using an HTTP GET, this means that normal HTTP
caching systems like the Squid proxy can be used along with standard
and well-defined mechanisms to control cache behaviour.
The caching proposal also considers P2P systems like Gnutella as a way
to distribute data. One possible scheme for this is to define a
mapping from URLs to a Gnutella resource. In this case, replace 'URL'
above to 'URI'.
Andrew Dalke
--
Group Leader Medical & Population Genetics Dept.
MIT/Whitehead Inst. Center for Genome Research
One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
phone +1 617 252 1069 / fax +1 617 252 1902
_______________________________________________
DAS mailing list
http://biodas.org/mailman/listinfo/das
--

----------------------------------------------
David Block -- Genome Informatics Developer
***@gnf.org
http://radio.weblogs.com/0104507
(858)812-1513
Loading...