Lightweight Repository CRUD Service (LRCRUDS)

This is a REST-based service that will generally reside on a server along with some sort of digital repository. The function of the service is to allow authorized clients to create, retrieve, update, and possibly delete digital packages from the repository.

The format for creation, update, and retrieval is a Zip archive containing a standardized header file, a METS files conformant to ECHODEP Hub and Spoke profile, plus all of the files and directories that comprise the object in the repository's native import or export format.

In order to identify specific packages in a repository the service must map URIs into the local identifiers used by the repository. The format of URIs used by the service must follow this format:

BaseURL "/" EscapedLocalRepositoryIdentifier [ "?" query ]

The BaseURL conforms to normal syntax for HTTP or HTTPS URLs, namely: "http[s]:" "//" host [ ":" port ] [ abs_path ] The [ "?" query ] portion may occur after the EscapedLocalRepositoryIdentifier under special circumstances to be described below, but it is not considered part of the local identifier. The EscapedLocalRepositoryIdentifier must make up the final segment of the abs_path with any characters not otherwise allowed in a path segment escaped according to the rules in RFC 2396. When the EscapedLocalRepositoryIdentifier is unescaped it must be a local identifier for an object in the repository. In the case of a Create action, the local identifier can be the location in the repository into which the new record should be created. In terms of the Common Gateway Interface (CGI) the EscapedLocalRepositoryIdentifier can be considered the PATH_INFO variable. Following are some examples:

In general, the LRCRUD service will reside on the same host as the repository which it serves, but it could be on any accessible port, and the path component of the base URL is arbitrary.

For this service the database CRUD actions (Create/Retrieve/Update/Delete) have been mapped to corresponding HTTP methods:

CRUD HTTP
Create POST The Create action uses the HTTP POST method with the local identifier part of the URI representing the location in the repository in which to create the new record. For example, with DSpace the local identifier would identify a specific collection in DSpace in which to create the new record. The local identifier is optional and will depend on the particular repository. It could be the handle of a collection as in DSpace or it could be a hierarchical path to a location on disk to store the files. If it is omitted the assumption is that the new package will be created in some default location dependent on the repository. The URI of the newly created record is returned as part of the HTTP response to the POST. For example, to create a new DSpace package in the DSpace collection with an identifier of 2135.12346:
  POST /dspace/lrcruds/2135.12346 HTTP/1.1
  Date: Sun, 06 Nov 1994 08:49:37 GMT
  From: thabing@uiuc.edu
  User-Agent: ECHODEP_Hub_and_Spoke/1.0
  Content-Length: 3495
  Content-MD5: <Base64-encoded MD5 Hash>
  Content-Type: application/x-www-form-urlencoded
  Last-Modified: Tue, 01 Nov 1994 12:45:26 GMT

  <X-WWW-FORM-URLENCODED OCTETS>
          
A successful response would look like the following:
  201 Created
  Date: Sun, 06 Nov 1994 08:51:09 GMT
  Server: ECHODEP_LRCRUDS/1.0 DSpace/1.4
  Allow: POST, HEAD
  Location:  http://some.plqace.edu/dspace/lrcruds/2135.89342
          
It is critical to note that the package itself is not uploaded as part of the POST request, but that the POST request creates only a stub or placeholder record. However, if there are additional parameters required by the repository to create the stub record those parameters can be passed in the body of the request as URL encoded form values. These will be dependent on the requirements of a specific repository. Since there will be a lag between the creation of the stub record and the update of the actual package, and in some cases this lag may be significant, the stub record should be clearly labeled as a placeholder and not a 'real' record. If the repository has a means to suppress this record it should do so until it is updated by the real record.

The reason that the actual package is not uploaded as part of the POST is that the identifier assigned to the package by the repository needs to be embedded in the METS file which is part of the package. The typical sequence of operations to ingest a new package would be to use POST to create a new placeholder record and get the identifier for that record. That identifier is then used to update provenance and other metadata which is part of the package, and then the placeholder record is updated or overwritten with the actual package using the PUT action.
Update PUT The Update actions use the HTTP PUT method. If the identifier does not already exist in the repository an HTTP 404 Not Found error is returned. If the identifier does exist, the package is replaced or updated with the new package.

For example, to create or update a DSpace package with an identifier of 2135.89342:
  PUT /dspace/lrcruds/2135.89342 HTTP/1.1
  Date: Sun, 06 Nov 1994 08:49:37 GMT
  From: thabing@uiuc.edu
  User-Agent: ECHODEP_Hub_and_Spoke/1.0
  Content-Length: 3495
  Content-MD5: <Base64-encoded MD5 Hash>
  Content-Type: application/zip
  Last-Modified: Tue, 01 Nov 1994 12:45:26 GMT

  <ZIP FILE OCTETS>
          
A successful response would look like the following:
  204 No Content
  Date: Sun, 06 Nov 1994 08:51:02 GMT
  Server: ECHODEP_LRCRUDS/1.0 DSpace/1.4
  Allow: GET, PUT, DELETE, HEAD
          
The HTTP PUT method must be idempotent, meaning that the same request with the same data must produce the same result no matter how many times it is performed.
Retrieve GET Retrieval of a record is being mapped to the HTTP GET method. In this case the exact same URL as would be used to create or update the record is used, for example:
  GET /dspace/lrcruds/2135.89342 HTTP/1.1
  Date: Sun, 06 Nov 1994 08:49:37 GMT
  From: thabing@uiuc.edu
  User-Agent: ECHODEP_Hub_and_Spoke/1.0
          
A successful response would look like the following:
  200 OK
  Date: Sun, 06 Nov 1994 08:51:02 GMT
  Server: ECHODEP_LRCRUDS/1.0 DSpace/1.4
  Allow: GET, PUT, DELETE, HEAD
  Content-Length: 3495
  Content-MD5: <Base64-encoded MD5 Hash>
  Content-Type: application/zip
  Last-Modified: Tue, 01 Nov 1994 12:45:26 GMT
  
  <ZIP FILE OCTETS>
          
Delete DELETE Deletion of a record is being mapped to the HTTP DELETE method. In this case the exact same URL as would be used to create or update the record is used, for example:
  DELETE /dspace/lrcruds/2135.89342 HTTP/1.1
  Date: Sun, 06 Nov 1994 08:49:37 GMT
  From: thabing@uiuc.edu
  User-Agent: ECHODEP_Hub_and_Spoke/1.0
          
A successful response would look like the following:
  204 No Content
  Date: Sun, 06 Nov 1994 08:51:02 GMT
  Server: ECHODEP_LRCRUDS/1.0 DSpace/1.4
  Allow: GET, PUT, DELETE, HEAD
        
Other Administrative Functions POST HEAD In addition to Create, the POST method may be used for miscellaneous administrative functions which are not CRUD actions. All POST requests must be encoded as application/x-www-form-urlencoded, and the responses must be XML conforming to the schema outlined in this document. Unlike the PUT, GET, and DELETE methods, POST is not required to be idempotent. In fact the create stub record action will not be idempotent because it will generate a new record and return a new identifier each time it is used.

The HEAD method may be issued to retrieve meta-information about a resource. In general, a HEAD request will be treated the same as a GET request except that the entity body is not returned, only the HTTP header is returned.

For information on programming the PUT and DELETE methods from various client and server platforms see http://www.intertwingly.net/wiki/pie/PutDeleteSupport.

The following sections will describe the various actions in more detail. In general, the rules defined in the HTTP specification must be followed for all HTTP methods and headers unless stated otherwise below.

Common HTTP Header Fields

Common

The following HTTP header fields are common to all methods:

Date
The Date header must be present and it must contain the date/time on which the request or response originated. The date should conform to RFC 1123 as outlined the HTTP specification, but may conform to other formats as allowed by the HTTP standard, for example:

Date: Sun, 06 Nov 1994 08:49:37 GMT

Entity Body Processing

Because of the possible large size of the files being transported, the chunked transport encoding must be supported for both the retrieve (GET) response and the update (PUT) request. The following headers pertain to entity body processing with both requests and responses:

Content-Length
If chunked transport encoding is used the Content-Length header must be omitted. However, if chunked transport encoding is not used the Content-Length header is required. If the Content-Length header is present receiving applications must verify the actual length of the entity body against the length value in the Content-Length header and must record or report any discrepancies in a fashion appropriate to the underlying system.

Content-Length: 12345

Content-MD5
The Content-MD5 header is required for all requests and responses which carry an entity body.

Content-MD5: 1B2M2Y8AsgTpgAmY7PhCfg==

Receiving systems must validate that the Content-MD5 value from the header matches the MD5 generated from the entity body after it is received, and must record or report any discrepancies in a fashion appropriate to the underlying system.

Content-Type
The Content-Type header is required for all requests and responses which carry an entity body.

Content-Type: application/zip
Content-Type: application/x-www-form-urlencoded

Currently the only expected content types are application/zip used for updating (PUT) or retrieving (GET) packages and application/x-www-form-urlencoded for sending optional, repository-specific parameters to a repository during a create (POST). Receiving systems must validate that the Content-Type value from the header matches the actual content type of the entity body after it is received, and must record or report any discrepancies in a fashion appropriate to the underlying system.

Transfer-Encoding
The only transfer encodings supported by this protocol are none or chunked. If no transfer encoding is used this header must not be used, but if chunked is used the header must be present, such as:

Transfer-Encoding: chunked

Future Considerations

Request

The following HTTP header fields are common to all request methods:

User-Agent
This value must identify the name and version of the client application that is making the request, for example:

User-Agent: ECHODEP_Hub_and_Spoke/1.0

From
This value must contain the email address of the person who is reponsible for the user agent making the request. This may not be the same person who is actually sitting at the console operating the user agent, but will typically be the system administrator or other person responsible for the user agent overall, for example:

From: xyz123@uiuc.edu

Authorization
In general, every request to the LRCRUD Service will require authorization. Currently, this is accomplished using HTTP Basic Authentication, for example:

Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==

Because Basic Authentication is not secure (passwords are sent over the network in the clear), the secure HTTPS protocol should be used. Future versions of this protocol may allow additional authentication schemes.

Response

The following HTTP header fields are common to all response methods:

Server
This value must identify the names and versions of the services that are responding to the request. There must be two products represented in this header. The first product is the name and version of the LRCRUDS service itself, and the second product must be the name and version of the repository with which the LRCRUDS is communicating, for example:

Server: ECHODEP_LRCRUDS/1.0 DSpace/1.4

If the native repository already uses a specific product identifier in its own HTTP responses that same product identifier must be used here.

Allow
This value is a comma-separated list of HTTP methods that are allowed on the given resource as identified by the EscapedLocalRepositoryIdentifier. For example, a request for a package might return this Allow in the response header:

Allow: GET, PUT, DELETE, HEAD

But a request for a location where packages may be posted would return this Allow in the response header:

Allow: POST, HEAD

WWW-Authenticate
All HTTP 401 Unauthorized responses must have a WWW-Authenticate header, such as:

WWW-Authenticate: Basic realm="LRCRUDS"

Currently this protocol only supports Basic Authentication. See the Authorization response header for more details.

Create (POST)

The Create action is used to create a placeholder or stub package in the repository and return the local identifier of that package back to the requester.

Request

The EscapedLocalRepositoryIdentifier portion of the URL must contain the local identifier for the location in which the stub package is to be created. If no EscapedLocalRepositoryIdentifier is given the service may create the stub package in a default location if there is one or return an error message. If an EscapedLocalRepositoryIdentifier is given the service must verify that it does identify a location where new packages may be created.

The entity body of the request may optionally contain X-WWW-FORM-URLENCODED values. These are dependent on the requirements of the serviced repository, but may include minimal metadata required to create the stub package, such as title or alternate identifiers. If there is an optional X-WWW-FORM-URLENCODED entity body, the request must include a Content-Length, Content-Type, and Content-MD5 headers which correspond to the entity body.

Response

After successful execution of the Create request the service must respond with an HTTP 201 Created message. The response must also include a Location header which contains the absolute URI of the newly created package. This is the URI which can be used in later, Retrieve, Update, or Delete interactions with that package.

If the location identified by the EscapedLocalRepositoryIdentifier does not existing an HTTP 404 status must be returned. The reason phrase must be "Location not found".

If the EscapedLocalRepositoryIdentifier is required but not included with the request an HTTP 400 status must be returned with a reason phrase of "Location is required".

If the location identified by the EscapedLocalRepositoryIdentifier exists but does not represent a location to which new packages may be created then an HTTP 400 must be returned with a reason phrase "Packages my not be created in this location". For example, the identifier may represent a package instead of a location that contains packages.

If the Create request requires authorization, the server must respond with an HTTP 401 Unauthorized status.

If the service has retrieved data from the underlying repository but the data is corrupt or the service is unable to process the data for any reason, the service must respond with an HTTP 502 Bad Gateway response. If a more definitive status message is available, then the service should use it in place of the generic "Bad Gateway." This message will be dependent on the underlying repository.

If the service is not able to communicate with the underlying repository for any reason, it must respond with an HTTP 503 Service Unavailable response. If a more definitive status message is available, then the service should use it in place of the generic "Service Unavailable." This message will be dependent on the underlying repository.

If the service has established a connection to the underlying repository and is attempting to retrieve data, but the request has timed out, the service must respond with an HTTP 504 Gateway Timeout response. If a more definitive status message is available, then the service should use it in place of the generic "Service Unavailable." This message will be dependent on the underlying repository.

If there are problems processing the entity body, such as missing required parameters, an HTTP 400 status must be returned. These messages will be dependent on the requirements of the specific repository, but the reason phrase should contain enough information for an administrator to try and correct the problem before retrying the request.

If there is an optional entity body, the Content-Length header is required. If this is missing the service must return an HTTP 411 Length required status.

If there is an optional entity body, the Content-MD5 header is also required, and the LRCRUDS service must verify that the MD5 checksum contained in the header matches the checksum of the actual entity body. If it does not macth the service must respond with an HTTP 400 status with a reason phrase "MD5 checksum does not match".

Update (PUT)

The Update action replaces the package identified by EscapedLocalRepositoryIdentifier with the new package which must be contained in the entity body of the PUT request.

Request

The EscapedLocalRepositoryIdentifier portion of the URL must contain the local identifier for the package which is being put into the repository. Generally this will be an identifier which was returned as part of a previous Create (POST) operation. For Update the EscapedLocalRepositoryIdentifier is always mandatory. An Update always assumes that the package or at least a placeholder for the package already exists in the repository, and the Update operation is replacing the old contents of that package withe the new contents.

The entity body of the request must contain a zip file which contains the files making up the package to be ingested. The Content-Type header value must be "application/zip".

Response

After successful execution of the Update request the service must respond with an HTTP 204 No content message. This signifies that the request was successful, but no entity body is returned.

If the package identified by the EscapedLocalRepositoryIdentifier does not existing an HTTP 404 status must be returned. The reason phrase must be "Package not found".

If the Content-Type header of the request is not "application/zip", the service must respond with a HTTP 415 status. The reason phrase must be "application/zip is the only supported media type".

If the Create request requires authorization, the server must respond with an HTTP 401 Unauthorized status.

Our current implementation must respond with an HTTP 501 status if the request contains a Content-Range header. The reason phrase must be "Content-Range is not implemented". This might change in a future version of the protocol.

If the service has retrieved data from the underlying repository but the data is corrupt or the service is unable to process the data for any reason, the service must respond with an HTTP 502 Bad Gateway response. If a more definitive status message is available, then the service should use it in place of the generic "Bad Gateway." This message will be dependent on the underlying repository.

If the service is not able to communicate with the underlying repository for any reason, it must respond with an HTTP 503 Service Unavailable response. If a more definitive status message is available, then the service should use it in place of the generic "Service Unavailable." This message will be dependent on the underlying repository.

If the service has established a connection to the underlying repository and is attempting to retrieve data, but the request has timed out, the service must respond with an HTTP 504 Gateway Timeout response. If a more definitive status message is available, then the service should use it in place of the generic "Service Unavailable." This message will be dependent on the underlying repository.

If the client making the request to the LRCRUDS service does not complete the request in a timely fashion the service may respond with an HTTP 408 Request timeout status. This might happen if there is a long delay between the client sending the request headers and the entity body.

If there are problems processing the entity body, such as the zip file is corrupt or the zip file does not contain the files expected for ingest into the repository, an HTTP 400 status must be returned. These messages will be dependent on the requirements of the specific repository, but the reason phrase should contain enough information for an administrator to try and correct the problem before retrying the request.

The Content-Length header is not allowed if chunked transfer encoding is used . If chunked transfer encoding is not being used this header is requiored, and if this is missing the service must return an HTTP 411 Length required status.

The Content-MD5 header is also required, and the LRCRUDS service must verify that the MD5 checksum contained in the header matches the checksum of the actual entity body. If it does not match the service must respond with an HTTP 400 status with a reason phrase "MD5 checksum does not match".

Retrieve (GET)

The Retrieve action gets a package identified by the EscapedLocalRepositoryIdentifier from a repository and returns all the associated files as a zip archive.

Request

The EscapedLocalRepositoryIdentifier portion of the URL must contain the local identifier for the package which is being retrieved. Generally this will be an identifier which was returned as part of a previous Create (POST) operation. For Retrieve the EscapedLocalRepositoryIdentifier is always mandatory.

Response

After successful execution of the Retrieve request the service must respond with an HTTP 200 OK message. This signifies that the request was successful, and the entity body contains the requested package.

If the package identified by the EscapedLocalRepositoryIdentifier does not existing an HTTP 404 status must be returned. The reason phrase must be "Package not found".

If the Retrieve request requires authorization, the server must respond with an HTTP 401 Unauthorized status.

If the service has retrieved data from the underlying repository but the data is corrupt or the service is unable to process the data for any reason, the service must respond with an HTTP 502 Bad Gateway response. If a more definitive status message is available, then the service should use it in place of the generic "Bad Gateway." This message will be dependent on the underlying repository.

If the service is not able to communicate with the underlying repository for any reason, it must respond with an HTTP 503 Service Unavailable response. If a more definitive status message is available, then the service should use it in place of the generic "Service Unavailable." This message will be dependent on the underlying repository.

If the service has established a connection to the underlying repository and is attempting to retrieve data, but the request has timed out, the service must respond with an HTTP 504 Gateway Timeout response. If a more definitive status message is available, then the service should use it in place of the generic "Service Unavailable." This message will be dependent on the underlying repository.

The Content-Type header of the response must be "application/zip". Unless the Transfer-Encoding is chunked, the Content-Length header is required. The Content-MD5 header is also required. A Last-Modified header should be present with the date on which the package was last modified by the repository, if available.

Delete (DELETE)

Other Administrative Functions (POST)

Zip File Layout

Header File

<LRCRUDS date='' version=''>
  <packageIdentifier>...</packageIdentifier>
  <repositoryInformation>
    <premis:agent>
      ...
    </premis:agent>
  </repositoryInformation>
  <metsFilename>...</identifier>
</LRCRUDS>

TODO: Investigate using the Java JAR file manifest instead of creating our own formats. This would allow functionality such as digitally signing the file. See JAR File Specification for details.

Security Considerations

References

http://wiki.dspace.org/LightweightNetworkInterface


Contact Tom Habing with any questions or comments.