Page tree
Skip to end of metadata
Go to start of metadata

The TextGrid TG-publish Service Documentation has moved HERE!


Inhalt


Overview

The TG-publish service shall be used to publish documents and/or TextGrid editions/collections created within the TextGridLab to the public TextGrid Repository. Documents that are published in the TextGrid Repository...

  • ...can be publicly searched from within the TextGridLab TG-search GUI.
  • ...are publicly available in the TextGrid Repository Browser.
  • ...have got an Handle persistent identifier to be able to stay accessible.
  • ...are stored in a secure and robust repository environment that provides retrieval for the long term.

There are mainly three different ways to publish Data:

  1. Valid TextGrid Editions or Collections that has already been created in the TextGridLab can be published. To learn more about the mandatory TextGrid Metadata please have a look at the TextGrid Metadata Schema documentation.
  2. Some technical data from the TextGridLab can be published as WorldReadable objects. This objects and their metadata are tested differently for validity, they are not added to the public search index, and furthermore they get no PIDs.
  3. Objects that shall be checked for validity after importing can be first published into the TextGrid Sandbox. After importing it is possible to check the objects in the TextGrid Repository Sandbox Browser, and then finally publish them using a special policy of TG-publish, or again delete them if not valid/complete/errorful. Only the final publishing is coverd yet by TG-publish, hence you need write access to the RDF database. Publishing and deletion is yet only possible via the Import Tool External (koLibRI). Also the Sandbox is not yet used from within the TextGridLab.

Furthermore the TG-publish service can be used to copy TextGrid objects from (a) the TextGridLab and (b) the TextGridRep to own TextGrid projects in the TG-lab.

TG-publish uses the workflow library koLibRI (kopal Library of Retrieval and Ingest).

GIT Repository

The GIT repository can be found here:

https://projects.gwdg.de/projects/kolibri/repository

The related tags of this TG-publish documentation you can find here:

https://projects.gwdg.de/projects/kolibri/repository/show/kolibri-tgpublish-service?rev=2.4.1
https://projects.gwdg.de/projects/kolibri/repository/revisions/2.4.1/show/kolibri-tgpublish-client
https://projects.gwdg.de/projects/kolibri/repository/revisions/2.4.1/show/kolibri-tgpublish-api

Version

This page is valid for TG-publish Service Version as of

3.4.0-2014-07-25T13:27-'TextGrid 3.0'

To check the currently used TG-publish version simply try:

http://textgridlab.org/1.0/tgpublish/version

More Documentation

To get a more detailled understanding of the koLibRI workflow tool and its modules, you can find the koLibRI documentation here:

koLibRI Documentation Version 2.0 (German – work in progress)

Installation

In general a complete TG-publish Service installaltion consists of (or needs):

  1. The TG-publish service itself.
  2. A second TG-crud instance for public data (TG-crud public).
  3. A second TG-search instance for public data including a second eXist and Sesame instance (TG-search public).
  4. The public TextGrid Repository Browser (including the TextGrid Sandbox).

Downloading TG-publish

TG-publish can be downloaded from the DARIAH Nexus Repository as WAR file from

http://dev.dariah.eu/nexus/service/local/artifact/maven/redirect?r=releases&g=de.langzeitarchivierung.kolibri&a=kolibri-tgpublish-service&v=3.4.0&e=war

and be renamed to tgpublish.war (just for simplicity) and directly be deployed into a Tomcat's webapp folder. You can of course also try a newer version, if existing.

If something needs to be added to TG-publish (e.g. more koLibRI modules whitch extends functionality), you can find the current sources of the koLibRI here:

https://projects.gwdg.de/projects/kolibri/repository

Just do a

mvn package

in /kolibri/kolibri-tgpublish-service/. You may have to check out all the modules for building TG-publish, its modules all do begin with kolibri-tgpublish. You should NOT change the API and client, because TG-lab depends on it! To test the TG-publish service (locally, but with non-local services) you can also run

mvn tomcat:run

Configuration

There are mainly two configuration files coming with the tgpublish koLibRI module, that need to be taken care of: config.xml and policies.xml. The latter is used to define the TG-publish workflow and lead the publishing process through the different ActionModules, that are normally processed one by one. These modules can share information using a custom data object. For more detailled information please see the work-in-progress version 2.0 referred to above.

TG-publish is pre-configured to put it's configuration files into /etc/textgrid/tgpublish/ and log to /var/log/textgrid/tgpublish/.

So please create the appropriate folders and then copy the config files from

https://projects.gwdg.de/projects/kolibri/repository/revisions/2.4.1/show/kolibri-tgpublish-service/config

into the config folder. Do not forget to set the permissions and owner settings so that Tomcat can write to it!

More config files may be needed from

https://projects.gwdg.de/projects/kolibri/repository/revisions/master/show/config

Please just copy the following, and add more, if some file-not-found-errors should occur!

dias_formatregistry.xml
jhove.conf
…

policies.xml

There are four policies to be used with the TextGridLab at the moment:

  1. TGPublish
  2. TGPublishWorldReadable
  3. TGPublishSandboxData, and
  4. TGCopy

All these workflows (or policies) are described in the policies.xml file and define the order of processing koLibRI ActionModules. Each of the three workflows is started as a ProcessStarter with the current configuration (see below). TGPublish is used from within the TG-lab using the Publish Perspective, TGPublishWorldReadable also is used from within the TG-lab, but only applies to single technical files as e.g. XML Schema documents, XSLT stylesheets, TextGrid workflow documents, etc. The variety of files to be able to publish worldReadable can be checked by requesting the worldReadable List. TGPublishSandboxData is used to finally publish objects that were imported to the TextGrid Sandbox and is used from e.g. the Import Tool External (koLibRI), and last but not least TGCopy, that is used from within the TG-copy workflow to copy TextGrid objects from either the public repository or the non-public repository to own projects for further processing. Rewriting URIs and other stuff is included here.

This three policies will be explained now in detail.

TGPublish

<?xml version="1.0" encoding="UTF-8"?>
<policy name="TGPublish">
  <step class="textgrid.PublishStart">
    <step class="textgrid.PublishCheckEdition">
      <step class="textgrid.CheckIsPublic">
        <step class="textgrid.CheckReferences">
          <step class="textgrid.GetPids">
            <step class="textgrid.ModifyAndUpdate">
              <step class="textgrid.CopyElasticSearchIndex">
                <step class="textgrid.CopyRelationData">
                  <step class="textgrid.MoveToStaticGridStorage">
                    <step class="textgrid.UpdateTgauth">
                      <step class="textgrid.PublishComplete" />
                    </step>
                  </step>
                </step>
              </step>
            </step>
          </step>
        </step>
      </step>
    </step>
  </step> 
</policy>

PublishStart

Just marks the publish process started. 

PublishCheckEdition

Checks for correct Edition/Collection Metadata.

CheckIsPublic

Checks for already published objects.

CheckReferences

Checks if some objects that are referred to, are NOT contained in the current Edition/Collection to publish.

GetPids

Fetches PIDs for every object's TextGrid URI using the GWDG Handle Service.

ModifyAndUpdate

Does rewriting of several URIs to PIDs, modifies all necesarry object metadata and/or data, and finally updates everything calling TG-crud#UPDATEMETADATA or TG-crud#UPDATE.

CopyElasticSearchIndex

Copies the search index to the public index database.

CopyRelationData

Copies the RDF relation data to the public RDF database.

MoveToStaticGridStorage

Moves all metadata and data to the public storage location.

UpdateTgauth

Just updates the TG-auth calling the method TG-auth#PUBLISH

PublishComplete

PublishComplete is called just to ensure the operation has finished successfully, and to report to logfiles, etc.

TGPublishWorldReadable

<?xml version="1.0" encoding="UTF-8"?>
<policy name="TGPublishWorldReadable">
  <step class="textgrid.PublishStart">
    <step class="textgrid.PublishCheckWorldReadable">
      <step class="textgrid.ModifyAndUpdate">
        <step class="textgrid.MoveToStaticGridStorage">
          <step class="textgrid.UpdateTgauth">
            <step class="textgrid.PublishComplete" />
          </step>
        </step>
      </step>
    </step>
  </step> 
</policy>

PublishStart

See above.

PublishCheckWorldReadable

Checks for correct WorldReadable Metadata.

ModifyAndUpdate

Does rewriting of several URIs to PIDs, modifies all necesarry object metadata and/or data, and finally updates everything calling TG-crud#UPDATEMETADATA or TG-crud#UPDATE.

MoveToStaticGridStorage

Moves all metadata and data to the public storage location.

UpdateTgauth

See above.

PublishComplete

See above.

TGPublishSandboxData

<?xml version="1.0" encoding="UTF-8"?>
<policy name="TGPublishSandboxData">
  <step class="textgrid.PublishStart">
    <step class="textgrid.UpdateTgauth">
      <step class="textgrid.ReleaseNearlyPublishedRelation">
        <step class="textgrid.PublishComplete" />
      </step>
    </step>
  </step>
</policy> 

PublishStart

See above.

UpdateTgauth

See above.

ReleaseNearlyPublishedRelation

Releases the nearlyPublished relation in the TG-rep's Sesame triple store and the ElasticSearch database – so that the object is viewable and searchable in the public TextGrid repository browser and TextGridLab search GUI.

PublishComplete

See above.

TGCopy 

<?xml version="1.0" encoding="UTF-8"?>
<policy name="TGCopy">
  <step class="textgrid.StartCopy">
    <step class="textgrid.GatherObjectUris">
      <step class="textgrid.ModifyAndCreate">
        <step class="textgrid.CopyComplete" />
      </step>
    </step>
  </step>
</policy>

CopyStart

Just marks the copy process started. 

GatherObjectUris

Gets all referenced URIs from the objects out of the given URI list (out of all aggragations/editions/collections recursively), and adds every URI to the PublishResponse object list. 

ModifyAndCreate

Retrieves every URI from the PublishResponse object list from the TG-crud, rewrites aggregation lists and other URIs includes, and creates a new TextGrid object in the project given. 

CopyComplete

CopyComplete is called just to ensure the operation has finished successfully, and to report to logfiles, etc. 

config.xml

config.xml or in this case config__tgpublish.xml is the main koLibRI configuration file. Here are the processStarters configured as well as all the ActionModules, and some global things as well. The TextGrid specific ProcessStarters and ActionModules all are described inside the config file (see description tags), and so we just refer to the file itself:

https://projects.gwdg.de/projects/kolibri/repository/revisions/2.4.1/changes/kolibri-tgpublish-service/config/config__tgpublish.xml

XML config file tags that are not documented are not used by the TG-publish (and not needed), you can look up their meaning in the koLibRI documentation, or the main koLibRI configuration file:

https://projects.gwdg.de/projects/kolibri/repository/changes/config/config.xml?rev=2.4.1

Logging

At the moment the koLibRI logs to stdout (see e.g. the Tomcat's catalina.out log) and also to a logfile located at the configured logfile location (see config.xml). The logfile's name has got a timestamp in it, and a new file is created every time the koLibRI Workflow Tool is started.

TextGridLab GUI

How the TG-publish and TG-copy is used from within the TextGridLab you can read here:

TG-publish GUI and TG-copy GUI (Context Menu of the Navigator View)

TG-publish HTTP API

API Overview

MethodRequest ParametersResponseExamples
getStatus
TextGrid URI or TG-copy UUID

publishResponse XML

/tgpublish/textgrid:1234/status
/tgpublish/tgcopy:fc1023a3-739e-4727-b4b2-64711650dfc1/status
getVersion
noneVersion String
/tgpublish/version
listWorldReadables
noneworldReadableMimetypes XML
/tgpublish/listWorldReadables
publish

sid, log, ignoreWarnings, dryRun

HTTP 201 OK
/tgpublish/textgrid:1234/publish?sid=SID&log=&ignoreWarnings=TRUE&dryRun=FALSE
publishSandboxData
sid, log, ignoreWarnings, dryRunHTTP 201 OK
/tgpublish/textgrid:1234/publishSandboxData?sid=SID&log=&ignoreWarnings=TRUE&dryRun=FALSE
publishWorldReadable
sid, log, ignoreWarnings, dryRunHTTP 201 OK
/tgpublish/textgrid:1234/publishWorldReadable?sid=SID&log=&ignoreWarnings=TRUE&dryRun=FALSE
copy 
sid, log, uri (repeatable), projectId, newRevisionTG-copy UUID
/tgpublish/copy?sid=SID&log=&uri=textgrid:1234&uri=textgrid:2345&projectId=TGPR&newRevision=FALSE

getStatus()

Gets the status of a certain publish process using the specific TextGrid URI or the TG-copy UUID, e.g.

http://textgridlab.org/1.0/tgpublish/textgrid:1234/status

Optional parameters are

  • the RBAC Session ID, and
  • the log service parameter.

The TextGrid URI or the TG-copy UUID is given in the RESTful URL path, see example above. You get back the publish response as an XML file, which is described below.

 getVersion()

The current version of the TG-publish installation you can get using

http://textgridlab.org/1.0/tgpublish/version

listWorldReadables()

The list of TextGrid objects that can be published as single WorldReadable objects you can get using

http://textgridlab.org/1.0/tgpublish/listWorldReadables

publish()

Publishes data from the TG-lab, needs a TextGrid URI as resource, and takes the following as input parameters:

The RBAC session ID (mandatory) [As used in the TG-lab]

The log parameter (optional – default is "") [As used in the TG-lab, too]

The ignoreWarnings trigger (default is FALSE) [Any warnings are ignored, if set to TRUE. Objects containing warnings will be published anyhow]

The dryRun trigger (default is TRUE) [If set to TRUE, nothing will really be published, but just checked]

The object to publish is given with the URI path parameter as following:

http://textgridlab.org/1.0/tgpublish/textgrid:1234/publish?sid=MB896JHG&log=&ignoreWarnings=TRUE&dryRun=FALSE

The publish call only returns an HTTP 201 OK, that indicates the service did queue the request and will be processing it as soon as possible. If an HTTP error is returned, the publishing process was NOT started and will not be! You can check your requests outcome by calling status().

publishSandboxData()

Really publishes finally to the TextGridRep. At the moment only used by the Import Tool External (koLibRI). Usage of the sandbox within the TG-lab is planned.

publishWorldReadable()

Publishes an object as WorldReadable. The same parameters as in publish() do apply.

copy()

Copies the objects belonging to the given URIs and all their aggregated objects (from aggregations/editions/collections) to the project belonging to the given project ID. You can copy objects from the TG-lab (non-public repository) and/or the public repository to your own projects, e.g. for further processing.

The following parameters do apply.

The RBAC session ID (mandatory) [As used in the TG-lab]

The log parameter (optional – default is "") [As used in the TG-lab, too]

The URIs of the objects to copy [This can be URIs of aggregations/collections/editions (in that case all subelements are gathered automatically) or URIs of single objects. This parameter is repeatable, so you can build a list of your favourite objects, and copy them allinto one of your projects.]

The Project ID [of the project to copy the objects to.]

newRevision [A boolean parameter to create new revisions from all copied objects (TRUE), or not (FALSE).]

Here an HTTP example of the copy() method:

http://textgridlab.org/1.0/tgpublish/copy?sid=MB896JHG&log=&uri=textgrid:1234&uri=textgrid:2345&projectId=TGPR&newRevision=FALSE

The copy call returns a session UUID as a string, that (a) does indicate the service did queue the request and will be processing it as soon as possible, and that you can (b) query via the status() call your copy status. If an HTTP error is returned, the copying process was NOT started and will not be!

To track the provenience of the copied objects, an isDerivedFrom relation is set in the TextGrid metadata of each copied object.

The TG-publish WADL Service Description

The WADL file of the TG-publish response you can find here as Service WADL.

TG-publish response

 XML Schema

The embedded PublishResponse XML schema can be found here:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="unqualified">
    <xs:element name="publishResponse" type="publishResponse" />
    <xs:element name="worldReadableMimetypes" type="worldReadableMimetypes" />
    <xs:complexType name="publishResponse">
        <xs:sequence>
            <xs:element maxOccurs="unbounded" minOccurs="0" name="PublishObject"
                type="publishObject" />
            <xs:element minOccurs="0" name="PublishStatus" type="publishStatus" />
        </xs:sequence>
        <xs:attribute name="dryRun" type="xs:boolean" />
    </xs:complexType>
    <xs:complexType name="publishObject">
        <xs:sequence>
            <xs:element maxOccurs="unbounded" minOccurs="0" name="error" type="publishError" />
            <xs:element minOccurs="0" name="referencedUris" type="referencedUris" />
            <xs:element maxOccurs="unbounded" minOccurs="0" name="warning" type="publishWarning" />
        </xs:sequence>
        <xs:attribute name="uri" type="xs:string" />
        <xs:attribute name="pid" type="xs:string" />
        <xs:attribute name="status" type="statusType" />
    </xs:complexType>
    <xs:complexType name="publishError">
        <xs:sequence>
            <xs:element minOccurs="0" name="message" type="xs:string" />
            <xs:element minOccurs="0" name="type" type="errorType" />
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="referencedUris">
        <xs:sequence>
            <xs:element maxOccurs="unbounded" minOccurs="0" name="uri" type="xs:string" />
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="publishWarning">
        <xs:sequence>
            <xs:element minOccurs="0" name="message" type="xs:string" />
            <xs:element minOccurs="0" name="type" type="warningType" />
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="publishStatus">
        <xs:sequence>
            <xs:element maxOccurs="unbounded" minOccurs="0" name="module" type="module" />
        </xs:sequence>
        <xs:attribute name="progress" type="xs:int" use="required" />
        <xs:attribute name="processStatus" type="processStatusType" />
        <xs:attribute name="activeModule" type="xs:string" />
    </xs:complexType>
    <xs:complexType name="module">
        <xs:sequence>
            <xs:element maxOccurs="unbounded" minOccurs="0" name="message" type="xs:string" />
        </xs:sequence>
        <xs:attribute name="name" type="xs:string" />
        <xs:attribute name="status" type="statusType" />
    </xs:complexType>
    <xs:complexType name="worldReadableMimetypes">
        <xs:sequence>
            <xs:element maxOccurs="unbounded" minOccurs="0" name="regexp" type="xs:string" />
        </xs:sequence>
    </xs:complexType>
    <xs:simpleType name="statusType">
        <xs:restriction base="xs:string">
            <xs:enumeration value="OK" />
            <xs:enumeration value="WARNING" />
            <xs:enumeration value="ERROR" />
            <xs:enumeration value="NOT_YET_PUBLISHED" />
            <xs:enumeration value="ALREADY_PUBLISHED" />
        </xs:restriction>
    </xs:simpleType>
    <xs:simpleType name="errorType">
        <xs:restriction base="xs:string">
            <xs:enumeration value="NOT_SPECIFIED" />
            <xs:enumeration value="AUTH" />
            <xs:enumeration value="WRONG_CONTENT_TYPE" />
            <xs:enumeration value="NO_PUBLISH_RIGHT" />
            <xs:enumeration value="PID_GENERATION_FAILED" />
            <xs:enumeration value="MISSING_METADATA" />
            <xs:enumeration value="ALREADY_PUBLISHED" />
            <xs:enumeration value="METADATA_WARNINGS_EXIST" />
            <xs:enumeration value="SERVER_ERROR" />
        </xs:restriction>
    </xs:simpleType>
    <xs:simpleType name="warningType">
        <xs:restriction base="xs:string">
            <xs:enumeration value="NOT_SPECIFIED" />
            <xs:enumeration value="CHECK_REFERENCES" />
        </xs:restriction>
    </xs:simpleType>
    <xs:simpleType name="processStatusType">
        <xs:restriction base="xs:string">
            <xs:enumeration value="FINISHED" />
            <xs:enumeration value="RUNNING" />
            <xs:enumeration value="FAILED" />
            <xs:enumeration value="NOT_QUEUED" />
        </xs:restriction>
    </xs:simpleType>
</xs:schema>

publishResponse Root Element

The publishResponse element has got the attribute dryRun, the element publishStatus and a list of publishObject elements.

dryRun

The attribute dryRun just tells if a publish process is or was running or just a publish test without any consequences.

publishStatus

This element tells about the current status of the overall publishing process.

  • module: Possibly a list of processed modules (not yet used by TG-publish).
  • progress : The overall progress in percent. Each module can implement updating this progress value. Please use it for progress bars or other nice things.
  • activeModule: Holds the currently activeModule, that is the qualified name of the corrresponding koLibRI Java ActionModule.
  • processStatus: The processStatusdescribes the current overall process status, and can contain the following values:
    • FINISHED: This TG-publish process is finished, either with failure or success
    • RUNNING: The current process is still running, just try again later...
    • FAILED: The TG-publish process has failed. See object list for reasons.
    • NOT_QUEUED: The given TextGrid URI is not yet queued in the TG-publish queue. Either you have not started a publish process with this URI, or the process has not yet begun.

publishObject

Each of the objects in the list holds the following data, the root Collection or Edition is alco contained, normally as the first element of the list.

  • uri: The URI of the object to publish.
  • pid: The generated PID of the object.
  • status: This status describes the status of the single object, and can contain:
    • OK: Object was processed successfully.
    • WARNING: Object was processed successfully, but there were warnings (can yet be published using the ignoreWarnings flag).
    • ERROR: An error occured while publishing.
    • NOT_YET_PUBLISHED: The object was not yet published (according to TG-auth*).
    • ALREADY_PUBLISHED: The object already is published (according to TG-auth*).
  • publishError: Errors occured processing this object.
    • AUTH
    • WRONG_CONTENT_TYPE
    • NO_PUBLISH_RIGHT
    • PID_GENERATION_FAILED
    • MISSING_METADATA
    • ALREADY_PUBLISHED
    • METADATA_WARNINGS_EXIST
    • SERVER_ERROR
    • NOT_SPECIFIED
  • referencedUris: List of URIs that are references from within this object, and NOT contained in the current Edition or Collection.
  • publishWarning: Warnings occured processing this object.
    • CHECK_REFERENCES: Check the referenced URIs.
    • NOT_SPECIFIED: Other warning.