The TextGrid TG-publish Service Documentation has moved HERE!
Inhalt
Overview
The TG-publish service shall be used to publish documents and/or TextGrid editions/collections created within the TextGridLab to the public TextGrid Repository. Documents that are published in the TextGrid Repository...
- ...can be publicly searched from within the TextGridLab TG-search GUI.
- ...are publicly available in the TextGrid Repository Browser.
- ...have got an Handle persistent identifier to be able to stay accessible.
- ...are stored in a secure and robust repository environment that provides retrieval for the long term.
There are mainly three different ways to publish Data:
- Valid TextGrid Editions or Collections that has already been created in the TextGridLab can be published. To learn more about the mandatory TextGrid Metadata please have a look at the TextGrid Metadata Schema documentation.
- Some technical data from the TextGridLab can be published as WorldReadable objects. This objects and their metadata are tested differently for validity, they are not added to the public search index, and furthermore they get no PIDs.
- Objects that shall be checked for validity after importing can be first published into the TextGrid Sandbox. After importing it is possible to check the objects in the TextGrid Repository Sandbox Browser, and then finally publish them using a special policy of TG-publish, or again delete them if not valid/complete/errorful. Only the final publishing is coverd yet by TG-publish, hence you need write access to the RDF database. Publishing and deletion is yet only possible via the Import Tool External (koLibRI). Also the Sandbox is not yet used from within the TextGridLab.
Furthermore the TG-publish service can be used to copy TextGrid objects from (a) the TextGridLab and (b) the TextGridRep to own TextGrid projects in the TG-lab.
TG-publish uses the workflow library koLibRI (kopal Library of Retrieval and Ingest).
GIT Repository
The GIT repository can be found here:
https://projects.gwdg.de/projects/kolibri/repository
The related tags of this TG-publish documentation you can find here:
https://projects.gwdg.de/projects/kolibri/repository/show/kolibri-tgpublish-service?rev=2.4.1
https://projects.gwdg.de/projects/kolibri/repository/revisions/2.4.1/show/kolibri-tgpublish-client
https://projects.gwdg.de/projects/kolibri/repository/revisions/2.4.1/show/kolibri-tgpublish-api
Version
This page is valid for TG-publish Service Version as of
3.4.0-2014-07-25T13:27-'TextGrid 3.0'
To check the currently used TG-publish version simply try:
http://textgridlab.org/1.0/tgpublish/version
More Documentation
To get a more detailled understanding of the koLibRI workflow tool and its modules, you can find the koLibRI documentation here:
koLibRI Documentation Version 2.0 (German – work in progress)
Installation
In general a complete TG-publish Service installaltion consists of (or needs):
- The TG-publish service itself.
- A second TG-crud instance for public data (TG-crud public).
- A second TG-search instance for public data including a second eXist and Sesame instance (TG-search public).
- The public TextGrid Repository Browser (including the TextGrid Sandbox).
Downloading TG-publish
TG-publish can be downloaded from the DARIAH Nexus Repository as WAR file from
http://dev.dariah.eu/nexus/service/local/artifact/maven/redirect?r=releases&g=de.langzeitarchivierung.kolibri&a=kolibri-tgpublish-service&v=3.4.0&e=war
and be renamed to tgpublish.war (just for simplicity) and directly be deployed into a Tomcat's webapp folder. You can of course also try a newer version, if existing.
If something needs to be added to TG-publish (e.g. more koLibRI modules whitch extends functionality), you can find the current sources of the koLibRI here:
https://projects.gwdg.de/projects/kolibri/repository
Just do a
mvn package
in /kolibri/kolibri-tgpublish-service/. You may have to check out all the modules for building TG-publish, its modules all do begin with kolibri-tgpublish. You should NOT change the API and client, because TG-lab depends on it! To test the TG-publish service (locally, but with non-local services) you can also run
mvn tomcat:run
Configuration
There are mainly two configuration files coming with the tgpublish koLibRI module, that need to be taken care of: config.xml and policies.xml. The latter is used to define the TG-publish workflow and lead the publishing process through the different ActionModules, that are normally processed one by one. These modules can share information using a custom data object. For more detailled information please see the work-in-progress version 2.0 referred to above.
TG-publish is pre-configured to put it's configuration files into /etc/textgrid/tgpublish/ and log to /var/log/textgrid/tgpublish/.
So please create the appropriate folders and then copy the config files from
https://projects.gwdg.de/projects/kolibri/repository/revisions/2.4.1/show/kolibri-tgpublish-service/config
into the config folder. Do not forget to set the permissions and owner settings so that Tomcat can write to it!
More config files may be needed from
https://projects.gwdg.de/projects/kolibri/repository/revisions/master/show/config
Please just copy the following, and add more, if some file-not-found-errors should occur!
dias_formatregistry.xml jhove.conf …
policies.xml
There are four policies to be used with the TextGridLab at the moment:
- TGPublish
- TGPublishWorldReadable
- TGPublishSandboxData, and
- TGCopy
All these workflows (or policies) are described in the policies.xml file and define the order of processing koLibRI ActionModules. Each of the three workflows is started as a ProcessStarter with the current configuration (see below). TGPublish is used from within the TG-lab using the Publish Perspective, TGPublishWorldReadable also is used from within the TG-lab, but only applies to single technical files as e.g. XML Schema documents, XSLT stylesheets, TextGrid workflow documents, etc. The variety of files to be able to publish worldReadable can be checked by requesting the worldReadable List. TGPublishSandboxData is used to finally publish objects that were imported to the TextGrid Sandbox and is used from e.g. the Import Tool External (koLibRI), and last but not least TGCopy, that is used from within the TG-copy workflow to copy TextGrid objects from either the public repository or the non-public repository to own projects for further processing. Rewriting URIs and other stuff is included here.
This three policies will be explained now in detail.
TGPublish
<?xml version="1.0" encoding="UTF-8"?> <policy name="TGPublish"> <step class="textgrid.PublishStart"> <step class="textgrid.PublishCheckEdition"> <step class="textgrid.CheckIsPublic"> <step class="textgrid.CheckReferences"> <step class="textgrid.GetPids"> <step class="textgrid.ModifyAndUpdate"> <step class="textgrid.CopyElasticSearchIndex"> <step class="textgrid.CopyRelationData"> <step class="textgrid.MoveToStaticGridStorage"> <step class="textgrid.UpdateTgauth"> <step class="textgrid.PublishComplete" /> </step> </step> </step> </step> </step> </step> </step> </step> </step> </step> </policy>
PublishStart
Just marks the publish process started.
PublishCheckEdition
Checks for correct Edition/Collection Metadata.
CheckIsPublic
Checks for already published objects.
CheckReferences
Checks if some objects that are referred to, are NOT contained in the current Edition/Collection to publish.
GetPids
Fetches PIDs for every object's TextGrid URI using the GWDG Handle Service.
ModifyAndUpdate
Does rewriting of several URIs to PIDs, modifies all necesarry object metadata and/or data, and finally updates everything calling TG-crud#UPDATEMETADATA or TG-crud#UPDATE.
CopyElasticSearchIndex
Copies the search index to the public index database.
CopyRelationData
Copies the RDF relation data to the public RDF database.
MoveToStaticGridStorage
Moves all metadata and data to the public storage location.
UpdateTgauth
Just updates the TG-auth calling the method TG-auth#PUBLISH
PublishComplete
PublishComplete is called just to ensure the operation has finished successfully, and to report to logfiles, etc.
TGPublishWorldReadable
<?xml version="1.0" encoding="UTF-8"?> <policy name="TGPublishWorldReadable"> <step class="textgrid.PublishStart"> <step class="textgrid.PublishCheckWorldReadable"> <step class="textgrid.ModifyAndUpdate"> <step class="textgrid.MoveToStaticGridStorage"> <step class="textgrid.UpdateTgauth"> <step class="textgrid.PublishComplete" /> </step> </step> </step> </step> </step> </policy>
PublishStart
See above.
PublishCheckWorldReadable
Checks for correct WorldReadable Metadata.
ModifyAndUpdate
Does rewriting of several URIs to PIDs, modifies all necesarry object metadata and/or data, and finally updates everything calling TG-crud#UPDATEMETADATA or TG-crud#UPDATE.
MoveToStaticGridStorage
Moves all metadata and data to the public storage location.
UpdateTgauth
See above.
PublishComplete
See above.
TGPublishSandboxData
<?xml version="1.0" encoding="UTF-8"?> <policy name="TGPublishSandboxData"> <step class="textgrid.PublishStart"> <step class="textgrid.UpdateTgauth"> <step class="textgrid.ReleaseNearlyPublishedRelation"> <step class="textgrid.PublishComplete" /> </step> </step> </step> </policy>
PublishStart
See above.
UpdateTgauth
See above.
ReleaseNearlyPublishedRelation
Releases the nearlyPublished relation in the TG-rep's Sesame triple store and the ElasticSearch database – so that the object is viewable and searchable in the public TextGrid repository browser and TextGridLab search GUI.
PublishComplete
See above.
TGCopy
<?xml version="1.0" encoding="UTF-8"?> <policy name="TGCopy"> <step class="textgrid.StartCopy"> <step class="textgrid.GatherObjectUris"> <step class="textgrid.ModifyAndCreate"> <step class="textgrid.CopyComplete" /> </step> </step> </step> </policy>
CopyStart
Just marks the copy process started.
GatherObjectUris
Gets all referenced URIs from the objects out of the given URI list (out of all aggragations/editions/collections recursively), and adds every URI to the PublishResponse object list.
ModifyAndCreate
Retrieves every URI from the PublishResponse object list from the TG-crud, rewrites aggregation lists and other URIs includes, and creates a new TextGrid object in the project given.
CopyComplete
CopyComplete is called just to ensure the operation has finished successfully, and to report to logfiles, etc.
config.xml
config.xml or in this case config__tgpublish.xml is the main koLibRI configuration file. Here are the processStarters configured as well as all the ActionModules, and some global things as well. The TextGrid specific ProcessStarters and ActionModules all are described inside the config file (see description tags), and so we just refer to the file itself:
https://projects.gwdg.de/projects/kolibri/repository/revisions/2.4.1/changes/kolibri-tgpublish-service/config/config__tgpublish.xml
XML config file tags that are not documented are not used by the TG-publish (and not needed), you can look up their meaning in the koLibRI documentation, or the main koLibRI configuration file:
https://projects.gwdg.de/projects/kolibri/repository/changes/config/config.xml?rev=2.4.1
Logging
At the moment the koLibRI logs to stdout (see e.g. the Tomcat's catalina.out log) and also to a logfile located at the configured logfile location (see config.xml). The logfile's name has got a timestamp in it, and a new file is created every time the koLibRI Workflow Tool is started.
TextGridLab GUI
How the TG-publish and TG-copy is used from within the TextGridLab you can read here:
TG-publish GUI and TG-copy GUI (Context Menu of the Navigator View)
TG-publish HTTP API
API Overview
Method | Request Parameters | Response | Examples |
---|---|---|---|
getStatus | TextGrid URI or TG-copy UUID | publishResponse XML | /tgpublish/textgrid:1234/status /tgpublish/tgcopy:fc1023a3-739e-4727-b4b2-64711650dfc1/status |
getVersion | none | Version String | /tgpublish/version |
listWorldReadables | none | worldReadableMimetypes XML | /tgpublish/listWorldReadables |
publish | sid, log, ignoreWarnings, dryRun | HTTP 201 OK | /tgpublish/textgrid:1234/publish?sid=SID&log=&ignoreWarnings=TRUE&dryRun=FALSE |
publishSandboxData | sid, log, ignoreWarnings, dryRun | HTTP 201 OK | /tgpublish/textgrid:1234/publishSandboxData?sid=SID&log=&ignoreWarnings=TRUE&dryRun=FALSE |
publishWorldReadable | sid, log, ignoreWarnings, dryRun | HTTP 201 OK | /tgpublish/textgrid:1234/publishWorldReadable?sid=SID&log=&ignoreWarnings=TRUE&dryRun=FALSE |
copy | sid, log, uri (repeatable), projectId, newRevision | TG-copy UUID | /tgpublish/copy?sid=SID&log=&uri=textgrid:1234&uri=textgrid:2345&projectId=TGPR&newRevision=FALSE |
getStatus()
Gets the status of a certain publish process using the specific TextGrid URI or the TG-copy UUID, e.g.
http://textgridlab.org/1.0/tgpublish/textgrid:1234/status
Optional parameters are
- the RBAC Session ID, and
- the log service parameter.
The TextGrid URI or the TG-copy UUID is given in the RESTful URL path, see example above. You get back the publish response as an XML file, which is described below.
getVersion()
The current version of the TG-publish installation you can get using
http://textgridlab.org/1.0/tgpublish/version
listWorldReadables()
The list of TextGrid objects that can be published as single WorldReadable objects you can get using
http://textgridlab.org/1.0/tgpublish/listWorldReadables
publish()
Publishes data from the TG-lab, needs a TextGrid URI as resource, and takes the following as input parameters:
The RBAC session ID (mandatory) [As used in the TG-lab]
The log parameter (optional – default is "") [As used in the TG-lab, too]
The ignoreWarnings trigger (default is FALSE) [Any warnings are ignored, if set to TRUE. Objects containing warnings will be published anyhow]
The dryRun trigger (default is TRUE) [If set to TRUE, nothing will really be published, but just checked]
The object to publish is given with the URI path parameter as following:
http://textgridlab.org/1.0/tgpublish/textgrid:1234/publish?sid=MB896JHG&log=&ignoreWarnings=TRUE&dryRun=FALSE
The publish call only returns an HTTP 201 OK, that indicates the service did queue the request and will be processing it as soon as possible. If an HTTP error is returned, the publishing process was NOT started and will not be! You can check your requests outcome by calling status().
publishSandboxData()
Really publishes finally to the TextGridRep. At the moment only used by the Import Tool External (koLibRI). Usage of the sandbox within the TG-lab is planned.
publishWorldReadable()
Publishes an object as WorldReadable. The same parameters as in publish() do apply.
copy()
Copies the objects belonging to the given URIs and all their aggregated objects (from aggregations/editions/collections) to the project belonging to the given project ID. You can copy objects from the TG-lab (non-public repository) and/or the public repository to your own projects, e.g. for further processing.
The following parameters do apply.
The RBAC session ID (mandatory) [As used in the TG-lab]
The log parameter (optional – default is "") [As used in the TG-lab, too]
The URIs of the objects to copy [This can be URIs of aggregations/collections/editions (in that case all subelements are gathered automatically) or URIs of single objects. This parameter is repeatable, so you can build a list of your favourite objects, and copy them allinto one of your projects.]
The Project ID [of the project to copy the objects to.]
newRevision [A boolean parameter to create new revisions from all copied objects (TRUE), or not (FALSE).]
Here an HTTP example of the copy() method:
http://textgridlab.org/1.0/tgpublish/copy?sid=MB896JHG&log=&uri=textgrid:1234&uri=textgrid:2345&projectId=TGPR&newRevision=FALSE
The copy call returns a session UUID as a string, that (a) does indicate the service did queue the request and will be processing it as soon as possible, and that you can (b) query via the status() call your copy status. If an HTTP error is returned, the copying process was NOT started and will not be!
To track the provenience of the copied objects, an isDerivedFrom relation is set in the TextGrid metadata of each copied object.
The TG-publish WADL Service Description
The WADL file of the TG-publish response you can find here as Service WADL.
TG-publish response
XML Schema
The embedded PublishResponse XML schema can be found here:
<?xml version="1.0" encoding="UTF-8"?> <xs:schema attributeFormDefault="unqualified" elementFormDefault="unqualified"> <xs:element name="publishResponse" type="publishResponse" /> <xs:element name="worldReadableMimetypes" type="worldReadableMimetypes" /> <xs:complexType name="publishResponse"> <xs:sequence> <xs:element maxOccurs="unbounded" minOccurs="0" name="PublishObject" type="publishObject" /> <xs:element minOccurs="0" name="PublishStatus" type="publishStatus" /> </xs:sequence> <xs:attribute name="dryRun" type="xs:boolean" /> </xs:complexType> <xs:complexType name="publishObject"> <xs:sequence> <xs:element maxOccurs="unbounded" minOccurs="0" name="error" type="publishError" /> <xs:element minOccurs="0" name="referencedUris" type="referencedUris" /> <xs:element maxOccurs="unbounded" minOccurs="0" name="warning" type="publishWarning" /> </xs:sequence> <xs:attribute name="uri" type="xs:string" /> <xs:attribute name="pid" type="xs:string" /> <xs:attribute name="status" type="statusType" /> </xs:complexType> <xs:complexType name="publishError"> <xs:sequence> <xs:element minOccurs="0" name="message" type="xs:string" /> <xs:element minOccurs="0" name="type" type="errorType" /> </xs:sequence> </xs:complexType> <xs:complexType name="referencedUris"> <xs:sequence> <xs:element maxOccurs="unbounded" minOccurs="0" name="uri" type="xs:string" /> </xs:sequence> </xs:complexType> <xs:complexType name="publishWarning"> <xs:sequence> <xs:element minOccurs="0" name="message" type="xs:string" /> <xs:element minOccurs="0" name="type" type="warningType" /> </xs:sequence> </xs:complexType> <xs:complexType name="publishStatus"> <xs:sequence> <xs:element maxOccurs="unbounded" minOccurs="0" name="module" type="module" /> </xs:sequence> <xs:attribute name="progress" type="xs:int" use="required" /> <xs:attribute name="processStatus" type="processStatusType" /> <xs:attribute name="activeModule" type="xs:string" /> </xs:complexType> <xs:complexType name="module"> <xs:sequence> <xs:element maxOccurs="unbounded" minOccurs="0" name="message" type="xs:string" /> </xs:sequence> <xs:attribute name="name" type="xs:string" /> <xs:attribute name="status" type="statusType" /> </xs:complexType> <xs:complexType name="worldReadableMimetypes"> <xs:sequence> <xs:element maxOccurs="unbounded" minOccurs="0" name="regexp" type="xs:string" /> </xs:sequence> </xs:complexType> <xs:simpleType name="statusType"> <xs:restriction base="xs:string"> <xs:enumeration value="OK" /> <xs:enumeration value="WARNING" /> <xs:enumeration value="ERROR" /> <xs:enumeration value="NOT_YET_PUBLISHED" /> <xs:enumeration value="ALREADY_PUBLISHED" /> </xs:restriction> </xs:simpleType> <xs:simpleType name="errorType"> <xs:restriction base="xs:string"> <xs:enumeration value="NOT_SPECIFIED" /> <xs:enumeration value="AUTH" /> <xs:enumeration value="WRONG_CONTENT_TYPE" /> <xs:enumeration value="NO_PUBLISH_RIGHT" /> <xs:enumeration value="PID_GENERATION_FAILED" /> <xs:enumeration value="MISSING_METADATA" /> <xs:enumeration value="ALREADY_PUBLISHED" /> <xs:enumeration value="METADATA_WARNINGS_EXIST" /> <xs:enumeration value="SERVER_ERROR" /> </xs:restriction> </xs:simpleType> <xs:simpleType name="warningType"> <xs:restriction base="xs:string"> <xs:enumeration value="NOT_SPECIFIED" /> <xs:enumeration value="CHECK_REFERENCES" /> </xs:restriction> </xs:simpleType> <xs:simpleType name="processStatusType"> <xs:restriction base="xs:string"> <xs:enumeration value="FINISHED" /> <xs:enumeration value="RUNNING" /> <xs:enumeration value="FAILED" /> <xs:enumeration value="NOT_QUEUED" /> </xs:restriction> </xs:simpleType> </xs:schema>
publishResponse Root Element
The publishResponse element has got the attribute dryRun, the element publishStatus and a list of publishObject elements.
dryRun
The attribute dryRun just tells if a publish process is or was running or just a publish test without any consequences.
publishStatus
This element tells about the current status of the overall publishing process.
- module: Possibly a list of processed modules (not yet used by TG-publish).
- progress : The overall progress in percent. Each module can implement updating this progress value. Please use it for progress bars or other nice things.
- activeModule: Holds the currently activeModule, that is the qualified name of the corrresponding koLibRI Java ActionModule.
- processStatus: The processStatusdescribes the current overall process status, and can contain the following values:
- FINISHED: This TG-publish process is finished, either with failure or success
- RUNNING: The current process is still running, just try again later...
- FAILED: The TG-publish process has failed. See object list for reasons.
- NOT_QUEUED: The given TextGrid URI is not yet queued in the TG-publish queue. Either you have not started a publish process with this URI, or the process has not yet begun.
publishObject
Each of the objects in the list holds the following data, the root Collection or Edition is alco contained, normally as the first element of the list.
- uri: The URI of the object to publish.
- pid: The generated PID of the object.
- status: This status describes the status of the single object, and can contain:
- OK: Object was processed successfully.
- WARNING: Object was processed successfully, but there were warnings (can yet be published using the ignoreWarnings flag).
- ERROR: An error occured while publishing.
- NOT_YET_PUBLISHED: The object was not yet published (according to TG-auth*).
- ALREADY_PUBLISHED: The object already is published (according to TG-auth*).
- publishError: Errors occured processing this object.
- AUTH
- WRONG_CONTENT_TYPE
- NO_PUBLISH_RIGHT
- PID_GENERATION_FAILED
- MISSING_METADATA
- ALREADY_PUBLISHED
- METADATA_WARNINGS_EXIST
- SERVER_ERROR
- NOT_SPECIFIED
- referencedUris: List of URIs that are references from within this object, and NOT contained in the current Edition or Collection.
- publishWarning: Warnings occured processing this object.
- CHECK_REFERENCES: Check the referenced URIs.
- NOT_SPECIFIED: Other warning.