|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--gate.util.AbstractFeatureBearer | +--gate.creole.AbstractResource | +--gate.creole.AbstractLanguageResource | +--gate.corpora.DocumentImpl
Represents the commonalities between all sorts of documents.
The DocumentImpl class implements the Document interface. The DocumentContentImpl class models the textual or audio-visual materials which are the source and content of Documents. The AnnotationSetImpl class supplies annotations on Documents.
Abbreviations:
We add an edit method to each of these classes; for DC and AS the methods are package private; D has the public method.
void edit(Long start, Long end, DocumentContent replacement) throws InvalidOffsetException;
D receives edit requests and forwards them to DC and AS. On DC, this method makes a change to the content - e.g. replacing a String range from start to end with replacement. (Deletions are catered for by having replacement = null.) D then calls AS.edit on each of its annotation sets.
On AS, edit calls replacement.size() (i.e. DC.size()) to figure out how long the replacement is (0 for null). It then considers annotations that terminate (start or end) in the altered or deleted range as invalid; annotations that terminate after the range have their offsets adjusted. I.e.:
A note re. AS and annotations: annotations no longer have offsets as in the old model, they now have nodes, and nodes have offsets.
To implement AS.edit, we have several indices:
HashMap annotsByStartNode, annotsByEndNode;which map node ids to annotations;
RBTreeMap nodesByOffset;which maps offset to Nodes.
When we get an edit request, we traverse that part of the nodesByOffset tree representing the altered or deleted range of the DC. For each node found, we delete any annotations that terminate on the node, and then delete the node itself. We then traverse the rest of the tree, changing the offset on all remaining nodes by:
newOffset = oldOffset - ( (end - start) - // size of mod ( (replacement == null) ? 0 : replacement.size() ) // size of repl );Note that we use the same convention as e.g. java.lang.String: start offsets are inclusive; end offsets are exclusive. I.e. for string "abcd" range 1-3 = "bc". Examples, for a node with offset 4:
edit(1, 3, "BC"); newOffset = 4 - ( (3 - 1) - 2 ) = 4 edit(1, 3, null); newOffset = 4 - ( (3 - 1) - 0 ) = 2 edit(1, 3, "BBCC"); newOffset = 4 - ( (3 - 1) - 4 ) = 6
Inner Class Summary | |
(package private) class |
DocumentImpl.AnnotationComparator
Inner class needed to compare annotations |
Field Summary | |
private int |
ASC
Constant used in the inner class AnnotationComparator to order annotations ascending |
private Boolean |
collectRepositioningInfo
If you set this flag to true the repositioning information for the document will be kept in the document feature. |
protected DocumentContent |
content
The content of the document |
private Annotation |
crossedOverAnnotation
This is a variable which contains the latest crossed over annotation found during export with preserving format, i.e., toXml(annotations) method. |
private static boolean |
DEBUG
Debug flag |
protected AnnotationSet |
defaultAnnots
The default annotation set |
private int |
DESC
Constant used in the inner class AnnotationComparator to order annotations descending |
private int |
DOC_SIZE_MULTIPLICATION_FACTOR
This field is used when creating StringBuffers for toXml() methods. |
private Vector |
documentListeners
|
protected String |
encoding
The encoding of the source of the document content |
private static Map |
entitiesMap
A map initialized in init() containing entities that needs to be replaced in strings |
private Vector |
gateListeners
|
private boolean |
isRootTag
This field indicates if an annotation is the doc's root tag. |
protected Boolean |
markupAware
Is the document markup-aware? |
protected Map |
namedAnnotSets
Named sets of annotations |
protected int |
nextAnnotationId
The id of the next new annotation |
protected int |
nextNodeId
The id of the next new node |
private int |
ORDER_ON_ANNOT_ID
Constant used in the inner class AnnotationComparator to order annotations on their ID |
private int |
ORDER_ON_END_OFFSET
Constant used in the inner class AnnotationComparator to order annotations on their end offset |
private int |
ORDER_ON_START_OFFSET
Constant used in the inner class AnnotationComparator to order annotations on their start offset |
private Boolean |
preserveOriginalContent
If you set this flag to true the original content of the document will be kept in the document feature. |
(package private) static long |
serialVersionUID
Freeze the serialization UID. |
protected URL |
sourceUrl
The source URL |
protected Long |
sourceUrlEndOffset
The end of the range that the content comes from at the source URL (or null if none). |
protected Long |
sourceUrlStartOffset
The start of the range that the content comes from at the source URL (or null if none). |
private String |
stringContent
A property of the document that will be set when the user wants to create the document from a string, as opposed to from a URL. |
Fields inherited from class gate.creole.AbstractLanguageResource |
dataStore, lrPersistentId |
Fields inherited from class gate.creole.AbstractResource |
name |
Fields inherited from class gate.util.AbstractFeatureBearer |
features |
Constructor Summary | |
DocumentImpl()
Default construction. |
Method Summary | |
(package private) static void |
|
void |
addDocumentListener(DocumentListener l)
Adds a DocumentListener to this document. |
private int |
analyseAmpCodding(String content)
This function compute size of the ampersand codded sequence when semicolin is not present. |
private String |
annotationSetToXml(AnnotationSet anAnnotationSet)
This method saves an AnnotationSet as XML. |
private void |
buildEntityMapFromString(String aScanString,
TreeMap aMapToFill)
This method takes aScanString and searches for those chars from entitiesMap that appear in the string. |
protected boolean |
check(Object a,
Object b)
Check: test 2 objects for equality |
void |
cleanup()
Clear all the data members of the object. |
private void |
collectInformationForAmpCodding(String content,
RepositioningInfo info,
boolean shouldCorrectCR)
Collect information for substitution of "&xxx;" with "y" It couldn't be collected a position information about some unicode and &-coded symbols during parsing. |
private void |
collectInformationForWS(String content,
RepositioningInfo info)
HTML parser perform substitution of multiple whitespaces (WS) with a single WS. |
int |
compareTo(Object o)
Ordering based on URL.toString() and the URL offsets (if any) |
private void |
correctRepositioningForCRLFInXML(String content,
RepositioningInfo info)
Correct repositioning information for substitution of "\r\n" with "\n" |
void |
datastoreClosed(CreoleEvent e)
Called when a DataStore has been closed |
void |
datastoreCreated(CreoleEvent e)
Called when a DataStore has been created |
void |
datastoreOpened(CreoleEvent e)
Called when a DataStore has been opened |
void |
edit(Long start,
Long end,
DocumentContent replacement)
Propagate edit changes to the document content and annotations. |
boolean |
equals(Object other)
Equals |
private String |
featuresToXml(FeatureMap aFeatureMap)
This method saves a FeatureMap as XML elements. |
private StringBuffer |
filterNonXmlChars(StringBuffer aStrBuffer)
This method filters any non XML char see: http://www.w3c.org/TR/2000/REC-xml-20001006#charsets All non XML chars will be replaced with 0x20 (space char) This assures that the next time the document is loaded there won't be any problems. |
protected void |
fireAnnotationSetAdded(DocumentEvent e)
|
protected void |
fireAnnotationSetRemoved(DocumentEvent e)
|
AnnotationSet |
getAnnotations()
Get the default set of annotations. |
AnnotationSet |
getAnnotations(String name)
Get a named set of annotations. |
private List |
getAnnotationsForOffset(Set aDumpAnnotSet,
Long offset)
This method returns a list with annotations ordered that way that they can be serialized from left to right, at the offset. |
Boolean |
getCollectRepositioningInfo()
Get the collectiong and preserving of repositioning information for the Document. |
DocumentContent |
getContent()
The content of the document: a String for text; MPEG for video; etc. |
String |
getEncoding()
Get the encoding of the document content source |
Boolean |
getMarkupAware()
Get the markup awareness status of the Document. |
Map |
getNamedAnnotationSets()
Returns a map with the named annotation sets. |
Integer |
getNextAnnotationId()
Generate and return the next annotation ID |
Integer |
getNextNodeId()
Generate and return the next node ID |
protected String |
getOrderingString()
Utility method to produce a string for comparison in ordering. |
Boolean |
getPreserveOriginalContent()
Get the preserving of content status of the Document. |
URL |
getSourceUrl()
Documents are identified by URLs |
Long |
getSourceUrlEndOffset()
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. |
Long[] |
getSourceUrlOffsets()
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. |
Long |
getSourceUrlStartOffset()
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. |
String |
getStringContent()
The stringContent of a document is a property of the document that will be set when the user wants to create the document from a string, as opposed to from a URL. |
int |
hashCode()
Hash code |
private boolean |
hasOriginalContentFeatures()
Return true only if the document has features for original content and repositioning information. |
Resource |
init()
Initialise this resource, and return it. |
private boolean |
insertsSafety(AnnotationSet aTargetAnnotSet,
Annotation aSourceAnnotation)
This method verifies if aSourceAnnotation can ve inserted safety into the aTargetAnnotSet. |
private boolean |
isNumber(char ch,
boolean hex)
Check for numeric range. |
boolean |
isValidOffset(Long offset)
Check that an offset is valid, i.e. |
boolean |
isValidOffsetRange(Long start,
Long end)
Check that both start and end are valid offsets and that they constitute a valid offset range, i.e. |
static boolean |
isXmlChar(char ch)
This method decide if a char is a valid XML one or not |
void |
removeAnnotationSet(String name)
Removes one of the named annotation sets. |
void |
removeDocumentListener(DocumentListener l)
Removes one of the previously registered document listeners. |
private StringBuffer |
replaceCharsWithEntities(String anInputString)
This method replace all chars that appears in the anInputString and also that are in the entitiesMap with their corresponding entity |
void |
resourceAdopted(DatastoreEvent evt)
Called by a datastore when a new resource has been adopted |
void |
resourceDeleted(DatastoreEvent evt)
Called by a datastore when a resource has been deleted |
void |
resourceLoaded(CreoleEvent e)
Called when a new Resource has been loaded into the system |
void |
resourceRenamed(Resource resource,
String oldName,
String newName)
Called when the creole register has renamed a resource.1 |
void |
resourceUnloaded(CreoleEvent e)
Called when a Resource has been removed from the system |
void |
resourceWritten(DatastoreEvent evt)
Called by a datastore when a resource has been wrote into the datastore |
private String |
saveAnnotationSetAsXml(AnnotationSet aDumpAnnotSet,
boolean includeFeatures)
This method saves all the annotations from aDumpAnnotSet and combines them with the document content. |
private String |
saveAnnotationSetAsXmlInOrig(Set aSourceAnnotationSet,
boolean includeFeatures)
This method saves all the annotations from aDumpAnnotSet and combines them with the original document content, if preserved as feature. |
void |
setCollectRepositioningInfo(Boolean b)
Allow/disallow collecting of repositioning information. |
void |
setContent(DocumentContent content)
Set method for the document content |
void |
setDataStore(DataStore dataStore)
Set the data store that this LR lives in. |
void |
setEncoding(String encoding)
Set the encoding of the document content source |
void |
setLRPersistenceId(Object lrID)
Sets the persistence id of this LR. |
void |
setMarkupAware(Boolean newMarkupAware)
Make the document markup-aware. |
void |
setNextAnnotationId(int aNextAnnotationId)
Sets the nextAnnotationId |
void |
setPreserveOriginalContent(Boolean b)
Allow/disallow preserving of the original document content. |
void |
setSourceUrl(URL sourceUrl)
Set method for the document's URL |
void |
setSourceUrlEndOffset(Long sourceUrlEndOffset)
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. |
void |
setSourceUrlStartOffset(Long sourceUrlStartOffset)
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. |
void |
setStringContent(String stringContent)
The stringContent of a document is a property of the document that will be set when the user wants to create the document from a string, as opposed to from a URL. |
private String |
textWithNodes(String aText)
This method creates Node XML elements and inserts them at the corresponding offset inside the text. |
String |
toString()
String respresentation |
String |
toXml()
Returns a GateXml document that is a custom XML format for wich there is a reader inside GATE called gate.xml.GateFormatXmlHandler. |
String |
toXml(Set aSourceAnnotationSet)
Returns an XML document aming to preserve the original markups( the original markup will be in the same place and format as it was before processing the document) and include (if possible) the annotations specified in the aSourceAnnotationSet. |
String |
toXml(Set aSourceAnnotationSet,
boolean includeFeatures)
Returns an XML document aming to preserve the original markups( the original markup will be in the same place and format as it was before processing the document) and include (if possible) the annotations specified in the aSourceAnnotationSet. |
private String |
writeEmptyTag(Annotation annot)
|
private String |
writeEmptyTag(Annotation annot,
boolean includeNamespace)
Returns a string representing an empty tag based on the input annot |
private String |
writeEndTag(Annotation annot)
Returns a string representing an end tag based on the input annot |
private String |
writeFeatures(FeatureMap feat,
boolean includeNamespace)
Returns a string representing a FeatureMap serialized as XML attributes |
private String |
writeStartTag(Annotation annot,
boolean includeFeatures)
|
private String |
writeStartTag(Annotation annot,
boolean includeFeatures,
boolean includeNamespace)
Returns a string representing a start tag based on the input annot |
Methods inherited from class gate.creole.AbstractLanguageResource |
getDataStore, getLRPersistenceId, getParent, isModified, setParent, sync |
Methods inherited from class gate.creole.AbstractResource |
checkParameterValues, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
Methods inherited from class gate.util.AbstractFeatureBearer |
getFeatures, setFeatures |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, registerNatives, wait, wait, wait |
Methods inherited from interface gate.LanguageResource |
getDataStore, getLRPersistenceId, getParent, isModified, setParent, sync |
Methods inherited from interface gate.Resource |
getParameterValue, setParameterValue, setParameterValues |
Methods inherited from interface gate.util.FeatureBearer |
getFeatures, setFeatures |
Methods inherited from interface gate.util.NameBearer |
getName, setName |
Field Detail |
private static final boolean DEBUG
private Boolean preserveOriginalContent
private Boolean collectRepositioningInfo
private Annotation crossedOverAnnotation
protected int nextAnnotationId
protected int nextNodeId
protected URL sourceUrl
protected DocumentContent content
protected String encoding
private boolean isRootTag
private final int DOC_SIZE_MULTIPLICATION_FACTOR
private final int ORDER_ON_START_OFFSET
private final int ORDER_ON_END_OFFSET
private final int ORDER_ON_ANNOT_ID
private final int ASC
private final int DESC
private static Map entitiesMap
protected Long sourceUrlStartOffset
protected Long sourceUrlEndOffset
protected AnnotationSet defaultAnnots
protected Map namedAnnotSets
private String stringContent
protected Boolean markupAware
static final long serialVersionUID
private transient Vector documentListeners
private transient Vector gateListeners
Constructor Detail |
public DocumentImpl()
Method Detail |
public Resource init() throws ResourceInstantiationException
init
in interface Resource
init
in class AbstractResource
private void correctRepositioningForCRLFInXML(String content, RepositioningInfo info)
private void collectInformationForAmpCodding(String content, RepositioningInfo info, boolean shouldCorrectCR)
shouldCorrectCR
flag is true
the correction
for CRLF substitution is performed.private int analyseAmpCodding(String content)
private boolean isNumber(char ch, boolean hex)
private void collectInformationForWS(String content, RepositioningInfo info)
(ch <= ' ')
.public void cleanup()
cleanup
in interface Resource
cleanup
in class AbstractLanguageResource
public URL getSourceUrl()
getSourceUrl
in interface Document
public void setSourceUrl(URL sourceUrl)
setSourceUrl
in interface Document
public Long[] getSourceUrlOffsets()
getSourceUrlOffsets
in interface Document
public void setPreserveOriginalContent(Boolean b)
setPreserveOriginalContent
in interface Document
public Boolean getPreserveOriginalContent()
getPreserveOriginalContent
in interface Document
public void setCollectRepositioningInfo(Boolean b)
setCollectRepositioningInfo
in interface Document
public Boolean getCollectRepositioningInfo()
getCollectRepositioningInfo
in interface Document
public Long getSourceUrlStartOffset()
getSourceUrlStartOffset
in interface Document
public void setSourceUrlStartOffset(Long sourceUrlStartOffset)
setSourceUrlStartOffset
in interface Document
public Long getSourceUrlEndOffset()
getSourceUrlEndOffset
in interface Document
public void setSourceUrlEndOffset(Long sourceUrlEndOffset)
setSourceUrlEndOffset
in interface Document
public DocumentContent getContent()
getContent
in interface Document
public void setContent(DocumentContent content)
setContent
in interface Document
public String getEncoding()
public void setEncoding(String encoding)
public AnnotationSet getAnnotations()
getAnnotations
in interface Document
public AnnotationSet getAnnotations(String name)
getAnnotations
in interface Document
public void setMarkupAware(Boolean newMarkupAware)
setMarkupAware
in interface Document
b
- markup awareness status.public Boolean getMarkupAware()
getMarkupAware
in interface Document
public String toXml(Set aSourceAnnotationSet)
toXml
in interface Document
public String toXml(Set aSourceAnnotationSet, boolean includeFeatures)
toXml
in interface Document
aSourceAnnotationSet
- is an annotation set containing all the
annotations that will be combined with the original marup set. If the
param is null
it will only dump the original markups.includeFeatures
- is a boolean that controls whether the annotation
features should be included or not. If false, only the annotation type
is included in the tag.private boolean insertsSafety(AnnotationSet aTargetAnnotSet, Annotation aSourceAnnotation)
aTargetAnnotSet
- the annotation set to include the aSourceAnnotationaSourceAnnotation
- the annotation to be inserted into the
aTargetAnnotSetprivate String saveAnnotationSetAsXml(AnnotationSet aDumpAnnotSet, boolean includeFeatures)
aDumpAnnotationSet
- is a GATE annotation set prepared to be used
on the raw text from document content. If aDumpAnnotSet is null
then an empty string will be returned.includeFeatures
- is a boolean, which controls whether the annotation
features and gate ID are included or not.private boolean hasOriginalContentFeatures()
private String saveAnnotationSetAsXmlInOrig(Set aSourceAnnotationSet, boolean includeFeatures)
aDumpAnnotationSet
- is a GATE annotation set prepared to be used
on the raw text from document content. If aDumpAnnotSet is null
then an empty string will be returned.includeFeatures
- is a boolean, which controls whether the annotation
features and gate ID are included or not.private List getAnnotationsForOffset(Set aDumpAnnotSet, Long offset)
aDumpAnnotSet
- is a set containing all annotations that will be
dumped.offset
- represent the offset at witch the annotation must start
AND/OR end.private String writeStartTag(Annotation annot, boolean includeFeatures)
private String writeStartTag(Annotation annot, boolean includeFeatures, boolean includeNamespace)
private void buildEntityMapFromString(String aScanString, TreeMap aMapToFill)
private String writeEmptyTag(Annotation annot)
private String writeEmptyTag(Annotation annot, boolean includeNamespace)
private String writeEndTag(Annotation annot)
private String writeFeatures(FeatureMap feat, boolean includeNamespace)
public String toXml()
toXml
in interface Document
private StringBuffer filterNonXmlChars(StringBuffer aStrBuffer)
aStrBuffer
- represents the input String that is filtred. If the
aStrBuffer is null then an empty string will be returendpublic static boolean isXmlChar(char ch)
ch
- the char to be testedprivate String featuresToXml(FeatureMap aFeatureMap)
private StringBuffer replaceCharsWithEntities(String anInputString)
anInputString
- the string analyzed. If it is null then returns the
empty stringprivate String textWithNodes(String aText)
aText
- The text representing the document's plain text.private String annotationSetToXml(AnnotationSet anAnnotationSet)
anAnnotationSet
- The annotation set that has to be saved as XML.public Map getNamedAnnotationSets()
null
if no named annotaton set exists.getNamedAnnotationSets
in interface Document
public void removeAnnotationSet(String name)
removeAnnotationSet
in interface Document
name
- the name of the annotation set to be removedpublic void edit(Long start, Long end, DocumentContent replacement) throws InvalidOffsetException
edit
in interface Document
public boolean isValidOffset(Long offset)
public boolean isValidOffsetRange(Long start, Long end)
public void setNextAnnotationId(int aNextAnnotationId)
public Integer getNextAnnotationId()
public Integer getNextNodeId()
public int compareTo(Object o) throws ClassCastException
compareTo
in interface Comparable
protected String getOrderingString()
static void()
public String getStringContent()
public void setStringContent(String stringContent)
protected boolean check(Object a, Object b)
public boolean equals(Object other)
equals
in class Object
public int hashCode()
hashCode
in class Object
public String toString()
toString
in class Object
public void removeDocumentListener(DocumentListener l)
Document
removeDocumentListener
in interface Document
public void addDocumentListener(DocumentListener l)
Document
DocumentListener
to this document.
All the registered listeners will be notified of changes occured to the
document.addDocumentListener
in interface Document
protected void fireAnnotationSetAdded(DocumentEvent e)
protected void fireAnnotationSetRemoved(DocumentEvent e)
public void resourceLoaded(CreoleEvent e)
CreoleListener
Resource
has been loaded into the systemresourceLoaded
in interface CreoleListener
public void resourceUnloaded(CreoleEvent e)
CreoleListener
Resource
has been removed from the systemresourceUnloaded
in interface CreoleListener
public void datastoreOpened(CreoleEvent e)
CreoleListener
DataStore
has been openeddatastoreOpened
in interface CreoleListener
public void datastoreCreated(CreoleEvent e)
CreoleListener
DataStore
has been createddatastoreCreated
in interface CreoleListener
public void resourceRenamed(Resource resource, String oldName, String newName)
CreoleListener
resourceRenamed
in interface CreoleListener
public void datastoreClosed(CreoleEvent e)
CreoleListener
DataStore
has been closeddatastoreClosed
in interface CreoleListener
public void setLRPersistenceId(Object lrID)
LanguageResource
setLRPersistenceId
in interface LanguageResource
setLRPersistenceId
in class AbstractLanguageResource
public void resourceAdopted(DatastoreEvent evt)
DatastoreListener
resourceAdopted
in interface DatastoreListener
public void resourceDeleted(DatastoreEvent evt)
DatastoreListener
resourceDeleted
in interface DatastoreListener
public void resourceWritten(DatastoreEvent evt)
DatastoreListener
resourceWritten
in interface DatastoreListener
public void setDataStore(DataStore dataStore) throws PersistenceException
LanguageResource
setDataStore
in interface LanguageResource
setDataStore
in class AbstractLanguageResource
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |