Monolithic documents and Open Content
21-March-2006
permalink comments (1) forum (1) email thisI have been working for the last three days on an evaluation of the European Social Fund Objective 3 programme. Doesn't sound too exciting but I am quite enjoying it. I am doing a case study based on East Wales and am looking at two projects run by the Welsh Development Agency dealing with Equal Opportunities in SMEs. I will write a future post on these projects.
Most of the work involves desk research - ploughing through endless reports and documents and academic studies to pick out relevant information appertaining to the specific focus for the evaluation I am undertaking (Support to Structures and Systems since you ask!).
Seven years ago I recall working on an early XML editor which was supposed to help in this by allowing 'deep markup' of the contents of reports. The editor was not bad but it was too time consuming and difficult to use. Seven years on and we have still not cracked it. Most research documents and reports are available on the web but in Word or PDF format. Google sort of lets you search inside them but is pretty poor.
At the end of the day we still have to pay people like me to find the reports, download and print them and skim read to find the relevant information which is then reentered on my computer in the form of a new (and unsearchable) document.
There has to be a better way - this should be part of the developing Open Content agenda.
Technorati Tags: Open content
1 Replies (comments)
1 Too true! Which is why large documents are being overtaken by user-generated "microcontent"
A very good point, Graham. It's too easy to get ground down by the intransigence of this problem, especially as exhibited within large organisations. Something should be done.
Without open structural internal markup, these documents deny users the ability to deep-link into the documents, to deep-tag and deep-discuss them. Thus they fail to become part of the discourse and discoverability that makes the web so powerful.
We should not underestimate the technological and work-cultural problems, though. You and I have both worked on the development of XML-based solutions, as well as on xhtml-based web content versions of largish documents. We know that it is very hard to create software that really encourages and assists ordinary authors to make their large documents well-structured. I had always hoped that editing technology would eventually come along which leaned very heavily on re-using and linking but forced users to apply formal structure instead of ad-hoc presentation formatting. If authors got a lot of value from the re-use and linkage assistants, they might be willing to trade that for the work required to make documents with machine-readable structure. In fact, as you'll recall, this was the founding ambition of KnowNet :o) We here at KnowNet are always trying to improve matters a bit, as are other like-minded developers (for instance, large html documents in our Plone sites are internally perma-linkable, our indexFolder product goes a long way towards encouraging authors to edit in well-formed, semantically-structured chunks, and we are currently working hard at structured-blogging/microformats support in knotes).
But progress is slow and authors' habits are hard to break. Yet overall, the web is exhibiting marked improvement.
In the (temporary) absence of really usable structuring editors for large documents, the web has found another way to create very large corpora of deep-linked, deeply-discussed, well-structured content. Make the content in tiny chunks, and later link it together according to the interest people find in it. That content lives in millions of weblogs, billions of weblog entries, billions of wiki pages and hundreds of millions of tagging gestures in the social bookmarkjing systems. I'm serious about this - user generated content attains its value from its connectedness and its rich collaboratively-generated structuring and this also makes it posible for google's page-rank algorithm to make helpful content stand out fromn the dross.
It is becoming easier to find what you need to know about the big document on the desk in front of you by googling your query (and finding some weblog or mail-list archive entries about it; providing usefully linked starting points to explore from) than it would be to turn the pages of the document or try to find the text in a pdf viewer.
My hunch is that there will be several avenues of gradual improvement in deep-structured content and monlithic documents:
- new web-based editing tools such as Writely. These may eventually get to the tipping point in functionality I mentioned above: improving their assistance for re-using chunks and assisting authors in rich-linking so much that it makes uit worth authors' time spent to structure there content.
- the possibility of combining basic browser platforms with structural-description 3rd-party services - Annotea-like tagging of internal markers in otherwise unstrcutred documents, with 3rd-party services available to proxy the stabndardisation of these markers
- open-access legal initiatiaves, such as the reuirement that UK-funded research deposits its papers in an open-access repository. It might be possible to make the case that they are not open-access unless their insides are sensible accessible as well - to legally require that the effort is spent converting them into open structural formats. Perhaps this could be part of e-government requirements as well.
- shaming bad practice with good examples and actively tagging/blogging/linking users. Increasing numbers of users are becoming citizens of the two-way web, and as more of the content they are interested in is becoming deep-linkable, deep-taggable and deep-discussable. We can hope that these trends will interact positively to create grassroots demand for well-structured content.
By the way, I maintain a couple del.icio.us tags on technology issues related to this: del.icio.us/Mike_Malloch/structured-editing and del.icio.us/Mike_Malloch/webtech/structured-content , and also regularly tag items relevant to the policy issues around open-access : del.icio.us/Mike_Malloch/policy/open-content. Also of some relavance are services/file-editing, social-content and standards/file-formats
Linking and trackbacks
When linking to this weblog entry, please use the 'permalink', which is http://www.knownet.com/writing/weblogs/Graham_Attwell/entries/4180707254
