English Language Modelling
This Document is a work in progress!
NOTE: UnderstandingOntologies
The objective of this note is to dig into the details of what should be considered when seeking to form a a top-level ontology using natural language, which is applied in this case to the english language (although, not exclusively).
Language Related Fields of Research
Etymology Etymology (/ˌɛtɪˈmɒlədʒi/ ET-im-OL-ə-jee)1 is the study of the history of the form of words2 and, by extension, the origin and evolution of their semantic meaning across time.3 It is a subfield of historical linguistics, and draws upon comparative semantics, morphology, semiotics, and phonetics.
Source: WikiPedia
Semiotics Semiotics (also called semiotic studies) is the systematic study of sign processes (semiosis) and meaning making. Semiosis is any activity, conduct, or process that involves signs, where a sign is defined as anything that communicates something, usually called a meaning, to the sign's interpreter. The meaning can be intentional such as a word uttered with a specific meaning, or unintentional, such as a symptom being a sign of a particular medical condition. Signs can also communicate feelings (which are usually not considered meanings) and may communicate internally (through thought itself) or through any of the senses: visual, auditory, tactile, olfactory, or gustatory (taste). Contemporary semiotics is a branch of science that studies meaning-making and various types of knowledge
Source: WikiPedia
Epistemology Epistemology (/ɪˌpɪstəˈmɒlədʒi/ (listen); from Ancient Greek ἐπιστήμη (_epistḗmē) 'knowledge', and -logy_), or the theory of knowledge, is the branch of philosophy concerned with knowledge. Epistemology is considered a major subfield of philosophy, along with other major subfields such as ethics, logic, and metaphysics. Source: WikiPedia
Considerations
What would a "Gold Plated Solution" Look Like?
In seeking to consider how to define a solution on a BestEfforts basis, seeking the outcome to be FitForPurpose whilst considering the likehood or probability that sacrifices are likely to be made (particularly with respect to the first implementation, which is the useful purpose of this documentation); i think its probably best to figure out the broader-scale ecosystem environment, prior to deciding what's not going to be implemented at this stage. This is considered important to figure out what's required and what is considered to be a premium (meaning 'nice to have' / unnecessary) feature.
As such; the first line of enquiry is focused on seeking to gain a comprehension of what could be done if the resources were not limited technically and/or otherwise.
Looking into the history of the english language, evokes an array of considerations that i think are related to Etymology, Semiotics, Epistemology etc.
SpaceTime Considerations (GeoSpatial & Temporal)
Languages evolve overtime and in-turn also, relates to places and peoples from different places. Some of the implications include pictorial languages such as is demonstrated by heraldry whereby the spelling of different words and the general ability to read and write a language was not always common.
The ability to develop a language model that seeks to take into consideration the geospatial relationships of where different words are thought to have originated (often) and the notations of time in relation to those known events, can be processed by AI models in ways that cannot be done by human minds alone, even if they've studied a particular subject or topic over many, many years. Therein also; it takes humans many years to gain even a basic command of a spoken language, and years further to gain knowledge about the use of that language for writing and reading of other peoples works.
Languages are also, constantly evolving. As such, the intended meaning of words as were used hundreds and/or thousands of years ago, may be different to the modern meaning of that word or words. Similarly words are being continually redefined and new words made.
The Confluence of Languages
The english language is not simply 'english', rather, it is made-up of words that come from many other languages and many different peoples from various places, around the world overtime. The ability to understand the meaning of english words, isn't simply able to be done as well as may otherwise be formed - should the history of those words be considered or be made able to be considered.
Specialised Vocabularies & Field Specific Meanings
There are various industries that make use of language of various forms and in various ways.
Sometimes the meaning that is applied within that professional field; has distinctions to the use of the term in other settings. As such, the concept topic-field becomes an important attribute when seeking to comprehend the 'meaning' of a statement, that may be employed in relation to a specified field of 'liberal arts' profession, skillset or domain.
Functional (Software) Language
Language sets for the useful production of software, may include definitions about protocols and in-turn also API definitions. Further definitions could be provided about the meaning of various functions provided by various software languages.
Translations
A 'gold plated' solution would also be able to support translations between languages with a high-level of accuracy, in real-time.
Complex Document "Graph" support for AnyURI
A gold plated solution might be able to process the text (or indeed also audio) of any webpage or electronic resource, and support a means to both better manage the history of a persons time spent on a computer, supporting improved recall and perhaps also the ability to archive versions of documents and thereafter support the ability to distinguish between versions of the same resource that may either change; or that the context relating to the content artifact changes, as to result in a different sort of meaning / categorisation of the content artifact; and in-turn perhaps also, any other content artifacts that refer to it.
SocioEconomic Considerations.
There is an incredibly high-skilled series of 'jobs' or in-other-words, work, that does continually need to be undertaken and the useful benefit of that work is instrumental.
As is noted below; there are some solutions that are available as 'open source libraries' yet there are other solutions that are made available on a paid basis. Given the enormous scope of works that could be done towards supporting the commons / common-sense of computer-humanity language systems, whilst the means to make use of these systems should be free of 'survellience' particularly in relation to private use, the idea that all this work can be done 'for free', isn't considered reasonable or consistant with various practical factors that have a firm footing in reality. As such, some way of ensuring support for those whose job/skillset/life it is to do language related work, is considered both reasonable and important.
Defining AI Related Input.
As is noted in the [[AgentLabelling]] note; it is important that the technical delivery is designed in such a way that ensures that, by default, the solution identifies which agent was responsible for which words being added to a document or communication artifact or event; This is likely able to be done via markup that is embedded in the content assets, however the exact solution is presently unknown and yet to be more formerally defined. Whilst part of the consideration relates more specifically to AI (inc 'autocorrect') the same requirement is usefully important for collaborative documents involving many human actors.
Summary: "Gold Plated Solution"
Whilst these considerations are not yet exhaustive (ie: there's more), and that i have not got into the database (software) methods and implications - as yet... i note again,
It is not presently expected that all of these sorts of qualities are going to be achievable nor is it considered that these qualities, characteristics and related considerations are all required in-order to make a significant improvement above and beyond the manner through which systems are made to operate today.