The General Mission.
NOTE: This is an early draft document; that's largely a copy/paste from a .md file located in a vscode project (well, not any longer)
The goal of this project is to create a trained natural language model that is able to be used locally on average laptops and more powerful devices.
Sense needs to be some form of generative system, that is able to be updated in a manner that declaratively supports version control and version based use; but also, in a way that may restructure portions or large portions of the model; based upon the information provided via new inputs.
The language model is not intended to be a general knowledge model; rather, its express purpose is to generate a comprehensive 'understanding' of one or more human language models; starting, with English.
The requirements to produce a comprehensive model to support even simply the English vocabulary model, will require inputting various other languages into the modelling environment.
The production of this system is intended to be usefully employed by persons whose first language is something other than English. The ecosystem is intended to support social development via interlinked nodes and various instructive components that is intended to support various necessary protections, features and functionality.
Whilst systems are seeking me to employ the term 'model' as though its just one model; therefore, i have named it sense. The implied implications relates to the formation of tools to support the development of personal ontology, an API that relates to how people 'sense' one another both at the time, and as a consequence of reviewing historical records.
As such, it is not one model; but it is indeed software, so the term that will be used in relation to the development of software that in-turn produces language related software components, which may be called models - will simply be called 'sense'... 'sense' is thought likely to evolve into something that becomes best implemented as a form of Neuromophic software application.
The end goals is to form a comprehensive model of the English language and its various dialects and industry specific vocabularies.
This will be achieved by forming a generative model that is able to be improved upon overtime through human-assisted training. The model will be able to be used in a variety of contexts, including but not limited to:
Dictionary and Spellcheck Ontology Production Ontology Reasoning Search indexing Recommendation Engine Search Database Queries Database Structures Content Metadata Generation Ai Assistants * API Generation
Program Construct
The program produced should operate in a decentralised manner across IPv6 connected devices. The systems will require authentication and a variety of other considerations to be properly met. The early versions are likely to be less secure than future generations.
It is intended that other application will make use of the sense system, as a required library.
Notes about Existing Language Model
This note provides some resources that are either duplicates or an extention of the [PCTWebizenUseOfOntology] notes made also.
In the webizen ecosystem; one of the objectives is to seek to create top-level ontology that is sought to be fit-for-purpose for applications relating to human beings (self); then sociosphere / sociology and biosphere ontologies. in combination this may be considered an approach that seeks to distinguish our consciousness and our natural world; as something other than another persons 'thing'. Then, whilst there are many ontologies to describe things, new ontologies may be created using natural language; which would in-turn also be defined in a way that creates meaningful relationships between existing ontologies and PCTOntologyModelling outputs.
There are some existing Vocab Models that are defined using RDF/OWL.
Any that are not listed below will be be added to this NLP google-drive repo that also stores info about NLP related docs and notes (although there may be more elsewhere)
Pre:Existing Large Language Models.
Whilst investigating solutions, an array of existing language models have been identified that provide a great deal of the underlying data that is considered to be required, although the methods to employ best employ them is presently unclear.
Whilst making a note of the work done previously making enquiries with ChatGPT as is illustrated by: [ChatGPTDynamicOntology] and in-turn the VocabularyModelling folder has been created to 'create space' for more thougher investigation. In-order to illustrate the considerations; i'll start with illustrating the resources that i've found so far.
NLP Notes
As is more broadly considered to be a generalisation, the ecosystems are designed to support Natural Language Processing as a foundational AI capability.
This note is undeveloped and is now therefore on the 'todo list', although, the location of this document may change as the NLP functions are more related to the webizen agent than the PCT systems that need to be designed to support it.
TBD -
Natural Language Processing vs. ChatGPT
As a consequence of the public release of ChatGPT - large-scale 'natural language' models are perhaps the most significant area of discussion generally - in relation to the use of AI, at the moment. Consequentially also, it seems to me as though there's also some level of confusion about the nature of natural language processing generally and what it is that i'm working to achieve. The implicit implication is that these large language models are far too big and hardware intensive to run on any one machine.
Also, they're not simply a model that is about 'english' as a language, rather, services like ChatGPT have a comprehensive knowledge (database) that it has harvested from the internet, that embody the contributions of many billions of persons over time. Perhaps, a better term for it might be that it is a 'knowledge platform' rather than more simply being a 'natural language model'.
At the opposite end of the spectrum; are dictionary files that have been part of word-processing software for decades and indeed also - there are various examples of electronic dictionary devices that are very much like caculators; therein, the amount of hardware/software resources required to make basic dictionaries work - must be minor.
Asking ChatGPT to provide some information (given public sources couldn't otherwise be easily identified); ChatGPT provided the following information,
Electronic software dictionaries have a long history dating back to the 1970s. One of the earliest examples of an electronic dictionary was the FED-2, created by the Soviet company "FED" in 1971. It was a translation device that used punch cards and a built-in CRT screen.
In the 1980s and 1990s, electronic dictionaries started to become more advanced and portable, with the introduction of devices such as the Franklin Electronic Publishers' "Bookman" and the Casio "Data Plus" series, which could fit into a pocket and could be powered by batteries. These early electronic dictionaries were limited in their storage capacity and typically held a small fraction of the words found in print dictionaries.
In the late 1990s, the first electronic dictionaries with built-in TFT screens and more advanced search functions were developed, such as the Seiko ER6700, which had a 2-megabyte memory and could store around 30,000 words.
As such; there is a spectrum between what is easily achieved, what may be feasibly achieved and in-turn what cannot be achieved at this time; without the use of public APIs.