Gigantic ANN models are amazing in getting good predictions provided the datasets are available. However the whole ANN model becomes a blackbox. In healthcare AI can play a big role, but there is a debate if we could rely on fully black-boxed AI models for critical decision making. Symbolic AI is an alternative but is mostly seen as rule based engine limiting the scale at which it can be used.
Merging statistical AI like ANNs with Symbolic AI is a good promise and active research is being pursued. We believe that this decade is of NeuroSymbolic AI. Our active research is around breaking down the mammoth AI models into multiple stages whose boundaries are valid symbolic representations.
Our research is helping us in not just working towards explainable AI goals with NeuroSymbolic AI but also training models with minimal data. This helps us go on higher order datasets like paragraphs and stories level without having to build large datasets.
Our NeuroSymbolic AI architecture is based on an AI architecture called GIPCA (General Intelligence Predictive and Corrective Architecture). BISLU (Brain Inspired Spoken Language Understanding) is built up using the GIPCA architecture.
Conventional NLU today is an AI model trained as an Intent-classification model. Mostly these intents are handful and puts a restriction on understanding a human like a human by a computer.
Universal NLU is an approach to understand humans like a human which takes spoken utterances stream on one side and generates Human Thought Representations at the output. If the utterance is in domain knowledge of Universal NLU it will generate high-resolution thoughts and if out of domain then it will generate low-resolution thoughts. Universal NLU is always aware and keeps extracting information for further processing.
Universal NLU separates Language specific syntactic structures and semantic meaning totally separate so that it can be adapted to any spoken language.
NLP is being mostly done as intent classification models on sentences. Sentence segmentation is quite easy on written text, but spoken languages are usually a continuous stream of words being thrown out of speech-to-text engines. Extracting intents from a streaming audio is quite challenging.
Existing implementation asks users for new behaviour like taking pauses or using wakeword, which is a good solution. But these restrictions may not be natural in many use-cases including ours where patients and doctors are talking naturally. Existing approaches use punctuation rich text coming from Speech-to-Text which might be mostly relying on pauses and language structure.
Our research is currently focussed on a hybrid approach towards using pauses, meaning accumulation and centom theory. Meaning accumulation engine accumulates sub-intents and forms actionable-intents built on top of a solid clinical NER.
Understanding Spoken Language has many challenges. One of the challenges is that the utterances are coming as stream with sometimes well finished sentences and sometimes half baked. Also these utterances many times do not follow written-text like grammar rules. This makes it hard to segment text and run NLU on top of them.
Centom is a new way of segmenting spoken language streams. The basic hypothesis is that the language is built on small syantical chunks containing connected entities which are also semantically correlated. We call these atomically connected entities as Centom (Atomic Connected Entities)
Centom makes it easier to break continuous stream of words coming from speech into segments at syntactic level while preserving local semantic level information in each segment. Centom segments can form a strong symbolic level base for Neuro-Symbolic AI systems.
Speech-to-Text (STT) systems have evolved a lot in the last decade with the increased applicability of Deep Learning. While there are many ANN-based STT models available today, almost none of them are both lightweight and highly-accurate. This makes the usage of STT systems inefficient as one would have to compromise on the accuracy if the system has to be deployed in a resource-constrained environment and vice-versa. Moreover, even if we assume that we have a lightweight and state-of-the-art STT system, we still will not be able to use it in the medical domain due to the lack of relevant datasets that are needed to teach the system medical jargon and context.
We tackle the former problem by creating our own custom ANN-based lightweight state-of-the-art STT system which gives great performance in resource-constrained environments. Some part of it is inspired by Facebook’s Wav2Letter and Google’s Inception V3 network. Wav2Letter’s architecture allows us to build a lightweight model while Inception V3’s parallel branches increase the system’s performance by making it look at different receptive fields parallelly.
The dataset issue is fixed by training the STT system on a combination of our proprietary medical recordings dataset. Real-time conversations in a hospital or a clinic involve a lot of background noise and other disturbances so we tuned our dataset in such a way that our STT system was able to filter out these noises and disturbances and predict only relevant information.
Our research also includes detection of noises and other languages to ensure that system resources can be optimized in unwanted audio streams.
Automatic speech recognition task consists of two different models, Acoustic and Language Models which are used to transcribe an audio in an efficient way. The Language Model is a probabilistic approach to tweak the probabilities obtained from the acoustic model to eradicate the anomalies that could arise.
Generally, language models are created with a corpus containing text, and the probabilities to each n-gram are associated with their frequencies. In medical context, a lot of different types of classes like labscan, medication, etc. are involved and also the general sentences which can be used in general examination talks. And since the names of medicines, lab scans and other medical terms might be too long (of several words), a higher level of n-gram is required which takes a lot of computations. Also, the possible number of sentences that could be spoken in a medical conversation comes around to be order of sextillions, storing such amount of data requires a lot of space on hard disk which is not feasible. Creating a language model on this data is a lot costly and inefficient. Thus, a technique of Hierarchical Language Model is devised, which works in two layers of comparison and is logically trained over sextillions of sentences.
The hierarchical structure of the language model reduces complexity and provides the expected results, while reducing the memory, disk and CPU resource usages.
ETML stands for Extended Thought-Representation Mark-up Language. ETML serves the purpose of understanding the concept of thought in a thought cloud (Graphical thought representation of a conversation). Human Thought is an interesting concept correlating multiple entities in the physical universe together with certain properties. Human thought can be generated from any perceptual input. Thoughts can be imaginary but are inspired by the physical universe we live in.
Thought-Representation data structure is a rich graphical structure which can be understood by a computer. Thought Cloud refers to the collection of thoughts which altogether represents a chunk of conversation or a story or context of any form. For BISLU Thought Cloud is encompassing a complete conversation context.
ETML is a convenient textual representation of graphical structures providing interface to software engineers or humans to debug/modify easily instead of graphical structures. ETML also helps in creating datasets easily for converting texts into thoughts and vice-versa.
During conversations, people have a tendency to convey their intent via gestures rather than using words, which can be troublesome as we cannot afford to miss out on any detail of the patient during a consultation in order to give better results.
To tackle this problem, we are trying to develop a deep learning based algorithm that will understand the visual cues in real-time in a patient-doctor conversation. The algorithm first tries to detect the moment when the patient is trying to convey something via a visual cue, for example, by pointing to a body part. After detecting the moment, it extracts the particular frame and detects the name of the body-part to which the patient is pointing, which is provided to the Universal NLU where these cues are used to derefer or enhance spoken word. The visual cues generated by this model will either help our NLU de-refer/co-refer some entities or enrich entities with more information. Entity enrichment will add for example severity of discomfort or pain to some spoken words.
This entire model runs in real-time so that it can enhance the capability of the existing BISLU system to give accurate predictions.
NLP Co-referencing is mostly in limited context and specifically to intra-sentence. Entities in a conversation are referred throughout. Same entity in a conversation are sometimes referred to by the same word or sometimes by noun or sometimes by semantic correlation.
Our research revolves around dissecting the co-referencing in a wider context. It separates syntactical structure and semantic domain aspects of co-referencing so that we could separate Language-specific modules separately.
Our co-referencing research involves following 3 types of de-referencing:
Thought is a basic representation of utterance which can be written in many ways. Inter thought meaning anomaly as the name suggests is a system which interacts with thoughts and is capable of catching non-meaningfulness among the thoughts.
There are rule engines which try to solve the problem but are highly inefficient due to their architecture or non availability of knowledge-base.
To give a glimpse on this area of research, we would like to identify an anomaly when such an utterance happens. "Patient is father of two daughters. He has knee pain. Patient was pregnant 3 years ago". We can clearly see that being father and being pregnant does not hold together and we would like to throw an anomaly to these thoughts as our systems might have mis-understood the conversation. Similarly opposite of this would be reinforcement which points to highly coherent thoughts in the conversation.
Our research focuses on solving this as this will help our prior NLU stages to re-look at other possible solutions if needed. Our research focuses on a hybrid approach of using knowledge-bases along with reasoning based symbolic AI.
Since the system works in the thought domain, our research also focuses on creating thoughts surrounding the thoughts in a conversation, which we call ‘environment of a thought’. The spoken thoughts and the environmental thoughts could bring out the non-meaningfulness of a thought or the meaning anomaly.
Real meaningful human conversation is highly coherent when it comes to context in which conversation is happening. Imagine a scenario where a person is talking about many subjects in a conversation and changes context quite often. Such a discussion would be called a context anomaly.
Our research focuses on a basic human concept that changing context too much needs larger energy by the human brain. However brain energy is limited and hence it could only handle lesser context changes. Our research focuses around limited context-switching energy concepts to model contextual coherency and anomaly.
Information processing in a Neuro-Symbolic AI system needs to translate input and knowledge-bases into appropriate symbols for transformation, computation and manipulation. In Simbo's context this information is (mainly) the spoken words and the vast medical and worldly knowledge stores.
ELUPS is a fast and efficient in-memory look-up for medical concepts, worldly knowledge and standard English lexicon. The Knowledge-base for ELUPS is drawn from curated UMLS Meta-thesauras plus in-house curated and tagged data..
Given a dataset and respective targets(classes) where each “data point” in the dataset is a sequence of “metadata”. Eg: data point = [[meta1, meta2, meta3…], [meta1, meta2, meta3…], ….]. Here, metadata refers to different features of each element in a datapoint. Symbolic optimization is a process of “compressing” a large data set with target classes into an optimal set of patterns where each pattern uniquely identifies a certain class. Each of these optimal patterns is overlapping of metadata across a large number of data points belonging to a certain class in input data.
Since we are generalising the dataset using patterns the amount of memory and time required to store and access these patterns reduces exponentially while. Performing Symbolic optimization can be converted from time intensive to memory intensive and vice versa just by tweaking certain parameters. This provides an additional advantage in adjusting the implementation based on system capabilities and available resources.
We aspire to keep improving intelligence in our Software Systems and keep moving towards Human-like General Intelligence. Studying Spiking Neural Networks and simulating their behavior from actually extracted Neurons from Human neocortex is an active research area. We have simulated upto 50 cubic millimeter volume of the human neocortex on audio and video as perceptual inputs, to understand correlations among various perceptual features.
We have used NeuroMorpho.Org for actual human neuron SWC models. We created 3D models to simulate synapses among these and then used Brian2 simulator to simulate the models.