TYPES OF AI: A Comprehensive Account of Modern Closed & Open Source Models
It was just a while ago that everyone was raving about the performance of GPT-4 and now Claude is here, that surpasses GPT-4 in all benchmarks. We have also seen smaller niche-specific open-source models that perform just as well as any of these closed-source models. This provides a glimpse into how swiftly modern-day AI is changing. However, most people and companies solely focus on the developments in LLMs which have advanced rapidly but do not encompass the whole realm of artificial intelligence, rather they are a part of it.
This article discusses different sub-fields of artificial intelligence and how they progressed with time. Additionally, we will list some state-of-the-art closed-source as well as open-source models for each subcategory. You can find the open-source models on Hugging Face and run them on Codesphere.
Suggested Reading: Running an Open Source Text2Img model
So, without further ado let's start with our first category of artificial intelligence.
Computer vision Models
Computer vision is a field of artificial intelligence (AI) that enables machines to interpret and understand visual information. This multidisciplinary field is an intersection of computer science, mathematics, and engineering.
The history of computer vision dates back to the 1960s when the early efforts focused on basic tasks like character recognition. In the 1970s and 1980s, the development of algorithms for edge detection, object recognition, and feature extraction happened. In the 1990s, researchers worked on applying neural networks to visual tasks. In 2000, many computer vision applications like facial recognition systems and augmented reality became popular. However, computer vision was greatly transformed in 2010 after the advent of deep learning.
Presently, computer vision is used in several industries including healthcare, autonomous vehicles, and robotics. Apple's new release of the Vision Pro showed yet another example of what computer vision can help to achieve. Some of the notable computer vision models we use presently include ResNet, Inception, MobileNet, etc. One of the most notable models is the Vision Transformer (ViT). It applies the transformer architecture, initially designed for natural language processing, to image classification tasks. ViT has demonstrated competitive performance compared to traditional convolutional neural networks (CNNs) on various image classification benchmarks.
Discriminative Models
This category of computer vision focuses on learning the decision boundary between different classes directly. Basically, In discriminative computer vision, the emphasis is on training models to recognize and classify specific visual patterns or objects based on input data.
These models can be categorized based on their application in the following types:
Image Classification: Models in this category are trained to assign a specific label or class to an entire image, indicating the primary object or scene present.
Examples: EfficientNet, ResNet (Residual Networks), VGG, AlexNet, ViT
Object Detection: These models are trained to identify and locate objects within an image, providing bounding boxes around the detected objects.
Some examples of Object Detection models are, RetinaNet, MaskRCNN, RCNN, FASTRCNN, FASTERRCNNN, and Yolo v 1,2,3,4.
Semantic Image Segmentation: These models classify each pixel in an image, labeling them with the corresponding object or class they belong to.
Some notable examples of models include SegNet, MaskRCNN, UNet, and Autoencoders.
Facial Recognition: Discriminative models for facial recognition focus on identifying and verifying individuals based on facial features.
Some examples of Facial Recognition models include VGGFace, ArcFace, FaceNet, DeepFace, and OpenFace.
Generative Models
Generative computer vision involves the development of models that aim to generate new, realistic visual data. These models can learn underlying patterns and structures within a dataset and create unique instances based on them.
Generative computer vision has applications in various creative and practical domains, including:
Image Synthesis: These models are capable of creating realistic images that do not exist in the training dataset. This is often used for creative purposes, such as art generation or content creation.
Examples: StyleGAN, BigGAN, Pix2Pix
Image-to-Image Translation: Generative models can translate images from one domain to another. This includes tasks like converting satellite images to maps, turning black and white photos into color, or transforming a sketch into a realistic image.
Examples: Pix2Pix, CycleGAN, Stable Diffusion, Dall-e
Depth Estimation: Generative models can be trained to estimate the depth information of a scene from 2D images. This is valuable for applications such as autonomous navigation, augmented reality, and robotics.
Examples: MiDaS (Monocular Depth Estimation in Real-time), Monodepth2, DeepLabV3,
Audio Processing Models
Audio processing models are designed to analyze, understand, and generate audio data using artificial intelligence techniques. These models use machine learning algorithms, particularly deep learning, to perform various tasks related to audio processing.
Audio-processing AI models emerged in the mid-20th century that involved basic signal processing for tasks like speech recognition. However, the invention of Hidden Markov Models (HMMs) in the late 20th century facilitated probabilistic modeling of temporal patterns in audio. Breakthroughs in Automatic Speech Recognition (ASR) happened because of models like Listen, Attend, and Spell (LAS). Recent advances include the integration of AI into voice assistants, smart speakers, and other multimodal systems.
Discriminative
Discriminative AI in audio models is used to discern and classify distinct patterns within audio data. These models leverage training data to learn the boundaries between different classes or categories. It helps them accurately categorize and discriminate between various sounds, voices, or environmental noises.
Following are some of the major applications of Discriminative AI audio models:
Automatic Speech Recognition (ASR): ASR models convert spoken language into written text. Discriminative models in ASR, play a pivotal role in enabling voice-controlled devices, transcription services, and voice interfaces by accurately transcribing spoken words into a textual format.
Example: DeepSpeech and Google Speech-to-Text API
Audio Classification: These models categorize audio data into predefined classes or labels. This is utilized in tasks ranging from environmental sound classification (identifying sirens, and street noises) to content tagging in multimedia. It can also be used in security domains for example on monitored prison phone lines to alert guards of illegal activity being discussed in real-time.
Example: VGGish and YAMNe
Voice Activity Detection (VAD): The VAD models' function is to distinguish audio segments containing speech from those with background noise or silence. Such models are employed in various applications such as speech processing systems, audio surveillance, and telecommunications.
Example: OpenSMILE, VocatNet
Generative
Generative audio AI models represent a fascinating side of AI. It offers the ability to create, synthesize, and reproduce audio content based on learned patterns. Their capacity to understand and replicate diverse auditory experiences makes them a powerful tool for advancing audio synthesis and manipulation.
These models typically employ architectures like recurrent neural networks (RNNs), generative adversarial networks (GANs), or transformers. Generative audio models learn to capture the intricate details of various soundscapes, opening doors to applications like text-to-speech synthesis, music composition, and creative audio generation. Some of the key examples of generative AI audio models are:
Text-to-Speech (TTS): TTS applications utilize generative models to convert written text into natural-sounding speech. This is widely used in voice assistants, accessibility tools for visually impaired individuals, audiobook production, and automated customer service systems.
Example: WaveNet, FastSpeech, DeepVoice, BART
Text to Audio: There are two types of models here. Text-to-audio synthesis models generate audio representations directly from textual input. This is valuable in scenarios requiring expressive and contextually relevant audio content. Whereas Creative text-to-audio generators generate diverse and contextually relevant audio outputs based on textual prompts.
Example: Tacotron 2, GPT3
Audio to Audio: These AI models take existing audio inputs and produce unique audio outputs based on that. This can lead to the creation of innovative and experimental auditory experiences.
Example: DeepDream
NLP
Natural Language Processing (NLP) focuses on enabling computers to understand, interpret, and generate human language. It uses a mix of rule-based linguistic modeling, statistics, machine learning, and deep learning.
The goal of NLP is to enable computers to grasp the full meaning, intent, and sentiment behind human language. NLP powers various applications like language translation, voice-activated systems, and quick text summarization.
Natural Language Processing (NLP) history can be divided into two periods: before and during the deep learning era. In the pre-deep learning phase, from the late 1940s to the 1980s, early efforts in machine translation and symbolic approaches dominated. Prototype systems like ELIZA and SHRDLU emerged. In the 1980s we saw symbolic approaches with complex rule-based systems. The late 1980s and early 1990s witnessed a revolution with statistical models replacing rule-based systems, marking a shift toward machine learning algorithms and probabilistic decision-making.
In the deep learning era, neural networks emerged. The turning point was the application of CNNs to various NLP tasks called the multi-task learning approach in 2008. Word2Vec, presented by Mikolov et al. in 2013, transformed word embeddings. That year also marked the adoption of neural networks like RNNs, LSTMs, CNNs, and recursive neural networks. Key developments included sequence-to-sequence learning in 2014 and attention mechanisms in 2015. The latest breakthrough involves large pre-trained language models, showing remarkable performance across various tasks.
The more recent history of NLP is marked by the use and rising popularity of models like BERT, GPT, and Llama.
Discriminative
It is all about making distinctions between different types of data. It's like when you have to classify stuff such as spam or not spam in emails or figure out the sentiments behind tweets. Discriminative models basically learn the lines that separate one thing from another. They look at patterns in the data and draw these lines or planes that help decide what category something falls into. Some examples of the use cases of Discriminative AI are:
Text Classification: Discriminative models excel in categorizing text into predefined classes or labels. Applications include sentiment analysis, spam detection, and topic categorization.
Example: BERT, GPT-3, RNN, LSTM, GRU
Token Classification: Discriminative models are employed for tagging individual tokens in a sequence, useful in named entity recognition (NER), part-of-speech tagging, or other token-level annotations.
Example: Flair, BERT, RNN, LSTM, GRU, Attention-based Networks
Sentence Similarity: Discriminative models can determine the similarity between sentences, aiding in tasks like duplicate detection, paraphrase identification, or document matching.
Example: CLIP, Embeddings
Named Entity Recognition (NER): This involves identifying and classifying entities (such as names of people, organizations, and locations) within the text, crucial in information extraction and search.
Example: SpaCy's NER component, Stanford NER, Bert, Flair
Generative
Instead of just sorting things out, it tries to understand the structure of what it's given and then generates new text. Think of it like translating languages or having a chatbot that talks like a real person.
Generative models use techniques like statistical modeling and neural networks, especially the new attention-based transformers. They learn the probabilities of words and phrases hanging out together creating a kind of pattern that lets them make up new text that is correct.
The current state-of-the-art models are GPT 1, 2, 3, 4, Llama, Bert, GROK. Previously RNNs LSTMs and GRUs were the go-to for generative AI but then these attention-based transformers came along and changed everything.
Here are key examples of generative AI applications:
Text Generation: Generative models are used to create human-like texts for various reasons like creative writing for blog posts and articles as well as conversational agents like chatbots.
Example: Llama, GPT, GROK, Bert
Translation: Generative models like the ones based on transformers have been instrumental in improving the accuracy and fluency of machine translation systems.
Example: T5 by Google, Bert, GPT, Llama
Summarization: Generative models can generate abstractive summaries of documents, capturing the essential information while maintaining coherence.
Example: BART, Pegasus, GPT
Code Generation: Generative models can generate code snippets based on natural language descriptions, facilitating programming for non-experts.
Example: Codium, OpenAI Codex (Github Copilot), TabNine, DeepCode, PROSE
Future of AI models
From the recent run of AI-related events and advancements, it is safe to say that AI models will get better at understanding and solving problems across different areas. Unlike the common belief, we believe in the future AI is likely to work more and more closely with humans to enhance work processes rather than entirely replace them. They'll become more specialized for specific tasks and industries, improving accuracy and efficiency. In particular, smaller models will continue to perform better for broader tasks making AI less computing-heavy. We have seen how open-source models have made AI more accessible to everyone, with time it will become easier to deploy and use AI models like any other application. Last but not least, with time data security and privacy concerns are rising which means more and more companies will prefer to self-host their AI models.
Wrapping Up
Artificial intelligence revolution is already here and it is not just restricted to LLMs. There are many AI models in computer vision and audio processing domains of AI that are extremely sophisticated and perform well. We tried to cover the history of the development of different artificial intelligence fields. We also mentioned different discriminative and generative AI models for NLP, computer vision, and audio processing. If you happen to try any of these models, we would be thrilled to hear about your experience with them.