An Epistemic Evaluation on the Transformer Architecture for Machine Learning

06 December 2023

Morgan McCarty

Introduction

In the last year the impact of the transformer has been felt throughout the world. Models based on its architecture have become prevalent in the every day lives of almost all technologically connected people. ChatGPT (Generalized Pre-trained Transformer) took the world by surprise when suddenly a machine learning model appeared to be able to communicate effectively about almost anything, very quickly, and with high accuracy (to the extent some wanted to claim it was the first Artificial General Intelligence).

This sudden boom has its origins in the 2017 paper Attention is All You Need by Vaswani et al. (2017) which introduced a model (the transformer) which could be trained much faster than any previous models (and therefore take in much more information) while still performing at the same capacity (or better).

The Annotated Transformer provides a very good walk through the paper with code examples provided at each logical point.

Scope

Who Uses the Transformer

Even before ChatGPT, the transformer was the dominant model architecture for natural language processing. Vaswani et al. showed in their original publication that the model reached state-of-the-art performance in translation tasks (which are commonly used to measure performance in natural language processing (Blagec et al 2022)) with training times a small-fraction of the existing best models (e.g. Long Short-Term Memory or Convolutional Sequence to Sequence). This led to the transformer being the most commonly used model for most tasks in natural language processing (as well as some other fields, but with significantly less prevalance - e.g. computer vision [in contrast to Convolutional Neural Networks which are the dominant model there]). For the most part, the primary users of transformer models were machine learning engineers as very few commercial products implemented them (GPT was not yet a household name).

In late 2022 this changed significantly. ChatGPT released and built a user base of approximately 100 million people within two months (Hu 2023). Suddenly millions of people were interacting daily with a chatbot whose underlying architecture is the transformer. N.B. GPT-3.5 and GPT-4 use some human-based reinforcement learning to help allieviate some of the drawbacks of the transformer (namely hallucination), which are beyond the scope of an analysis on just the transformer (GPT-4 additionally can communicate with Bing).

What Knowledge Can Be Produced by a Transformer?

Transformers are capable of taking input and training on any data that can be constructed as a sequence, but were originally used with text (Vaswani, et al. 2017). As such the modality of the training data (i.e. image -> image, text -> text, etc.) heavily influences the type of knowledge that can be produced by the model. However, emergent properties extending beyond the original modality of the training data (e.g. text -> text models working with image -> text) have been discovered in many Large Language Models (LLMs) leading to the creation of more advanced Multimodal LLMs (Yin, et al. 2023). As such the potential knowledge generated by a transformer can be virtually infinite and is only limited by the data provided to the model (though data synthesis is also possible) and the amount of time spent training it.

As an example: ChatGPT generates encyclopedic knowledge based off of an image or text input prompt. Alternatively GPT can be fine-tuned to produce knowledge (or other content) specific to a certain domain with higher accuracy and/or specificity than the base model, but with loss of generality: e.g. superhero descriptions (Caelen 2023). Translation is another common task for fine-tuning.

Epistemic Evaluation Using Goldman’s Objectives

This epistemic evalution will be using the values described by Alvin Goldman in his Epistemology and Cognition (1986).

These are:

Power: the capability (and extent of that capability) of a technology to produce many true beliefs.
Reliability: the ratio of the amount of true beliefs acquired to the amount of false beliefs acquired.
Speed: the rate at which true beliefs can be acquired.
Fecundity: the accessibility for people to acquire true beliefs.
Efficiency: the cost associated with acquiring true beliefs.

A true belief is a piece of verifiable propositional knowledge (i.e. a fact) which someones believes. Additionally the acquistion of true beliefs relates to knowledge being provided by some source to someone who then believes in that peice of knowledge.

A false belief is a piece of verifiably incorrect propositional knowledge which someone believes.

This evalutation will use examples of different pre-trained and fine-tuned transformers and then connect back to how they compare to the underlying model (i.e. does the implication of one model performing a specific way imply another will also perform that way [and have the same epistemic consequences]). This is because the underlying model has no epistemic capabilities without training. It cannot provide knowledge without having first been provided knowledge.

Power

When it comes to a transformer, the amount of true beliefs produced is virtually infinite due to several factors. Transformers, once trained, produce outputs dependent upon some encoded input (for instance taking a sentence and turning it into a continuous structure - rather than a sequence - and then predicting the next sentence in a paragraph) (Vaswani, et al. 2017). This is a stochastic process as there are infinite possible outputs (assuming a continuous modality) for any given input. The model predicts the best output given some random variation of an initial state (e.g. for factual information there are infinite possible answer, but only one correct answer; if the model has been trained on a good dataset then the correct answer is the most likely output).

However, there is one heavy drawback to the naked architecture. The architecture, by itself, has no knowledge of its own and cannot synthesize any information (as it has nothing to synthesize). This means that any knowledge produced by the model must have been seeded by the training data. Advantages in efficiency over previous models (like the Long Short-Term Memory Model) allow the transformer to synthesize so much data in a reasonable amount of time that this drawback is heavily reduced (see Efficiency for more).

Reliability

"Care should be taken when using the outputs of GPT-4, particularly in contexts where reliability is important." (OpenAI 2023)

Transformers are problematic in terms of reliability. Not because there is an intent to deceive, but because of lack of data quality. Internet data is not entirely accurate, and misinformation and false beliefs are prevalent throughout the web. However, through use of web scraping techniques, vast amounts of data can be acquired very quickly (for instance, the entire corpus of Wikipedia for factual information, or forum posts on reddit for conversational information). GPT-4 was trained, to some extent (OpenAI does not wish to publish the exact training methods in contrast to their name), using internet data (OpenAI 2023). When the model synthesizes information incorrectly it will output information which is incorrect very confidently. This is such a pronounced issue that it became named “hallucination” (OpenAI 2023).

Just as the capacity for a transformer to generate true beliefs is infinite, the capacity to generate false beliefs is also infinite. The training data is the most important predictor for the output of the model and, therefore, the more flawed it is (or really if it is flawed at all) the more flawed the output will be. Moreover, even a flawless training dataset will likely still lead to hallucination. Transformers are constructed with a fully connected feed-forward neural network - any given node in the network is connected to all other nodes (Vaswani, et al. 2017). This speeds up training time and allows the network to consider all information at once (in absence of context, which is passed as a separate information value if needed), but it also enables/improves information synthesis. Synthesis can be used (and is often essential) for certain information that does not have a direct answer given, but needs to be inferred from context. If a transformer learns that the question “When was Napoleon born?” predicts a numerical year answer, but does not necessarily have the exact year, it may output something nonsensical like “2013” as that output was predicted by the weights (and is likely statistically similar to all other years provided context for the time Napoleon lived is not available in the dataset). Napoleon is not the greatest example as there exists large amounts of data related towards him, but a model which always forces an output will hallucinate data for people who do not have information readily accesible. ChatGPT (Plus) has methods to improve this, namely by searching Bing for the answer if it does not have a high confidence answer, but in absence of external tools created to improve these models hallucination remains a real risk (and GPT-4 still hallucinates, it’s just better at noticing when it does than other models) (OpenAI 2023).

A real world example of ChatGPT hallucinating information relates to a team from NBC New York asking the model to produce a news article related to Michael Bloomberg’s activities post his mayorial terms. The model created an article containing a quote from Bloomberg of which no record existed (Glorioso 2023).

Speed

Speed, in many ways, is the greatest advantage of the transformer. Decoding and encoding, once trained, take virtually no time (with respect to training time) for most models. Training is the largest bottleneck and is where the transformer was able to outshine all of its predecessors.

Before Attention is All You Need the state-of-the-art model was the Convolutional Sequence-to-Sequence (ConvS2S) Model (Gehring, et al. 2017). This models overall results on translation task testing were relatively similar to what the transformer achived, but the training time was an order of magnitude slower. The transformer was trained using 3 GPUs over 3.5 days while the ensemble (the greatest performing) ConvS2S model was trained on 8 GPUs over 37 days (again as encoding and decoding take only a few minutes each at most [after the model has been trained] the training time is thousands of times more impactful than the use time as beliefs of any kind cannot be acquired until the model is usable).

Speed (as well as power) is also impacted by a model’s generalizability to a task. The two step approach to the transformer’s training process, pre-training and fine-tuning, allows one pre-trained model to be used for countless tasks after fine-tuning. An example of this is BERT (Devlin, et al. 2019). BERT is a pre-trained encoder designed for several distinct language tasks.

Some of the tasks are:

Multi-Genre Natural Language Inference: a classification task where the model has to determine whether, for two sentences, there is a contradiction, entailment, or if they are neural with respect to each other.
Quora Question Pairs: a task where two questions are given and the model must predict whether the questions are equivalent.
Microsoft Research Paraphrase Corpus: sentence pairs extracted from new sources where the goal is to determine if the pairs are semantically equivalent.

The exact details of these tasks are unimportant except that they are noticably different from each other. Through fine-tuning BERT is able to perform at maximal level on each after only an hour of training (after pre-training for 4 days).

Pre-trained transformers like GPT or BERT are extremely benefifical with respect to speed because they are very general and can be applied to any task data can be found for.

Fecundity

The transformer, as an architecture, is accessible for anyone who wishes to train a model. The first bottleneck is finding data that can support the desired task which may be illegal in some places, considered unethical (Krotov, Silva 2018), or just too much data to reasonably store and/or train on in a realistic amount of time.

Additionally, even though transformers can be trained extremely fast (relative to the models that existed before), pre-training models on large corpuses to state-of-the-art levels costs many millions of dollars in compute. The CEO of OpenAI stated that training GPT-4 cost over $100 million (Knight 2023). This means that the constraints OpenAI (ethical or otherwise) has applied to GPT-4 apply to everyone who is unwilling to spend hundreds of millions of dollars (in absence of a cheaper method being created).

This cost to train means that there is an accessibility gap for the best models (which are the most epistemically powerful and reliable models). OpenAI charges $20 per month for plus access to ChatGPT using GPT-4. This means that anyone who cannot afford that payment will be forced to use a less reliable version (GPT-3.5). Additionaly, for those who do pay for premium access, OpenAI limits usage of their model beyond a non-specified usage limit.

Nonwithstanding the cost of training, there is also a specialization requirement for those who wish to create their own model. A great deal of knowledge is required to understand how a transformer works, how to implement one, how to acquire the data required for training, and how to train it.

Overall transformers are not very accessible to a wide swath of people as there is a massive price and knowledge requirement built into their creation. Ironically, OpenAI has closed off the accessibility to transformers due to their unmatched capability to train models.

Efficiency

In order to not repeat what was mentioned in previous sections with regards to cost of access and cost of speed this section will primarily focus on improvements over prior architectures.

The Long Short-Term Memory (LSTM) architecture (Hochreiter, Schmidhuber 1997) was created in response to the Vanishing Gradient Problem (Bengio, et al. 1994). This problem occurs in models which pass information backwards through the network in order to update the weights in a process known as backpropagation (Linnainmaa 1976). The first node reached will receive most of the information, but the update information will quickly reach zero preventing any update to occur from parsing new training data. This is epistemically harmful as it means the model cannot learn information effectively and therefore will be able to generate less true beliefs (reducing its power) and requiring significantly more training data and time (and therefore money) for producing strong results.

LSTMs (along with GRUs, a similar model) then became the leading model architecture (along with the Convolutional Neural Network) and subsequent models were similar forms of recurrent neural networks. They are very expensive to train, however, and remained impractical for large-scale models for years until compute power progressed.

Recurrant neural networks, however, all are significantly slower in training than feed-forward neural networks as they can only process one token of input at a time, rather than the entire corpus (like a transformer).

The change to a fully connected feed-forward network with an attention mechanism (allowing later tokens better access to earlier tokens) (Vaswani, et al. 2017) massively increased the speed at which models could be trained thereby reducing the number of GPUs required for training and significantly reducing the time required to train the model (to the same size model as prior models; OpenAI’s latest models used more compute time, but are significantly larger). As such the transformer is very efficient with respect to the prior state-of-the-art.

Conclusion

Transformers, as a model, are very powerful. They can produce a lot of true beliefs provided that they have been trained on a good dataset. However, the limitations provided by being (at its core) a statistical model still mean that many false beliefs are also created, thus making them very unreliable.

They are fast in encoding and decoding once trained, and they made massive improvements in training time versus previous architectures. Transformers have made great strides in being more efficient than their predecessors both in cost and time.

Corporations controlling the largest models has massively reduced the overall accessibility of transformers. Data is also becoming increasingly hard to get due to regulation.

Is the Transformer Good?

Quantifying the impact of an architecture is very difficult as it is really just a skeleton without any supporting systems. However it is possible to look at the state-of-the-art model implementations and see how they have affected the world.

ChatGPT is the clearest real-world example when it comes to the impact of transformers. Freelance jobbers have experienced decreases in work since the introduction of the app (Zinkula 2023). The Writer Guild of America had generative AI as a major point of contention during their strike (Coyle 2023). Many people are starting to see the impacts of large language models that could not have been predicted before the end of last year.

On the other hand, ChatGPT has been a valuable tool for many people. It can be a powerful aid for critical thinking with respect to topics already familiar and can provide a useful comparison towards human writing (Abramson 2023).

All things considered the impact of ChatGPT (and therefore the transformer) has largely been negative due to the actions of OpenAI. OpenAI is not a very open company, having abandoned its not-for-profit roots in trade for it’s larget shareholder being its not-for-profit originator (AP 2023). OpenAI also does not publish significant information about its model, citing competition and risks (OpenAI 2023). Not having this information means that OpenAI can control its models with whatever restrictions it pleases even though the technology may, one day, be essential.

Overall the transformer is a neutral tool which is being used for harm by its most powerful operator.

Strengths and Weaknesses of Goodman’s Objectives

In writing this evalution of the transformer several things became clear about Goodman’s Objectives.

First, they do cover the vast majority of what is needed to see the epistemic potential of a technology. Power, reliability, fecundity are arguably the most essential when looking at consumer facing applications. Speed and efficiency are most important when look at the devlopment of applications.

The largest weakness of Goodman’s Objectives is that they are discrete. The five categories become quickly intermingled when certain points-of-interest are viewed. For example: in the case of the transformer, higher speed = higher efficiency -> higher power. While these three are not necessarily equivalent to each other, they have underlying factors which definitely are.

Overall Goodman’s Objectives provide a very strong starting point for an evaluation, but, when viewed too puritanically, they quickly become a way to measure the same criteria again and again.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Blagec, K., Dorffner, G., Moradi, M., Ott, S., & Samwald, M. (2022). A global analysis of metrics used for measuring performance in natural language processing. arXiv preprint arXiv:2204.11574.

Hu, K. (2023). ChatGPT sets record for fastest-growing user base - analyst note. Reuters Technology.

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., & Chen, E. (2023). A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.

Caelen, O. (2023). Unleashing the Power of GPT-3: Fine-Tuning for Superhero Descriptions. Towards Data Science.

Goldman, A. I. (1986). Epistemology and cognition. harvard university Press.

OpenAI (2023). GPT-4 Technical report. arXiv preprint arXiv:2303.08774.

Glorioso, C. (2023). Fake News? ChatGPT Has a Knack for Making Up Phony Anonymous Sources. NBC New York I-Team.

Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017, July). Convolutional sequence to sequence learning. In International conference on machine learning (pp. 1243-1252). PMLR.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Krotov, V., & Silva, L. (2018). Legality and ethics of web scraping.

Knight, W. (2023). OpenAI’s CEO Says the Age of Giant AI Models Is Already Over. Wired.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.

Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2), 157-166.

Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2), 146-160.

Zinkula, J. (2023). ChatGPT is already stealing work from freelancers. Business Insider.

Coyle, J., & The Associated Press (2023). ChatGPT is the ‘terrifying’ subtext of the writers’ strike that is reshaping Hollywood. Fortune.

Abramson, A. (2023). How to use ChatGPT as a learning tool. American Psychological Association, 54(4).

The Associated Press (2023). OpenAI’s Unusual Nonprofit Structure Led to Dramatic Ouster of Sought-After CEO. U.S. News.