BERT: A Review of Transformer Architecture
All Articles

Why BERT? A review of Transformer Architecture

Mar 2, 2020 Best Practices Digital Transformation

In late 2018, a significant step forward in natural language processing was taken with the introduction of the Transformer deep learning architecture.  Since then, there has been a consistent release of new models based on the Transformer architecture that continues to push the state-of-the-art in complex natural language tasks.  Perhaps the model with the most notoriety is Google’s BERT model, and just recently they announced they would be integrating this model into their search platform with the promise that it would enhance the search engine’s ability to discern intent more contextually.   

Here at ThoughtTrace, we’ve spent the last six months researching whether or not a Transformer architecture like BERT would improve our product.  And in another month or so, we’ll be back to tell you what our decision was!   But for now, we’d like to share some of the thought processes that we, as a Data Science team, go through in evaluating the relevance and appropriateness of state-of-the-art research in our products.  Below you’ll see some of the factors that will ultimately contribute to our final decision. 

The Good: Increasing performance through contextual nuance 

Machine learning models struggle to gain any information directly from text.  Instead, the desired text is encoded into numbers (a format much more conducive to machine learning algorithms).  Traditionally the way this has always been done is simple: we simply count up how many times each word in our entire vocabulary appeared.   So a phrase like “I left money at the bank but he stole it” would look like:   


he  she  it  but  and  the  or  at  money  bank  stole  left 
1  1  0  1  1  0  1  0  1  1  1  1 


The deep learning movement has resurfaced some 30-year old models that are
much better at taking sequence into account, but one place that these deep learning approaches still fall short is in the ability to distinguish multiple meanings of an individual word.  In our original example, the word “bank” here could have two meanings: (1) the institution where we keep our money or (2) the side of a river.  Admittedly, even a human reader would struggle to know which meaning to use in our original example.  But as soon as we add second sentence, it becomes clear: But unfortunately,
 the same words in a different order would be encoded in exactly the same way!  Take “he left money at the bank but I stole it”, for example.  Clearly this approach has its shortcomings as it completely ignores the sequence that the words appear in. 

“Malik and I spent the afternoon fishing in a beautiful river by the mountain.  I left money at the bank but he stole it.” 

While a typical deep learning model would be able to use the context of both sentences to formulate a representation that distinguishes thefrom sentences about a financial institution, it is done only at a high level: the final representation of the entire two-sentence sequence would capture some information about water, for example.  But a Transformer model is able to make these distinctions at a much more granular level: at the level of the word.  It is actually able to store a different representation of the word, “bank”, itself each time it is used in a different context. 

Given that our software primarily processes legal documents, we in the Data Science team are no stranger to textual complexity.  And the traditional models (that take the absence or presence of particular words as a strong indicator of meaning) do extremely well.  But it would be incredibly powerful if our models could more adequately harness the small nuances of meaning when some words are used in different contexts.

Pretraining on an enormous corpus of data 

Because these Transformer-based models are often built by large tech companies like Google, Facebook, and Baidu, they can be trained on a magnitude of data that take considerable resources which are often prohibitive to the rest of us, costing thousands of dollars and taking weeks to train.  The resulting models will contain information from a tremendous collection of text from multiple different domains or genres.  In theory, this should make them more representative of the general discourse than models trained on only a small subset of language.  

This “corpus” of language usage becomes the backbone of the Transformer model, which can then be customized to fit any number of specific needs.

Ease of customizing the models for our domains 

One of the biggest advantages to deep learning architectures in general, including the Transformer, is that it is often easy to modify an existing model to better handle specific subject matter.  In the case of Transformer models, we can start with the publicly-released model (which is arguably generalized across all domains and subjects) and feed it data specifically from our domains of interest.  The model will adjust to become more adept at the specific content that we care most about.  So, for instance, we can build a model that has the benefits of being trained on all different forms of language while at the same time have a specific “niche” for handling oil and gas leases.  

The Bad: Unifying our different levels of prediction into one model 

The Transformer architecture is flexible enough to be used in multiple settings to address many different problems.  However, we at ThoughtTrace have a unique challenge.  We offer information to our users about their documents on a number of different levels: sometimes about a whole paragraph, sometimes about just a sentence, and sometimes even about individual words or phrases.  And while this is tremendously beneficial to our users, it requires us to maintain multiple models that need to be reconciled in the final output. (Learn more about our model development process and how we can quickly build upon existing models so they will work out-of-the-box for novel use cases.)

This presents numerous complexities in architecting the model and data in such a way that efficiently and accurately captures the relevant information. The level of back end sophistication in implementing any deep learning framework far supersedes that of a traditional machine learning approach. In addition, the precision and volume of labeled data required to do this effectively also pose unique challenges to our SME team responsible for curating the data that feeds our models. Their ability to do this well is critical to our success and is a true differentiator for us as a company. Kudos to them!

Difficulty in inspecting model decisions

Traditional models (as described earlier in this post) are actually quite easy to interpret.  The model assigns the highest values to words that it should pay the most attention to when making its decisions.  This interpretability can oftentimes give humans more confidence in the model because they can understand why the model makes the decisions it does. 

And while the Transformer models provide us significantly more context with which to make decisions, it also becomes significantly harder to interpret.  We can no longer easily see what word(s) the model pays most attention to, mainly because the model doesn’t look at words in isolation anymore: every decision is now made with a significant amount of context, and every small change to that context will affect the model’s decisions. 

There is plenty of research from Microsoft, Google, and others right now on ways to make these models more interpretable.  One of the most popular approaches is to build a “proxy model” that (1) makes similar decisions to the real model but (2) is more easily interpreted.  And while some of these approaches have shown promise in some cases, there really is no substitute for an inherently interpretable model. 

The Ugly: Increased training and inference costs 

As discussed above, these Transformer models are huge, and training them is expensive and time-consuming.  They require (1) more time to train time, (2) more space to store and (3) more powerful hardware to run properly.   

Where we can train our models today on a single machine with moderate CPU and memory in minutes, a Transformer model will require multiple GPUs and hours or days to train.  And when it comes time to use the model for predicting, we can use a moderate amount of CPU memory to serve prediction across all our customers, while a Transformer model would require much more substantial resources.  Cloud computing and the ubiquity of affordable hardware makes this easier to handle than even five years ago, but it’s still a not-inconsequential increase in complexity and cost. 

Our current model infrastructure also allows us to iterate quickly, training new models as often as we’d like and as soon as new data is available.  But a move to the more complex Transformer architecture would inevitably slow down this process as model retraining would no longer takes hours but days.  We’d have to make some hard decisions about how often if “often enough” for us to update our models. 

It’s easy for Data Scientists like us to get excited about the state-of-the-art trends in our field because the gains being made sometimes really are astonishing.  But at the same time, it’s important to evaluate those gains in the context of the service our software provides to our customers.  We are obviously tempted by every incremental improvement that our field makes, but we also have to ask ourselves whether it is necessary for us to take advantage of each specific gain.  Picking our moments in such a rapidly-developing field like ours can be tricky, but at least its never boring!

Ready to learn even more about ThoughtTrace? See how this technology goes to work for your company on day one.



Sign Up for News & Updates

ThoughtTrace is Hiring