Benchmarking

Data

All different scores are computed using the data collected by Fleksy. We start from a list of clean sentences, which we corrupt by introducing artificially generated typos. Then we measure how these typos are corrected.

Check out the blog post Ensuring high-quality auto-correction for 80+ languages for a thorough explanation on how we test the performance of our engine.

Datasets

For every language supported by Fleksy, we use three different datasets :

Formal: Data that comes from official, public datasets, such as Wikipedia dumps.

Conversational: Data scrapped from curated online content, such as movie subtitles.

Chat: Data gathered directly from our beta-testers testing the keyboard.

Because the sources used for each dataset is different, the resulting sentences have different properties (length, vocabulary, format, formality, etc…). Here is a few examples from each of these datasets :

Formal

With just a sliver of the vote return coming in after the polls closed at 8 p.m., Boxer already had a lead of nearly 230,000 votes.
Fellow White Sox reliever Scott Linebrink served as a sounding board for Jenks during the past season, when the two discussed Linebrink’s religious faith and what it did for his life.
And those increased reserves are incredibly profitable for ATP as they are going to be acquired very inexpensively and the only additional cost for ATP is drilling.
The New York campaign underscored the divided nature of the Republican Party as it tries to rebound from election losses in 2006 and 2008 that ousted the party from control of Congress and the White House.

Conversational

I never know how the day will turn out.
It does not think that it would have to him to be very been thankful?
The girl was beautiful.
Did you bring callie’s cards?

Chat

two foot spin with entrance
my second day aswell
hey. ur mom told ur tia that u guys are. coming this weekend.
Yılmaz just trying to hide money

For each dataset, we split the data into a train set and a test set, and we run the benchmark on the test set.

We use 20k sentences for each dataset, so across all datasets we have a test set of 60k sentences for each languages (note, for some languages we have less sentences in the test set because the whole dataset is too small, we keep at least 90% of the data for the train set).

Artificial typos

To introduce typos in the clean text, we simulate all possible typos that a human typing on a mobile keyboard could do. This include:

  • Characters additions / deletions
  • Characters transpositions
  • Accent simplifications
  • Case simplifications
  • Fat-finger syndrome (fuzzy typing)
  • Common typos (sampled from a dataset of most common typos for a given language)

We use the following typo rates:

  • Character transpositions : 1% of all characters
  • Character additions : 0.5% of all characters
  • Character deletions : 0.5% of all characters
  • Space deletions : 1% of all space characters
  • Symbol deletions : 10% of symbol characters
  • Accent simplification : 8% of accented characters
  • Case simplification : 8% of uppercased characters
  • Common typos : 5% of words

With these rates, we obtain an overall typo rate of 12%.

These rates come from studies on real-human typing habits : Reference #1, Reference #2.

Particularly, Reference #1 (which focus on mobile device typing) shows that typing on mobile devices leads to 2.3% of uncorrected errors (see introduction), and 8% of words autocorrected (see Intelligent text entry, page 8), for an overall typo rate of 10.3%.


Here is a few examples of sentences before and after introducing typos:

Clean sentence Corrupted sentence Typos introduced
He went hiking and said he’d think about it; never came back. He went hikimg and said hed think about it; never came back. - Fuzzy typing
- Symbol deletion
Like, what you’re doing here and what all this stuff is. Like, what you’re doinghere and waht all this stuff is. - Space deletion
- Character transposition
You must do something about yourself. You must do something about yourself.
That’s the way to get rid of pests like that. That’s the waj to get rid of pedts like thhat. - Common typo
- Fuzzy typing
- Character addition
He obviously wanted an ally. he obviously wanted an ally. - Case simplification
This is all we got between us and the Almighty! This is lal we got beween us and the Almgihty! - 2 x Character transposition
- Character deletion

Swipe gesture generation

For the task of swipe gesture resolution, the input is not simple text : we need to generate a swipe gesture.

When generating fuzzy typing typo, we sample key taps positions on the keyboard, using Gaussian distributions, and use these key taps position to see if the correct character was typed, or if a neighbor key was typed.

For generating the swipe gesture, we sample some key taps positions like for fuzzy typing, and then link the different keystrokes of the word using bezier curves. Some randomness on the speed & acceleration between points is added, in order to generate more natural swipe gestures.

Here is some examples of the generated swipe gestures (in red are the keystrokes generated by the fuzzy typing, in blue the points of the corresponding swipe gesture created).

For the word gives:

Swipe Typing Gives Swipe Typing Gives Swipe Typing Gives

For the word they:

Swipe Typing Gives Swipe Typing Gives Swipe Typing Gives

Metrics

Formulas

In this section we explain how each metric is computed. Have a look if you want to understand the details of each metric, but you can skip to the next section, Understanding the metrics, if you just want to be able to interpret these metrics.

Next-word prediction, swipe resolution, auto-completion

For these three tasks, the metric used is Accuracy.

The formula is: accuracy = correct / total

Where correct is the number of correct predictions, and total the total number of predictions.


For the next-word prediction task and auto-completion task, we use top-3 accuracy. It’s the same as accuracy, but instead of considering only one candidate (which is either correct or not), we consider the 3 most probable candidates (if any one of these 3 candidates is correct).

The reason for this is because the next-word predictions and auto-completion predictions are not “forced“ upon the user : 3 predictions are displayed at the top of the keyboard, and the user can choose any of the prediction displayed. So the correct prediction should appear among these 3 predictions displayed.

For swipe resolution however, only the best prediction is selected and applied. So we use accuracy (and not top-3 accuracy).


Auto-correction

For auto-correction, it’s different. We have a notion of true/false positive/negative. Let’s first define these notions:

  • True Negative : No typo introduced, the model doesn’t correct anything
  • False Positive : No typo introduced, but the model correct (wrongly) the word
  • True Positive : A typo is introduced, the model correct the word into the expected word
  • False Negative : A typo is introduced, but the model doesn’t correct anything

With an example it’s easier to visualize:

Word typed by the user Word after being corrected by the model Expected word
True Negative love love love
False Positive love loev love
True Positive loev love love
False Negative loev loev love

From these notions, we can compute the following metrics : accuracy, precision, recall, F-score, with the following formulas:

accuracy = (tp + tn) / (tp + tn + fp + fn)

precision = tp / (tp + fp)

recall = tp / (tp + fn)

f_score = 2 * (precision * recall) / (precision + recall)

Note : F-score is the harmonic mean of precision and recall. It’s a way to gather both precision and recall in a single metric.

Actually we use Fβ-score, which is a variant of the F-score where we can use a constant β to weight the precision/recall ratio (see the wikipedia page about F-score).

This is useful because we value precision more.

We currently use β = 0.75, which means precision has 50% more weight than recall.s

Understanding the metrics

Swipe resolution

Accuracy - [0 - 1] - higher is better

Accuracy is straightforward : this is the ratio of correct predictions.

So an accuracy of 0.8 means the model correctly predicted the word being swiped 80% of the time.

Next-word prediction & auto-completion

Top-3 accuracy - [0 - 1] - higher is better

Same as accuracy, but 3 candidates are considered.

So a top-3 accuracy of 0.6 means that within the 3 candidates predicted by the model, the next word (or the word completion) is in these 3 candidates 60% of the time.

Auto-correction

Precision - [0 - 1] - higher is better

Precision is the ratio of typos among what is corrected by the model.

So a precision of 0.7 means that among all corrections made by the model, 70% were actually typos (and 30% were correct words that didn’t need to be corrected).

A low precision means many words are corrected when they should not, and a high precision means only actual typos are corrected.


Recall - [0 - 1] - higher is better

Recall is the ratio of typos detected by the model.

So a recall of 0.65 means that the model correctly detected 65% of typos (and 35% of typos were not corrected by the model).

A low recall is symptom that most typos are not detected, and a high recall means most of typos are detected as typos.


F-score - [0 - 1] - higher is better

F-score is the harmonic mean of precision and recall, it’s just a way to gather both precision and recall in a single metric.

Note that we weight precision 50% more than recall.

Competitors

To validate our performance against direct competitors, we use consistent metrics and a standardized evaluation system, explained in the previous section. Presented below are some of the results obtained in comparison with our direct competitors.

Gboard Metrics

Language compared: en-US
Date: Jun 14, 2023

Auto-correction
Precision 0.79
Recall 0.64
F-score 0.73
Auto-completion
Top-3 accuracy 0.69
Next-word prediction
Top-3 accuracy 0.28

Test set: 300 sentences


Fleksy Metrics

Date: Nov 9, 2023

Dictionary: en-US 6000.15
Virtual Keyboard SDK version: iOS v4.17.1 / Android v4.5.1

Auto-correction
Precision 0.71
Recall 0.6
F-score 0.67
Auto-completion
Top-3 accuracy 0.64
Next-word prediction
Top-3 accuracy 0.21
Swipe resolution
Accuracy 0.95

Check more information on KeyboardSDK for English en-US accuracy and performance.



If something needs to be added or if you find an error in our documentation, please let us know either on our GitHub or Discord.

Last updated on March 14, 2024