All different scores are computed using the data collected by Fleksy. We start from a list of clean sentences, which we corrupt by introducing artificially generated typos. Then we measure how these typos are corrected.
For every language supported by Fleksy, we use three different datasets :
Formal: Data that comes from official, public datasets, such as Wikipedia dumps.
Conversational: Data scrapped from curated online content, such as movie subtitles.
Chat: Data gathered directly from our beta-testers testing the keyboard.
Because the sources used for each dataset is different, the resulting sentences have different properties (length, vocabulary, format, formality, etc…). Here is a few examples from each of these datasets :
With just a sliver of the vote return coming in after the polls closed at 8 p.m., Boxer already had a lead of nearly 230,000 votes. |
Fellow White Sox reliever Scott Linebrink served as a sounding board for Jenks during the past season, when the two discussed Linebrink’s religious faith and what it did for his life. |
And those increased reserves are incredibly profitable for ATP as they are going to be acquired very inexpensively and the only additional cost for ATP is drilling. |
The New York campaign underscored the divided nature of the Republican Party as it tries to rebound from election losses in 2006 and 2008 that ousted the party from control of Congress and the White House. |
I never know how the day will turn out. |
It does not think that it would have to him to be very been thankful? |
The girl was beautiful. |
Did you bring callie’s cards? |
two foot spin with entrance |
my second day aswell |
hey. ur mom told ur tia that u guys are. coming this weekend. |
Yılmaz just trying to hide money |
For each dataset, we split the data into a train set and a test set, and we run the benchmark on the test set.
We use 20k sentences for each dataset, so across all datasets we have a test set of 60k sentences for each languages (note, for some languages we have less sentences in the test set because the whole dataset is too small, we keep at least 90% of the data for the train set).
To introduce typos in the clean text, we simulate all possible typos that a human typing on a mobile keyboard could do. This include:
We use the following typo rates:
With these rates, we obtain an overall typo rate of 12%.
These rates come from studies on real-human typing habits : Reference #1, Reference #2.
Particularly, Reference #1 (which focus on mobile device typing) shows that typing on mobile devices leads to 2.3% of uncorrected errors (see introduction), and 8% of words autocorrected (see Intelligent text entry, page 8), for an overall typo rate of 10.3%.
Here is a few examples of sentences before and after introducing typos:
Clean sentence | Corrupted sentence | Typos introduced |
---|---|---|
He went hiking and said he’d think about it; never came back. | He went hikimg and said hed think about it; never came back. | - Fuzzy typing- Symbol deletion |
Like, what you’re doing here and what all this stuff is. | Like, what you’re doinghere and waht all this stuff is. | - Space deletion- Character transposition |
You must do something about yourself. | You must do something about yourself. | |
That’s the way to get rid of pests like that. | That’s the waj to get rid of pedts like thhat. | - Common typo- Fuzzy typing- Character addition |
He obviously wanted an ally. | he obviously wanted an ally. | - Case simplification |
This is all we got between us and the Almighty! | This is lal we got beween us and the Almgihty! | - 2 x Character transposition- Character deletion |
For the task of swipe gesture resolution, the input is not simple text : we need to generate a swipe gesture.
When generating fuzzy typing typo, we sample key taps positions on the keyboard, using Gaussian distributions, and use these key taps position to see if the correct character was typed, or if a neighbor key was typed.
For generating the swipe gesture, we sample some key taps positions like for fuzzy typing, and then link the different keystrokes of the word using bezier curves. Some randomness on the speed & acceleration between points is added, in order to generate more natural swipe gestures.
Here is some examples of the generated swipe gestures (in red are the keystrokes generated by the fuzzy typing, in blue the points of the corresponding swipe gesture created).
For the word gives
:
![]() |
![]() |
![]() |
For the word they
:
![]() |
![]() |
![]() |
In this section we explain how each metric is computed. Have a look if you want to understand the details of each metric, but you can skip to the next section, Understanding the metrics, if you just want to be able to interpret these metrics.
For these three tasks, the metric used is Accuracy.
The formula is: accuracy = correct / total
Where correct
is the number of correct predictions, and total the total
number of predictions.
For the next-word prediction task and auto-completion task, we use top-3 accuracy. It’s the same as accuracy, but instead of considering only one candidate (which is either correct or not), we consider the 3 most probable candidates (if any one of these 3 candidates is correct).
The reason for this is because the next-word predictions and auto-completion predictions are not “forced“ upon the user : 3 predictions are displayed at the top of the keyboard, and the user can choose any of the prediction displayed. So the correct prediction should appear among these 3 predictions displayed.
For swipe resolution however, only the best prediction is selected and applied. So we use accuracy (and not top-3 accuracy).
For auto-correction, it’s different. We have a notion of true/false positive/negative. Let’s first define these notions:
With an example it’s easier to visualize:
Word typed by the user | Word after being corrected by the model | Expected word | |
---|---|---|---|
True Negative | love | love | love |
False Positive | love | loev | love |
True Positive | loev | love | love |
False Negative | loev | loev | love |
From these notions, we can compute the following metrics : accuracy, precision, recall, F-score, with the following formulas:
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f_score = 2 * (precision * recall) / (precision + recall)
Note : F-score is the harmonic mean of precision and recall. It’s a way to gather both precision and recall in a single metric.
Actually we use Fβ-score, which is a variant of the F-score where we can use a constant β to weight the precision/recall ratio (see the wikipedia page about F-score).
This is useful because we value precision more.
We currently use β = 0.75, which means precision has 50% more weight than recall.s
Accuracy
- [0 - 1]
- higher is better
Accuracy is straightforward : this is the ratio of correct predictions.
So an accuracy of 0.8
means the model correctly predicted the word being swiped 80% of the time.
Top-3 accuracy
- [0 - 1]
- higher is better
Same as accuracy, but 3 candidates are considered.
So a top-3 accuracy of 0.6
means that within the 3 candidates predicted by the model, the next word (or the word completion) is in these 3 candidates 60% of the time.
Precision
- [0 - 1]
- higher is better
Precision is the ratio of typos among what is corrected by the model.
So a precision of 0.7
means that among all corrections made by the model, 70% were actually typos (and 30% were correct words that didn’t need to be corrected).
A low precision means many words are corrected when they should not, and a high precision means only actual typos are corrected.
Recall
- [0 - 1]
- higher is better
Recall is the ratio of typos detected by the model.
So a recall of 0.65
means that the model correctly detected 65% of typos (and 35% of typos were not corrected by the model).
A low recall is symptom that most typos are not detected, and a high recall means most of typos are detected as typos.
F-score
- [0 - 1]
- higher is better
F-score is the harmonic mean of precision and recall, it’s just a way to gather both precision and recall in a single metric.
Note that we weight precision 50% more than recall.
To validate our performance against direct competitors, we use consistent metrics and a standardized evaluation system, explained in the previous section. Presented below are some of the results obtained in comparison with our direct competitors.
Language compared: en-US
Date: Jun 14, 2023
Auto-correction | |
Precision | 0.79 |
Recall | 0.64 |
F-score | 0.73 |
Auto-completion | |
Top-3 accuracy | 0.69 |
Next-word prediction | |
Top-3 accuracy | 0.28 |
Test set: 300 sentences
Compare these metrics with our KeyboardSDK for English en-US