Can You Be Identified By the Way You Write?

The Fingerprints are in the Data.

Can we mirror forensic handwriting analysis in todays digital world?

According to forensic science textbooks, at least twenty-one distinguishing characteristics of the elements of somebodies handwriting are necessary to complete an accurate identification. But what if we don’t have handwriting at all, but rather text messages, or an encrypted untraceable email? With the rise of lyrical analysis, document similarity analysis, and neural networks that are beginning to interpret the nuances of human languages, it is natural that text and speech analysis will be one of the components fueling the future of innovation, and I see it playing a large role in modern forensics going forwards.

The Experiment:

Searching the web and using web-sraping tools and APIs, I have compiled the full texts of the following ten novels, each written by at least one of the same four authors:

  • “Maze Of Bones” ~ Rick Riordan; “One False Note”~ Gordon Korman; “Medusa Plot” ~ G. Korman; “Sword Thief” ~ Peter Lerangis; “Vipers Nest” ~ P. Lerangis; “Dead Of Night” ~ P. Lerangis “Beyond The Grave” ~ Jude Watson; “In Too Deep” ~ J. Watson; “Vespers’ Rising ~ P. Lerangis, G. Korman, R. Riordan, & J. Watson; “The Black Circle” ~ Patrick Carman;

Standard Document Similarity Analysis

For those unfamiliar with the mechanics, I will give a brief synopsis of document analysis.

Traditionally, for document analysis, texts are broken down into their component words, and stored often in a Document Term Matrix (DTM), or in an Inverted index, sometimes called a Term Document Matrix, which is a collection of all words in the corpus of texts, each serving as an index for a list of documents containing that word. This allows for efficient computation of statistical measures tf, idf, and tf-idf.

  • tf(w,d): number of occurrences of word w in document d (Term Frequency)
  • idf(w): log(number of docs. in corpus/number of docs.containing w)
  • tf-idf(w,d): tf *idf

This places a higher weight on words that rarely occur within a corpus, since a unique word is most likely related to the specific subject of the document, rather than a filler word that is common to all texts in that language. Multiplying this by tf reaffirm this, since words that occur frequently are likely more integral to the subject of the text. The logarithm allows words that are frequent throughout the corpus to not play any role in tf-idf.

For each document, we then form a vector of all of the words in the document, where the weight of each word’s component is its tf-idf for that document. Let V1 and V2 be the weight vectors for two documents D1 and D2, and for any D, let |D| refer to its magnitude, and let • denote dot product. We find the Cosine Similarity between two documents as follows:

  • CosSim(D1, D2) = (V1V2) / (|V1|*|V2|)

A Cosine Similarity of 1 indicates that two documents are identical, whereas 0 indicates that they share no meaningful content.

What if the same person writes about two different things?

Note that all books in our corpus have uniformly different subject matter. When I ran the traditional cosine similarity algorithm on the corpus, the results were meaningless, and none of the predictions yielded a correct answer.

The algorithm without modification places high value on the subject of books, rather than the writing itself. It focuses on the exact words used, and how unique the words used are. While common stop words may not tell us much about a book’s subject matter, the author’s usage of them may tell us a lot about the author, so I needed a more moderate definition of idf.

Cosine similarity would be useless: If I were to write a happy birthday email to my cousin, and to write a novel about sailing across the Atlantic, it is unlikely that any operative words would overlap. Rather than looking at what is rare, it is more useful for identification to look at not only what is common between documents, but the relative frequencies in which common words are used. We define our own variables:

  • tf(w,d) = (number of instances of w in d) / (number of words in d)
  • idf(w,d) = (number of docs. containing w) / (number of docs in the corpus)
  • tf-idf(w,d) = tf * idf

tf is divided by the number of words in the document to focus on the frequency of the word relative to the size of the document. With idf, I took the reciprocal of the expression inside of the logarithm so that more common filler words received more weight, and more rare words received less weight. Then, I dropped the logarithm so that the number of documents a word appeared in was proportional to the weight it receives through tf. For example, the books written by P. Lerangis contained many variations of the interjection “agh”, for example “Aaaaaagggghhh”, or “Agghh” or Aggggghhhhh”. These words are extremely charactaristic to his books, and provide nothing to the subject matter, and have relatively low frequency, and therefore would have a low weight in the original format. While this is still not ideal, and their weighting is still far undervalued, the results were fantastic nonetheless.

Results

Since most of the similarities were extremely small, all values were scaled by 1000, leaving only values between 935 and 1000 (this makes sense, as we expect documents to be similar at a high level when we are looking at common words), to get cleaner numbers, I subtracted 930, and multiplied by (10/7) so that all of the weights were neatly scaled between 0 and 100.

As we can see above, the results were astoundingly good. For every instance in which an author wrote other books in the corpus, those books were all scored as the most similar to the author’s other books. For example, P. Lerangis wrote “Sword Thief”, “Vipers Nest”, and “Dead of Night”, all of which had each other as their pairwise top 2 most similar books out of the 10 in the corpus. The same applies for all of the authors and their books.

In this figure above, a contour map with color coded grid lines allows us to see that points of high similarity clearly fall on the intersection of like-colored grid lines, or that books written by the same author are more likely to have a high similarity. On top of this, we can also see that 8 (Vespers Rising) and 5 (The Black Circle) are overwhelmingly more blue than the other regions of the graphic.

Vespers’ Rising is a book co-authored by Lerangis, Korman, Riordan, and Watson. While it could have been interesting to see ties from this to the other books, it certainly makes sense that a mix of text patterns would net out to something unrecognizable for analytics.

The Black Circle, is written by Patrick Carman, an author absent from the rest of the corpus. Accordingly, it makes sense that it has a very low similarity to the rest of the corpus.

This is the kind of consistency that is promising for the use of these technologies, as not only do we see a high rate of positive identifications, but we also see a very low rate of false predicted identifications, as similarity calculations are too involved for a high similarity rating to arise by chance with any regular frequency. As we can see above, there is a very consistent and significant difference between similarity scores for books by the same author vs. different authors.

Above is a directed prediction graph, in which each book’s node has as many outgoing edges as there are other books in the corpus by the same author. An arrow represents a predicted identification of authors of other books. Put simply, an arrow represents a prediction. If a node has two outgoing edges, and both are green, that indicates that it correctly predicted both of the other two books written by that author.

Since 8 is “Vespers’ Rising”, which has 4 authors, the 8 → 10 edge is somewhat less significant, because it was based off of low similarities, and involves an edge case instead of a straight one author match. In general, a forensic setting would be looking for text as an identifying factor, therefore texts authored by groups are somewhat irrelevant to these settings anyway.

Book 1 is Rick Riordan’s “Maze of Bones”. The only other book partially authored by Riordan is “Vespers’ Rising”, therefore the single prediction for node 1 is also somewhat meaningless in that he was not sole author of any other books.

When we look at this graph in absence of 8 → 10 and 1 → 4, we see a graph only with edges representing predictions on books with a single author, and it forms a graph with strongly connected components (SCC) {3, 10, 7}, {6, 4}, {9, 2} {5} {1} {8}, meaning that the graph is partitioned by author. Traversing edges from any given node, you can only reach and be reached by nodes representing books by the same author.

This is extremely significant.

Takeaways

What do these strongly connected components tell us? It tells us that our algorithm is more than capable of “categorizing” texts by author.

There is one out-edge from 2, 6, 4, and 9, and two out-edges from 3, 7, and 10. Accordingly, if the algorithm were to randomly assign edges along those frequencies, the rough probability of correctly generating the predictions the way it did in our model is:

(1/9) for the 4 nodes with out-degree 1, and 1/(9 choose 2) for the 3 nodes with out-degree 2

For reference, you are 20,007 times more likely to be attacked by a shark than you are to randomly produce a result as accurate as ours.

This is a clear indication that there is massive potential for author identification in the future.

The value of this technology is not limited to forensic identification either. Since the success of a novelist largely stems from their voice, vocabulary, and word choice, these kinda of analyses could potentially give an indication of how similar a novelist’s jargon is to other successful authors, and help publishers inform their decisions before putting anything to contract.

This could also be used in the preservation and study of history. In situations where the author of historical documents is unknown, similarity analysis could likely provide a powerful tool in shaping how we look at these questions.

The implications for cybersecurity are also notable. Just as a credit card company takes note and freezes an account when their are uncharacteristic charges made, it is feasible that in the future social media companies, blogs, email servers, and more could use text analysis with your previous use as a corpus to flag posts and sent messages that don’t seem to have been written by the account owner, which could be a failsafe to minimize the implications of password hacks.

Finally, the potential use that I find the most intriguing is the crossover with voice recognition. In this algorithm, we are focusing on the way in which someone most regularly forms their sentences, and the relative frequencies, tendencies, and quirks of somebodies word choice. It is important to realize that none of these qualities are unique to written text, as vocal tendencies rely on the same pillars. In addition to common voice recognition, analysis of the transcripts of speech with this kind of text analysis could help identify somebody based on the similarity to their typical speech profile.

To summarize, yes, you can identify authors from their writing patterns, and the implications for the future are sizable.

Code

  1. Contact me for a copy of my original R-Notebook

Studying Computer/Data Science at the University of Pennsylvania

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store