A Wordle with Clustering of "Similar" words

Enter the text you want to cluster, and then press submit to process it (see below) and send the result to the wordle.net website:


Threshold for Relative Matching: (must be between 0.0 and 1.0, inclusive)

                                 

Explanation :

Wordles are often used to convey the "essence" of a document by demonstrating the frequency of the occurrence of a word via the size in which it is printed.  However, such conveyance is sometimes frustrated by occurrences of a variety of different forms of the same word -- singular versus plural, for instance. Thus, this website applies a simple text mining pre-processor (written in python) to the text before it is sent to wordle.net.  You can see this pre-processing by clicking on the Preview button.

For example, the default text in the text area is a collection of statements about being a good teacher ( click Reload Example to recover the default).  A wordle can be used to capture the essence of what these statements say about good teaching, but the default wordle approach treats "student" and "students" as two separate words (and also treats "know" and "knows" as two different words).   The pre-processing removes common words and then uses a matching algorithm (more details below) to group words into related groups (called "tags"), after which the most frequent word for each group becomes the value of the tag.  The Threshold parameter controls the sensitivity of the match (1.0 is strict -- all words are independent as per the usual approach -- while 0.0 means all words match each other and there is only one "tag" group).  In this way, "student" and "students" fall into the same word group (i.e., have the same "tag").  The difference is illustrated by the two wordles below:

Original Teaching Characteristics Wordle   A More Scientific Teaching Characteristics Wordle
Original Teaching Characteristics Wordle      A Teaching Characteristics Wordle with Clustering

We use the same font, layout, and color scheme in both so that the two can be compared.  The wordle with clustering combines the various forms of "know" -- among others -- to illustrate that (at least in this demonstration) a good teacher emphasizes her students, is understandable, and is always "in the know". 

 

Similarly, a wordle on the constitution suffers from such small variations.  The original and "clustered" wordles are shown below:

Original US Constitution Wordle   A More Scientific US Constitution Wordle
Original US Constitution Wordle       A US Constitution Wordle with Clustering

Finally, it is important to note that wordles cannot completely capture the essence of a document.  Words are not individual patterns.  Their meaning and interpretation varies -- most words have multiple definitions, in fact!!  Consider, for example, the following sentence:

The law of the land says that an airplane can land on land
owned by landed gentry as long as their own flying craft
land on their land and they don't own too many planes.

The sentence is clearly about airplanes and where they can land, but any wordle -- no matter how "scientific" or how much clustering is done -- would imply that the sentence is mostly about owning land.  The problem is that the word "land" has many meanings, whereas airplane, flying craft, and planes are separate words referring to (mostly) the same "tag."   This is typical in text-mining applications -- i.e., that such applications often need to reflect the conceptual context and variety of word usages, definitions in a way that differentiates even between a word and that same word used differently in another part of the document.

 

(Some) Technical Details

The pre-processing is based on the Levenshtein distance between two strings (see http://en.wikipedia.org/wiki/Levenshtein_distance and http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance for details).  Specifically, a fuzzy matching algorithm called LevenMatch is defined which returns a relative comparison between two strings a and b.  The return value is in [0,1] and

Here are some examples:

LevenMatch( 'cat', 'bat' ) = 0
LevenMatch( 'cat', 'cats' ) = 0.5625
LevenMatch( 'student', 'students' ) = 0.765625

 

Any two words with a LevenMatch above the Threshold are put into the same "word group" (i.e., tag), and then the tag value is associated with the number of words in each word group.

__________

click here to download the python source file (and see source for this page).

Based on the excellent Wordle Logo website located at http://www.wordle.net  and developed by Jonathan Feinberg.

Email: knisleyj who is at etsu.edu.