Assesses the vocabulary of our beloved, greatest president.
Go to file
Théophile Bastian 4c9540afc5 Detail a bit methodology 2019-07-04 12:00:06 +02:00
trump_tweet_data_archive@4398599156 Initial commit 2019-07-04 11:57:16 +02:00
.gitignore Initial commit 2019-07-04 11:57:16 +02:00
.gitmodules Initial commit 2019-07-04 11:57:16 +02:00
LICENSE Initial commit 2019-07-04 11:31:32 +02:00
README.md Detail a bit methodology 2019-07-04 12:00:06 +02:00
__init__.py Initial commit 2019-07-04 11:57:16 +02:00
bootstrap.py Initial commit 2019-07-04 11:57:16 +02:00
count_words.py Initial commit 2019-07-04 11:57:16 +02:00
generate_list.py Initial commit 2019-07-04 11:57:16 +02:00

README.md

Trump vocabulary

NOTE: this was written in a few minutes without bothering with clean and robust code.

This code goes through the tweets of Donald Trump and produces a ranked list of words used.

The result (not much updated, though) can be found here.

Methodology

A word is considered to be a contiguous sequence of letters and quotes (') only. Words that have less than four occurrences are removed (considered irrelevant — probably some random name).

Install

Clone this reopsitory with submodules: git clone --recurse-submodules

Alternatively, if you already cloned the repo, you can run

git submodule update --init --depth 1

Get a shell

You can explore the data in a shell by using count_words.py as an init script for your favorite shell, eg.

ipython -i count_words.py

The following will be available to you as variables:

  • tweets: the list of all tweets ever,
  • occur: python dictionary of occurrences of words in Trump's tweets
  • ranked: ranked list of occurrences of words in Trump's tweets

Generating the list

Simply run

python ./generate_list.py [OUTPUT_FILE]

If you omit OUTPUT_FILE, the list will be generated to trumprank.txt.