trump-vocabulary/README.md

52 lines
1.2 KiB
Markdown
Raw Permalink Normal View History

2019-07-04 11:57:16 +02:00
# Trump vocabulary
2019-07-04 11:31:32 +02:00
2019-07-04 11:57:16 +02:00
**NOTE:** this was written in a few minutes without bothering with clean and robust
code.
This code goes through the tweets of Donald Trump and produces a ranked list of words
used.
The result (not much updated, though) can be found
[here](https://tobast.fr/files/trumprank.txt).
2019-07-04 12:00:06 +02:00
## Methodology
A word is considered to be a contiguous sequence of letters and quotes (`'`)
only. Words that have less than four occurrences are removed (considered
irrelevant — probably some random name).
2019-07-04 11:57:16 +02:00
## Install
Clone this reopsitory with submodules: `git clone --recurse-submodules`
Alternatively, if you already cloned the repo, you can run
```bash
git submodule update --init --depth 1
```
## Get a shell
You can explore the data in a shell by using `count_words.py` as an init script for
your favorite shell, eg.
```bash
ipython -i count_words.py
```
The following will be available to you as variables:
* `tweets`: the list of all tweets ever,
* `occur`: python dictionary of occurrences of words in Trump's tweets
* `ranked`: ranked list of occurrences of words in Trump's tweets
## Generating the list
Simply run
```bash
python ./generate_list.py [OUTPUT_FILE]
```
If you omit `OUTPUT_FILE`, the list will be generated to `trumprank.txt`.