Markov Chaining Trump
“What if”, I thought “I put the transcript of Donald Trump’s 17th of Feb 2017 press conference into a Markov chain. I wonder what the result will be?”
Which later had me tweeting out this…
Fed the transcript from Trump press conference into the simplest Markov Chain, the output is roughly indistinguishable from the input.
…you can see the results for yourself at the brand spanking new…
* * *
A Quick Markov Chain Refresher
I have a more detailed post about Markov Chains here, but here’s a quick refresher.
A Markov Chain looks at a word, a word pair or word triple and picks a likely next word based on a corpus of source text. For example, if I’d supplied it with a whole bunch of fairy stories and started it with the word triple of “Once upon a”, then there’s a good chance it’ll pick the word “time” next.
Then you repeat with “upon a time”, which may give you “a” or “there”, carry on with “a time there” and we may get “was” or “were”. You keep going as it strings together probabilistic next words.
What you get out depends on very much what you put in. It never puts in words that could make sense but weren’t in the original source text. If I put in the stories of Sherlock Holmes you’re only every going to get back words used in those stories, abet in a different order. It’s not suddenly going to start dropping spaceships or robots in there.
The output will reflect the input, which it’s supposed to, you get familiar sometimes plausible gibberish.
* * *
Garbage In, Garbage Out
I’m going to be a little self aggrandizing here, I’ve been messing with text processing and analysis for a good while now, and I’m pretty awesome at it. Recently very much with a focus on news over at Kaleida. I’m going to say that overall I have a pretty good gut feeling with information density, sentiment and cohesion of bodies of text, both before and after it processing.
I’m not about to head off into a “let’s remove all the low value information words from the Trump transcript and see what’s left” or “what’s the low to high value information density ratio” territory, as tempting as it is. For this I’m sticking with the Markov Chain.
Stick with me for this bit…
There are two things that can effect the quality of the results coming out of a Markov chain. The first is the amount of data you put in.
If you put in too much, like books and books and books worth from various sources then the result you get back out is unsatisfactory gibberish. There are too many opportunities for it to select next words so the results tend to jump around all over the place, making hardly any sense, it skips around in an incoherent fashion to the point of being unbelievable. As we read it we can see words are used, but not in ways we’d normally use them.
The above is when you can tell the input is bad, FAKE INPUT, sad.
If you throw a novel or two in, then you get pretty cohesive output, it makes enough sense to be interesting.
If you use too small a source then the output too closely matches the input. There are not enough opportunities for the text to split off, or jump around. You basically get the source text regurgitated back out pretty much word for word.
- Too much input: incoherent rambling out.
- Too little input: normal-ish text back out.
- Just right input: curiously interesting, and potentially believable text out.
Here’s a sample of output from the Trump remixer…
"And it’s not, its not, not a bad decision. That’s the real problem. And, you know, Paul was very much involved with this stuff started coming out, it came out that way and in the United States actually got together and got along and don’t forget, that’s the way that other people, including yourselves in this country’s going to be able to make a deal."
Now, I don’t think I ever met him. And he was right.
The problem is the amount of text put into the system, the press transcript, is too little input. At just 12,187 words what we should be getting back is pretty close to the original source.
And, I think it is.
And, this is just based on my vast and awesome amount of experience.
It feels like the incoherent, jumping, random, branching, gibberish that you get back out when you put too much in.
And, and, I’ve never seen that before.
Anyway, try for yourself: revdancatt.github.io/CAT784-remixing-trump
* * *
* * *