Data Analysis

Using the same techniques shown here, let’s see what sentences look like in an award-winning science-fiction horror short-story. I loaded “The Bucket” into R-Studio:


“The Bucket” is a 2460 word short story about an alcoholic chemist who pukes eyeballs. The eyeballs belong to an extra-dimensional being who needs him to keep drinking so he can be vomited into existence. Its average sentence is only about eight words long, which is a quite short. Short sentences are more easily understood, and can deliver a devastating punch. In the R readout, we can see that many sentences are a friendly, momentum-building 1-5 words.

But sheets of numbers aren’t too readable. Let’s plot them by order in the story:


This data has awesome trends which show you how the story was structured. Let’s take it one aspect at a time, starting with examination of outliers:


Sentence 1 is a 48-word behemoth right at the front of the piece. I’m wary of this, as it seems like a gatekeeper—a long, clumsy, burdensome sentence can halt a reader’s interest. But this sentence is actually load-bearing: it perfectly demonstrates what the story is about, and should guarantee the reader continue at least a page or so:

I generally vomited into a toilet and flushed the eyeballs before the painful horror set in, but after a midnight joust with a bottle of gin, I found myself retching into an orange, plastic bucket in my closet, where the eyeballs struggled like fish flopping for the water.

Also note the repeated sounds which guide the reader through the sentence. “Midnight joust with a bottle of gin.” “Fish flopping.” These provide continuity for the reader to follow so they don’t get lost along the way.

I believe this is Sentence #2:

I won’t get into details, it involves a lot of trans-dimensional mathematics, and you Stage One guys aren’t hot on that.

Another vital pillar of the story. It introduces the Stages of Reality, which are important in the plot. Well worth spending 22 words on. Similarly, Sentence #3 is the barely-lengthy

There’s an equation for it, but you can only really solve it at Stage Two or Three…

This story is all about power hierarchies. Arnie is powerless before alcohol, and Trip is so far above him he cannot comprehend. Long sentences should reinforce that theme, as does this broken thought which makes Sentence #4:

When Trip said you were ‘Stage Three technology,’ he meant—You’re saying Trip enslaved eighty-six billion sentient realities, and you’re one of them?”

That’s the turning point of the book, where the reader knows there’s no going back for our hero Arnie. Trip is a mass-slaver of universal proportions, and Arnie can’t let him hold reality hostage!

Sentences 5 and 6 showcase a bit of body-horror near the climax:

“Well, ordinarily you’d hafta swallow that thing, but our universes are close enough I can toss you a Synapse Cable. You feel like hurling, Arnie?”

“I’m pretty sober right now.”

“Well, either you’re gonna hafta swallow that Pod, or you’re gonna hafta start drinking so I can get you this Cable.”

And Sentence 7 sets up the final bit of body-horror:

I’m close enough, Arnie, if you vomit right now, I can survive in your universe, but it has to be right now.

(You might notice some of these sentences are slightly different word-counts than they appear to be on the plot; I’m having some trouble with the regex. It’s accurate enough.)

These longest sentences form the tent-poles of the story. The other sentences are draped on them, leaning on them for context. Once a long sentence sets up a situation, the sentences around it can focus on action. Here’s a way to visualize that:


Using the conventional wisdom that longer sentences are more exposition heavy, while shorter sentences are more action-oriented, these pink regions show the ‘berth of words’ created by exposition. Sentences reduce their word-counts by allowing a few important lines to hold the boring bits!


This isn’t very scientific of me, but I drew some random lines. To me, it appears that once a long sentence gets a reader to the top of a roller-coaster, so to speak, it drops them into an action-valley which provides momentum to heft the reader’s attention up the next hill.


Meanwhile, looking at the shortest sentences, we see the interjections, the cut-off gasps and pants, etc. I notice concavities beneath the exposition sentences we observed earlier. Near exposition, there are fewer sudden actions warranting a single word sentence! I wonder if this is just a product of my writing style, or if these features form the heartbeat of a book. I should apply this analysis to more short stories!

Finally, I’ll leave you with some diagnostic plots. The first four relate to the sentence-lengths as they are. The second four relate to the square roots of the sentence-lengths—this makes the data a little more linear. We can see that the sentence labeled 6 here, and labeled 1 above, is always an outlier. It’s like the thesis of the fiction. It demands to be noticed.





