What Does Natural Language Processing Mean for Writers, Content, and Digital Marketing?

Cate McGehee Apr 19 2018

Terrible Movie Pitch:

The battle for the voice of the internet has begun. In one corner, we have computer programs fortified by algorithms, Artificial Intelligence, Natural Language Processing, and other sexy STEM buzzwords. In the other corner, we have millions of copywriters armed with the only marketable skill a liberal arts education can provide: communication. Who will lol the last lol?

Spoiler:

Writers, your jobs are probably safe for a long time. And content teams stand to gain more than they stand to lose.

I remember the day someone told me a computer had written a best-selling novel in Russia. My first thought? “I need to get the hell out of content marketing.”

The book was called True Love—an ambitious topic for an algorithm. It was published in 2008 and “authored” by Alexander Prokopovich, chief editor of the Russian publishing house Astrel-SPb. It combines the story of Leo Tolstoy’s Anna Karenina and the style of Japanese author Haruki Murakami, and draws influence from 17 other major works.

Frankly, that sounds like it’d make for a pretty good book. It also sounds a lot like how brands create their digital marketing strategies.

Today, every brand is a publisher. Whether you’re a multi-billion-dollar technology company or a family-run hot sauce manufacturer, content rules your digital presence. Maybe this means web guides, blog posts, or help centers. Maybe it means a robust social media presence or personalized chatbot dialogue. Maybe you feel the need to “publish or perish,” and provide value and engagement in a scalable way.

Brands require a constant influx of written language to engage with customers and maintain search authority. And in a way, all the content they require is based on 26 letters and a few rules of syntax. Why couldn’t a machine do it?

In the time since I first heard about True Love, I’ve moved from content writing to content strategy and UX, trying to stay one step ahead of the algorithms. But AI in general and Natural Language Processing in particular are only gaining momentum, and I find myself wondering more and more often what they’ll mean for digital marketing.

This essay will endeavor to answer that question through conversations with experts and my own composite research.

Portent’s Matthew Henry Talks Common Sense

“The Analytical Engine has no pretensions to originate anything. It can do whatever we know how to order it to perform.”

-Lady Ada Lovelace, 1842, as quoted by Alan Turing (her italics)

Lady Lovelace might have been the first person to contend that computers will only ever know as much as they’re told. But today’s white-hot field of machine learning and Artificial Intelligence (AI) hinges on computers making inferences and synthesizing data in combinations they were never “ordered to perform.”

One application of this Machine Learning and AI technology is Natural Language Processing (NLP), which involves the machine parsing of spoken or written human language. A division of NLP is Natural Language Generation (NLG), which involves producing human language. NLP is kind of like teaching computers to read; NLG is like teaching them to write.

I asked Portent’s Development Architect Matthew Henry what he thinks about the possibilities for NLP and content marketing. Matthew has spent over a decade developing Portent’s library of proprietary software and tools, including a crawler that mimics Google’s own. Google is one of the leading research laboratories for NLP and AI, so it makes sense that our resident search engine genius might know what the industry’s in for.

I half expected to hear that he’s already cooking up an NLP tool for us. Instead, I learned he’s pretty dubious that NLP will be replacing content writers any time soon.

“No computer can truly understand natural language like a human being can,” says Matthew. “Even a ten year old child can do better than a computer.”

“A computer can add a million numbers in a few seconds,” he continues, “which is a really hard job for a human being. But if a cash register computer sees that a packet of gum costs $13,000, it won’t even blink. A human being will instantly say Oh, that’s obviously wrong. And that’s the part that’s really hard to program.”

Knowing that something is obviously wrong is something we do all the time without thinking about it, but it’s an extremely hard thing for a computer to do. Not impossible—to extend my analogy, you could program a computer to recognize when prices are implausible, but it would be a giant project, whereas for a human being, it’s trivial.”

It’s not news that there are things computers are really good at that humans are bad at, and some things humans are really good at that computers can’t seem to manage. That’s why Amazon’s Mechanical Turk exists. As they say,

“Amazon Mechanical Turk is based on the idea that there are still many things that human beings can do much more effectively than computers, such as identifying objects in a photo or video, performing data de-duplication, transcribing audio recordings, or researching data details.”

Amazon calls the work humans do through Mechanical Turk “Human Intelligence Tasks,” or HITs. Companies pay humans small sums of money to perform these HITs. (A made-up example might be identifying pictures where someone looks “sad” for 10 cents a pop.)

Matthew might instead call these HITs, “Common Sense Tasks,” like knowing a pack of gum shouldn’t cost $13,000.

“People underestimate the power of common sense,” Matthew says. “No one has ever made a computer program that truly has common sense, and I don’t think we’re even close to that.”

And here’s the real quantum leap for not only NLP but Artificial Intelligence: right now, computers only know what they’ve been told. Common sense is knowing something without being told.

It sounds cheesy to say that our imaginations are what separate us from the machines, but imagination isn’t just about being creative. Today, computers can write poetry and paint like Rembrandt. Google made a splash in 2015 when the neural networks they’d trained on millions of images were able to generate pictures from images of random noise, something they called neural net “dreams.” And in 2016, they announced Project Magenta, which uses Google Brain to “create compelling art and music.”

So it’s not “imagination” in any artistic terms. It’s imagination in the simplest, truest form: knowing something you haven’t been told. Whether it’s Shakespeare inventing 1,700 words for the English language, or realizing that kimchi would be really good in your quesadilla, that’s the basis of invention. That’s also the basis of common sense and of original thought, and it’s how we achieve understanding.

To explain what computers can’t do, let’s dig a little deeper into one of the original Common Sense Tasks: understanding language.

Defining “Understanding” for Natural Language

NLP wasn’t always called NLP. The field was originally known as ”Natural Language Understanding” (NLU) during the 1960s and ‘70s. Folks moved away from that term when they realized that what they were really trying to do was get a computer to process language, not understand it, which is more than just turning input into output.

Semblances of NLU do exist today, perhaps most notably in Google search and the Hummingbird algorithm that enables semantic inferences. Google understands that when you ask, “How’s the weather?” you probably mean, “How is the weather in my current location today?” It can also correct your syntax intuitively:

Natural Language Processing example from Google Search - natural language understanding

And it can also anticipate searches based on previous searches. If you search “Seattle” and follow it with a search for, “what is the population,” the suggested search results are relevant to your last search:

Natural Language Processing example from Google Search inferred from previous searches

This is semantic indexing, and it’s one of the closest things out there to true Natural Language Understanding because it knows things without being told. But you still need to tell it a lot.

“[Google’s algorithm] Hummingbird can find some patterns that can give it important clues as to what a text is about,” says Matthew, “but it can’t understand it the way a human can understand it. It can’t do that, because no one’s done that, because that would be huge news. That would basically be Skynet.”

What is Skynet Google search knowledge graph result

In case you don’t know what Skynet is and you’re also too embarrassed to ask Matthew, too, here’s the Knowledge Graph.

Expert Opinion: NLP Scholar Dr. Yannis Constas on Why Language is So Freaking Hard to Synthesize

To find out what makes natural language so difficult to synthesize, I spoke with NLP expert Dr. Yannis Constas, a postdoctoral research fellow at the University of Washington, about the possibilities and limitations for the field. [1]

There are a lot. Of both. But especially limitations.

[Note: If you don’t want a deep dive into the difficulties of an NLP researcher, you might want to skip this section.]

“There are errors at every level,” says Yannis.

“It can be ungrammatical, you can have syntactic mistakes, you can get the semantics wrong, you can have referent problems, and you might even miss the pragmatics. What’s the discourse? How does one sentence entail from the previous sentence? How does one paragraph entail from the previous paragraph?”

One of the first difficulties Yannis tells me about is how much data it takes to train an effective NLP model. This “training” involves taking strings of natural language that have been labeled (by a human) according to their parts of speech and feeding those sentences into an algorithm, which learns to identify those parts of speech and their patterns.

Unfortunately, it takes an almost inconceivable amount of data to “train” a good algorithm, and sometimes there just isn’t enough input material in the world to make an accurate model.

“When we’re talking about a generic language model to train on, we’re talking about hundreds of millions of sentences,” he says. “That’s how many you might need to make a system speak good English with a wide vocabulary. However, you cannot go and get hundreds of millions of branded content sentences because they don’t exist out there.”

Yannis says he once tried to make an NLP model that could write technical troubleshooting guides, which might be a popular application for something like corporate support chatbots. He only had 120 documents to train it on. It didn’t work very well.

Right now, his research team is trying to figure out a way to combine corpuses of language to overcome the twin pitfalls of meager input:

  • Output that doesn’t make much linguistic sense
  • Output that all sounds pretty much the same

“We tried to take existing math book problems targeted at 4th graders and make them sound more interesting by using language from a comic book or Star Wars movie,” says Yannis. “That was specific to that domain, but you can imagine taking this to a marketing company and saying, ‘Look, we can generate your product descriptions using language from your own domain.’”

That’s the grail of NLP: language that is accurate to the domain yet diverse and engaging. Well, one of the grails. Another would be moving past the level of the sentence.

“80 to 90 percent of the focus of NLP has been on sentence processing,” says Yannis. “The state-of-the-art systems for doing semantic processes or syntactic processing are on a sentence level. If you go to the document level—for example, summarizing a document—there are just experimental little systems that haven’t been used very widely yet…The biggest challenge is figuring out how to put these phrases next to one another.”

It’s not that hard for an algorithm to compose a sentence that passes the Turing Test, or even hundreds of them. But language is greater than the sum of its parts, and that’s where NLP fails.

“When you break out of the sentence level, there is so much ambiguity,” says Yannis. “The models we have implemented now are still very rule-based, so they only cover a very small domain of what we think constitute referring expressions.”

“Referring expressions,” Yannis tells me, are those words that stand in for or reference another noun, like he, she, it, or these. He uses the example, “Cate is holding a book. She is holding it and it is black.” An NLP model would probably be at a loss for realizing that “she” is “Cate,” and “it” is “the book.”

“It’s something that sounds very simple to us,” says Yannis, “because we know how these things work because we’ve been exposed to these kinds of phenomena all our lives. But for a computer system in 2017, it’s still a significant problem.”

Models are also inherently biased by their input sources, Yannis tells me. For example, we’re discussing an AI researcher friend of his who combines neural networks and NLP to generate image descriptions. This seems like it would be an amazing way to generate alt tags for images, which is good for SEO but a very manual pain in the ass.

Yannis says that even this seemingly-generic image captioning model betrays bias. “Most images that show people cooking are of women,” he says. “People that use a saw to cut down a tree are mostly men. These kinds of biases occur even in the data sets that we think are unbiased. There’s 100,000 images—it should be unbiased. But somebody has taken these photos, so you’re actually annotating and collecting the biases.

“Similarly, if you were to generate something based on prior experience, the prior experience comes from text. Where do we get this text from? The text comes from things that humans have written…If you wanted to write an unbiased summary of the previous election cycle, if you were to use only one particular news domain, it would definitely be biased.”

(Oh yeah, and using neural networks for creating image captions isn’t just biased, it’s not always accurate. Here are a few examples from Stanford’s “Deep Visual-Semantic Alignments for Generating Image Descriptions:

Stanford Deep Visual-Semantic Alignments for Generating Image Descriptions that are funny or inaccurate

Sometimes it’s right. Sometimes hilariously wrong.

Finally, perhaps one of the biggest hurdles for NLP is particular to machine learning. Interestingly, it sounds a lot like something Matthew said.

One common source of error is lack of common sense knowledge,” says Yannis. “For example, ‘The earth rotates around the sun.’ Or even facts like, ‘a mug is a container for liquid.’ You’ve never seen that written anywhere, so if a model were to generate that it wouldn’t know how to do it. If it had knowledge of that kind of thing, it could make the inference that coffee is a liquid and so this mug could be a container of coffee. We are not there. Machines cannot do that unless you give them that specification.”

[1] Note: This interview was conducted in May of 2017. Quotes from Yannis only reflect his work, experience, and understanding at the point of this interview.

NLP is Hard. So is Programming. English is Harder.

 

English is hard - examples of homonyms that are hard for NLP

Source

It’s kind of funny that we need common sense to navigate our language, because so much of it makes so little sense. Perhaps especially English.

There are synonyms, homonyms, homophones, homographs, exceptions to every rule, and loan words from just about every other language. There are phrases like, ”If time flies like an arrow then fruit flies like a banana.” If you need this point really driven home, read the poem “The Chaos,” by Gerard Nolst Trenité, which contains over 800 irregularities of English pronunciation. Irregularities are systemic—they’re in pronunciation, spelling, syntax, grammar, and meaning.

Code is actually simpler and less challenging than natural language, if you think about this deeply. People have this impression it’s a heavy, mathematical thing to do, and it’s a job skill, so maybe it’s harder. But I can spend six months at Javascript and I’m fairly good at Javascript; if I’ve spent six months with Spanish, I’m barely a beginner.

-Internet linguist Gretchen McCulloch to Vox

Code is the “language” of computers because it’s perfectly regular, and computers aren’t good at synthesizing information or filling in the blanks on their own. That requires imagination and common sense. A computer can only “read” a programming language that’s perfectly written—ask any programmer who’s spent hours pouring over her broken code looking for that one semicolon that’s out of place. If our minds processed language the way a computer does, you couldn’t understand this sentence:

Example of common misspellings

Sure, computers can autocorrect those four misspelled words, and there’s a red line under them on my word processor. But that’s because there are rules for that, like how you can train a computer to recognize that a candy bar shouldn’t cost $13,000 because that’s 10,000 times the going rate.

Humans, however, are great at making inferences from spotty data. Our bodies do it all the time. Our eyes and brain are constantly inventing stuff to fill in the blind spot in our field of vision, and we can raed setcennes no mtaetr waht oredr the ltteers in a wrod are in, as lnog as the frist and lsat ltteer are in the rghit pclae.

Inference is something we were built (or rather, evolved) to do, and we’re great at it. In fact, humans actually learn languages better when they try less hard. Language takes root best in our “procedural memory,” which is the unconscious memory bank of culturally learned behaviors, rather than in our “declarative memory,” which is where you keep the things you’ve deliberately worked to “memorize.” Children can pick up other languages more easily than adults because they’re tapping into their procedural memory.

Computers, however, were designed to excel where humans are deficient, not to just duplicate our greatest strengths.

Trying to teach a computer to process and generate natural language is kind of like trying to build a car that can dance.

It’s fallacious to assume that because a car is much better than a human at going in one direction really fast, they would also make much better dancers if we could only get the formulas right. Instead, it seems, we should focus on the ways machines’ strengths help us compensate for our deficiencies.

The Future of AI and NLP means Helping Us, Not Replacing Us

The power of the unaided mind is highly overrated. Without external aids, memory, thought, and reasoning are all constrained. But human intelligence is highly flexible and adaptive, superb at inventing procedures and objects that overcome its own limits. The real powers come from devising external aids that enhance cognitive abilities.

-Donald Norman, from Things that Make Us Smart (1993)

“The way that I think about AI, it has to be assistive,” says Yannis.”For an application to be successful, it has to be assistive to human society. We can go into the idea that robots are going to take over the world and they just need to learn to speak first, and that’s kind of cool for a movie. But we have problems with each other we haven’t solved yet, so let’s use this technology to help us in our everyday life.”

Once again, Matthew and Yannis are in uncanny agreement.

“A lot of people worry that technology will take over all the jobs and Skynet will exterminate humanity, but I don’t think any of those worries are that plausible,” Matthew tells me. “If you look at the history of technologies taking over people’s jobs, there’s usually a major disruption and then new jobs are created. People were screaming bloody murder when they invented machines to weave fabrics, but there are plenty of people in the textile industry today. Fabric weaving was a tedious, dangerous job, and they had factories full of child laborers because no one wanted to do it…

“’I’m not that scared about the future of technology,” continues Matthew. “I think that the world is much better off than it was centuries ago, and I think it’s most likely going to be even better still in the future.”

If Yannis and Matthew are right, NLP won’t be replacing content writers anytime soon. In fact, we might stand to gain even more from the technologies.

Comic about reassuring ourselves of human value in the face of artificial intelligence

2 Comments

  1. Very nice piece of work, Cate! Sums up the problems of NLP beautifully, while letting writers know they needn’t be looking for a new career just yet!

    • The Portent Team

      The Portent Team

      Thanks, Doc! Of course, I’m still crossing my fingers for a near future of Universal Basic Income made possible by machine labor and a public safety net befitting a developed nation, so hopefully the writers will be safe for a long time.

Comments are closed.