These are things I found interesting while reading Speech & Language Processing: An Introduction to Natural Language Processing, Computational Linguistics & Speech Recognition with Language Models

The notes are still in progress as I have not yet finished reading the book

I. Fundamental Algorithms for NLP

2. Regular Expressions, Tokenization, Edit Distance

  • ELIZA was found convincing by many users because the system emulated a person who was not supposed to not refer to any outside context
  • A collection of texts is called a corpus
  • Regular expressions in the book are placed between slashes / /, but I can’t find the historical context for that, as I don’t remember using slashes in regular expressions in languages and tools that I’ve used
  • One issue with backslashes \ is that they are used to escape characters by both strings in the programming language, and also by the regular expression library. So to match a literal backslash \ in a string, the regular expression \\ is used. But those backslashes must also be escaped when written as a string in the programming language, so you end passing the string "\\\\" to the regular expression library. Languages like Python sometimes use tricks like raw string notation r"\\"to make the backslashes less annoying
  • I’m not as familiar with the non-greedy (match as little as possible) Kleene operators *? and +?, and the lookahead assertions (?= and (?!, so it was interesting to learn about them. I wonder how often they come up in practice
  • It’s fun to have a detailed explanation of regular expressions. In practice, complicated regular expressions can be difficult to get right. Developers often looked up common complicated regular expressions on places like Stack Overflow then modified them to their particular taste. Now they often use LLMs for a first draft.
  • In Fei-Fei Li’s The World’s I See, one of the themes was the importance of good training data. This book also focuses on explorations of training data, like the Brown corpus, the Switchboard corpus, and the massive Google n-grams corpus
  • There’s a link to Using uh and um in spontaneous speaking about how the fillers “uh” and “um” are used differently: “uh” for a short filled pause and “um” for a long filled pause
  • In biology, the line between what is in the same species or not is malleable and sometimes even political. In the discussion of what is a language and what is a dialect, I wonder if the same issues appear
  • My grandfathers’ primary languages were Portuguese and Spanish, so I heard code-switching all the time growing up
  • I’ve read many older books and have noted the differences in style over time, so it makes sense that researchers have explicitly created corpora from particular time periods
  • One of the elements in the datasheet is “Are there copyright or other intellectual property restrictions?” I think a lot of the law pertaining to training on copyrighted data is still up in the air, so I’d be interested in learning more about this topic
  • I enjoy reading historical notes, so it was great to see the example of Unix command pipelines like tr → sort → uniq, which have been supplanted by scripting languages with mature Unicode support
  • In compilers, the tokenization stage is also called lexical analysis or lexing
  • The Natural Language Toolkit (NTLK) for Python is recommended
  • It’s not super explicit in the book, so I want to spell it out for myself here: tokenization seems to balance the tradeoffs inherent in the choice of vocabulary size and consequent complexity of the model. With too many symbols, the result is model that has to deal with “a huge vocabulary with large numbers of very rare words“
    With too few symbols, the model will be manageable, but many words will be broken into symbols
    Bottom-up tokenization makes the best out of the tradeoff by resulting in “most words will be represented as full symbols, and only the very rare words (and unknown words) will have to be represented by their parts”
  • The statement “lemmatization algorithms can be complex” seems an understatement. I wonder how much of it can be learned from the text, like bottom-up tokenization learns from the text