Chapter 5: Strings

A string is a sequence of one or more characters, and one of the most frequently used types in programming. It is therefore fitting that we acquaint ourselves with the idea of operating on strings.

String and character literals

You might be familiar by now with string and character literals from the introductory chapter, which introduced some literals, or from other programming languages. A string literal is surrounded by double quotes: " string ". Within the string, you can escape a double-quote using a backslash:

    "This string contains a \" double quote \" "

Strings are immutable and indexable – indices return the characters at the index position, starting from 1.

The difference between string and character literals

String and character literals are differentiated by two indicia:

  • strings may have a length other than one while a Char type object necessarily has the length one (or potentially zero),
  • strings are introduced and terminated by double quotation marks "", Char type objects are introduced by single apostrophes ''.

The second of these tends to be somewhat vexing for many programmers who are used to the equivalence of '' and "" in languages that do not necessarily have an implemented type or class for characters mirroring Char. So while for instance in Python, 'a' == "a" holds, this is not the case in Julia:

    julia> typeof("a")
    ASCIIString (constructor with 2 methods)

    julia> typeof('a')
    Char

    julia> "a" == 'a'
    false

Heredocs and multiline literals

Multiline literals allow you to keep longer spans of text within a single string, with line breaks. They are introduced, similarly to Python, by triple double quotation marks """:

    multiline_declaration = """
        We hold these truths to be self-evident,
        that all men are created equal,
        that they are endowed by their Creator with certain unalienable Rights,
        that among these are Life, Liberty and the pursuit of Happiness.

        That to secure these rights, Governments are instituted among Men,
        deriving their just powers from the consent of the governed...
    """

    julia> println(multiline_declaration)
        We hold these truths to be self-evident,
        that all men are created equal,
        that they are endowed by their Creator with certain unalienable Rights,
        that among these are Life, Liberty and the pursuit of Happiness.

        That to secure these rights, Governments are instituted among Men,
        deriving their just powers from the consent of the governed...

As you can see, the use of the """ or 'heredoc' format has preserved the line breaks and structure of the text, a rather helpful feature where longer texts are concerned.

Regex literals

Regular expressions (regexes) are special strings that represent particular patterns. They are useful in matching and searching text, and a good knowledge of regex should be essential knowledge for any good functional programmer.

To construct a regex literal, preface the string with r:

    julia> regex_literal = r"a|e|i|o|u"
    r"a|e|i|o|u"

This is a regex literal that matches (English) vowels. Julia recognises regex literals as the type regex:

    julia> typeof(regex_literal)
    Regex

String operations

Substrings

Because strings are indexable, we can use ranges to select a part of a string, something we generally refer to as a substring or string subsetting:

    julia> declaration = "When in the Course of human events"
    "When in the Course of human events"

    julia> declaration[1:4]
    "When"

You might recall that a range might actually have a step attribute, which we can use to obtain every _n_th letter within a text. Let's see every odd-numbered letter within the first few words of the Declaration of Independence:

    julia> declaration[1:2:end]
    "We nteCus fhmneet"

You might remember that end, which we used above to extend the range across the entire length of the string, behaves like a number. Therefore, you can use it to create a substring that excludes the last, say, five letters:

     julia> declaration[1:end-5]
     "When in the Course of human e"

Concatenation, splitting and interpolation

Concatenating and repeating

In most programming languages, maths and string operations correspond, so you can use + to concatenate and * to repeat a string. This is not the case in Julia. + has no method for ASCIIStrings. What you would expect + to do is accomplished by *:

    julia> "I" * " <3 " * "Julia"
    "I <3 Julia"

So how do you multiply a sequence of text? Easy – use the ^ operator. This is useful if you happen to have been set the old school punishment of 'lines' (writing the same sentence all over again).

    julia> "I will not say bad things about functional languages again. " ^ 10
    "I will not say bad things about functional languages again. I will not say bad things about functional languages again. I will not say bad things about functional languages again. I will not say bad things about functional languages again. I will not say bad things about functional languages again. I will not say bad things about functional languages again. I will not say bad things about functional languages again. I will not say bad things about functional languages again. I will not say bad things about functional languages again. I will not say bad things about functional languages again. "

split()

The split() function separates a piece of text at a particular character, which it also removes. The result is an array of the chunks. By default, split() will separate at spaces, but you can provide any other string – not even necessarily a single character, as the third example shows:

    julia> split(declaration)
    7-element Array{SubString{ASCIIString},1}:
     "When"
     "in"
     "the"
     "Course"
     "of"
     "human"
     "events"

    julia> split(declaration, "e")
    6-element Array{SubString{ASCIIString},1}:
     "Wh"
     "n in th"
     " Cours"
     " of human "
     "v"
     "nts"

    julia> split(declaration, "the")
    2-element Array{SubString{ASCIIString},1}:
     "When in "
     " Course of human events"

If you provide "" as the string to split at, Julia will split the text into individual letters.

You may also use a regex to split your text at:

    julia> regex_literal = r"a|e|i|o|u"
    julia> split(declaration, regex_literal)
    12-element Array{SubString{ASCIIString},1}:
     "Wh"
     "n "
     "n th"
     " C"
     ""
     "rs"
     " "
     "f h"
     "m"
     "n "
     "v"
     "nts"

Needless to say, since strings are immutable, the original string is not affected by the application of split().

Interpolation

String interpolation refers to the incredibly useful capability of including variable values within a string. As you might remember, we have used * above to concatenate strings:

    julia> love = "<3"
    "<3"

    julia> "I " * love * " Julia"
    "I <3 Julia"

While this is technically correct, it is much faster by using string interpolation, in which case we would refer back to the variable love as $(love) within the string. Julia knows this means it is to replace $(love) with the contents of the variable love:

    julia> "I $(love) Julia"
    "I <3 Julia"

You can put anything within the parentheses in string interpolation – anything Julia knows how to handle. For instance, including an expression in a string, you get

    julia> "Three plus four is $(3+4)."
    "Three plus four is 7."

If, and only if, you are referring to a variable, you can omit the parentheses (but not if you are referring to an expression):

    julia> "I $love Julia"
    "I <3 Julia"

Regular expressions and finding text within strings

As it has been mentioned, the main utility of regular expressions (Regexes) is to find things within long pieces of text. In the following, we will introduce the three main regex search functions of Julia - match(), matchall() and eachmatch(), with reference to a bit of the Declaration of Independence:

    declaration = "We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.--That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, --That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn, that mankind are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty, to throw off such Government, and to provide new Guards for their future security.--Such has been the patient sufferance of these Colonies; and such is now the necessity which constrains them to alter their former Systems of Government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world."

If you are familiar with regular expressions, plod ahead! However, if

    (GIR 0AA)|((([A-Z-[QVX]][0-9][0-9]?)|(([A-Z-[QVX]][A-Z-[IJZ]][0-9][0-9]?)|(([A-Z-[QVX]][0-9][A-HJKSTUW])|([A-Z-[QVX]][A-Z-[IJZ]][0-9][ABEHMNPRVWXY])))) [0-9][A-Z-[CIKMOV]]{2})

looks like gobbledygook to you or you feel your regex fu is a little rusty, put down this book and consult the Regex cheatsheet or, even better, Jeffrey Friedl's amazing book on mastering regexes.

Finding and replacing using the search() function

If you are only concerned with finding a single instance of a search term within a string, the search() function returns the range index of where the search expression appears:

    julia> search(declaration, "Government")
    241:250

search() also accepts regular expressions:

    julia> search(declaration, r"th.{2,3}")
    9:13

To retrieve the result, rather than its index, you can pass the resulting index off to the string as the subsetting range, using the square bracket [] syntax:

    julia> declaration[search(declaration, r"th.{2,3}")]
    "these"

Ah, so that's the word it found!

Where a search string is not found, search() will yield 0:-1. That is an odd result, until you realise the reason: for any string s, s[0:-1] will necessarily yield "" (that is, an empty string).

Finding using the match() family of functions

The problem with search() is that it retrieves one, and only one, result – the first within the string passed to it. The match() family of functions can help us with finding more results:

  • match() retrieves either the first match or nothing within the text,
  • matchall() returns an array of all matching substrings, and
  • eachmatch() returns an iterator over all matches.

The match() family of functions needs a regular expression literal as a search argument. This is so even if the regular expression does not make use of any pattern matching beyond a simple string. Thus,

    julia> match(r"truths", declaration)

is valid, while

    julia> match("truths", declaration)

yields an error:

    ERROR: `match` has no method matching match(::ASCIIString, ::ASCIIString)

Understanding RegexMatch objects

Most regex search functions return an object of type RegexMatch. As the name reveals, a RegexMatch is a composite type representing a match. As such, it encapsulates (to use a little more OOP terminology than one would normally be allowed to in a book on functional programming) four values, the first three of which will be of immediate interest to us:

  • RegexMatch.match is the matched substring,
  • RegexMatch.captures is an array of types that represent the type of what the regex would capture,
  • RegexMatch.offset is generally an Int64 that represents the index of the first character of the matched string where there is a single match (e.g. when using match()).

To illustrate, let's consider the result of a match() call, which will be introduced in the next subsection:

    m = match(r"That .*?,", declaration)

    julia> m.match
    "That to secure these rights,"

    julia> m.captures
    0-element Array{Union(SubString{UTF8String},Nothing),1}

    julia> m.offset
    212

match()

match() retrieves the first match or nothing - in this sense, it is rather similar to search():

    julia> match(r"That .*?,", declaration)
    RegexMatch("That to secure these rights,")

The result is a RegexMatch object. The object can be inspected using .match (e.g. match(r"truths", declaration).match).

matchall()

matchall() returns an array of matching substrings, which is sometimes a little easier to use:

    julia> matchall(r"That .*?,", declaration)
    2-element Array{SubString{UTF8String},1}:
     "That to secure these rights,"
     "That whenever any Form of Government becomes destructive of these ends,"

eachmatch()

eachmatch() returns an object known as an iterator, specifically of the type RegexMatchIterator. We have on and off encountered iterators, but we will not really deal with them in depth until chapter [X], which deals with control flow. Suffice it to say an iterator is an object that contains a list of items that can be iterated through. The iterator will iterate over a list of RegexMatch objects, so if we want the results themselves, we will need to call the .match method on each of them:

    for i in eachmatch(r"That .*?,", declaration)
        println("A matching search result is: $(i.match)")
    end

The result is quite similar to that returned by matchall():

    A matching search result is: That to secure these rights,
    A matching search result is: That whenever any Form of Government becomes destructive of these ends,

ismatch()

ismatch() returns a boolean value depending on whether the search text contains a match for the regex provided.

    julia> ismatch(r"truth(s)?", declaration)
    true

    julia> ismatch(r"sausage(s)?", declaration)
    false

Replacing substrings

Julia can replace substrings using the replace() syntax... let's try putting some sausages into the Declaration of Independence!

    julia> replace(declaration, "truth", "sausage")
    "We hold these sausages to be self-evident, that all men are created equal,..."

An interesting feature of replace() is that the replacement does not need to be a string. In fact, it is possible to pass a function as the third argument (as always, without the parentheses () that signify a function call). Julia will interpret this as 'replace the substring with the result of passing the substring to this function':

    julia> replace(declaration, "truth", uppercase)
    "We hold these TRUTHs to be self-evident, that all men are created equal,..."

Much more dignified than self-evident sausages, I'd say! At risk of repeating myself, it is important to note that since strings are immutable, replace() merely returns a copy of the string with the search string replaced by the replacement string or the result of the replacement function, and the original string itself will remain unaffected.

Where the substring is not found, the result will be, unsurprisingly, an unaltered string.

Regex flags

A little-known feature of Julia regexes is the ability for a regex to be appended one or more flags. These, like most of Julia's regex capability, derive from Perl's regex module perlre.

Flag Function
i Case-insensitive pattern matching
m Treats string as a multiline string, so that ^ and $ will refer to the start or end of any line within the string.
s Treats line as a single line. This will result in . accepting a newline as well. When used together with m, it will result in . matching every possible character while still allowing ^ and $ to match, just after and just before newlines within the string.
x Ignore non-backslashed, non-classed whitespace.

Flags are appended to the end of each regex, which might strike users more familiar with e.g. the Pythonic way of modifying the regex search object itself, as somewhat unusual:

    multiline = r"^We"m

In this case, the regex r"^We" was augmented by the multiline flag, appended at its end.

String transformation and testing

Case transformations

Case transformations are functions that act on Strings and transform character case. Let's examine the effect of these transformations in turn.

Function Effect Result
uppercase() Converts the entire string to upper-case characters WE HOLD THESE TRUTHS TO BE SELF-EVIDENT
lowercase() Converts the entire string to lower-case characters we hold these truths to be self-evident
ucfirst() Converts the first character of the string to upper-case We hold these truths to be self-evident
lcfirst() Converts the first character of the string ot lower-case we hold these truths to be self-evident

Testing and attributes

results matching ""

    No results matching ""