Don't Quote Me:
The Single and Double Quotes Problem

8 June 2001

APL Font Required
If the next two rows
have different symbols
Œ„¼…½’‘œ¯–•—
APL Font Exemplar
then
and install this font.

As Gary Bergquist mentioned in the Limbering Up! column of the 2001 Q2 issue of his quarterly newsletter, now that some APL implementations allow both single and double quotes to mark a string, the natural question to ask by anyone who wants to parse such a line is which quote marks delimit the strings and which ones are inside a string.

While I am unable to find a simple solution, there is one which involves a very powerful (albeit less than practical) tool called Transitive Closure.

Cut To The Chase

Give an arbitrary sequence of single and double quotes in (say) V, define the following variables

I„(¼½V)°.=¼½V Ye old identity matrix
J„0,0 ¯1‡I Shift I one to the right (J is now a shift matrix)
Q„<\((¼½V)°.<¼½V)^V°.=V For each quote mark, what's the next matching one?
R„QŽI-QŸ.^J The 1st row marks the trailing quote in each pair

then

Z„IŸRŸRŸ.^J

is the answer (or more accurately, the first row of that matrix is the answer).

It's easy to find simpler (and more efficient) ways of expressing this last line. However after we've used a matrix-divide on the previous line, the efficiency of the last line doesn't seem all that important.

Nonetheless, for the efficiency experts in the crowd, as you may already know, left-multiplying by J (that is, JŸ.^Q is equivalent to the expression (1 0‡Q)®0 which shifts Q up one row (zero filling). Similarly, right-multiplying by J (QŸ.^J) is equivalent to 0,0 ¯1‡Q, which shifts Q right one column (zero filling). Though the shifts are more efficient than matrix multiplication, it can be easier to understand the explanation below by leaving in J. Finally, as we're interested in only the first row of the resulting matrix, there's no need to perform those operations on the whole matrix R. Eventually, we're left with the last two lines of

R„(½V)½QŽI-0,0 ¯1‡Q
Z„RŸ1,¯1‡R
.

Doing It By Hand

As the matrix Q demonstrates, assuming that a given (single or double) quote mark is a string marker, it's easy to determine the matching trailing string marker. If we were solving this problem by hand, we would note that the first quote mark must be a string marker, and that the first row of Q gives us the identity of the matching trailing string marker. Having found one pair of string markers, we would move our finger over one spot to the next adjacent string marker (here's where we need that shift matrix), and start over again, chaining through the sequence of single and double quotes. Note that at the start of each iteration, we would use the row in Q corresponding to the index of the last trailing string marker plus one (the shift).

Finally, we need a mechanism to chain those assumptions together through the whole string, which is where Transitive Closure comes in.

Transitive Closure

The term Transitive Closure comes from Graph Theory where it is used in the context of directed graphs. Assume we have an N by N boolean incidence matrix called M on the N nodes of a graph. These fancy terms mean simply that M[i;j] is 1 iff there is a directed edge from node i to node j. Transitive closure translates the concept "has a directed edge from i to j" to the concept "has a directed path of zero or more edges from i to j".

That is, if T is the transitive closure of M, then if M[i;j] and M[j;k], then T[i;k], and if M[i;j] and M[j;k] and M[k;l], then T[i;l], etc. In other words, transitive closure takes the concept of transitivity between edges (i is connected to j which is connected to k, therefore i is connected to k) to the limit (closure). For certain incidence matrices (in particular, they must be upper triangular), there is a closed form representation of transitive closure of the incidence matrix using power series.

Applying this to the problem at hand, Q gives us each matching quote mark, which we use to chain through the sequence of matches. By definition, the first quote mark delimits a string, and the 1st row of Q gives us the matching quote mark (say in the 5th position), the next (6th) quote mark begins the next pair, and the 6th row of Q gives us its matching quote mark, etc.

An Infinite Set of Matrices

Concentrating on the sequence of trailing string markers, everything we need is in Q. Looking at each row of Q, it answers the question "If the quote mark corresponding to this row number starts a string, where is the matching trailing string marker?".

Ordinarily, when chaining from one element to another with transitive closure the expression QŸ.^Q identifies the next marker, however in this case we must shift to the next string after finding a match so we need to shift the right hand matrix as in QŸ.^JŸ.^Q. This expression answers the question "If the quote mark corresponding to this row number starts a string, where is the matching trailing string marker of the string next to this one?".

Putting this all together, the trailing string markers are identified as follows:

Q The 1st row has the 1st trailing marker
QŸ.^JŸ.^Q The 1st row has the 2nd trailing marker
QŸ.^JŸ.^QŸ.^JŸ.^Q The 1st row has the 3rd trailing marker
QŸ.^JŸ.^QŸ.^JŸ.^QŸ.^JŸ.^Q The 1st row has the 4th trailing marker
etc.

To get the whole result, we Ÿ together these separate matrices as in

QŸ(QŸ.^JŸ.^Q)Ÿ(QŸ.^JŸ.^QŸ.^JŸ.^Q)Ÿ(QŸ.^JŸ.^QŸ.^JŸ.^QŸ.^JŸ.^Q)Ÿ...

Note that throughout this discussion, we're taking advantage of the fact that Ÿ.^ is associative.

The above expression can be simplified by factoring out Q, which we can do because Ÿ.^ is also distributive. Note we can arbitrarily factor out Q on either the left or right; for reasons which will become clear below, I chose to factor it out on the right.

This leaves us with:

I
QŸ.^J
QŸ.^JŸ.^QŸ.^J
QŸ.^JŸ.^QŸ.^JŸ.^QŸ.^J
etc.

Joining these together gives

IŸ(QŸ.^J)Ÿ(QŸ.^JŸ.^QŸ.^J)Ÿ(QŸ.^JŸ.^QŸ.^JŸ.^QŸ.^J)Ÿ...

Closed Forms

The sequence above is a power series, which is amenable to being rewritten in a closed form. Recall that a power series of the form

1 + X + (X×X) + (X×X×X) + (X×X×X×X) +...

can be expressed in closed form as

÷(1 - X)

for a suitable range of X. In the context of a power series, this closed form can be used more broadly on other objects including matrices with a suitable substitution of the functions and variables. In our case, we'll use the following substitutions:

1 is I,
+ is Ÿ,
X is QŸ.^J,
× is Ÿ.^, and
÷ is Ž.

Thus, the closed form of the power series can be expressed as

ŽI-QŸ.^J

We factored out Q earlier, so we need to multiply it back in again, which yields

R„(ŽI-QŸ.^J)Ÿ.^Q

or more simply (and here's why I factored out Q on the right)

R„QŽI-QŸ.^J

Had we factored out Q on the other side, we'd be left instead with the equivalent form

R„QŸ.^ŽI-JŸ.^Q

As a side note, because we're interested in only the first row of R, this equivalent form can be written more simply as

R„Q[1;]Ÿ.^ŽI-JŸ.^Q

In either case, the 1st row of R identifies all of the trailing string markers. As a side effect, if the last element in the 1st row of R is 1, then the string markers are all properly matched.

Finally, because another valid string marker (beginning the next string) follows every trailing string marker, a shift to the right of R (that is, RŸ.^J) gives the starting string markers (except for the 1st one which is where I comes in). Putting this all together, the final answer is the 1st row of

Z„IŸRŸRŸ.^J

Bergquist actually requested something more, that is a function called NONQUOTE which extracts from a line of text just the characters which are outside strings.

Such a function can be written as follows:

    ’ R„NONQUOTE CV;N;V;Q
[1]   © Returns all of CV except for quotes
[2]   ©  or their contents.
[3]   N„CV¹'''"' © Mask for quotes
[4]   V„N/CV	 © Select the quotes so we process less
[5]   Q„<\((¼½V)°.<¼½V)^V°.=V © Matching trailing string marker
[6]   R„(½V)½QŽ((¼½V)°.=¼½V)-0, 0 ¯1‡Q © All trailing markers
[7]   R„RŸ1,¯1‡R © Both leading and trailing string markers
[8]   R„(N‹¬\N\R)/CV © Just the non-quoted material
    ’

Fonts

I had some trouble getting a single APL font for this page to display correctly in both Netscape Navigator and Internet Explorer. Eventually, I gave up trying to get one font to fit both browsers and split the cases via a piece of JavaScript to distinguish the two major browsers.

The two fonts I used are Dyalog Std TT (which works better with Mozilla 1.0+/Netscape Navigator 4.76+) and APL2000 (which works better with Internet Explorer 5.5+). The APLHELP font from APL2000, Inc. also works well with Netscape Navigator, but, to my knowledge, that copyrighted font isn't downloadable from the Internet.

BTW, it would not surprise me if these same fonts display differently (including not at all) with different versions of the two browsers I tried, not to mention the browsers I didn't try (Opera, Safari, etc.). Good luck.

Author

Bob Smith

http://www.Qualitas.com
http://www.SudleyPlace.com

This article can be found on the web at http://www.sudleyplace.com/APL/index.htm.

Acknowledgments

This code was developed and tested on APL+Win 3.6, courtesy of APL2000, Inc.

Gary Bergquist's quarterly Zark APL Tutor Newsletter comes as part of his Zark APL Tutor service; it can be ordered from him for $29/year at

Zark Inc.
23 Ketchbrook Lane
Ellington, CT 06029
860-872-7806


© Sudley Place Software 1997-2007. Get Firefox!   Get Thunderbird!   Get OpenOffice!
Comments or suggestions? Send them to .