JLC's experiments on concordance making using awk, sed and the shell

  1. General principle
  2. File preparation (Beginning of line tagging)
  3. Labeling of file (or chunk)
  4. concatenation of constituent files
  5. Multi-rotation
  6. Indexing
  7. Rotating back
  8. Extracting the head word
  9. Further processing the concordance
  10. Examples of concordances made so far (CLICK)

General principle

The general guiding principle in those experiments is that the concordance will be obtained as the final result of many small steps, each step being effected by applying a script to one DATA file or by concatenating together two or more data file.

For instance, starting from the file "EnSortant", which contains the text of a poem, (En Sortant de l'École) the application of a sequence of 7 scripts (combined in a pipeline) produces a concordance of the words of that poem which is sent to the STANDARD OUTPUT and can be captured into a file such as the one reachable by CLICKING HERE.

The script sequence is:

D1s_ EnSortant | D2_Label ES | D3 | D4 | D5 | D6 | D7n

Similarly, starting from the texts of the three antāti of the 4000 tivviya pirapantam, contained in three files, "A1" (First Antāti), "A2" (Second Antāti) and "A3" (Third Antāti), one can obtain a joint concordance of the three texts by applying the series of commands:

D1svarbrackets A1 | D2_Label ANTA-1 > TAG-A1
D1svarbrackets A2 | D2_Label ANTA-2 > TAG-A2
D1svarbrackets A3 | D2_Label ANTA-3 > TAG-A3
cat TAG-A1 TAG-A2 TAG-A3 > TAG-A1-A2-A3
cat TAG-A1-A2-A3 | D3 | D4 | D5 | D6 | D7n > A1-A2-A3-cnc

The resultant concordance, if further processed in order to take care of the punctuation will look like the file obtained by CLICKING HERE

File preparation (Beginning of line tagging)

This section of the task is taken care of by several distinct scripts, depending on the format in which the data is provided, the simplest form being a collection of stanzas separated by lines containing stanza numbers (or sutra numbers), in which case the script D1svar can be used. If the numbers appear between "(" and ")", the script D1svarpar can be used. If the numbering scheme is more complex and if the references are enclosed between "{" and "}", the script D1svarbrackets can be used.

The case of a poem not divided in stanzas (or other types of line groups) is taken care of by the script D1s_, and additional scripts will probably have to be written.

At an earlier stage, the scripts D1s4 and D1s8 have also been used but they are being phased out because they presuppose preprocessing of the DATA file, in order to remove the stanza numbers.

Script D1svar reads:

cat $1 | awk '
$1 ~ /[0-9]+/ {LINE_GROUP_NBR = $1; LINE = 0}
$0 !~ /^$/ && $1 !~ /[0-9]+/ {print "@@@_" LINE_GROUP_NBR "-" ++LINE "\t" $0} '

Script D1svarpar reads:

cat $1 | awk '
$1 ~ /([0-9]+)/ {LINE_GROUP_NBR = substr($1, 2, length($1)-2); LINE = 0}
$0 !~ /^$/ && $1 !~ /([0-9]+)/ {print "@@@_" LINE_GROUP_NBR "-" ++LINE "\t" $0} '

Script D1svarbrackets reads:

cat $1 | awk '
$1 ~ /{.*}/ {LINE_GROUP_NBR = substr($1, 2, length($1)-2); LINE = 0}
$0 !~ /^$/ && $1 !~ /{.*}/ {print "@@@_" LINE_GROUP_NBR "-" ++LINE "\t" $0} '

Script D1s_ reads:

cat $1 | awk '
# First step in making concordance: ONE rotation PER word
# This version numbers the lines of poems not divided in stanzas.
# And it inserts a dummy prefix (PREF = @@@_), for the poem name,
# to be replaced later by a suitable prefix.

{	LL = NR
	PREF = "@@@_"
	print PREF LL "\t" $0
} '

Script D1s4 reads:

cat $1 | awk '
# First step in making concordance: ONE rotation PER word
# This version numbers 4-line stanzas (without separation lines between them)
# And it inserts a dummy prefix (PREF = @@@_), for the poem name,
# to be replaced later by a suitable prefix.
{	WW = (NR-1) % 4
	SS = int((NR+3)/4)
	PREF = "@@@_"
	XX=sprintf("%c", WW+97)
	print PREF SS XX "\t" $0
} '

Script D1s8 reads:

cat $1 | awk '
# First step in making concordance: ONE rotation PER word
# This version numbers 8-line stanzas (without separation lines between them)
# And it inserts a dummy prefix (PREF = @@@_), for the poem name,
# to be replaced later by a suitable prefix.
{	WW = (NR-1) % 8
	SS = int((NR+7)/8)
	PREF = "@@@_"
	XX=sprintf("%c", WW+97)
	print PREF SS XX "\t" $0
} '

Labeling of file (or chunk)

Labeling of the lines (in order to be able to identify the source) is done by the script "D2_Label" which replaces the default tag ("@@@") by a tag given as argument to D2_Label. The script reads:

XXX=$1
shift
sed '
s/@@@/'"$XXX"'/'

concatenation of constituent files

For an example of this, see the section "General Principles" (at the beginning) and the making of the joint concordance for the three Antāti-s.

Multi-rotation

The multi-rotation script (D3) creates as many copies of the line as there are words to be indexed. It reads.

cat $1 | awk '
# First step in making concordance: ONE rotation PER word
# Version 1a

BEGIN { FS = "\t" }
{	
	print $2 "\t" $1
	for (i = length($2); i > 0; i--)
	if (substr($2,i,1) == " ")
	print substr($2,i+1) "\t" $1 " " substr($2,1,i-1)
} '

Indexing

Indexing is done in a very simple way, because it simply relies on the capacities of the sort program. As a consequence, texts in transliteration are indexed in the Latin alphabetical order whereas texts in Tamil are indexed in the Tamil order. The script (D4) reads:

# Second step in making the concordance: sort the multiple rotated file made in the first step
sort -f

Rotating back

Rotating back the elements of the line into their normal order is accomplished by script "D5", which reads:

awk '
BEGIN { FS = "\t" }
  { print $2 " *** " $1 } '

Extracting the head word

The extraction of the headword (which is the word following "***") is done by means of script "D6", which puts a copy of the headword at the beginning of the line:

sed 's/\(.*\*\*\* \([^ ][^ ]*\).*\)/\2	\1/'

Further processing the concordance

There are of course no limits to what one can expect from the process, where all sorts of improvements are possible. At the moment, there is one simple script (D7a), which keeps only the headword and the place of occurrence, and a more complex script (D7n), which counts the number of occurrences or the headword and gives all the lines (with location) containing it. The script D7a reads:

awk ' { print $1 "\t" $2 } '

The script D7n reads:

awk '
# Attempt based on the (grey) AWK book (1988), p.92

BEGIN { FS = "\t" 
	VIRGIN = 1
	}
	{
	  if ($1 != prev) {
		if ( VIRGIN != 1) print OCCUR " Occ."
		print "\n" "**********" "\n" $1 "\n" "----------"
		prev = $1
		VIRGIN = 0
		OCCUR = 0
	  }
	  print "\t" $2
	  OCCUR++
	}
END { print OCCUR " Occ." } '

(*) Examples of concordances made so far

The following examples are traces of the first attempts. They may contain imperfections (to be examined at leasure).

((INPUT)) TV01T (alias CiPu): Civapurāṇam
((COMMAND)) D1s_ TV01T | D2_Label CiPu | D3 | D4 | D5 | D6 | D7n > TV01Tcnc
((OUTPUT)) TV01Tcnc: Civapurāṇam concordance
((INPUT)) TV07T (alias TiVePa): Tiruvempāvai
((COMMAND)) D1svarpar TV07T | D2_Label TVP | D3 | D4 | D5 | D6 | D7n > TV07Tcnc
((OUTPUT)) TV07Tcnc: Concordance