Ongoing report on the conversion of the TIPA files received from Mr Arockiasamy

Background and previous steps

In the recent months, Eva has negociated with Mr Ilavazhagan, who is the man behind the தமிழ் மண் பதிப்பகம், the acquisition by NETamil of the editable files of the 17-vol (+ index) of the TIPA (தமிழ் இலக்கணப் பேரகராதி)

As part of the negociation, professor VV, Babu, Thilak and JLC went to Chennai on 2nd of september 2016 and met with Mrs Chitra (from தமிழ் மண் பதிப்பகம்) and with Mr Arockiasamy Mohanraj, a software consultant who had come from Bangalore and who was in charge of converting the files from their original format to a Unicode-compatible format.

After the negociation between Eva and Mr Ilavazhagan had been concluded, Mr Arockiasamy Mohanraj started to work on the conversion and on 7th of October he sent us a collection of files.

However, those files were not useable, because they were affected by two bugs: the ழி BUG and the ழூ BUG.

The first of these two bugs was the TOTAL ABSENCE in the files received of the ழி UYIRMEY, which had been eliminated by the conversion method, as can be seen in the SAMPLE ACCESSIBLE THROUGH CLICKING HERE!

The second of these bugs consisted in the fact that the ழூ UYIRMEY did not always stand for itself but was sometimes an INVOLUNTARY substitute for the ">" sign.

After a series of exchanges, Mr Arockiasamy managed to find the cause for the bugs, and sent us a new series of files for the TIPA, in two installments.

On 5th november 2016, we received 8 TIPA files and on 27th november 2016 (Sunday), we received the remaining 9 TIPA files.

A preliminary (quick) conversion was made by JLC, and the result placed at THIS URL on the NETamil Pondy server.

The current starting point

The task at hand is now to make a more precise conversion, and to document the method used.

The starting point is a collection of HTML files, such as the following:

(0a)

Opened with a text editor (Notepad++), the inside of the first HTML files looks like this:

(0b)

Importing the files into Oxygen

The first step in making those files more direcctly usable consists in using the import facilities of Oxygen (a XML editor), as illustrated by the following series of screenshots, where the last 2 steps illustrate an adjustment made in the <meta> tag.

(1a)

(1b)

(1c)

(1d)

(1e)

(1f)

The file is then saved under a convenient name, as in the following example:

(1g)

Getting rid of the HTML namespace and of the CSS dependancies

At this stage, the file which has been saved under a new name and which is open inside the Oxygen XML editor is a file following the "XHTML 1.0 Strict" standard. This implies, among other things, that handling this file constantly means having to be HTML "namespace" aware, which fact greatly complicates the writing of XSLT scripts.

Therefore, my first step in handling the file, consists in getting rid of the namespace dependency, by modifying the current headers (visibly selected in the image below), by new headers.

(2a)

However, the removal of the (implicit) dependency on the HTML dtd, forces us to MAKE EXPLICIT the interpretation of all the (HTML) ENTITIES present in our current file. This is why the replacement header is as follows:

(2b)

After the change of header, another component which I also decide to eliminate, as a simplification, is the link to the explicit CSS, as demonstrated by the following two SCREEN SHOTS:

(2c)

(2d)

Applying XSLT scripts on the simplified file

At this stage, we are now in a position where we can apply a series of small XSLT scripts, in order to extract information from the simplified file, or in order to progressively transform it. Several of those script will include as an "imported" component, a simple script, called "copy.xslt" and referred to by some writers as the IDENTITY TRANSFORMATION, which is reproduced below.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="node() | @*">
 <xsl:copy>
  <xsl:apply-templates select="@* | node()"/>
 </xsl:copy>
</xsl:template>

</xsl:stylesheet>

<!--    ((Script Name: copy.xslt))
        ((copied from /XSLT Cookbook (2nd ed.)/, Sal Mangano, 2006, pp.274-275))
        -->

A first example is the following script, called List_elements_in_body.xslt, which can be applied (using Oxygen) to the simplified file _A01_eluthu1_2b.html, whose preparation has been described in the preceding section:

<xsl:stylesheet
    version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    
<xsl:import href="copy.xslt"/>
    
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>
    

    <xsl:template match="body">
    <body>
        <table>
            <tr><th>RANK</th><th>CLASS</th></tr>
            <xsl:apply-templates></xsl:apply-templates>
        </table>
    </body>
    </xsl:template>

    <xsl:template match="body//*">
        <tr>
             <td><xsl:value-of select="position()"/></td>
            <td><xsl:value-of select="local-name()"/></td>
        </tr>
        <xsl:apply-templates></xsl:apply-templates>
    </xsl:template>
    
    <xsl:template match="body//text()">
        <tr>
            <td><xsl:value-of select="position()"/></td>
            <td>TEXT</td>
        </tr>
    </xsl:template>
    
 <!-- (Script Name: List_elements_in_body.xslt)
     -->
    
</xsl:stylesheet>

When the xslt script is run on the html file, it generates a table containing 36120 rows, starting with:

RANK	CLASS
1	div
1	TEXT
2	div
1	TEXT
2	p
1	TEXT
3	TEXT
3	TEXT
4	div
1	TEXT
2	p
1	TEXT
3	TEXT
4	p
1	span
1	TEXT
2	span
1	TEXT
3	span
1	TEXT
4	span
[.....]	[.....]

If we extract the content of the right column into a file (called A01_elements.txt), we can run on it (under Linux) the following command:

sort A01_elements.txt | uniq -c > A01_elements_uniq.txt

The content of the resulting file (A01_elements_uniq.txt) will be:

     41 br
    302 div
   3798 p
  13450 span
  18529 TEXT

Typically, if we extrapolate what is seen in the beginning of the file to the totality, the <body> element seems to contain <div> elements, which contain elements.

Among the elements, some directly contain TEXT, whereas others contain elements which themselves contain TEXT.

Additionally, all the elements seem to contain a class attribute and a xml:lang attribute, whereas in the case of elements we seem to have, with a few exceptions (to be examined), only a xml:lang attribute.

As a preliminary exploration, I extract the list of all the attested values for the class attribute of the elements, using the following script:

<xsl:stylesheet
    version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    
<xsl:import href="copy.xslt"/>
    
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>
    
    <xsl:template match="body">
        <body>
        <table>
            <tr><th>RANK</th><th>CLASS</th></tr>
            <xsl:apply-templates></xsl:apply-templates>
        </table>
        </body>
    </xsl:template>

    <xsl:template match="div">
            <xsl:apply-templates></xsl:apply-templates>
    </xsl:template>
    
    <xsl:template match="p">
        <tr>
            <td><xsl:value-of select="position()"/>
            </td><td><xsl:value-of select="@class"/></td>
        </tr>
    </xsl:template>

<!--  ((SCRIPT: list-class-attribute-values-for-p-elements.xslt))
    -->
    
</xsl:stylesheet>

When this second xslt script is run on the html file, it generates a table containing 3798 rows, starting with

RANK	CLASS
2	No-Paragraph-Style para-style-override-1
2	Head para-style-override-2
4	Body-text para-style-override-2
6	Body-text para-style-override-2
8	subhead
10	Body-text para-style-override-2
12	Body-text para-style-override-2
14	subhead
16	Body-text para-style-override-2
18	Body-text para-style-override-2
20	Body-text para-style-override-2
22	Body-text para-style-override-2
24	Body-text para-style-override-2
26	Body-text para-style-override-2
28	Body-text para-style-override-2
30	Body-text para-style-override-2
32	Body-text para-style-override-2
34	Body-text para-style-override-2
36	Body-text para-style-override-2
38	Body-text para-style-override-2
40	Body-text para-style-override-2
42	Body-text para-style-override-2
44	Body-text para-style-override-2
46	Body-text para-style-override-2
48	subhead
50	Body-text para-style-override-2
52	subhead
54	Body-text para-style-override-2
[.....]	[.....]

As previously, I copy the right column of this chart to a file (called A01_p_styles.txt) and run a small script:

sort A01_p_styles.txt | uniq -c > A01_p_styles_uniq.txt

The result is a list of 97 distinct items, which is:

      1 bodyIN1 para-style-override-11
      1 bodyIN1 para-style-override-14
      2 bodyIN1 para-style-override-15
     97 bodyIN1 para-style-override-2
      3 bodyIN1 para-style-override-23
      1 bodyIN1 para-style-override-42
      1 bodyIN1 para-style-override-44
      3 bodyIN1 para-style-override-45
      2 bodyIN1 para-style-override-46
      8 bodyIN1 para-style-override-47
      1 bodyIN1 para-style-override-6
      1 bodyIN1 para-style-override-8
      2 bodyIN2 para-style-override-12
      1 bodyIN2 para-style-override-15
      8 bodyIN2 para-style-override-17
     58 bodyIN2 para-style-override-2
      8 bodyIN2 para-style-override-20
      4 bodyIN2 para-style-override-21
     15 bodyIN2 para-style-override-22
      2 bodyIN2 para-style-override-23
      1 bodyIN2 para-style-override-27
      4 bodyIN2 para-style-override-28
      5 bodyIN2 para-style-override-31
      1 bodyIN2 para-style-override-32
      3 bodyIN2 para-style-override-33
      3 bodyIN2 para-style-override-34
     10 bodyIN2 para-style-override-9
      2 bodylN3 para-style-override-12
      1 bodylN3 para-style-override-16
      5 bodylN3 para-style-override-2
      3 bodylN3 para-style-override-21
      2 bodylN3 para-style-override-28
      1 bodylN3 para-style-override-29
      2 bodylN3 para-style-override-8
      2 bodylN3 para-style-override-9
     14 Body-text para-style-override-12
     72 Body-text para-style-override-15
   1738 Body-text para-style-override-2
      5 Body-text para-style-override-21
      7 Body-text para-style-override-23
     16 Body-text para-style-override-24
      1 Body-text para-style-override-25
     16 Body-text para-style-override-31
      2 Body-text para-style-override-32
      1 Body-text para-style-override-39
      2 Body-text para-style-override-40
      1 Body-text para-style-override-41
     17 Body-text para-style-override-49
      2 Body-text para-style-override-50
      1 Body-text para-style-override-51
     10 Body-text para-style-override-52
     11 Body-text para-style-override-8
     17 Body-text para-style-override-9
      8 example1 para-style-override-15
    176 example1 para-style-override-2
      2 example1 para-style-override-21
      1 example1 para-style-override-23
      7 example1 para-style-override-31
      4 example para-style-override-15
      1 example para-style-override-19
    149 example para-style-override-2
      2 example para-style-override-36
      9 example para-style-override-37
      1 example para-style-override-38
      3 example para-style-override-53
      1 example para-style-override-54
      6 example para-style-override-55
      2 example para-style-override-56
     19 example para-style-override-8
      2 example para-style-override-9
     14 Head para-style-override-2
      1 Head para-style-override-43
      1 Head para-style-override-48
      3 nobody para-style-override-2
    296 No-Paragraph-Style para-style-override-1
      2 No-Paragraph-Style para-style-override-57
      1 No-Paragraph-Style para-style-override-58
      1 padal1 para-style-override-18
     11 padal1 para-style-override-2
     17 padal2 para-style-override-15
     99 padal2 para-style-override-2
      3 padal2 para-style-override-23
     13 padal para-style-override-2
     25 right para-style-override-2
      1 right para-style-override-31
      2 sub
      4 sub1
    645 subhead
      1 subhead para-style-override-10
      1 subhead para-style-override-13
     20 subhead para-style-override-26
      2 subhead para-style-override-3
      6 subhead para-style-override-30
      2 subhead para-style-override-35
     28 subhead para-style-override-4
      5 subhead para-style-override-5
      4 subhead para-style-override-7

However, 94 among those items are further analysable as the combination of a semantics prefix and a format specification, whereas 3 items (namely, "sub", "sub1" and "subhead") contain only the semantics part:

The chart for the semantics component distribution is as follows

     12 bodyIN1
     15 bodyIN2
      8 bodylN3
     18 Body-text
     12 example
      5 example1
      3 Head
      1 nobody
      3 No-Paragraph-Style
      1 padal
      2 padal1
      3 padal2
      2 right
      1 sub
      1 sub1
     10 subhead

The chart for the format specification distribution is as follows

      1 para-style-override-1
     12 para-style-override-2
      1 para-style-override-3
      1 para-style-override-4
      1 para-style-override-5
      1 para-style-override-6
      1 para-style-override-7
      4 para-style-override-8
      4 para-style-override-9
      1 para-style-override-10
      1 para-style-override-11
      3 para-style-override-12
      1 para-style-override-13
      1 para-style-override-14
      6 para-style-override-15
      1 para-style-override-16
      1 para-style-override-17
      1 para-style-override-18
      1 para-style-override-19
      1 para-style-override-20
      4 para-style-override-21
      1 para-style-override-22
      5 para-style-override-23
      1 para-style-override-24
      1 para-style-override-25
      1 para-style-override-26
      1 para-style-override-27
      2 para-style-override-28
      1 para-style-override-29
      1 para-style-override-30
      4 para-style-override-31
      2 para-style-override-32
      1 para-style-override-33
      1 para-style-override-34
      1 para-style-override-35
      1 para-style-override-36
      1 para-style-override-37
      1 para-style-override-38
      1 para-style-override-39
      1 para-style-override-40
      1 para-style-override-41
      1 para-style-override-42
      1 para-style-override-43
      1 para-style-override-44
      1 para-style-override-45
      1 para-style-override-46
      1 para-style-override-47
      1 para-style-override-48
      1 para-style-override-49
      1 para-style-override-50
      1 para-style-override-51
      1 para-style-override-52
      1 para-style-override-53
      1 para-style-override-54
      1 para-style-override-55
      1 para-style-override-56
      1 para-style-override-57
      1 para-style-override-58

We now examine the "class" attribute values which are sometimes taken by the elements, by means of another script, which is as follows:

<xsl:stylesheet
    version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    
<xsl:import href="copy.xslt"/>
    
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>
    
    <xsl:template match="body">
        <body>
        <table>
            <tr><th>RANK</th><th>CLASS</th></tr>
            <xsl:apply-templates></xsl:apply-templates>
        </table>
        </body>
    </xsl:template>

    <xsl:template match="div">
            <xsl:apply-templates></xsl:apply-templates>
    </xsl:template>

    <xsl:template match="text()">
    </xsl:template>
 
    <xsl:template match="p">
        <xsl:apply-templates></xsl:apply-templates>
    </xsl:template>
    
    <xsl:template match="span">
        <tr>
            <td><xsl:value-of select="position()"/>
            </td><td><xsl:value-of select="if (exists(@class)) then @class else 0"/></td>
        </tr>
    </xsl:template>

<!--  ((SCRIPT: list-class-attribute-values-for-span-elements.xslt))
    -->
    
</xsl:stylesheet>

This again generates a chart, containing 13450 rows, among which 12606 contain the value 0 in the right column, which indicates the absence of a class attribute.

The remaining 844 rows are distributed in the following manner:

    389 char-style-override-1
    133 char-style-override-2
     57 char-style-override-3
    126 char-style-override-4
     22 char-style-override-5
     37 char-style-override-6
     10 char-style-override-7
      2 char-style-override-8
      1 char-style-override-9
     10 char-style-override-10
      3 char-style-override-11
      1 char-style-override-12
     14 char-style-override-13
      1 char-style-override-14
      2 char-style-override-15
      1 char-style-override-16
     34 char-style-override-17
      1 char-style-override-18

Examining the entities (and giving some of them a more suitable interpretation)

At this stage, before we start to simplify the structure and to get rid of the unnecessary elements of formating, a preliminary task consists in examining the entities enumerated inside the current header, in order to provide a proper set of equivalence. The list (compiled on the basis of the 17 books of the TIPA) currently runs as follows:

  <!ENTITY lsquo "&#8216;"> (i.e. "‘")
  <!ENTITY rsquo "&#8217;"> (i.e. "’")
  <!ENTITY ldquo "&#8220;"> (i.e. "“")
  <!ENTITY rdquo "&#8221;"> (i.e. "”")
  <!ENTITY trade "&#8482;"> (i.e. "™")
  <!ENTITY ndash "&#8211;"> (i.e. "–")
  <!ENTITY bull "&#2022;"> (i.e. "ߦ")
  <!ENTITY deg "&#176;"> (i.e. "°")
  <!ENTITY para "&#182;"> (i.e. "¶")
  <!ENTITY frac14 "&#188;"> (i.e. "¼")
  <!ENTITY frac12 "&#189;"> (i.e. "½")
  <!ENTITY thorn "&#254;"> (i.e. "þ")
  <!ENTITY shy "&#173;"> (i.e. "SOFT HYPHEN")
  <!ENTITY Agrave "&#192;"> (i.e. "À")
  <!ENTITY sup3 "&#179;"> (i.e. "³")

However, in the case of the 1st file, not every entity is found. There are no attestations of "þ", "", "À" and "³". What we find is:

2 occurrences of the trade ENTITY ("™" = "™") which have to be replaced by "ண"
1 occurrence of the ndash ENTITY ("–" = "–") which is suppressed in normalizing
1 occurrence of the bull ENTITY ("•" = "•") which is replaced by ஶ் (palatal s = grantha ś)
4 occurrences of the deg ENTITY ("°" = "°"), which have to be replaced by ஸ்
7 occurrences of the para ENTITY ("¶" = "¶"), which are (temporarily) replaced by the string "{{special_puLLi}}"

As for remaining 6 entities ("‘" = "‘"; "’" = "’"; "“" = "“"; "”" = "”"; "¼" = "¼"; "½" = "½"), they are used in a proper way, and must simply be translated automatically. Therefore, after applying the corrections, the header for the corrected html file (called _A01_eluthu1_2c.html) contains simply, in the end:

  <!ENTITY lsquo "&#8216;"> (i.e. "‘")
  <!ENTITY rsquo "&#8217;"> (i.e. "’")
  <!ENTITY ldquo "&#8220;"> (i.e. "“")
  <!ENTITY rdquo "&#8221;"> (i.e. "”")
  <!ENTITY frac14 "&#188;"> (i.e. "¼")
  <!ENTITY frac12 "&#189;"> (i.e. "½")

Getting rid of the "xml:lang" attribute

At this stage, we are now ready to get rid of the xml:lang attribute, by using the following script, which filters out unwanted elements

<xsl:stylesheet
    version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    
<xsl:import href="copy.xslt"/>
    
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>
    
    <xsl:template match="body">
        <body>
            <xsl:apply-templates></xsl:apply-templates>
        </body>
    </xsl:template>

    <xsl:template match="p">
        <p><xsl:attribute name="class"><xsl:value-of select="@class"/></xsl:attribute><xsl:apply-templates></xsl:apply-templates></p>
    </xsl:template>

    <xsl:template match="span[@class]">
        <span><xsl:attribute name="class"><xsl:value-of select="@class"/></xsl:attribute><xsl:apply-templates></xsl:apply-templates></span>
    </xsl:template>

    <xsl:template match="span">
        <SPAN><xsl:apply-templates></xsl:apply-templates></SPAN>
    </xsl:template>
    

    <!-- (Script Name: filtering_out_xml_lang_attributes_in_p_and_span.xslt)
     -->
    
</xsl:stylesheet>

It must be noted, however, that we are playing a trick, by using two different spellings for the element, which we write as (with Upper-case), when it does not have a "class" attribute. This causes Oxygen to protest, but allows us to concatenate the strings of consecutive elements, using for our comfort the Notepad++ editor, searching for the following regular expression (to be replaced by the EMPTY string).

</SPAN>\s*<SPAN>

After that, in order to produce a file acceptable to the Oxygen editor, we still have to:

get rid of the empty elements
replace all the remaining UPPERCASE "SPAN>" strings by (lowercase) "span>" strings

The final file will be saved under a new name (in this case: _A01_eluthu1_2d.html)

Seen with the Notepad++ editor, its appearance is now as follows:

(3a)