CSplain

Czech and Slovak in plain TeX

Petr Olk, July 2013

cz Cesky

CSplain is a conservative extension of Knuth's plain TeX. Difference between plain format and csplain format lies in the fact that instead of the CM fonts the CS-fonts are used by default, which allows you to:

  • direct processing characters of the Czech and Slovak letters (without using macros),
  • hyphenation patterns for the Czech and Slovak language.

CSplain supports since Dec. 2012:

  • implicit input encoding UTF-8,
  • power managing with fonts (including resizing them),
  • usage of TeX, eTeX, pdfTeX, XeTeX or LuaTeX,
  • internal encodings: by CSfonts or by T1 or Unicode (the last one only with XeTeX or LuaTeX),
  • hyphenation patterns for 50+ languages in various internal encodings,
  • the powerful OPmac macro package which is a part of the csplain package.

csplain.tar.gz
csplain - TUGboat atricle
the lecture in Brno

csplain.tar.gz package contains not only files needed for csplain format generating but further macro support. It is a part of CSTeX. You don't need probably to download and extract the csplain package from this site, because it is a part of the normal distributions of TeX (Texlive, MikTeX).

Contents

  1. Formats csplain and pdfcsplain
  2. Making the formats
  3. Input and internal encoding
  4. Using Czech and Slovak
  5. The macro file opmac.tex
  6. The UTF-8 input
  7. Fonts in csplain
  8. Hyphenation patterns of various languages differently encoded
  9. Recommended reading

1. Formats csplain and pdfcsplain

Format csplain has implicitly set DVI format as an outupt, while pdfcsplain outputs implicitly to PDF. To run TeX format csplain (on document titled dokument.tex) you can use the

   csplain document

The document.dvi is created. To run TeX with pdfcsplain format (on document titled document.tex) you can use the

   pdfcsplain document

The document.pdf is created. The commands csplain and pdfcsplain are implemented differently depending on the TeX distribution and operating system.

2. Making the formats

The following formats would be installed in your TeX distribution automatically:

1. csplain.fmt ...... input: UTF-8, output: DVI, engine: pdfTeX+encTeX,
                      commandline: csplain document 
2. pdfcsplain.fmt ... input: UTF-8, output: PDF, engine: pdfTeX+encTeX,
                      commandline: pdfcsplain document
3. pdfcsplain.fmt ... input: UTF-8, output: PDF, engine: luaTeX,
                      comandline: luatex -fmt pdfcsplain document
4. pdfcsplain.fmt ... input: UTF-8, output: PDF, engine: XeTeX,
                      commandline: xetex -fmt pdfcsplain document

The formats 1 and 2 are intended for common usage. If you are an expert, you can try to use the formats 3 and 4.

If you need to generate the formats manually, here i the command lines:

1. pdftex -jobname csplain -ini -enc csplain-utf8.ini
   ... the csplain.fmt file is created, save it to .../web2c/pdftex/
2. pdftex -jobname pdfcsplain -ini -enc csplain-utf8.ini   
   ... the pdfcsplain.fmt file is created, save it to .../web2c/pdftex/
3. luatex -jobname pdfcsplain -ini csplain.ini
   ... the pdfcsplain.fmt file is created, save it to .../web2c/luatex/
4. xetex -jobname pdfcsplain -ini -etex csplain.ini
   ... the pdfcsplain.fmt file is created, save it to .../web2c/xetex/

You have to use texhash command (or something similat in your distribution) after the files are installed.

Note: You can generate your own formats for XeTeX or LuaTeX based on CSplain. See the files xeplain.ini and luaplain.ini

3. Input and internal encoding

Input encoding. The old version of csplain had have the input encoding depended on the used operting system. The new version (from December 2012) accepts only UTF-8 encoding.

Internal encoding. The default for csplain in the CSfont encoding (derived from ISO-8859-2). The CSfonts are loaded in csplain by default. You can use another fonts but wit the same encoding.

If you need to use the T1 encoded fonts then you have to write the followig line at the begin of your document:

\input t1code

It he XeTeX or LuaTeX is used then the default nor T1 encoding are unusable for Czech and Slovak texts, because these engines are working in Unicode internally which differs from menitoned encodings in Czech and Slovak alphabets. So you have to wirite the following at begin of your document:

\input ucode

and all fonts you are using have to be Unicode encoded (OpenType format). For example you can use \input lmfonts.

4. Using Czech and Slovak

The csplain is started so that its default behavior is as plainTeX. It means that it is set to English hyphenation and control sequences \v, \' expand to the \accent primitive. It is also active the \nonfrenchspacing. The default setting is thus the same as in plainTeX. The difference is only in the implicit dimension size of typesetting area. The csplain creates one inch margins on A4, while the plainTeX is set for one inch for letter paper format.

To initialize the hyphenation patterns and setting the sequences \v, \', \^, \`, \', \r to expand to the natural characters, the following commands are reserved:

   \chyph     % initializes Czech hyphenation and \frenschpacing
   \shyph     % initializes Slovak hyphenation and \frenschspacing
   \csaccents % causes different behavior of \', \v, \^, \`, \" and \r,
              % which expands now to characters of CSfont

Recommendation: The first line of the document should be such

   \chyph % use format csplain

When a user processes such document by other format then the \chyph isn't defined and the above line appears in the error message including comments so the user can see by what the document have to be processed.

To return to the original settings:

   \ehyph      % the default U.S. hyphenation and \nonfrenchspacing
   \cmmaccents % \', \v etc. expand to the \accent primitive

Other commands are just shortcuts to some of the characters in the CS-fonts:

   \clqq     % Czech left double quotation mark
   \crqq     % Czech right double quotation mark
   \flqq     % French left double quotation mark
   \frqq     % French right double quotation mark
   \promile  % permille character
   \uv       % the text quoted by Czech quotes: \uv{text}
   \ogonek a % Polish letter a with ogonek (assembled from components)

cstexman.pdf

For more information about defaulta in csplain and its differences against plainTeX you can use the Manual on CSTeX, paragraphs 4.3 and 9.4.

5. The macro file opmac.tex

The cplain format is designed as a minimal extension of plainTeX, so the format itself does not offer other features outside of the plain commands and commands described in the previous chapter. It is the basis for low-level processing of Czech and Slovak texts of all kinds. However, the user has to have programmed more typically used features: automatic creation of content, numbering, cross-references, verbatim environment, hyperlinks, font size switching etc. User can't do this work when (s)he uses opmac.tex macro file. This file is a part of csplain package since the end of 2012.

For more information about this macro use the OPmac www page.

6. UTF-8 encoded csplain

This chapter describes the behavior of csplain generated for input encoding UTF-8 using encTeX, i. e. in TeX and pdfTeX. The short notice about XeTeX and LuaTeX is at the end of this chapter. Since 2012, the UTF-8 encoding csplain is recommended by default.

CSplain format with UTF-8 input implicitly recognizes the following characters in input files:

  1. All ASCII characters (128 characters called ``seven-bit chars'')
  2. 奵 characters.
  3. Characters that are defined in plainTeX or csplain as a control sequences: \ss, \l, \L, \ae, \oe, \AE, \OE, \o, \O, \i, \j, \aa, \AA, \S, \P, \copyright, \dots, \dag, \ddag, \clqq, \crqq, \elqq, \erqq, \elq, \erq, \flqq, \frqq, \promile. UTF-8 codes for these characters are tranformed into these control sequences in TeX input processor and they are transfomred back to the UTF-* codes during \write.

If any other character will be in input file (long dash, indivisible space, etc.) then UTF-8 encoded csplain displays on the terminal message similar to this:

  WARNING: unknown UTF-8 code: ` = ^^e2^^82^^ac' (line: 42)

and it inserts the balck box to DVI or PDF output. The user can map undefined code to the control sequence and the sequence define, like this:

  \mubyte\eurochar ^^e2^^82^^ac\endmubyte % kd znaku  mapovn na \eurochar
  \def\eurochar{{\eurofont e}}            % definice \eurochar
  \font\eurofont=feymr10                  % pouit font

Following files are prepared to extend the set of UTF-8 codes which are understandable (mapped to the control sequence and defined):

  utf8lat1.tex ... mapping of UTF-8 codes from Latin-1 Supplement U+0080--U+00FF
  utf8lata.tex ... mapping of UTF-8 codes from Latin Extended-A U+0100--U+017F

I suppose that in the same manner as the files utf8lat1.tex and utf8lata.tex someone extends the possibility of mapping UTF-8 codes for other important blocks of the UNICODE table.

Another example about pupporting new UTF-8 codes is the file cyrchars.tex, which supports the cyrillic characters nativelly (without an explicit font switching). More documentation about this is at the end of the mentioned file.

If you give to the csplain input the file which isn't coded by UTF-8, the error message will appear:

  ! UTF-8 INPUT IS CORRUPTED! May be you are using another input encoding.

In such case, you can add one of the following two possible \input commands in your document:

  \input utf8off ...  switches off the UTF-8 encoding, input / output is in ISO-8859-2
  \input mixcodes ... the mix of following encodings can follows:
                      UTF-8 or ISO-8859-2 and CP1250. All processes
                      correctly without having to use a switch.
                      Output by \write is stored in UTF-8.

XeTeX and LuaTeX supports UTF-8 input encoding naturally without encTeX, thus the warnings about missing UTF-8 characters don't occur. Because the internal encoding is Unicode in XeTeX and LuaTeX, you have to set this internal encoding and to load Unicoded font when you are using non-ASCII characters in your document:

\input ucode    % internal encoding is set to Unicode
\input lmfonts  % Unicoded font is loaded
Tady je esk textk. % Non-ASCII text is supported now
\end

cstexman.pdf

More detailed information on the usage of UTF-8 encoded input can be found in the Manual for CSTEX in paragraph 4.6.

7. Fonts in csplain

The default font family in csplain is CSfont that is a mild extension of Knuth's Computer Modern fonts. I is possible to switch to another font family from 35 fonts guaranteed in every PostScript from Adobe by \inputting one of the following font declaration file: in your document:

  \input ctimes   % Times font family
  \input chelvet  % Helvetica font family
  \input cavantga % AvantGarde font family
  \input cbookman % Bookman font family
  \input cncent   % NewCenturySchlbk font family
  \input cpalatin % Palatino font family

The metrics of these fonts are prepared in a slightly extended encoding than CSfont. These metrics are included in the cspsfonts.tar.gz package and they should be a part of every TeX distribution. After \input of any of these files, you can print the following new characters \ellipsis, \textbullet, \sterling, \euro, \trademark, \registered, \currency, \section, \clq, \crq, \flq, \frq and the following macros from plainTeX are redefined: \dag, \ddag, \copyright, \Lslash \lslash, \P and \S so that they referring directly to the slots in the font (the original macros typically build these character from components). The UTF-8 encoded csplain will extend the set of mapped UTF-8 codes by these characters.

cstexman.pdf

The metrics for these fonts include the Euro and a few other characters. I regenerated all these these metrics. Recently, these fonts do not expand by virtual fonts on flying accents. This means you can search in the PDF file and you can transfer the text from PDF viewer by clipboard. For more details see chapter 3 in the Manual for CSTeX.

The font declaration files mentioned above load other macro from tx-math.tex. It prepares the mathematics typesetting using TX fonts. There is a superset of mathematical symbols known from AMSTeX available. The math alphabets \frak (Fracture), \script (script more rounded than \cal), \bbchar (double strokes letters), \bf, \bi (bold alphabet sansserif normal and slanted) are ready. In 2012, the obsolete and little functional macro \setsimplemath is removed. For more details see section 4.5 in the manual CSTeX.

When using the default font family (CSfonts), it is possible to \input the macro ams-math.tex, which offers similar possibilities as tx-math.tex but using AMS fonts.

Fonts in text and math can easily zoom in and out. For more details see Manual CSTeX.

Many fonts in TeX distributions are encoded only in T1 encoding, which is incompatible with CSfont encoding used in csplain. But this does not matter, just type at the beginning of your document:

   \input t1code

and you can work with T1 encoded fonts. CSplain internally switches to T1 encoding including hyphenation patterns. If you are using UTF-8 input, you need not worry about anything else, the macro t!code does the change of the encoding tables for the input processor itself. When the encTeX isn't used, it is necessary to care the transcoding otherwise.

If you write \input t1code before the \input ctimes (or \input cavantga etc.), the corresponding T1 encoded fonts are loaded. These fonts have two problems: a) do not include Euro, b) d and t characters with caron are menacing implemented. Problem b) is solvable only if you find an error in the corresponding configuration of fontinst program and ask the administrator to re-generate all T1 encoded fonts in TeX distribution.

I am preparing to include into CSTeX other font declaration files for fonts commonly available in the current TeX distributions. Using `` tex cs-all'' you can print all font families supported by font declaration files. Apart from the above ctimes, chelvet etc. the following files are ready:

  \input lmfotns     % Latin Moder fonts
  \input cs-bera     % Bera
  \input cs-arev     % ArevSans
  \input cs-charter  % Charter
  \input cs-antt     % Antykwa Torunska
  \input cs-polta    % Antykwa Poltawskiego
  \input cs-termes   % TeX Gyre Termes
  \input cs-adventor % TeX Gyre Adventor
  \input cs-bonum    % TeX Gyre Bonum
  \input cs-heros    % TeX Grye Heros
  \input cs-pagella  % TeX Gyre Pagella
  \input cs-schola   % TeX Gyre Schola
  \input cs-cursor   % TeX Gyre Cursor

All these font declaration files load fonts in CSfont encoding by default, but if the \input t1code is used before, these files load fonts in T1 encoding.

I you are working with large sets of fonts I suggest you to use the macro OFS.

8. Hyphenation patterns of various languages differently encoded

Hyphenation patterns are loaded when the format is generated. CSplain is ready to load hyphenation patterns of 54 languages (see here) in the three possible encodings. By default, it reads only English (the default pattern of plainTeX) and the Czech and Slovak patterns encoded by CSfont and T1 (Cork). If the generation by 16-bit TeX engine is detected, the Czech and Slovak hyphenation patterns are loaded in Unicode too.

Czech patterns is switched on by \chyph (or \czlang, which does the same thing), Slovak by \shyph (or \sklang) and English by \ehyph (or \uslang). These switchers operate in the context of their coding set command \input t1code (T1 encoding) or \input ucode (Unicode). If such \input command isn't used, hyphenation patterns are initialized in the CSfont encoding.

Other hyphenation patterns can be loaded during format generating if you uncomment corresponding line in the file hyphen.lan. Or, it is possible to add the request of hyphen-pattenrs loading in the command line which generates the format like this:

pdftex -ini -enc "\let\plCork=y \let\enc=u \input csplain.ini"

This example generates csplain with UTF-8 encoding and loads implicit hyphenation patterns and the hyphenation patterns of our Polish friends (pattenrs encoded by Cork). You can switch these hyphenation patterns on by command \pllang in your document. (Something like \phyph is no longer supported because of 54 possible languages and their hyphenation patterns but we have only 26 letters in the alphabet). The \pllang will not work until the \input t1code because Polish hyphenation patterns are loaded in Cork encoding only (aka T1).

If the 16-bit TeX engine (LuaTeX, XeTeX) is detected then it is possible to load hyphenation patterns marked \..Unicode, eg. \deUnicode, \ruUnicode, \plUnicode. You can switch on to these hyphenation patterns by \delang, \rulang, \pllang, \czlang, ..., if preceded by \input ucode. It is also necessary to establish the typesetting in Unicode by some Unicode font, otherwise the output will be garbaged. At this moment, you cen use \input lmfonts to load Latin Modern fonts in Unicode. The other possibility is to use TeXgyre fonts \input cs-termes, \input cs-adventor, ..., \input cs-schola, which are able to load unicode varinats of these fonts too.

Csplain set after \ input parameters ucode \ lccode only for the Czech and Slovak alphabet, ie if you are using a different language, it is necessary for him set the needed \ lccode. Otherwise unikdovan hyphenation patterns of foreign languages will not work.

9. Recommended reading

Items are listed in the suggested order.

  1. Petr Olk: First meeting with TeX.
  2. Petr Olk: TeX for Pragmatists.
  3. Petr Olk: TeXbook inside out.
  4. Petr Olk: TeX typesetting system.
  5. Donald Knuth: The TeXbook.