% preamble stuff \author{Simon Cozens} \def\WP{\hbox{{\sl WEB\/}Perl}} \title{Weave for {\it WEB\/}Perl} \def\pt{{\tt tangle.pl}} \def\pw{{\tt weave.pl}} \def\at#1{{\tt @#1}} \def\vis{{\tt\char` }} @*Introduction. `Literate Programming' was a concept created by D.~E. Knuth to describe his `WEB' meta-language which allows one to both program and document one's code at the same time. According to Knuth: \begin{quote} The philosophy behind WEB is that an experienced system programmer, who wants to provide the best possible documentation of his or her software products, {}needs two things simultaneously: a language like \TeX{} for formatting, and a language like C for programming. Neither type of language can provide the best documentation by itself; but when both are appropriately combined, we obtain a system that is much more useful than either language separately. \end{quote} WEB has two parts: a {\bf tangler}, which creates compilable program code in the target language, and a {\bf weaver}, which writes documentation in some formatting language. The original WEB language was specifically designed to convert a WEB program into a \TeX{} document and a PASCAL program. Many other implementations of WEB have arisen which purport to format any language and any document formatter. However, I have chosen to write an implementation specifically for Perl, due to the syntactic complexity of the language. It is my belief that many of the WEB implementations around at present neither cater particularly well for the Perl programming language, nor implement the full set of WEB commands that Knuth intended. @ Revision History. I started version 1 on the 29th of October 1998. I had got up to version 0.7 in the last month and it was working reasonably well, but I was a little upset about the untidiness and unstructuredness of the code, and the lack of compatibility with WEB. I liked a few of the algorithms I had used, (I still think the indexing section is rather good, but we'll come to that later) so I've kept them, but rather a lot of things like the way modules were hashed and labelled was just too damned vague, and vagueness is not a good thing in computer programming. I've also attempted to throw in a few more features---formatter independence will be a good thing if it works properly. This is version 1.1, on the 10th of November, caused by a total re-think of the perl formatting system. A note on my revision scheme - the major part reflects the number of large-scale rewrites of the program, the minor part, the number of times large chunks have been rewritten, and the sub-revision part, the number of small changes which have been made to it. Scratch that. This is the 13th of November, and it's version 2. This is a complete rethink of everything. This was caused by me thinking first and programming second. People were bemused at the sight of me sitting in a pub, drinking Guinness and hacking on an A4 notepad. No, not a computer, a notepad. Another rethink. This is 3.0.0, and I've abstracted a lot of the common code between tangle and weave, and I {\bf am} going to use proper tokenisation this time. It may be slightly slower (a fault of Perl, which may be overcome if I can get Mr Beattie's nice compiler to work) but it's more {\bf correct}, damn it. Formatter abstraction will work properly, and I will implement all that I say I'm going to implement in the {\it WEB\/}Perl Book. It's the 6th of March. I've been a busy, busy boy. @= print "This is weave, version 3.0.-1\n"; @* Implementation Notes. This is the bit where I ramble on about how I'm planning to implement various things. Think of it as my coding notes, where I examine the specifications and get absolutely clear about what I'm trying to do in various areas and how I'm going to do it. {\tt WEB} coding has made me see that I need to think first and code second. This is the part I'm writing first. @ Change files - what do we need to do about them? This will, of course, become part of the common code, but the current version of \pt{} (1.1.0) doesn't fully support changing, so we've got to think about it now. OK, what are we looking for in a change file? The easiest one to spot is going to be alterations to the plain format definition (\at{w} commands) which need to be fiddled into the format hash. More on that later. The hard part, of course, is going to be wandering through and spotting \at{x}--\at{y}--\at{z} constructions. Now, there are two options here---% we can either incorporate changes on the fly, reading both files simultaneously, which is how Knuth's WEB does it, or we can read the change file into a hash, (or similar structure) and implement changes after all is read. In the event, I did it similar to Knuth's method. You can see more details in the common code web, but the basic principle is that when a line matching the first line of a change is found, all lines in the change are matched and, if successful, the code is taken from the replacement section until an \at{z} is hit, and then we start again\dots Here, though, I need to mention the fact that \pw{} deals with the change file in a different way to Knuth's weaver, which, when presented with a change file, weaves only those parts which are changed. \pw{} incorporates the changes and then weaves the whole program. @ The next one I want to think about is the whole business of formatter commands and format specifications. Basically it looks as though things are converted to an internal metalanguage which is interpreted according to the target formatter. This isn't as messy as it sounds. The idea is, you have a set of commands which are translated into the target formatter's language, like this: $$\hbox{\tt boldface} == \hbox{\tt \{$\backslash$bf \#\}} \leqno{\hbox{\at{w}}} $$ and this: $$ \backslash == \hbox{\tt \$$\backslash${}backslash\$} \leqno{\hbox{\at{f}}}$$ % Don't try this at home, folks. and you also have your ordinary {\tt WEB} format definitions, which are formatter independent and reference these either directly, like this: $$\hbox{\tt for} == \hbox{\tt boldface}\leqno{\hbox{\at{f}}}$$ (which, when \pw{} first sees it, defines the token {\tt for} to be a synonym for {\tt \{$\backslash$bf \#\}}. When \pw{} runs through the web, each {\tt for} will be replaced by {\tt \{$\backslash$bf for\}})\footnote{Notice that \pw{} will substitute the ``command'' name, when it doesn't find a parameter that it's expecting.} or indirectly, as {\tt WEB} programmers should be used to: $$\hbox{\tt foreach} == \hbox{\tt for}\leqno{\hbox{\at{f}}}$$ {However, these two statements have to be specified in the order given above - if \pw{} has no other definition of how to format {\tt for}, it will format {\tt foreach} by replacing it with the letters {\tt for}!)} Note at this point, there are {\bf two distinct types} of formatter array; one of these deals with internal commands, which we don't want ordinary code to be able to access---it would not be pleasant if every instance of the word ``begin'' in our code to be replaced by the formatter code to start a new document---which we will call the {\bf internal command array} and will code as |%command|, which are specified by \at{w} commands, and also those which should replace specific tokens---escaping a backslash, formatting up reserved words, and so on---which we will call the {\bf token format array}, manipulate using \at{f} commands and code as |%tokenformat|. Basically, if you want a perl token to be formatted in a given way, it goes in |%tokenformat|. If it's a format command sequence, it goes in |%command|. As you can probably tell, \pw{} can be expanded in a number of very powerful ways, and format in any way imaginable. @ \label{crush}Token crushing. From time to time, we'll need to amalgamate certain tokens together. Here we present a list of things that need to be crushed into one token so that |%tokenformat| entries can be put into operation for them. If we're using a version of Perl that doesn't have the zero-length lookahead operators, we also need to crush together individual letters into tokens. \begin{verbatim} .= += -= *= /= => -> \end{verbatim} Rethink. I'm going to take quite a momentous step here. I'm going to say that {\bf all command sequences} should be crushed into one token. That is, \at{\vis}, \at{*}, \at{$<$} and so on should be counted as one token. This means I've got to rewrite substantial bits of the tokeniser. Ho well. @* Notes on Parsing. Recognising various elements of Perl is extremely difficult. To quote the Perl FAQ: \begin{quote} There is no BNF, but you can paw your way through the yacc grammar in {\tt perly.y} in the source distribution if you're particularly brave. The grammar relies on very smart tokenizing code, so be prepared to venture into {\tt toke.c} as well. In the words of Chaim Frenkel: ``Perl's grammar can not be reduced to BNF. The work of parsing perl is distributed between yacc, the lexer, smoke and mirrors.'' \end{quote} To implement some of the nattier features that have been suggested to me, I'm going to need to look into recognising things like regular expressions, identifiers, and reserved words, and I have {\bf committed} myself to doing this, as much as possible, token-based, rather than by using regexps. This may well be shooting myself in the foot, because using a lexer is a Good Thing. However, Knuth didn't need regexps to parse Pascal, so I don't need them either. How are we going to do these elements, then? @ Identifiers are the easiest place to start, since we know that an identifier is going to start with appropriate punctutation. But how much punctuation? |$#array| is very common. I shall work from the hypothesis that Perl variables can be described by the regular expression $$ (\$\mid@@{}\mid\%)(\$\mid@@{}\mid\#)*[A-Za-z0-9\_]+$$ % I wonder if you've spotted the problem with using a vertical bar to denote % code interjections in a documentation phase? @ Reserved words I've done before, and I'm happy with the criteria I used to recognise them then: \begin{quote} The reserved words problem is tricky. Let us define some circumstances in which we can definitely state that a given string of characters is a reserved word. \begin{itemize} \item The string of characters must exactly equal the reserved word. Sounds obvious, I know, but that's the way to start. \item It must be preceded and followed by non-word characters. \item Things that can precede a reserved word: \begin{enumerate} \item Blank space \item Nothing at all \item A brace of either type. \item A bracket of either type. \item A mathematical operator \end{enumerate} \item Things that can follow a reserved word: \begin{enumerate} \item Blank space \item Nothing at all \item A brace of either type. \item A bracket of either type. \item A mathematical operator \end{enumerate} \item We also impose the restriction that words are not reserved inside quotation marks. \end{itemize} So, let us assume that the character we're examining is a valid preceding-type. If we can then find an exact match for a reserved word, then a valid following-type character, we have something we should embolden. \end{quote} @ {\it I'll come back to recognising regexps, but I'm eager to get hacking right now.} @* What happens when. Here is a detailed plan of action for the weaving process. Firstly, we bring in our data file(s), and we make a first pass over that data. The purpose of this pass is to \begin{itemize} \item Deal with \at{w}s that the changing mechanism presents us with. \item Collect up and number code section names, and collecting cross-reference information. \item Collect up starred module names, preparing the table of contents. \item Collect up and index variables and other index information. \item Put the merged source into another array so that we can spin through it faster on the next pass. \end{itemize} Note that we don't actually disambiguate abbreviated module names at this stage---we can afford to leave this until they are needed in the output stage. The second pass is the main output phase: this one pass through the merged source will produce the finished document. We can spit out the preamble provided by the web, which is easy enough. After this, we can add our start-of-document material, and print out the table of contents, which we should have constructed in the first pass. Then the fun really starts. Basically, every token is examined, after which it either causes a state shift, or (in Perl mode) a format lookup. But we'll look in more detail at the mechanism for that later. For the moment, here's the whole program: @p @ @ @ @ @ And so here we begin: by way of setting up, we load in all the common functions\footnote{In this particular build, I've placed the common code in a separate file which gets concatenated to the end of the webs for \pt{} and \pw{} just prior to tangling. When I release the code publically, I'll probably add the common code to the end of the webs so that only two web files need be shipped and that the compilation process is nice and simple.}, deal with any variables we may need to set up, and finally bring in the web (and possibly change) files. @= $top=time; @ @ @ local @@tokenlist; $|=1; $changing=$finished=0; @ @ @* The Default Format. This is a good candidate for changing, if you want to change \pw{} itself. This sets up the inital format arrays for the \LaTeX{} format. It's quite a long section, and should be relatively self-explanatory. All of the commands will be described in the `Formatter Independence' chapter of the \WP{} Book. Firstly, here are all the commands internal to \pw{}: @= %command = ( "tab" => "{\\ }\\hspace*{0.5cm} ", "newline" => "\\\\\n", "preamble" => "\\documentclass[twoside]{article}\n\\pagestyle{myheadings}\n", "begin" => "\\begin{document}\n\\maketitle\n", "end" => "\\end{document}\n", "begincode" => "\\noindent\{\\sl\n", "endcode" => " \} %end slanted\n", "starred" => "\\newpage {\\bf #modno #partname} \\markboth{\\sc #partname}{\\sc #partname} ", "unstarred" => "\\vskip 3pt {\\bf #modno} ", "label" => "\\label{#}", "ref" => "\\pageref{#}", "dotfill" => "\\dotfill{}", "smalldotfill" => "\\dots{}", "indexname" => "Index", "beginindex" => "\\begin{theindex}", "indexitem" => "\\item ", "indexspace" => "\\indexspace ", "endindex" => "\\end{theindex}", "tocheader" => "\$\$\\hbox{\\Large Table Of Contents}\$\$", "ctname" => "Table of Code Sections", "codetable" => "\\newpage\$\$\\hbox{\\Large Table Of Code Sections}\$\$ \\markboth{\\sc Table of Code Sections}{\\sc Table of Code Sections}", "modequals" => " \$\\equiv\$ ", "moredef" => " \$\\buildrel+\\over=\$ ", "at-p" => "\\par {\\bf Program Source } \\", "macro" => "\\verb+\@@d+ ", "hbox" => "\\hbox{#}", "break" => "\\discretionary{}{}{}", "visiblespace" => "{\\char`\ }\\discretionary{}{}{}", "comment" => "\\hfill{}{ \\rm \\{ # \\} }", "opensq" => "`", "closesq" => "'", "opendq" => "``", "closedq" => "''", "dummy" => "{}", "escape" => "\\#{}", "rw" => "{\\bf #}", "begincommand" => "{\\tt ", "endcommand" => "}", "beginstring" => "{\\tt ", "endstring" => "}", "csref" => "{\\rm \$\\langle\$ #modno #csname \$\\rangle\$}", "modref" => "{\\bf This code is used in sections \$\\mit # \$}\\\\", "saref" => "{\\bf See also sections \$\\mit # \$}\\\\", "isamacro" => "{\\tt # }" ); @ And a few things which are used by our \LaTeX{} format definition but are not native to \pw{}. @= $command{"begintt"}="{\\tt "; $command{"endtt"}="}"; $command{"bold"}="{\\bf # }"; $command{"teletype"} = "{\\tt # }"; @ Here's a quickie to allow parameters to be passed quickly to commands, where there's a single-parameter macro. This won't be used on commands which take named parameters. @= sub fillin { ($commandname, $param) = (shift, shift); $answer=$command{$commandname}; $answer=~s/\#/$param/g; return $answer; } @ Now things we need to escape so that \LaTeX{} processes them tidily: @= %tokenformat=( "<" => "\$<\$", ">" => "\$>\$", "_" => fillin("escape","_"), "\$" => fillin("escape","\$"), "\\" => $command{"break"}."\$\\backslash\$", "~" => fillin("escape","~"), "\&" => fillin("escape","\&"), "\%" => fillin("escape","\%"), "^" => fillin("escape","^"), "#" => $command{"escape"}, @{ Look carefully at that one @} "{" => $command{"break"}.fillin("escape","{"), "}" => $command{"break"}.fillin("escape","}"), "=>" => "\$\\rightarrow\$", "@@&" => fillin("teletype","@@&"), "|" => "{ \$ \\mid \$} " ); @ Another swift formatting function to allow us to quickly insert reserved words into the token format hash. The formatter command will be set up here but executed at output time, in case someone changes either the token format hash for a given reserved word (using an \at{f}) or the overall format specification for reserved words, |%command{"rw"}|, using an \at{w}. Quite how that will work I have yet to discover. @= sub rw { while ($rw=shift) { $tokenformat{$rw}=$command{"rw"}; } } @ Similarly, when a macro is defined, we want to add it to the token format. @= sub isamacro { $_=shift; $tokenformat{$_}=$command{"isamacro"}; } @ Here we load up all the reserved words with the default reserved-word command sequence, which remains unexpanded until output time, as noted above. Change this as perl's reserved word list changes. @= rw(qw (abs accept alarm atan2 bind binmode bless caller chdir chmod chomp chop chown chr chroot close closedir connect continue cos crypt dbmclose dbmopen defined delete die do dump each else elsif endgrent endhostent endnetent endprotoent endpwent endservent eof eval exec exists exit exp fcntl fileno flock for foreach fork format formline getc getgrent getgrgid getgrnam gethostbyaddr gethostbyname gethostent getlogin getnetbyaddr getnetbyname getnetent getpeername getpgrp getppid getpriority getprotobyname getprotobynumber getprotoent getpwent getpwnam getpwuid getservbyname getservbyport getservent getsockname getsockopt glob gmtime goto grep hex if import index int ioctl join keys kill last lc lcfirst length link listen local localtime log lstat map mkdir msgctl msgget msgrcv msgsnd my next no oct open opendir ord pack package pipe pop pos print printf prototype push qq quotemeta qw qx rand read readdir readlink recv redo ref rename require reset return reverse rewinddir rindex rmdir scalar seek seekdir select semctl semget semop send setgrent sethostent setnetent setpgrp setpriority setprotoent setpwent setservent setsockopt shift shmctl shmget shmread shmwrite shutdown sin sleep socket socketpair sort splice split sprintf sqrt srand stat study sub substr symlink syscall sysopen sysread system syswrite tell telldir tie tied time times tr truncate uc ucfirst umask undef unless unlink unpack unshift untie until use utime values vec wait waitpid wantarray warn while write )); @ And that, it seems wraps up the default format. Now we have to implement the bloody thing. @* Phase One Processing. To recap, we've stated our aims in this phase are: \begin{itemize} \item Deal with \at{w}s that the changing mechanism presents us with. \item Collect up and number code section names, and collecting cross-reference information. \item Collect up starred module names, preparing the table of contents. \item Collect up and index variables and other index information. \item Put the merged source into another array so that we can spin through it faster on the next pass. \end{itemize} To check for code section names, I first thought that we could use regular expressions, which actually seems vaguely sensible. However, what can we do if a section name extends over several lines? We have to use the tokeniser in this case. This has a number of repercussions on the rest of the process. We first need to check when a new module starts so we can increment the module counter and thus give the correct number in the code section array. (Since all code sections have modules, but not all modules have code sections.) Now, at this stage we {\bf can} disambiguate dotted references, if we keep a full list somewhere of all the code section names that we come across both in the definition of modules (\at{$<$}\dots\at{$>=$}) and also in references to other modules. \footnote{I previously thought you couldn't, but I've realised that you can only use dots in {tt WEB} if you've previously used the name in full---I thought you could use the dotted name first, but you can't. This makes things much easier.} (\at{$<$}\dots\at{>}) Accordingly, here's the main phase one loop: @= @ local $newsourceline=0; @@merged=(); $merging=1; $phase=1; $mergingtime=$tokentime=$moduletime=$cstime=$uctime=$indextime=0; $initialising=time-$top; while ($finished==0) { $mark=time; @ @@merged[$newsourceline++]=$_; $mergingtime+=time-$mark; $mark=time; @ $tokentime+=time-$mark; while(moretokens()) { $mark=time; @ $moduletime+=time-$mark; $mark=time; @ $cstime+=time-$mark; $mark=time; @ $uctime+=time-$mark; $mark=time; @ $indextime+=time-$mark; skiptoken() @{ Since nobody's interested in it. @} } @ } $bot=time; if ($stats) { print "\n\nPhase 1 statistics:\nInitialisation time\t: $initialising\n"; print "Merging time\t: $mergingtime\nToken time\t: $tokentime\n"; print "Module time\t: $moduletime\nCS Time\t: $cstime\nUC time: $uctime\n"; print "Indexing time\t: $indextime\nOther time\t:"; $other=($bot-$top)-($initialising+$tokentime+$mergingtime+$moduletime+$cstime+$uctime+$indextime); print $other."\n"; } @ $merging=0; $phase=2; @ Loading the tokens into the tokeniser isn't part of the common code, for reasons explained there. This is where we implement that token crushing stuff we stipulated earlier. @= @@tokenlist=split /(?<=[^a-zA-Z0-9])|(?=[^a-zA-Z0-9])/o; $tokencount=0; @@tokenlist=crushtokens(@@tokenlist); #if ($debug>1) { # print "\n $sourcelineno:\n"; # foreach (@@tokenlist) { print $_ } ; # print "\n"; #} @ This code was suggested by Malcolm Beattie. Nothing like a bit of name-% dropping, is there? @p sub crushtokens { my @@res; my $el1, $el2; while (@@_) { ($first, $second) = (shift, shift); if (@) { push(@@res, $first.$second) } else { push(@@res, $first); if ($second eq "." or $second eq "@@" or $second eq "=" or $second eq "+" or $second eq "*" or $second eq "-" or $second eq "/" or $second eq "+") { unshift @@_, $second; } else { push (@@res, $second); } } } return @@res; } @ Here we supply the tests as to whether or not to crush two tokens into one, based on the rules given in section 6 above. @= ($first eq "@@") or ($first eq "=" and $second eq "=") or ($second eq ">" and ( $first eq "=" or $first eq "-" )) or ($second eq "=" and ( $first eq "." or $first eq "-" or $first eq "+" or $first eq "*" or $first eq "/")) @{ Gee, it's like BNF, isn't it? @} @ The common code provides us with a hook to process the \at{w} statements that change the |%command| array. When we hit this hook, |$_| contains a line which has an \at{w} in it somewhere. This has {\bf not} been tokenised, (nor should it be) and has come from the change file. We will never, ever have to consider there possibility of an \at{w} statement being more than one line long, because it's declared not to be. Therefore, we can match the definition by, dare I say it, regular expressions. Urgh. Yuk. We don't actually need to do any processing of the format definitions at this stage, other than parsing the line and putting the command into the hash. @= ($predicate, $object) = /@@\w\s*(.*?)\s*==\s*(.*)\s*/; $command{$predicate}=$object; @ At the beginning of a module, we need to do a number of things. Firstly and most importantly, we have to increment the module counter. If it's starred, we need to put it in the table of contents array. @= if (thistoken() eq "@@\ ") { $modulenumber++ ; $context="doc" } if (thistoken() eq "@@\*") { print "*".++$modulenumber."*" ; $context="doc" ; skiptoken(); ($status,$title)=gimme("."); @{ Or at least, I think it should do. @} if ($status eq "EOF") { $errortoken=-1; @ print "! Part name unterminated.\n\n"; print "You should use a full stop to indicate the end\n"; print "of the name of a part. I'm taking everything to the end of the line\n"; print "to be the title.\n\n"; } push @@startable, [$modulenumber, $title]; @{ Ooh, complex data structures\dots @} } @ Once we've found a section name, what the hell do we do with it? First, if it's dotted, we should be able to disambiguate it. When we do this we will actually edit the merged source array. It also needs to have a number, if it's the definition of a section. But each module has its own number, assigned by the code above which checks for module beginnings, so we connect the code section number to a name. In both cases, we'll have a list |@@cs_names| which just contains all the names we've come across so far, for the purposes of disambiguation. We're also attempting to collect cross-reference and see-also information in this phase, so we want to construct two arrays. For each module number, we need to know what module, if any, the current section is a continuation of. (We can reverse that process later.) We also need to know where a section is referenced. This means when we come across a reference to a section number |X| in section |Y|, we need to add |Y| as an entry to the list stored in |@@references[X]|. @= if (thistoken() eq "@@"."<") { $name=""; skiptoken(); $name = substr(careful_gimme("@@".">"),0,-2); $name =tidyname($name); if ($name=~ /\.\.\.\s*$/) { @ } else { push(@@cs_names, $name) unless (scalar grep(/\Q$name\E/, @@cs_names) > 0); } careful_skip() if (thistoken() eq "+"); if(thistoken() eq "=") { @ } else { @ } } @ In order to disambiguate the name, we need to refer to an array of previously seen names, which we've called |@@cs_names|. (This is where getting your data structures planned first is a good idea.) The merged code hack to fix the code for when we go through to the output phase is fun, though. By the way, the reason I'm not using dotted names in this code is because the current version of \pt{} doesn't properly support them and I'm too busy with \pw{} to rebootstrap it at the moment. @= $newname=cname($name); @@merged[$newsourceline-1]=~s/\Q$name\E/$newname/g; @{ Should do the trick @} $name=$newname; @ When we get to unnamed code sections, joy of joys, we don't need to do that much. @= if (thistoken() eq "@@"."p") { $context="perl" } @ When we hit a module definition, what must we do? Well, first, we must decide whether or not we're a continuation of a previous module, and the way to do that is to see if the name's been defined before. If so, we put reciprocal entries into the |@@seealso| array. If not, we set |@@modules[$modulenumber]=$name|. We also have a lookup hash |%whereis| for ease of reference. @= if ($context eq "perl") { @ print "! Huh? I thought we were already talking Perl?\n\n"; } $context="perl"; $j=0; for $i (0..$modulenumber) { if (tidyname($name) eq tidyname(@@modules[$i])) { $j=$i; last; } } if ($j>0) { push (@@{$seealso[$modulenumber]},$j); push (@@{$seealso[$j]}, $modulenumber); @@displayedmodulenumber[$modulenumber]=$j; } else { @@modules[$modulenumber]=$name; $whereis{tidyname($name)}=$modulenumber; @@displayedmodulenumber[$modulenumber]=$modulenumber; } @ With a module reference, we need to update the `is used in' array, which we've called |@@references|. But---and this is where things get really interesting---the reference maybe to a module that's not defined yet, so we don't know what module number it is. So, for the moment the best we can do is to add the {\bf name} of the module (which has been fully disambiguated) to a list, and sort out the references array later. So, in module {\tt foo}, we come across the module {\tt bar}. We don't know module bar's module number yet, but we do know foo's. The aim is to get a list of module numbers that we put at the end of the beautification of {\tt bar} to say where {\tt bar} is used, and hence obviously {\tt foo}'s module number should be in this list. (Am I making this too simple?) At the moment, we need a list keyed in a hash to the string |"bar"|. I'm not going to explain it any more, because it turns into pseudocode very shortly\dots @= push (@@{$usedin{$name}}, $modulenumber); @ Now, of course, we need to sort this into the references array we wanted first time around. Wow. This is where are data structures really begin to smell bad. @= while (($key, $arrayref) = each %usedin) { if (! defined $whereis{tidyname($key)}) { @ print "You asked me to find $key, but I couldn't."; print "I'm going to carry on as if nothing has happened.\n"; } @@references[$whereis{tidyname($key)}]=$arrayref; } @* Indexing. Next, we've got to look at what goes in the index. There are the trivial cases, where the user specifies explicitly that something should appear in the index, that is, when we have control codes like \at{\^{}}, \at{.}, \at{:}. Similarly, \at{!} and \at{?} tell us implicitly that the next token(s) will be placed in the index. We also, however, have to look up and recognise variables, as explained in the Implementation Notes. Of course, now we've spent all this time thinking about it, it'll be easy to code. Right? To make matters easy, variables are exactly one token long. Now, the text after explicit index control codes will not contain any more web \at{whatever} sequences, so we can throw it verbatim into the index. @= if (thistoken() eq "@@"."^" or thistoken() eq "@@"."." or thistoken() eq "@@".":") { @{ Control string up until \at{>} goes into index @} @ } elsif (thistoken() eq "@@!" or thistoken() eq "@@?") { @ } elsif (thistoken() eq "sub" and $context eq "perl") { @ } elsif ($context eq "perl" and (thistoken() eq "\$" or thistoken() eq "\@@" or thistoken() eq "\#" or thistoken() eq "&")) { @ } @ Not much to be done here---just get the whole of the control text and index it. @= $type=thistoken(); careful_skip(); $indexstring=careful_gimme("@@>"); @ @ Find a token that looks like we want to index it, and do so. @= $type=thistoken(); careful_skip(); while (thistoken() eq "\$" or thistoken() eq "\@@" or thistoken() eq "\%" or thistoken() eq "\&" or thistoken() eq "\#") { careful_skip() } $indexstring=thistoken(); while (lookahead(1) eq "_") { $indexstring.= gettoken().gettoken(); } @ @ The names of functions are one token long, too, and don't contain any non-word characters. @= careful_skip(); while (thistoken() =~ /\s/) { careful_skip(); } $type="@@"."!"; $indexstring=thistoken(); while (lookahead(1) eq "_") { $indexstring.= gettoken().gettoken(); } @ @ A function is preceded by an ampersand and consists only of word characters starting with an alphabetic character. (You can't, for instance, say |&32| unless you mean something to do with bitwise and. How Perl disambiguates |&foo()| and |32&foo()| is beyond me, however.) If it's not preceded by an ampersand, we're not going to pick it up. Sorry. Indexing a variable is essentially the same as indexing a function, since we've got a character that could possibly be a variable start symbol, and if it's followed by an alphanumeric token, we index it. @= careful_skip(); if (thistoken() =~ /^[a-zA-Z]/) { $type="@@"."^"; $indexstring=thistoken(); while (lookahead(1) eq "_") { $indexstring.= gettoken().gettoken(); } @ } else { if ($tokencount>1) { $tokencount--; } else { @{ Sorry, folks. Can't implement this yet. @} } } @ Final part of the indexing section is actually to add the indexing function. Well, the index formatting work will be done later, but right now we want to put the type and name information into the index hash. @= push @@{$index{$indexstring}}, [$modulenumber, $type]; @{ Aaaaargh! @} @ Note that the section for formatting the index will silently drop one-character variables and functions. @* Phase two. Now we've got all the data we need in order to format the file effectively. Now we start the clever stuff. It's a vaguely state-driven switch, again, and a cool little translation engine when in Perl mode. @= $state="preamble"; $module_count=0; print "\nWriting... ("; $lineno=0; @ foreach (@@source) { $bal=0; @ print $_; while (moretokens()) { if ($state eq "preamble") { @ } elsif ($state eq "doc") { @ flush(); } elsif ($state eq "perl") { @ } else { die "Can't happen."; } } flush(); } flush(); @ At the beginning of each line we read in the next line and tokenise it. @= $_=$merged[$lineno++]; $sourceline=$lineno; print "\n*D* $_" if $debug>1; @ @ $open_this_line=0; @ This is how we squash the spaces, by the way. @= @@tokenlist=squish(@@tokenlist); sub squish { my @@res; my $last; while (defined ($_=shift)) { push @@res, $_ unless ($last =~ /\s/ and /\s/); $last = $_; } return @@res; } @ Before anything else happens, we open up the file for writing and write our preamble on it. @= @ print OUTPUT $command{"preamble"}; @ A hook for common code: @= $output=$input; $output=~ s/(.*)\..*/\1/; $output.=".tex"; @* Auxilliary functions. General stuff to help things work smoothly. @p sub clear_buffer{ $out_buff=""; } sub output{ $out_buff.=shift; } sub flush { indent() if $state eq "perl"; print OUTPUT $out_buff; print OUTPUT " ".$command{"newline"} if $state eq "perl" and $out_buff=~/\S/; clear_buffer(); } @ The indentation stuff gives me a real headache, so I'm going to have to move it to a seperate section @p sub indent { print "*D*". " "x(8*$this_indent).@@tokenlist if $debug>1; print OUTPUT $command{"tab"}x$this_indent; $this_indent+=$open_this_line; } sub outward { $open_this_line++ unless $open_this_line<0; } sub inward { $dirty=0; if ($open_this_line <1) { $this_indent--; $open_this_line=1; $dirty=1; } $open_this_line-- unless $open_this_line<0; die "Brain failure." if $dirty and $open_this_line!=0; } @* Preamble state. We should be able to wrap this one up pretty quickly; All we do is copy text verbatim (apart from control codes) until we hit an \at{*}. @= if (lookahead(1) =~ /@@\*/) { print OUTPUT $command{"begin"}; @ @ } else { output(gettoken());} @* Module definition state. When we're here, we're in the start of a module, and the current token is the control sequence that started us. If it's an unstarred module, we just write a module header and set the state to documentation. If it's starred, we need to read in the module name and translate it to readable format---that is, translate anything between vertical bars. For this, we're going to do something clever, and turn the Perl-into-documentation section into a function, taking a token and returning a translated version of that token. This is something we'll deal with later. @= $type = substr(gettoken(),1,1); $module_count++; if ($type eq ' ') { $newmodulecommand="unstarred"; } elsif ($type eq '*') { print "$module_count*"; @ $newmodulecommand="starred"; } else { print "! This can't happen!\n I'm looking for a module type, but it's"; print "neither starred nor unstarred.\n Can't cope. Sorry.\n"; exit; } $labelcommand=fillin("label",$module_count); output($labelcommand); die "! Mandatory command $newmodulecommand undefined\n This probably means the format is broken.\n" unless defined($command{$newmodulecommand}); output("\n".reformat($command{$newmodulecommand},$newmodulecommand)); $state='doc'; @ This is a specialised form of |gimme()|---we don't use the state stack because we want this returned as a string, not directly output to the buffers. @= ($foo,$starredname)=specialgimme("."); print "\nFound a starred module $starredname\n" if $debug>0; sub specialgimme { $return=''; $whywefinished="EOF"; $perlmode=0; $delimiter=shift; while (moretokens()) { $thisone=gettoken(); if ($thisone eq $delimiter) { $whywefinished="FOUND"; last } elsif ($thisone eq "|") { #skiptoken(); $olddelim=$delimiter; $return.= $command{"begincode"}; $return.= barparse(); $return.= $command{"endcode"}; $delimiter=$olddelim; } else { $return.=$thisone; } } return ($whywefinished,$return); } @ This one might come in useful, I don't know. Basically, it returns everything up until the next parallel bar, but translated into formatted Perl-documentation. @p sub barparse { $toparse=careful_gimme("|"); chop($toparse); $returnee=""; foreach(split //, $toparse) { $returnee.=trans($_); } return $returnee; } @* Documentation state. This is (I think) easy. All we need do is watch out for things in parallel bars, and push Perl mode onto the stack if they're seen, and deal with \at{}-commands. @= if (thistoken() eq "|") { skiptoken(); output($command{"begincode"}.barparse().$command{"endcode"}); } elsif (substr(thistoken(),0,1) eq "@@") { @ } else { output(gettoken()); } @ Here we've hit an \at{}-command; this should be a generic routine, and the individual parsing sections will check if they're in an appropriate mode to do the Right Thing. %' @= while (substr(thistoken(),0,1) eq "@@") { $toktype=substr(thistoken(),1,1); print "\n Command $toktype\n" if $debug>1; if ($toktype eq "@@") { output("@@"); skiptoken(); } elsif ($toktype =~ /\s/ or $toktype eq "*") { @ @ } elsif ($toktype eq "d") { @ } elsif ($toktype eq "p") { skiptoken(); @ } elsif ($toktype eq "f") { @ } elsif ($toktype eq "<") { @
} elsif ($toktype eq "'") { @ } elsif ($toktype eq "\"") { @ } elsif ($toktype eq "{") { @ } elsif ($toktype eq "t") { @ } elsif ($toktype eq "&") { output(tokenformat(thistoken())); skiptoken(); } elsif ($toktype eq "1") { $debug=1; skiptoken(); } elsif ($toktype eq "2") { $debug=2; skiptoken(); } else { @ print "! I don't understand this command sequence: "; print thistoken()."\n"; skiptoken(); } } @* Perl State. OK. Here goes. Actually dealing with being in the Perl state is trivial; we examine a token---if it's a command, we pass it to the command handler above. Otherwise, we throw it at the generic translation routine. @= if (substr(thistoken(),0,1) eq "@@") { @ } elsif (thistoken() eq "\\") { output(trans(gettoken()).trans(gettoken())) } else { @ @ @ output(trans(gettoken())); } @ Likewise, the token translator's not too bad, because most of the work takes place within the format and command array, which we've previously defined. @p sub trans { $_=shift; if (defined($tokenformat{$_})) { return reformat($tokenformat{$_},$_) } else { return $_ } } @ Now for something that fills in the parameters. Let the data structures do the work! @p sub reformat { ($template, $parameter) = (shift, shift); $template =~ s/#modno/$module_count/g; $template =~ s/#partname/$starredname/g; $template =~ s/#/$parameter/g; return $template; } @ Wow. That was it? No, we have to deal with indentation and string parsing. Nothing interesting happens in strings, but I'm going to lump in back-quote stuff as well. @= if (thistoken() eq "\"" or thistoken() eq "\`") { print "Parsing a Perl String\n" if $debug>1; $type=thistoken(); output($command{"opendq"}); output($command{"beginstring"}) if $type eq "\""; output($command{"begincommand"}) if $type eq "\`"; careful_skip(1); until (thistoken() eq $type) { print thistoken() if $debug>1; if (thistoken() eq "\\") { output(trans(gettoken()).trans(gettoken())) } else { output($command{"visiblespace"}) if thistoken()=~/\s/; output(trans(thistoken())) unless thistoken()=~/\s/; careful_skip(1); } } output($command{"endstring"}) if $type eq "\""; output($command{"endcommand"}) if $type eq "\`"; output($command{"closedq"}); careful_skip(1); print "Done\n" if $debug>1; } @ And also indentation. This should be easy, since it's the same as it was in Weave 2. @= if (thistoken() eq "{") { outward() } if (thistoken() eq "}") { inward() } # FALL THROUGH @* Format definitions. Format definitions appear to be one line long, starting with \at{f} and consisting of a token, an equivalence sign (==) and one token. This gets parsed and thrown into the token formatter thingy. After that we drop back into the previous state, which I hope was documentation mode. If it wasn't, we scream. @= if ($state ne "doc") { @ print "\n! Format command shouldn't be in ".state()." state\n"; } else { if (gettoken() ne "@@"."f") { @ die "! Can't happen\n"; } $toformat = gettoken(); if (gettoken() ne "==") { @ print "\n! Incorrectly formatted format definition.\n"; print "Needs to be @@"."f, token, ==, token\n\n"; } else { @ $tokenformat{$toformat}=$like; @ } } @ @= $like=gettoken(); if (defined($tokenformat{$like})) { @{ WEB-style formats @} $like = $tokenformat{$like}; } elsif (defined($command{$like})) { @{ ``Direct'' formatting @} $like = reformat($command{$like},$like); } else { print "\nWarning: reformatting $toformat as $like.\n"; } @ @= output(reformat($command{"rw"},"format")." "); output(trans($toformat)." "); output(trans("==")." "); output(trans($like)); @* Macro definitions. A macro definition starts with an \at{d} and ends with, (and here I dig out my Knuth, only to find he doesn't specify\dots) I suppose, with any control code. It's formatted, more or less, as Perl. @= if ($state ne "doc") { @ print "\n! Macro definition shouldn't be in ".state()." state\n"; } else { if (gettoken() ne "@@"."d") { @ die "\n! Can't happen\n"; } while (($lhs = gettoken())=~/\s/) {} while (($type=gettoken())=~/\s/) {} isamacro($lhs); if ($type eq "(") { @ } elsif ($type eq "=" or "==") { @ @{ Yes, I know. @} output(trans($type)); } else { @ print "! That's no macro definition!\n"; print "You said $lhs $type, which makes no sense. Not to me, anyway.\n"; } @ } @ Hey, we could use these routines in the tangler\dots @= output(reformat($command{"macro"})." "); output(trans($lhs)." ".trans("(")); until ($mytoken=gettoken() eq ")") { @ @ output(trans("#")); $label=gettoken(); @ skiptoken(); output(trans(thistoken())); } @ @= unless (moretokens()) { @ print "! Macro definition not finished?\n"; print "End of line before close brackets - why?\n"; last; @{ And that doesn't count as a |goto|, all right? @} } @ @= unless ($mytoken eq "#") { @ print "! Something's wrong with your macro\n"; print "I'm expecting to see an #, but I see $mytoken,"; print " and I don't know why.\n"; last; } @ @= unless (nexttoken() eq "," or nexttoken() eq ")") { @ print "! This is not TeX!\n"; print "Parameters are one token long separated by commas.\n"; print "I'm going to be confused now. You might want to stop.\n"; last; } @ An ``ordinary'', that is, non-parametric macro should just consist of a one-token label. This is easy enough to format. @= output($command{"macro"}." "); output(trans($lhs)." "); @ The RHS of the macro is ordinary Perl, which stops with pretty much all control sequences. @= until (substr(thistoken(),0,1) eq "@@") { @ @ output(trans(gettoken())); careful_skip(0) unless moretokens(); @{ End of line hack @} } @* Inane stuff. Right, I'm fed up of the serious coding, and I want to get on to the bells and whistles. I'll come back to the sensible bit later, but first, this: @= skiptoken(); output(fillin("hex",gettoken())); @ And, of course, everyone who's anyone uses octal\dots @= skiptoken(); output(fillin("octal",gettoken())); @ Another simple one. @= $texbox=careful_gimme("@@>"); chop($texbox); chop($texbox); output(fillin("hbox",$texbox)); careful_skip(1); @* Section State. Back to work. We have hit this state if the current token is an \at{<}; This could be either a section reference or a section definition; we don't particularly care. The important thing, as far as we're concerned, is to get it formatted as documentation---or rather, as a section heading. Then depending on what state we were in before, we either send it back whence it came, or we write out beginning-of-section stuff, and pass it off to Perl state. @
= skiptoken(); $tempname=specialcareful_gimme("@@".">"); $modname=$tempname; if ($state eq "doc") { @ } elsif ($state eq "perl") { @ } else { @ print "! I'm extremely confused.\n"; print "You appear to have made a reference to a module\n"; print "($modname) where I didn't expect one. I'm going\n"; print "to pretend I didn't see it.\n"; } @ This is a special version of |careful_gimme()| because section names may well be more than one line long, if you're Don Knuth. @p sub specialcareful_gimme { $delimiter=shift; ($status, $rv) = specialgimme($delimiter); if ($status eq "EOF") { while ($status eq "EOF" and $finished==0) { careful_skip(0); @{ End-of-line hack @} ($status, $newrv)=specialgimme($delimiter); $rv.=$newrv; } } return($rv); } @ I find the |unless| \dots |else| construction extremely amusing, for some strange reason. Maybe I should stop coding now. @= unless (gettoken() eq "=") { @ @ print "! You can't say that there.\n"; print "It seems you tried to reference a module in documentation state.\n"; print "I'm going to pass over this and pretend I didn't see it\n"; } else { print "*D* Defining.\n" if $debug >1; $cscommand = $command{"csref"}; $cscommand =~ s/#csname/$modname/g; $real_no=$displayedmodulenumber[$module_count]; $cscommand =~ s/#modno/$real_no/g; output($cscommand); output( ($seealso[$module_count]->[0] and $seealso[$module_count]->[0] < $module_count) ? $command{"moredef"} : $command {"modequals"}); output(fillin("label","cs-".$module_count)); @ } @ It doesn't actually matter too much if the |$whereis| element is undefined; this will have been picked on before, and show we just blank it out. %' @= $thisno= $whereis{tidyname($modname)}; $thisno= "" unless $thisno >0; $cscommand = $command{"csref"}; $cscommand =~ s/#csname/$modname/g; $cscommand =~ s/#modno/$thisno/g; output($cscommand); @* Top and tail. Two sections which introduce and leave perl state, which print out all the see-also and referenced-by information. No surprises in the first. @= $open_this_line=0; $this_indent=0; $state="perl"; output($command{"begincode"}); $began=1; @ The second is tricky, though. @= output($command{"endcode"}) if $began; $began=0; @ @ @ Lastly, a few things before we quit. @= if ($state eq "perl") { @ } @ @ output($command{"end"}); flush(); close(OUTPUT); print ")\n\n"; @* Format a comment. This is relatively trivial. @= skiptoken(); $comment=specialcareful_gimme("@@".">"); chop($comment); chop($comment); chop($comment); print "*D* Comment - $comment\n" if $debug>1; output(fillin("comment",$comment)); @* Build in debugger. Slot this in wherever you need it. @= if ($debug>2) { $action=""; until ($action =~/C/i) { print "\n--- DEBUG ---\n"; foreach (@@tokenlist) { print $_} print "Last token: <".lasttoken()."> "; print "This token: <".thistoken()."> "; print "Next token: <".nexttoken().">\n"; print "Moretokens? ".moretokens()."\n"; print "Action: (F)orward, (B)ackward, (C)ontinue, (Q)uit\n"; $action=<>; die "Quitting at user's request\n" if $action =~/Q/; $debug=1 if $action=~/1/; $tokencount++ if $action =~/F/; $tokencount-- if $action =~/B/; } } @* Tables, References and Index. Here we generate the table of contents, the table of code sections and so on. @ We write the TOC by looking at the |@@startable| array we set up earlier, which I've now forgotten the structure of. @
= print OUTPUT $command{"tocheader"}; foreach (@@startable) { ($moduleno, $title) = ($_->[0],$_->[1]); print OUTPUT $title.$command{"dotfill"}.fillin("ref",$moduleno).$command{"newline"}; } print OUTPUT $command{"ctname"}.$command{"dotfill"}; print OUTPUT fillin("ref","Cod").$command{"newline"}; print OUTPUT $command{"indexname"}.$command{"dotfill"}; print OUTPUT fillin("ref","Ind").$command{"newline"}; @ Next we've got to look at the table of code section names. This will be stored in some nice neat array somewhere. @= output($command{"codetable"}); output(fillin("label","Cod")); foreach (sort @@cs_names) { $number = $whereis{tidyname($_)}; next if @@displayedmodulenumber[$number] != $number; next unless $number; @{ This is positively evil @} @ ($foo,$_)=specialgimme("@@>"); output($number." ".$_); output($command{"dotfill"}.fillin("ref","cs-".$number)); output($command{"newline"}); } @ Next the hyperlink information. Firstly, modules which reference us. This is stored in an array format, in an array keyed to module number. It's groovy data-structure time again. @= if (@@displayedmodulenumber[$module_count] == $module_count) { @@myrefs=@@{$references[$module_count]}; if ($#myrefs>-1) { print "*D* Outputting refs ".join(",",@@myrefs)." for $module_count\n" if $debug>0; output(fillin("modref",join(", ",@@myrefs))); } } @ @= @@myseealso=@@{$seealso[$module_count]}; if ($seealso[$module_count]->[0]) { output(fillin("saref",join(", ",@@myseealso))); } @ Index. Yes. I've been trying to avoid this, haven't I? @= @