use strict;
made me a perl5-porteruse strict;
We do have a process that extracts all data from a database related to one single person repeatedly and spits it out in an intermediate format easy to parse by secondary processes like formatters.
This format looks like:
cc|eeee|data
where cc
stands for the category
data
belongs to identified by element type (number)
eeee
. Element codes can occur in multiple categories
(like a surname or a date of birth) and can occur more than once in a
single category (like free notes). It might look
like:
00|0010|Duckstad 01|0100|14319 01|0110|4896319807 01|0120|43645148 01|0210|Donald 01|0240|Duck 01|0310|19890201 01|0320|12 01|0321|Gansdorp 01|0410|M 01|0420|O 01|0510|1 01|0511|Nederlandse 01|1020|392 01|1021|Duckstad 01|1111|Cornelis Prulpark 01|1120|3 01|1160|2015EH 01|1310|Donald@Duck.nl 01|2010|51 : 99|9999|0
A persons collection of data is ended with element 9999 in category 99.
We do have a formatter that came with the database
(RPT
)
We want a human readable list of the relevant data, formatted something like:
Gemeente Duckstad Blad: 1 Leerlingenlijst met NAW Datum: 3 jun 2001 , 12:25 =============================================================================== 12816 Duck, Kwik M 01-02-1989 Cornelis Prulpark 3 4775.08.118 2015 EH Duckstad 3001 Duck, Kwek M 01-02-1989 Cornelis Prulpark 3 0436.45.148 2015 EH Duckstad 30456 Duck, Kwak M 01-02-1989 Cornelis Prulpark 3 8256.01.987 2015 EH Duckstad 816 Lizzy V 14-09-1993 Gansdorpboulevard 45 A 2938.91.886 2148 AW Gansdorp
Since our RPT scripts outdated our knowledge of perl, we got a rather huge base of those scripts and, as you use what you know, it looked like like:
print e180011 col 1, e180025 col 37 + incr_n, e180041 col 64 + incr_n + incr_v, e180050 col 75 + incr_n + incr_v print e180152 col 5, e180212 col 46 if e180261 ^= '' then begin if e180311 ^= '' then set e180263 to e180263 /+ ', ' + e180311 print e180142 col 5, e180263 col 12, e180270 col 64 end end header print 'GEMEENTE' col 1, e100010 col 11, 'Blad: ' col 63, pageno -1 col 69 using '#,##&' print 'Leerlingen' col 1, 'Datum:' col 49, D_sys col 56, hour col 69 print 75[=] col 1 print 'Achternaam' col 1, 'Voornaam + initalen' col 37, 'Geb dat' col 64, 'Gesl' col 72 print 'Adres' col 5, 'Postcode + Plaats' col 46 print 'SrtOnd' col 5, 'School' col 12, 'Klas' col 53 print 75[-] col 1 end
But it needs some kind of initialization:
sorted input cat [numeric 2], type [numeric 4], data [string 255] before report set d_sys to today set d_sys_d to mdy (d_sys, 1) set d_sys_m to mdy (d_sys, 0) set d_sys_y to mdy (d_sys, 2) set d_sys_ms to index (d_sys_m, ' ', 'januari', 'februari', 'maart', 'april', 'mei', 'juni', 'juli', 'augustus', 'september', 'oktober', 'november', 'december') set D_sys to Sprintf ('%02d-%02d-%04d', d_sys_d, d_sys_m, d_sys_y) set D_sys_s to Sprintf ('%02d %s %04d', d_sys_d, d_sys_ms, d_sys_y) set lpseqno to 1 set lp_new to 1 if cat = 10 and type = 0010 then set e100010 to data /+ '' else set e100010 to '' before cat if lp_new = 1 then begin set lp_new to 0 set c18 to 0 set c15 to 0 : set c35 to 0 set i to ar_close ('a18') set e180000 to ' ' set e180010 to ' ' set e180011 to ' ' set e180020 to ' ' set e180021 to ' ' set e180030 to ' ' set e180031 to ' ' set e180040 to ' ' set e180041 to ' ' : set e350092 to ' ' set e350093 to ' ' set e350094 to ' ' set e359999 to ' ' end if cat = 99 then set lp_new to 1 after data set Data to data /+ '' if cat = 18 then begin if c18 = 0 then begin if type = 0000 then set e180000 to Data else if type = 0010 then set e180010 to flip_str (Data, flip) else if type = 0011 then set e180011 to flip_str (Data, flip) else if type = 0020 then set e180020 to Data else if type = 0021 then set e180021 to Data else if type = 0030 then set e180030 to Data : [[ many many lines later ]] : else if type = 0070 then set i to ar_write ('a22', c22, n220070, Data) else if type = 9999 then set i to ar_write ('a22', c22, n229999, Data) end if type = 9999 then set c22 to c22 + 1 end if cat = 32 then begin if c32 = 0 then begin if type = 0010 then set e320010 to Data else if type = 7010 then set e327010 to flip_str (Data, flip) : [[ many lines later ]] : else if type = 0091 then set i to ar_write ('a35', c35, n350091, Data) if type = 0092 then set i to ar_write ('a35', c35, n350092, Data) else if type = 0093 then set i to ar_write ('a35', c35, n350093, Data) else if type = 0094 then set i to ar_write ('a35', c35, n350094, Data) else if type = 9999 then set i to ar_write ('a35', c35, n359999, Data) end if type = 9999 then set c35 to c35 + 1 end after cat if cat = 99 then begin
Hmm, I hope you'll see the maintainability factor of such code is extremely low :-(
But since this code is the same for all reports, it's stored in a seperate file, which is prepended to the format part on a "need-now" basis, which enables easy to maintain the different format scripts as such.
It does what we want without flaws, so what's the problem?
Maintainability! It's gotten out of hand. And we want more! More features, more speed, and - maybe most important - more maintainability.
By that time, perl5 just hit the street, and I played with perl4 for some time, also building a perl interface to our (so far unsupported) database into perl4.
Reading the advantages of perl5 over perl4, the symbolic references hit me as extremely useful, because conversion would be easy as pie. The format part would become something like:
format STDOUT = @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @<<<<<< @<<< $e180010, $e180020, $e180030,$e180040 @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $e180152, $e180212 ~ @<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<< $e180142,$e180263, $e180270 . format STDOUT_TOP = Gemeente @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Blad: @>>>>>> $s_gem, $% Leerlingen Datum: @<<<<<<<<< @<<<< $D_sys, $T_sys =============================================================================== Achternaam Voornaam + initialen Geb dat Gesl Adres Postcode + Plaats SrtOnd School Klas ------------------------------------------------------------------------------- .
which makes it already quite a lot easier to see what the user expects as
output, and all the variables are the same, except for that funny leading
dollarsign in front.
The initialization part would become real easy
(shortened):
while (<>) { chomp; push @elp, $_; m/^99/ || next; # Convert the expanded LP-list to elements known in the format(s) foreach $lp (@elp) { ($cat, $type, $data) = split m/\|/, $lp, 3; $data =~ s/\s+$//; $ect = "$cat$type"; $e = "e$ect"; $$e = $data; } # format this entity write; reset "e"; }
In fact, the first version of the perl equivalent of the 1350 line RPT script, were reduced to a mere 150 line of code!
use strict;
As the symbolic reference solution looked like a good solution when I
started, given the time saved by not having to go through all those report
scripts doing more then just prefix all variables with a dollar sign (and
keep the variables readable), I learned over time that use
strict;
was not a bad idea.
So I started to use lexicals, and convert the data structure from
$e010110
to $e[1]{110}
, something some of you,
if not most of you, would have thought of when faced with the problem in
the beginning.
From the programmers eye, this is no problem, but more simple users have less problems recognizing $e010240 as being the surname of the person is much easier than recognizing $e[1]{240}, but - since most of my user base didn't change to formats anyway, I decided that to be a minor problem.
Now the initialization becomes something like:
while (<>) { s/\s*$//; push @elp, $_; m/^99/ or next; # Convert the expanded LP-list to elements known in the format(s) foreach my $lp (@elp) { my ($cat, $type, $data) = split m/\|/, $lp, 3; $e[$cat]{$type} = $data; } # format this entity write; @e = (); }
... much cleaner and loosing the reset
, also less error-prone.
The format script would become something like:
format TOP = Gemeente @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Blad: @>>>>>> $s_gem, $% Leerlingen Datum: @<<<<<<<<< @<<<< $D_sys, $T_sys =============================================================================== Achternaam Voornaam + initialen Geb dat Gesl Adres Postcode + Plaats SrtOnd School Klas ------------------------------------------------------------------------------- . format A01 = @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<< @>>>>>>>>>>>>>> $e[1]{241}, $e[1]{214}, "$e[1]{311} $e[1]{411}" $line1 @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $e[1]{1100}, $e[1]{1025} ~ @<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<< $e[1]{2021},$e[1]{2012}, $e[1]{2040} .
But, but, but .... the code just presented doesn't compile! It does not even spit out an appropriate error message or some useful diagnostics. WHY? What is happening here?
Faced with this problem, I had to reduce the problem to a minimum, something I learned in the past. So I came up with a test case that I could post to perlbug@perl.org
#!/usr/bin/perl -w # use strict; # use diagnostics; my $e010101 = "Value 1"; my @e = (0, { 101 => $e010101 }); format OLD0 = @<<<<<<<<<< $e010101 . format NEW0 = @<<<<<<<<<< $e[1]{101} . format NEW0 = @<<<<<<<<<< "$e[1]{101}" .
No answers, no patches, no nothing. I was left on my own. What next?
Back to symbolic references | |||
Not use strict; Very error prone |
- | + |
The easiest solution (it still works) End-user friendly |
Use simple scalars | |||
Too much work Too vulnarable to maintainance |
- | + | End-user friendly |
Use LoL (list of lists) instead of LoH (list of hashes) | |||
Second level lists are filled too sparse Unable to add elements with letters |
- | + | Easy to implement |
Solve the bug | |||
Learning curve Time shift: will I regain the time spent? |
- | + | Geared for the future |
After digging through the available resources:
I still found no suitable place to start, because there's not much reference material on formats
There's a lot of discussions on perl5-porters varying from thread implementations to unicode handling and from string interpolation to regular expression irregularities, but format's are not discussed, so the mailing list archives are not a good resource. So I had to do some research myself.
Several mails to the mailing list asking for a place to start the research led to 2 pointers, both proved to be very wrong, because they assumed the flaw was in the formatting code.
Digging. Learning how to use gdb, which - on a HP-UX platform - proved to be less than easy. And though perlguts turned out to be a nice manual for the parts that did not blow up in my face, I had to track down what was happining, and try to follow the flow that is used when parsing formats.
After turning the formatting inside out and upside dow, not finding
anything unusual happening, I just added printf
statements
haphazardly throughout the perl source code on points where I did expect
the flow to pass by, and printed out some global variables. At this point
intuition turns out to be more valuable than knowledge, cause I almost hit
the right globals from the start.
Realizing a not very well known fact that the variable declaration within a form definition can span multiple lines if the starting line of those variable start with a open brace, I was intuitively looking for places that dealt with that code.
It turned out to be the closing brace }
that caused a
reset of a global brace counter when not aware of being inside a format
parsing. I then did a decrement, instead of a reset, to check if I was
right, and the format part now compiled, but other code parts did not.
So the solution was at hand: the trouble spot was detected, but I had to find a way to detect (or rather know) that at that point I was inside a format and, if so, decrement, otherwise reset.
OK, the problem is in toke.c on line 3175 where the detected closeing brace '}' also ends the current scope instead of just decrementing the in-brace count, ...
The patch to correct this looked like:
--- toke.c.org Tue May 29 16:04:29 2001 +++ toke.c Tue May 29 16:04:29 2001 @@ -3172,7 +3172,7 @@ yyerror("Unmatched right curly bracket"); else PL_expect = (expectation)PL_lex_brackstack[--PL_lex_brackets]; - if (PL_lex_brackets < PL_lex_formbrack) + if (PL_lex_brackets < PL_lex_formbrack && PL_lex_state != LEX_INTERPNORMAL) PL_lex_formbrack = 0; if (PL_lex_state == LEX_INTERPNORMAL) { if (PL_lex_brackets == 0) { End of Patch.
But I still have to get this into the perl core, so I don't have to do this time after time again. I learned from the archives that a patch that's accompanied by a test case has a better chance to survive:
--- perl/t/op/write.t.org Tue May 29 16:04:29 2001 +++ perl/t/op/write.t Tue May 29 16:04:29 2001 @@ -1,6 +1,6 @@ #!./perl -print "1..8\n"; +print "1..9\n"; my $CAT = ($^O eq 'MSWin32') ? 'type' : 'cat'; @@ -201,3 +201,19 @@ $that = 8; write LEX; } +# LEX_INTERPNORMAL test +my %e = ( a => 1 ); +format OUT4 = +@<<<<<< +"$e{a}" +. +open OUT4, ">Op_write.tmp" or die "Can't create Op_write.tmp"; +write (OUT4); +close OUT4; +if (`$CAT Op_write.tmp` eq "1\n") { + print "ok 9\n"; + unlink "Op_write.tmp"; + } +else { + print "not ok 9\n"; + } End of Patch.
Now that the bug is fixed, and the patch is accepted (as patch 6340), I wanted to extend the knowledge I picked up along the way. Realizing that at the moment it is not possible to detect if a format is defined at all, because globs cannot deal with it's FORMAT entry. Making glob's do what they should do for formats too, i.e. make *TOP{FORMAT} be as valid as *TOP{SCALAR}, *TOP{HASH} and all known other glob entries.
I was talking about this with Jan-Pieter Cornet on our way to TPC4, and we decided to give it a shot, and hacked towards it in the plane. We did the right patch, but made a wrong test for it, so we didn't trust ourselves in submitting it. Graham Barr helped us with the test
--- perl/pp.c.org Wed Jul 26 11:47:23 2000 +++ perl/pp.c Wed Jul 26 11:47:23 2000 @@ -599,6 +599,9 @@ case 'F': if (strEQ(elem, "FILEHANDLE")) /* XXX deprecate in 5.005 */ tmpRef = (SV*)GvIOp(gv); + else + if (strEQ(elem, "FORMAT")) + tmpRef = (SV*)GvFORM(gv); break; case 'G': if (strEQ(elem, "GLOB")) --- perl/sv.c.org Wed Jul 26 11:47:23 2000 +++ perl/sv.c Wed Jul 26 11:47:23 2000 @@ -2831,6 +2831,13 @@ dref = (SV*)GvIOp(dstr); GvIOp(dstr) = (IO*)sref; break; + case SVt_PVFM: + if (intro) + SAVESPTR(GvFORM(dstr)); + else + dref = (SV*)GvFORM(dstr); + GvFORM(dstr) = (CV*)sref; + break; default: if (intro) SAVESPTR(GvSV(dstr)); End of Patch.
Ahh, and let not forget that test, so the patch might be accepted again:
--- perl/t/op/gv.t.org Wed Jul 26 11:47:23 2000 +++ perl/t/op/gv.t Wed Jul 26 11:47:23 2000 @@ -11,7 +11,7 @@ use warnings; -print "1..40\n"; +print "1..41\n"; # type coersion on assignment $foo = 'foo'; @@ -97,15 +97,19 @@ %x = ("ok 19" => "\n"); sub x { "ok 20\n" } print ${*x{SCALAR}}, @{*x{ARRAY}}, %{*x{HASH}}, &{*x{CODE}}; +format x = +ok 21 +. +print ref *x{FORMAT} eq "FORMAT" ? "ok 21\n" : "not ok 21\n"; *x = *STDOUT; -print *{*x{GLOB}} eq "*main::STDOUT" ? "ok 21\n" : "not ok 21\n"; -print {*x{IO}} "ok 22\n"; -print {*x{FILEHANDLE}} "ok 23\n"; +print *{*x{GLOB}} eq "*main::STDOUT" ? "ok 22\n" : "not ok 22\n"; +print {*x{IO}} "ok 23\n"; +print {*x{FILEHANDLE}} "ok 24\n"; # test if defined() doesn't create any new symbols { - my $test = 23; + my $test = 24; my $a = "SYM000"; print "not " if defined *{$a}; @@ -131,7 +135,7 @@ # although it *should* if you're talking about magicals { - my $test = 29; + my $test = 30; my $a = "]"; print "not " unless defined ${$a}; @@ -173,4 +177,4 @@ } __END__ -ok 40 +ok 41 End of Patch.
This patch also made it through to the core, but was rejected by Gurusamy Sarathy for the 5.6.0 and 5.6.1 track. His motivations were not convincing for me, and some discussions with Gurusamy and Jarkko led to acceptation in the 5.7 development track (where it has always been) and eventually the patch is accepted also for the 5.8.0 release of Perl.
By extending the test suite for the write
functionality, in
order to get the patches also accepted for 5.8.0, I hit a new
problem in a situation where I tried to nest
(yes I know that's sick) formats:
#!/usr/bin/perl -w use strict; format HEADER = HEADER . sub HEADER () { local $~ = "HEADER"; my $HEADER; open HEADER, ">", \$HEADER; select HEADER; write; close HEADER; select STDOUT; $HEADER; } # HEADER format STDOUT_TOP = TOP @<<<<< @{[HEADER]} . format STDOUT = STDOUT @<<<< $% . write;
which should produce:
TOP HEADER STDOUT 1
but instead produces:
page overflow at format2_pl line 19. STDOUT 0 TOP HEADER
I did not have enough time to find out what exactly is going on, but I did find out that it needs some serious hacking
Now that I got the hack of it, I seriously considered to put some more effort in the perl5-porters movement. I was not scared away by rumours of the list being very unfriendly and flame wars being more common than found on other lists.
First thing to do is be a good listener. Don't join every discussion you see flying by, and only respond if you know what you're talking about. If you are sure, don't be afraid to be scared away, but always use funded arguments to explain what you mean. Don't be surprised if it turns out that the old members realy listen to you and see you as a serious participant in the development of Perl.
I wanted to join perl5-porters to take over Ilya's position. Hmmmm. So much for aiming too high.
There are several reasons why this did not happen.
What happened is that the list somehow noticed my knowledge of shell and awk programming (the ol' unix style commands) and that I wasn't scared of the complexity of Perl's configuration scheme (the Configure script).
Eventually you'll be asked to help out in the (sub)field(s) where you've proven to have knowledge of or insight in.
Now that Perl6 has taken a serious start, don't think perl5 doesn't need any more attention. We do still need enthousiastic people to make perl the language we all want, not scaring away people, but chasing away the bugs. Start being a list-lurker tomorrow and join perl5-porters next month!
Look throught the TODO list to check if there is an unexplored field left alone so far that yearns for your attention.