(March 2013)
To see it in action, just watch the video below - in fullscreen 720p quality (click on the video window, then select the 720p version from the settings icon near the bottom-right, then click on the rightmost icon to make it fullscreen).
Over the last couple of months, I've been building a set of code generators. They work from an XML file - and after reading it, they generate... stuff.
Lots of stuff.
The reason I went with .xml/.xsd files this time - and didn't design my own domain-specific language - is a simple one: in this case, the resulting "language" and tools will be used by non-programmers. These people must therefore be able to work in something resembling an IDE - with auto-completion a mandatory requirement.
In combination with editors like Eclipse / Visual Studio, .xsd files cover this need quite well. As the analysts create the .xmls that are fed into my code generators, these monster IDEs guide them - showing what they are allowed to enter at each point in the .xml file, highlighting errors, etc.
If you write your own DSL, getting up to this point is a lot more difficult (you basically have to write your own IDE).
So all went well. I created my code generators, people started creating .xmls, and marvelous, working things came out of them.
Mostly.
You see, you can never trust your input. Ever.
I therefore had to bulk-validate the .xml files - and found the best, strictest checks to be performed by SAXCount, a part of the Xerces XML parser:
$ SAXCount -n -s -f *xml
Error at file /var/tmp/a.xml, line 4, char 23
Message: empty content is not valid for content model '(transferBatch|notification)'
Error at file /var/tmp/b.xml, line 8, char 33
...
I tried other validators, too - and SAXCount seemed to be the most robust one. It caught things that others didn't, so long as the file begun with a reference to the .xsd:
<?xml version="1.0" encoding="utf-8" ?> <Genesis xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Genesis.xsd"> <Item ...> ...
Being a VIM guy, I wondered...
If only there was a way to easily navigate inside the errors of each .xml file, jumping immediately with the F4 function key from each error to the next... with the error info displayed at the bottom line of my editor.
Just as VIM does for C and C++, that is. And for Python (with Syntastic installed).
Moreover, while debugging, I had to quickly identify parts of the .xml files. I found the... misaligned aspect of element attributes to be anything but helpful:
<Item param="STR_NAME_GTE" label="Name from:" pw="2:10" /> <Item param="D_APPLOGGED_DATE" label="Date you logged in:" pw="62:10" /> <Item param="I_MINID" label="Serial:" pw="2:10" /> <Item param="I_MAX_SID" label="Up to serial ID:" pw="62:10" ... /> <Item param="BD_MINPRICE" label="Price:" pw="2:30" />
Imagine debugging hundreds of such lines - rearranging the attributes would help immensely in visually locating what is where:
<Item param="STR_NAME_GTE" label="Name from:" pw="2:10" /> <Item param="D_APPLOGGED_DATE" label="Date you logged in:" pw="62:10" /> <Item param="I_MINID" label="Serial:" pw="2:10" /> <Item param="I_MAX_SID" label="Up to serial ID:" pw="62:10" ... /> <Item param="BD_MINPRICE" label="Price:" pw="2:30" />
So how does one go about implementing this functionality in VIM?
Spawning an external tool from within VIM is easy. However, I wanted much
more than just that; I wanted the same functionality I have for :make
(which I've mapped to the function key F7
) - that is, errors shown in
the error list window, and me navigating from one to the next with F4
(which I've mapped to :cnext
).
So I created a saxcount
folder under my .vim/bundle
, and wrote the
following two lines in my saxcount/ftplugin/xml.vim
:
se errorformat=%E,%C%.%#Error\ at\ file\ %f%.\ line\ %l%.\ char\ %c,
%C\ \ Message:\ %m,%Z,%-G%f:\ %*[0-9]\ ms\ %.%#
se makeprg=SAXCount\ -n\ -s\ -f\ %
How did I get there?
Well, the second line is easy: se makeprg=SAXCount\ -n\ -s\ -f\ %
- makes my F7 (mapped to :make
) invoke SAXCount instead of make.
The magic errorformat
line is another story :‑)
It is supposed to catch error messages like these:
$ SAXCount -n -s -f a.xml
Error at file /var/tmp/a.xml, line 4, char 23
Message: empty content is not valid for content model '(transferBatch|notification)'
... or Fatal errors, that similarly begin with "Fatal Error" instead of "Error":
Fatal Error at file ...
Breaking down the two rules of my errorformat
, this is the first one ...
se errorformat= // Error report span in multiple lines, begins with %E, ends with %Z) %E,%C%.%#Error\ at\ file\ %f%.\ line\ %l%.\ char\ %c,%C\ \ Message:\ %m,%Z,
... which works as follows:
%E // begin multiline match of an error report , // end of first line from SAXCount, which is always empty %C // continuation - next line %.%#Error... // which matches '.*Error...' - so it also catches "Fatal Error..." %f%. // filename, followed by any char - in this case, the comma, // I could not use '\,' so I just used a '%.' %l and %c // similarly, line and column number %C // continuation - next line Message: %m // matches the actual message for the copen list %Z // end multiline match
The second errorformat
rule ignores (hence the minus in %-G
) the
informational lines emitted by SAXCount:
a.xml: 11 ms (64 elems, 207 attrs, 1133 spaces, 0 chars)
...via this:
%-G%f:\ %*[0-9]\ ms\ %.%# // basically: filename, colon, space, numbers, space, "ms", and ".*"
And now, all I have to do to validate .xml files is :make
(or just hit F7),
and navigate from each error to the next with F4 (:cnext
) - just as I do
for my Python and C++ work.
One down, one to go.
The end result: after visually selecting an area, I use the Leader key ( \ )
followed by '=', and attributes will line up - because of this line I added in my .vimrc
:
vmap <buffer> <Leader>=
:Tabularize/\v\zs\w+\ze\=["']<CR>
gv:!eatPeskySpacesOfTabularizedXML.pl<CR>
...with eatPeskySpacesOfTabularizedXML.pl containing this:
#!/usr/bin/perl while(<>) { s,(\w+)(\s*) =\s*(["'])((?:(?!\3).)*)\3,$1$2=$3$4$3,g; print; }
There's a lot of interesting backstory in this, though. Keep reading.
Tabular
As is almost always the case, the necessary VIM plugin is just a Google search away. In my case, searching for 'vim alignment' pointed to Tabular.
So assuming you set markers a
and b
to the beginning and end of the
section below...
<Item param="STR_NAME_GTE" label="Name from:" pw="2:10" />
<Item param="D_APPLOGGED_DATE" label="Date you logged:" pw="62:10" />
<Item param="I_MINID" label="Serial:" pw="2:10" />
<Item param="I_MAX_SID" label="Up to serial:" pw="62:10" nl="true" />
<Item param="BD_MINPRICE" label="Price:" pw="2:30" />
...this:
:'a,'bTabularize /=
...gets you this:
<Item param = "STR_NAME_GTE" label = "Name from:" pw = "2:10" />
<Item param = "D_APPLOGGED_DATE" label = "Date you logged:" pw = "62:10" />
<Item param = "I_MINID" label = "Serial:" pw = "2:10" />
<Item param = "I_MAX_SID" label = "Up to serial:" pw = "62:10" nl = "true" />
<Item param = "BD_MINPRICE" label = "Price:" pw = "2:30" />
Which is nice, but not what I wanted. Skimming over the Tabular manual, 5 min later:
:'a,'bTabularize/\v\zs\w+\ze\=["']
...gave me this:
<Item param ="STR_NAME_GTE" label ="Name from:" pw ="2:10" />
<Item param ="D_APPLOGGED_DATE" label ="Date you logged:" pw ="62:10" />
<Item param ="I_MINID" label ="Serial:" pw ="2:10" />
<Item param ="I_MAX_SID" label ="Up to serial:" pw ="62:10" nl ="true" />
<Item param ="BD_MINPRICE" label ="Price:" pw ="2:30" />
...which is almost perfect.
Breaking down the regexp to see how this works:
\v\zs\w+\ze\=["']
\v
: enter very magic mode (mostly Perl-ish regular expressions)\zs
: set start of match here\w+
: match a word (the attribute name, e.g. param
or label
)\ze
: set end of match hereTabular will then place a single space before and after every match, making sure the matches line up across lines.
So, are we done?
No, there's that pesky space before the equal sign. I am weird, I know :‑)
How would I go about removing it?
A simple regexp search and replace (s/ ="/="/g) would do the trick - but what if the strings end up containing equal signs in them? e.g.
posAndWidth ="40:5 =" height ="1"
posAndWidth ="-1:8 ='" textAlignment ="Right"
We would then break them up. No, we should search for the string beginning more cleverly - taking into account that XML strings can in fact use single quoting, too.
Let's hunt them down:
/\w+\s* =\s*(["'])[^\1]*\1
In detail:
\w+
: match the attribute name\s*
: followed by optional whitespace=
: followed by a single space and the equal sign\s*
: followed by optional whitespace(["'])
: followed by either kind of quote, which we mark...[^\1]*
: ...so that we can search for any character except it as many times as possible\1
: followed by the quote that we begun with in the first place.Should work, no?
Well... it doesn't.
Why?
I couldn't figure it out. So I asked the all-knowing Oracle for help.
A kind soul there explained that the negation I am using ([^\1]
) doesn't
work. Apparently, you can't use back references in character classes - they
simply don't work there.
But you can use ... look-ahead. To make sure the character that follows is NOT part of a back reference.
So what I want can be expressed like this, in regular expression engines that support look-ahead (like Perl's):
/\w+\s* =\s*(["'])((?!\1).)*\1
The new parts:
?!\1
: look ahead, and make sure we don't match the back reference (the quote we've seen before).
: Now that we know we don't, match any character*
: Do this as many times as possible\1
: followed by the quote that we begun with in the first place.In fact, since we don't want to store the lookahead (which will happen for
all characters in the strings, so it will be costly), we can use the ?:
syntax to stop their memorizing.
And this is how my journey ended:
s,(\w+)(\s*) =\s*(["'])((?:(?!\3).)*)\3,$1$2=$3$4$3,g;
I placed a Perl script doing this in my utilities and invoke it right after Tabularize.
You can fork my VIM configuration in GitHub to automatically use these two tricks, if you think they are useful.
One thing is certain: I learned a lot while making them work.
Index | CV | Updated: Sat Oct 8 11:41:25 2022 |