view doc/HtmlParser.txt @ 1173:717058f63429

Removed the experimental font size adjuster patch (it slipped in by mistake)
author Jorge Arellano Cid <>
date Mon, 15 Jun 2009 17:50:10 -0400
parents 6ee11bf9e3ea
children 20ffd8b339cc
line wrap: on
line source
 October 2001, --Jcid
 Last update: Dec 2004

                        THE HTML PARSER

   Dillo's  parser is more than just a HTML parser, it does XHTML
and  plain  text  also.  It  has  parsing 'modes' that define its
behaviour while working:

   typedef enum {
   } DilloHtmlParseMode;

   The  parser  works  upon a token-grained basis, i.e., the data
stream is parsed into tokens and the parser is fed with them. The
process  is  simple:  whenever  the  cache  has new data, it gets
passed to Html_write, which groups data into tokens and calls the
appropriate functions for the token type (TAG, SPACE or WORD).

   Note:   when  in  DILLO_HTML_PARSE_MODE_VERBATIM,  the  parser
doesn't  try  to  split  the  data stream into tokens anymore, it
simply collects until the closing tag.


  * A chunk of WHITE SPACE     --> Html_process_space

  * TAG                        --> Html_process_tag

    The tag-start is defined by two adjacent characters:

      first : '<'
      second: ALPHA | '/' | '!' | '?'

      Note: comments are discarded ( <!-- ... --> )

    The tag's end is not as easy to find, nor to deal with!:

    1)  The  HTML  4.01  sec.  3.2.2 states that "Attribute/value
    pairs appear before the final '>' of an element's start tag",
    but it doesn't define how to discriminate the "final" '>'.

    2) '<' and '>' should be escaped as '&lt;' and '&gt;' inside
    attribute values.

    3)  The XML SPEC for XHTML states:
      AttrValue ::== '"' ([^<&"] | Reference)* '"'   |
                     "'" ([^<&'] | Reference)* "'"

    Current  parser  honors  the XML SPEC.

    As  it's  a  common  mistake  for human authors to mistype or
    forget  one  of  the  quote  marks of an attribute value; the
    parser   solves  the  problem  with  a  look-ahead  technique
    (otherwise  the  parser  could  skip significative amounts of
    well written HTML).

  * WORD                       --> Html_process_word

    A  word is anything that doesn't start with SPACE, and that's
    outside  of  a  tag, up to the first SPACE or tag start.

    SPACE = ' ' | \n | \r | \t | \f | \v


  The parsing state of the document is kept in a stack:

  struct _DilloHtml {
     DilloHtmlState *stack;
     gint stack_top; /* Index to the top of the stack [0 based] */
     gint stack_max;

  struct _DilloHtmlState {
    char *tag;
    DwStyle *style, *table_cell_style;
    DilloHtmlParseMode parse_mode;
    DilloHtmlTableMode table_mode;
    gint list_level;
    gint list_number;
    DwWidget *page, *table;
    gint32  current_bg_color;

  Basically,  when a TAG is processed, a new state is pushed into
the  'stack'  and  its  'style'  is  set  to  reflect the desired
appearance (details in DwStyle.txt).

  That way, when a word is processed later (added to the Dw), all
the information is within the top state.

  Closing TAGs just pop the stack.