annotate doc/HtmlParser.txt @ 0:6ee11bf9e3ea

Initial revision
author jcid
date Sun, 07 Oct 2007 00:36:34 +0200
parents
children 20ffd8b339cc
rev   line source
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
1 October 2001, --Jcid
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
2 Last update: Dec 2004
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
3
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
4 ---------------
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
5 THE HTML PARSER
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
6 ---------------
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
7
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
8
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
9 Dillo's parser is more than just a HTML parser, it does XHTML
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
10 and plain text also. It has parsing 'modes' that define its
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
11 behaviour while working:
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
12
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
13 typedef enum {
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
14 DILLO_HTML_PARSE_MODE_INIT,
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
15 DILLO_HTML_PARSE_MODE_STASH,
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
16 DILLO_HTML_PARSE_MODE_STASH_AND_BODY,
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
17 DILLO_HTML_PARSE_MODE_BODY,
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
18 DILLO_HTML_PARSE_MODE_VERBATIM,
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
19 DILLO_HTML_PARSE_MODE_PRE
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
20 } DilloHtmlParseMode;
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
21
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
22
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
23 The parser works upon a token-grained basis, i.e., the data
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
24 stream is parsed into tokens and the parser is fed with them. The
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
25 process is simple: whenever the cache has new data, it gets
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
26 passed to Html_write, which groups data into tokens and calls the
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
27 appropriate functions for the token type (TAG, SPACE or WORD).
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
28
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
29 Note: when in DILLO_HTML_PARSE_MODE_VERBATIM, the parser
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
30 doesn't try to split the data stream into tokens anymore, it
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
31 simply collects until the closing tag.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
32
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
33 ------
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
34 TOKENS
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
35 ------
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
36
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
37 * A chunk of WHITE SPACE --> Html_process_space
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
38
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
39
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
40 * TAG --> Html_process_tag
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
41
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
42 The tag-start is defined by two adjacent characters:
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
43
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
44 first : '<'
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
45 second: ALPHA | '/' | '!' | '?'
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
46
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
47 Note: comments are discarded ( <!-- ... --> )
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
48
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
49
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
50 The tag's end is not as easy to find, nor to deal with!:
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
51
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
52 1) The HTML 4.01 sec. 3.2.2 states that "Attribute/value
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
53 pairs appear before the final '>' of an element's start tag",
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
54 but it doesn't define how to discriminate the "final" '>'.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
55
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
56 2) '<' and '>' should be escaped as '&lt;' and '&gt;' inside
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
57 attribute values.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
58
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
59 3) The XML SPEC for XHTML states:
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
60 AttrValue ::== '"' ([^<&"] | Reference)* '"' |
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
61 "'" ([^<&'] | Reference)* "'"
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
62
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
63 Current parser honors the XML SPEC.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
64
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
65 As it's a common mistake for human authors to mistype or
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
66 forget one of the quote marks of an attribute value; the
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
67 parser solves the problem with a look-ahead technique
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
68 (otherwise the parser could skip significative amounts of
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
69 well written HTML).
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
70
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
71
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
72
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
73 * WORD --> Html_process_word
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
74
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
75 A word is anything that doesn't start with SPACE, and that's
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
76 outside of a tag, up to the first SPACE or tag start.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
77
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
78 SPACE = ' ' | \n | \r | \t | \f | \v
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
79
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
80
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
81 -----------------
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
82 THE PARSING STACK
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
83 -----------------
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
84
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
85 The parsing state of the document is kept in a stack:
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
86
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
87 struct _DilloHtml {
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
88 [...]
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
89 DilloHtmlState *stack;
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
90 gint stack_top; /* Index to the top of the stack [0 based] */
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
91 gint stack_max;
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
92 [...]
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
93 };
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
94
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
95 struct _DilloHtmlState {
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
96 char *tag;
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
97 DwStyle *style, *table_cell_style;
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
98 DilloHtmlParseMode parse_mode;
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
99 DilloHtmlTableMode table_mode;
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
100 gint list_level;
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
101 gint list_number;
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
102 DwWidget *page, *table;
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
103 gint32 current_bg_color;
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
104 };
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
105
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
106
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
107 Basically, when a TAG is processed, a new state is pushed into
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
108 the 'stack' and its 'style' is set to reflect the desired
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
109 appearance (details in DwStyle.txt).
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
110
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
111 That way, when a word is processed later (added to the Dw), all
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
112 the information is within the top state.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
113
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
114 Closing TAGs just pop the stack.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
115
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
116