annotate doc/HtmlParser.txt @ 2048:5060d415a85a

clickable menu items (even those introducing submenus) MUST have callbacks I clicked on the "Panel size" item itself instead of any of the options in its submenu, and: Segfault!
author corvid <corvid@lavabit.com>
date Thu, 26 May 2011 02:51:18 +0000
parents cf7f2d3312fb
children
rev   line source
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
1 October 2001, --Jcid
1252
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
2 Last update: Jul 2009
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
3
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
4 ---------------
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
5 THE HTML PARSER
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
6 ---------------
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
7
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
8
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
9 Dillo's parser is more than just a HTML parser, it does XHTML
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
10 and plain text also. It has parsing 'modes' that define its
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
11 behaviour while working:
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
12
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
13 typedef enum {
1252
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
14 DILLO_HTML_PARSE_MODE_INIT = 0,
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
15 DILLO_HTML_PARSE_MODE_STASH,
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
16 DILLO_HTML_PARSE_MODE_STASH_AND_BODY,
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
17 DILLO_HTML_PARSE_MODE_BODY,
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
18 DILLO_HTML_PARSE_MODE_VERBATIM,
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
19 DILLO_HTML_PARSE_MODE_PRE
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
20 } DilloHtmlParseMode;
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
21
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
22
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
23 The parser works upon a token-grained basis, i.e., the data
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
24 stream is parsed into tokens and the parser is fed with them. The
1252
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
25 process is simple: whenever the cache has new data, it is
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
26 passed to Html_write, which groups data into tokens and calls the
1252
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
27 appropriate functions for the token type (tag, space, or word).
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
28
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
29 Note: when in DILLO_HTML_PARSE_MODE_VERBATIM, the parser
1252
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
30 doesn't try to split the data stream into tokens anymore; it
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
31 simply collects until the closing tag.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
32
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
33 ------
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
34 TOKENS
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
35 ------
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
36
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
37 * A chunk of WHITE SPACE --> Html_process_space
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
38
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
39
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
40 * TAG --> Html_process_tag
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
41
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
42 The tag-start is defined by two adjacent characters:
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
43
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
44 first : '<'
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
45 second: ALPHA | '/' | '!' | '?'
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
46
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
47 Note: comments are discarded ( <!-- ... --> )
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
48
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
49
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
50 The tag's end is not as easy to find, nor to deal with!:
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
51
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
52 1) The HTML 4.01 sec. 3.2.2 states that "Attribute/value
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
53 pairs appear before the final '>' of an element's start tag",
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
54 but it doesn't define how to discriminate the "final" '>'.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
55
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
56 2) '<' and '>' should be escaped as '&lt;' and '&gt;' inside
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
57 attribute values.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
58
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
59 3) The XML SPEC for XHTML states:
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
60 AttrValue ::== '"' ([^<&"] | Reference)* '"' |
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
61 "'" ([^<&'] | Reference)* "'"
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
62
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
63 Current parser honors the XML SPEC.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
64
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
65 As it's a common mistake for human authors to mistype or
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
66 forget one of the quote marks of an attribute value; the
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
67 parser solves the problem with a look-ahead technique
1252
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
68 (otherwise the parser could skip significant amounts of
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
69 properly-written HTML).
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
70
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
71
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
72
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
73 * WORD --> Html_process_word
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
74
1252
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
75 A word is anything that doesn't start with SPACE, that's
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
76 outside of a tag, up to the first SPACE or tag start.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
77
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
78 SPACE = ' ' | \n | \r | \t | \f | \v
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
79
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
80
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
81 -----------------
1591
cf7f2d3312fb trim some spaces
corvid <corvid@lavabit.com>
parents: 1252
diff changeset
82 THE PARSING STACK
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
83 -----------------
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
84
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
85 The parsing state of the document is kept in a stack:
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
86
1252
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
87 class DilloHtml {
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
88 [...]
1252
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
89 lout::misc::SimpleVector<DilloHtmlState> *stack;
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
90 [...]
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
91 };
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
92
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
93 struct _DilloHtmlState {
1252
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
94 CssPropertyList *table_cell_props;
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
95 DilloHtmlParseMode parse_mode;
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
96 DilloHtmlTableMode table_mode;
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
97 bool cell_text_align_set;
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
98 DilloHtmlListMode list_type;
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
99 int list_number;
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
100
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
101 /* TagInfo index for the tag that's being processed */
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
102 int tag_idx;
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
103
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
104 dw::core::Widget *textblock, *table;
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
105
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
106 /* This is used to align list items (especially in enumerated lists) */
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
107 dw::core::Widget *ref_list_item;
1591
cf7f2d3312fb trim some spaces
corvid <corvid@lavabit.com>
parents: 1252
diff changeset
108
1252
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
109 /* This is used for list items etc; if it is set to TRUE, breaks
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
110 have to be "handed over" (see Html_add_indented and
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
111 Html_eventually_pop_dw). */
20ffd8b339cc update docs a bit
corvid <corvid@lavabit.com>
parents: 0
diff changeset
112 bool hand_over_break;
0
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
113 };
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
114
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
115 Basically, when a TAG is processed, a new state is pushed into
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
116 the 'stack' and its 'style' is set to reflect the desired
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
117 appearance (details in DwStyle.txt).
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
118
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
119 That way, when a word is processed later (added to the Dw), all
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
120 the information is within the top state.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
121
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
122 Closing TAGs just pop the stack.
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
123
6ee11bf9e3ea Initial revision
jcid
parents:
diff changeset
124