// FIXME: xml namespace support??? // FIXME: https://developer.mozilla.org/en-US/docs/Web/API/Element/insertAdjacentHTML // FIXME: parentElement is parentNode that skips DocumentFragment etc but will be hard to work in with my compatibility... // FIXME: the scriptable list is quite arbitrary // xml entity references?! /++ This is an html DOM implementation, started with cloning what the browser offers in Javascript, but going well beyond it in convenience. If you can do it in Javascript, you can probably do it with this module, and much more. --- import arsd.dom; void main() { auto document = new Document("

paragraph

"); writeln(document.querySelector("p")); document.root.innerHTML = "

hey

"; writeln(document); } --- BTW: this file optionally depends on `arsd.characterencodings`, to help it correctly read files from the internet. You should be able to get characterencodings.d from the same place you got this file. If you want it to stand alone, just always use the `Document.parseUtf8` function or the constructor that takes a string. Symbol_groups: core_functionality = These members provide core functionality. The members on these classes will provide most your direct interaction. bonus_functionality = These provide additional functionality for special use cases. implementations = These provide implementations of other functionality. +/ module arsd.dom; static import arsd.core; import arsd.core : encodeUriComponent, decodeUriComponent; // FIXME: support the css standard namespace thing in the selectors too version(with_arsd_jsvar) import arsd.jsvar; else { enum scriptable = "arsd_jsvar_compatible"; } // this is only meant to be used at compile time, as a filter for opDispatch // lists the attributes we want to allow without the use of .attr bool isConvenientAttribute(string name) { static immutable list = [ "name", "id", "href", "value", "checked", "selected", "type", "src", "content", "pattern", "placeholder", "required", "alt", "rel", "method", "action", "enctype" ]; foreach(l; list) if(name == l) return true; return false; } // FIXME: something like
    spam
      with no closing
    should read the second tag as the closer in garbage mode // FIXME: failing to close a paragraph sometimes messes things up too // FIXME: it would be kinda cool to have some support for internal DTDs // and maybe XPath as well, to some extent /* we could do meh this sux auto xpath = XPath(element); // get the first p xpath.p[0].a["href"] */ /++ The main document interface, including a html or xml parser. There's three main ways to create a Document: If you want to parse something and inspect the tags, you can use the [this|constructor]: --- // create and parse some HTML in one call auto document = new Document(""); // or some XML auto document = new Document("", true, true); // strict mode enabled // or better yet: auto document = new XmlDocument(""); // specialized subclass --- If you want to download something and parse it in one call, the [fromUrl] static function can help: --- auto document = Document.fromUrl("http://dlang.org/"); --- (note that this requires my [arsd.characterencodings] and [arsd.http2] libraries) And, if you need to inspect things like `<%= foo %>` tags and comments, you can add them to the dom like this, with the [enableAddingSpecialTagsToDom] and [parseUtf8] or [parseGarbage] functions: --- auto document = new Document(); document.enableAddingSpecialTagsToDom(); document.parseUtf8("", true, true); // changes the trues to false to switch from xml to html mode --- You can also modify things like [selfClosedElements] and [rawSourceElements] before calling the `parse` family of functions to do further advanced tasks. However you parse it, it will put a few things into special variables. [root] contains the root document. [prolog] contains the instructions before the root (like ``). To keep the original things, you will need to [enableAddingSpecialTagsToDom] first, otherwise the library will return generic strings in there. [piecesBeforeRoot] will have other parsed instructions, if [enableAddingSpecialTagsToDom] is called. [piecesAfterRoot] will contain any xml-looking data after the root tag is closed. Most often though, you will not need to look at any of that data, since `Document` itself has methods like [querySelector], [appendChild], and more which will forward to the root [Element] for you. +/ /// Group: core_functionality class Document : FileResource, DomParent { inout(Document) asDocument() inout { return this; } inout(Element) asElement() inout { return null; } /++ These three functions, `processTagOpen`, `processTagClose`, and `processNodeWhileParsing`, allow you to process elements as they are parsed and choose to not append them to the dom tree. `processTagOpen` is called as soon as it reads the tag name and attributes into the passed `Element` structure, in order of appearance in the file. `processTagClose` is called similarly, when that tag has been closed. In between, all descendant nodes - including tags as well as text and other nodes - are passed to `processNodeWhileParsing`. Finally, after `processTagClose`, the node itself is passed to `processNodeWhileParsing` only after its children. So, given: ```xml ``` It would call: $(NUMBERED_LIST * processTagOpen(thing) * processNodeWhileParsing(thing, whitespace text) // the newlines, spaces, and tabs between the thing tag and child tag * processTagOpen(child) * processNodeWhileParsing(child, whitespace text) * processTagOpen(grandchild) * processTagClose(grandchild) * processNodeWhileParsing(child, grandchild) * processNodeWhileParsing(child, whitespace text) // whitespace after the grandchild * processTagClose(child) * processNodeWhileParsing(thing, child) * processNodeWhileParsing(thing, whitespace text) * processTagClose(thing) ) The Element objects passed to those functions are the same ones you'd see; the tag open and tag close calls receive the same object, so you can compare them with the `is` operator if you want. The default behavior of each function is that `processTagOpen` and `processTagClose` do nothing. `processNodeWhileParsing`'s default behavior is to call `parent.appendChild(child)`, in order to build the dom tree. If you do not want the dom tree, you can do override this function to do nothing. If you do not choose to append child to parent in `processNodeWhileParsing`, the garbage collector is free to clean up the node even as the document is not finished parsing, allowing memory use to stay lower. Memory use will tend to scale approximately with the max depth in the element tree rather the entire document size. To cancel processing before the end of a document, you'll have to throw an exception and catch it at your call to parse. There is no other way to stop early and there are no concrete plans to add one. There are several approaches to use this: you might might use `processTagOpen` and `processTagClose` to keep a stack or other state variables to process nodes as they come and never add them to the actual tree. You might also build partial subtrees to use all the convenient methods in `processTagClose`, but then not add that particular node to the rest of the tree to keep memory usage down. Examples: Suppose you have a large array of items under the root element you'd like to process individually, without taking all the items into memory at once. You can do that with code like this: --- import arsd.dom; class MyStream : XmlDocument { this(string s) { super(s); } // need to forward the constructor we use override void processNodeWhileParsing(Element parent, Element child) { // don't append anything to the root node, since we don't need them // all in the tree - that'd take too much memory - // but still build any subtree for each individual item for ease of processing if(parent is root) return; else super.processNodeWhileParsing(parent, child); } int count; override void processTagClose(Element element) { if(element.tagName == "item") { // process the element here with all the regular dom functions on `element` count++; // can still use dom functions on the subtree we built assert(element.requireSelector("name").textContent == "sample"); } } } void main() { // generate an example file with a million items string xml = ""; foreach(i; 0 .. 1_000_000) { xml ~= "sampleexample"; } xml ~= ""; auto document = new MyStream(xml); assert(document.count == 1_000_000); } --- This example runs in about 1/10th of the memory and 2/3 of the time on my computer relative to a default [XmlDocument] full tree dom. By overriding these three functions to fit the specific document and processing requirements you have, you might realize even bigger gains over the normal full document tree while still getting most the benefits of the convenient dom functions. Tip: if you use a [Utf8Stream] instead of a string, you might be able to bring the memory use further down. The easiest way to do that is something like this when loading from a file: --- import std.stdio; auto file = File("filename.xml", "rb"); auto textStream = new Utf8Stream(() { // get more auto buffer = new char[](32 * 1024); return cast(string) file.rawRead(buffer); }, () { // has more return !file.eof; }); auto document = new XmlDocument(textStream); --- You'll need to forward a constructor in your subclasses that takes `Utf8Stream` too if you want to subclass to override the streaming parsing functions. Note that if you do save parts of the document strings or objects, it might prevent the GC from freeing that string block anyway, since dom.d will often slice into its buffer while parsing instead of copying strings. It will depend on your specific case to know if this actually saves memory or not for you. Bugs: Even if you use a [Utf8Stream] to feed data and decline to append to the tree, the entire xml text is likely to end up in memory anyway. See_Also: [Document#examples]'s high level streaming example. History: `processNodeWhileParsing` was added January 6, 2023. `processTagOpen` and `processTagClose` were added February 21, 2025. +/ void processTagOpen(Element what) { } /// ditto void processTagClose(Element what) { } /// ditto void processNodeWhileParsing(Element parent, Element child) { parent.appendChild(child); } /++ Convenience method for web scraping. Requires [arsd.http2] to be included in the build as well as [arsd.characterencodings]. This will download the file from the given url and create a document off it, using a strict constructor or a [parseGarbage], depending on the value of `strictMode`. +/ static Document fromUrl()(string url, bool strictMode = false) { import arsd.http2; auto client = new HttpClient(); auto req = client.navigateTo(Uri(url), HttpVerb.GET); auto res = req.waitForCompletion(); auto document = new Document(); if(strictMode) { document.parse(cast(string) res.content, true, true, res.contentTypeCharset); } else { document.parseGarbage(cast(string) res.content); } return document; } /++ Creates a document with the given source data. If you want HTML behavior, use `caseSensitive` and `struct` set to `false`. For XML mode, set them to `true`. Please note that anything after the root element will be found in [piecesAfterRoot]. Comments, processing instructions, and other special tags will be stripped out b default. You can customize this by using the zero-argument constructor and setting callbacks on the [parseSawComment], [parseSawBangInstruction], [parseSawAspCode], [parseSawPhpCode], and [parseSawQuestionInstruction] members, then calling one of the [parseUtf8], [parseGarbage], or [parse] functions. Calling the convenience method, [enableAddingSpecialTagsToDom], will enable all those things at once. See_Also: [parseGarbage] [parseUtf8] [parseUrl] +/ this(string data, bool caseSensitive = false, bool strict = false) { parseUtf8(data, caseSensitive, strict); } /** Creates an empty document. It has *nothing* in it at all, ready. */ this() { } /++ This is just something I'm toying with. Right now, you use opIndex to put in css selectors. It returns a struct that forwards calls to all elements it holds, and returns itself so you can chain it. Example: document["p"].innerText("hello").addClass("modified"); Equivalent to: foreach(e; document.getElementsBySelector("p")) { e.innerText("hello"); e.addClas("modified"); } Note: always use function calls (not property syntax) and don't use toString in there for best results. You can also do things like: document["p"]["b"] though tbh I'm not sure why since the selector string can do all that anyway. Maybe you could put in some kind of custom filter function tho. +/ ElementCollection opIndex(string selector) { auto e = ElementCollection(this.root); return e[selector]; } string _contentType = "text/html; charset=utf-8"; /// If you're using this for some other kind of XML, you can /// set the content type here. /// /// Note: this has no impact on the function of this class. /// It is only used if the document is sent via a protocol like HTTP. /// /// This may be called by parse() if it recognizes the data. Otherwise, /// if you don't set it, it assumes text/html; charset=utf-8. @property string contentType(string mimeType) { _contentType = mimeType; return _contentType; } /// implementing the FileResource interface, useful for sending via /// http automatically. @property string filename() const { return null; } /// implementing the FileResource interface, useful for sending via /// http automatically. override @property string contentType() const { return _contentType; } /// implementing the FileResource interface; it calls toString. override immutable(ubyte)[] getData() const { return cast(immutable(ubyte)[]) this.toString(); } /* /// Concatenates any consecutive text nodes void normalize() { } */ /// This will set delegates for parseSaw* (note: this overwrites anything else you set, and you setting subsequently will overwrite this) that add those things to the dom tree when it sees them. /// Call this before calling parse(). /++ Adds objects to the dom representing things normally stripped out during the default parse, like comments, ``, `<% code%>`, and `` all at once. Note this will also preserve the prolog and doctype from the original file, if there was one. See_Also: [parseSawComment] [parseSawAspCode] [parseSawPhpCode] [parseSawQuestionInstruction] [parseSawBangInstruction] +/ void enableAddingSpecialTagsToDom() { parseSawComment = (string) => true; parseSawAspCode = (string) => true; parseSawPhpCode = (string) => true; parseSawQuestionInstruction = (string) => true; parseSawBangInstruction = (string) => true; } /// If the parser sees a html comment, it will call this callback /// will call parseSawComment(" comment ") /// Return true if you want the node appended to the document. It will be in a [HtmlComment] object. bool delegate(string) parseSawComment; /// If the parser sees <% asp code... %>, it will call this callback. /// It will be passed "% asp code... %" or "%= asp code .. %" /// Return true if you want the node appended to the document. It will be in an [AspCode] object. bool delegate(string) parseSawAspCode; /// If the parser sees , it will call this callback. /// It will be passed "?php php code... ?" or "?= asp code .. ?" /// Note: dom.d cannot identify the other php short format. /// Return true if you want the node appended to the document. It will be in a [PhpCode] object. bool delegate(string) parseSawPhpCode; /// if it sees a that is not php or asp /// it calls this function with the contents. /// calls parseSawQuestionInstruction("?SOMETHING foo") /// Unlike the php/asp ones, this ends on the first > it sees, without requiring ?>. /// Return true if you want the node appended to the document. It will be in a [QuestionInstruction] object. bool delegate(string) parseSawQuestionInstruction; /// if it sees a calls parseSawBangInstruction("SOMETHING foo") /// Return true if you want the node appended to the document. It will be in a [BangInstruction] object. bool delegate(string) parseSawBangInstruction; /// Given the kind of garbage you find on the Internet, try to make sense of it. /// Equivalent to document.parse(data, false, false, null); /// (Case-insensitive, non-strict, determine character encoding from the data.) /// NOTE: this makes no attempt at added security, but it will try to recover from anything instead of throwing. /// /// It is a template so it lazily imports characterencodings. void parseGarbage()(string data) { parse(data, false, false, null); } /// Parses well-formed UTF-8, case-sensitive, XML or XHTML /// Will throw exceptions on things like unclosed tags. void parseStrict(string data, bool pureXmlMode = false) { parseStream(toUtf8Stream(data), true, true, pureXmlMode); } /// Parses well-formed UTF-8 in loose mode (by default). Tries to correct /// tag soup, but does NOT try to correct bad character encodings. /// /// They will still throw an exception. void parseUtf8(string data, bool caseSensitive = false, bool strict = false) { parseStream(toUtf8Stream(data), caseSensitive, strict); } // this is a template so we get lazy import behavior Utf8Stream handleDataEncoding()(in string rawdata, string dataEncoding, bool strict) { import arsd.characterencodings; // gotta determine the data encoding. If you know it, pass it in above to skip all this. if(dataEncoding is null) { dataEncoding = tryToDetermineEncoding(cast(const(ubyte[])) rawdata); // it can't tell... probably a random 8 bit encoding. Let's check the document itself. // Now, XML and HTML can both list encoding in the document, but we can't really parse // it here without changing a lot of code until we know the encoding. So I'm going to // do some hackish string checking. if(dataEncoding is null) { auto dataAsBytes = cast(immutable(ubyte)[]) rawdata; // first, look for an XML prolog auto idx = indexOfBytes(dataAsBytes, cast(immutable ubyte[]) "encoding=\""); if(idx != -1) { idx += "encoding=\"".length; // we're probably past the prolog if it's this far in; we might be looking at // content. Forget about it. if(idx > 100) idx = -1; } // if that fails, we're looking for Content-Type http-equiv or a meta charset (see html5).. if(idx == -1) { idx = indexOfBytes(dataAsBytes, cast(immutable ubyte[]) "charset="); if(idx != -1) { idx += "charset=".length; if(dataAsBytes[idx] == '"') idx++; } } // found something in either branch... if(idx != -1) { // read till a quote or about 12 chars, whichever comes first... auto end = idx; while(end < dataAsBytes.length && dataAsBytes[end] != '"' && end - idx < 12) end++; dataEncoding = cast(string) dataAsBytes[idx .. end]; } // otherwise, we just don't know. } } if(dataEncoding is null) { if(strict) throw new MarkupException("I couldn't figure out the encoding of this document."); else // if we really don't know by here, it means we already tried UTF-8, // looked for utf 16 and 32 byte order marks, and looked for xml or meta // tags... let's assume it's Windows-1252, since that's probably the most // common aside from utf that wouldn't be labeled. dataEncoding = "Windows 1252"; } // and now, go ahead and convert it. string data; if(!strict) { // if we're in non-strict mode, we need to check // the document for mislabeling too; sometimes // web documents will say they are utf-8, but aren't // actually properly encoded. If it fails to validate, // we'll assume it's actually Windows encoding - the most // likely candidate for mislabeled garbage. dataEncoding = dataEncoding.toLower(); dataEncoding = dataEncoding.replace(" ", ""); dataEncoding = dataEncoding.replace("-", ""); dataEncoding = dataEncoding.replace("_", ""); if(dataEncoding == "utf8") { try { validate(rawdata); } catch(UTFException e) { dataEncoding = "Windows 1252"; } } } if(dataEncoding != "UTF-8") { if(strict) data = convertToUtf8(cast(immutable(ubyte)[]) rawdata, dataEncoding); else { try { data = convertToUtf8(cast(immutable(ubyte)[]) rawdata, dataEncoding); } catch(Exception e) { data = convertToUtf8(cast(immutable(ubyte)[]) rawdata, "Windows 1252"); } } } else data = rawdata; return toUtf8Stream(data); } private Utf8Stream toUtf8Stream(in string rawdata) { string data = rawdata; static if(is(Utf8Stream == string)) return data; else return new Utf8Stream(data); } /++ List of elements that can be assumed to be self-closed in this document. The default for a Document are a hard-coded list of ones appropriate for HTML. For [XmlDocument], it defaults to empty. You can modify this after construction but before parsing. History: Added February 8, 2021 (included in dub release 9.2) Changed from `string[]` to `immutable(string)[]` on February 4, 2024 (dub v11.5) to plug a hole discovered by the OpenD compiler's diagnostics. +/ immutable(string)[] selfClosedElements = htmlSelfClosedElements; /++ List of elements that contain raw CDATA content for this document, e.g. ` my plaintext & stuff `); // please note that if we did `document.toString()` right now, the original source - almost your same // string you passed to parseStrict - would be spit back out. Meaning the embedded-plaintext still has its // special text inside it. Another parser won't understand how to use this! So if you want to pass this // document somewhere else, you need to do some transformations. // // This differs from cases like CDATA sections, which dom.d will automatically convert into plain html entities // on the output that can be read by anyone. assert(document.root.tagName == "html"); // the root element is normal int foundCount; // now let's loop through the whole tree foreach(element; document.root.tree) { // the asp thing will be in if(auto asp = cast(AspCode) element) { // you use the `asp.source` member to get the code for these assert(asp.source == "% some asp code %"); foundCount++; } else if(element.tagName == "script") { // and for raw source elements - script, style, or the ones you add, // you use the innerHTML method to get the code inside assert(element.innerHTML == "embedded && javascript"); foundCount++; } else if(element.tagName == "embedded-plaintext") { // and innerHTML again assert(element.innerHTML == "my plaintext & stuff"); foundCount++; } } assert(foundCount == 3); // writeln(document.toString()); } // FIXME: