#instaparse

generated UTC: 2023-11-22 19:43
latest data: https://clojurians-log.clojureverse.org/instaparse/2023-11-09
messages: 1778
pro tips:
* Double click on text to filter by it. (doubleclick + cmd-f for extra points).
* Click on date to keep day visible regardless of filter.
* Click on time to keep hour visible regardless of filter.
#2015-06-1404:45aengelberg@mhuebert: Re: recent instaparse-live changes: The memoization worked like a charm, the parsing is a lot faster now when only changing the sample. Great idea!#2015-06-1602:00lucasbradstreetHowdy. instaparse-cljs 1.4.0.0-SNAPSHOT works with Clojure 1.7 now. Recent fixes from v1.3.5-1.4.0 haven't been ported to cljs yet. #2015-06-1602:02aengelbergcc @canweriotnow#2015-06-1602:02aengelberg@lucasbradstreet sweet!!! any plans to include tracing features? or is that not as applicable to cljs?#2015-06-1602:04aengelbergA lot of the hacky namespace reloading would probably be not doable on the cljs side. So maybe it would hit performance a little bit to include conditional tracing everywhere#2015-06-1602:04lucasbradstreetDepends on how hard it is to support. I can't throw too much more time into this for now. I think it usually makes sense to dev the parser in CLJ first, though instaparse-live may have changed my mind ;)#2015-06-1602:05lucasbradstreetAh, yeah, if that's required I'll probably pass on it for now. #2015-06-1623:34canweriotnow@lucasbradstreet: awesome. We're digging to find issues with processing certain chars or ranges in the cljs port... “%x41-57” works, but “%x41” doesn’t. “%x79-7A” doesn’t... possibly one of the char fns in instaparse.abnf (cljs) - clj version works fine.#2015-06-1623:35canweriotnowWe'll submit an issue (or hopefully PR) when we track it down.#2015-06-1623:39aengelbergYeah, playing around in my REPL, the %x41 does not work for some reason.#2015-06-1623:45aengelberghttps://github.com/lbradstreet/instaparse-cljs/blob/master/src/cljs/instaparse/abnf.cljs#L118#2015-06-1623:46aengelbergShould be (string (apply str (coerce-char (char-codes num1)))))#2015-06-1623:47lucasbradstreetAh cool. I won't have much time to push these changes atm but if you want to send me a PR I'll merge it. #2015-06-1623:48canweriotnowWow, awesome#2015-06-1623:49aengelberg@lucasbradstreet Actually I can't really tell what the purpose of char-codes is.#2015-06-1623:49aengelbergIt seems to take a character (possibly unicode) and split it into two characters. but at the REPL it doesn't seem to do that.#2015-06-1623:52lucasbradstreetJust off the top of my head it is to deal with multibyte chars by getting each byte (charCodeAt)#2015-06-1623:57canweriotnowhmm.. maybe %x41-57 didn't actually work, but just didn't throw like %x41... need to dig deeper.#2015-06-1623:59canweriotnowThe thing we're doing is generating parsers for various URI/IRI schemes from the ABNF in their respective RFCs, so we're problably hitting edge cases like crazy simple_smile#2015-06-1700:06aengelbergThat part of the abnf namespace seems not properly designed for supplementary characters altogether...#2015-06-1700:07aengelbergIt isn't using any utility (unlike the clj version) to turn an oversized integer into a series of two characters.#2015-06-1702:09aengelbergThe get-char-combinator function needs a rework, so ABNF terminals like %x5D-10FFFF can work.#2015-06-1702:10aengelbergJavaScript, unlike Java, does not seem to support regular expressions with \x{10FFFF}.#2015-06-1702:13aengelbergIn instaparse for Clojure, single characters are represented as a string combinator with the surrogate pair (two 16-bit chars side by side), and a character range uses the regex \x{10FFF} syntax. ClojureScript or JavaScript appear to not have much support for either of these things. It may be impossible to support Unicode character ranges in ABNF without introducing third-party js libraries.#2015-06-1702:40aengelbergOK, the former is doable via goog.i18n.uChar/fromCharCode.#2015-06-1704:03lucasbradstreetNice! Yeah, this character support code is probably the weakest part of the port. #2015-06-1704:04lucasbradstreetI'm glad that you're finding these issues. I had a feeling there were some lurking issues there.#2015-06-1704:56aengelberggoog has some utils to work with surrogate strings, but the regex (char range) seems impossible without pulling in an external dependency like Regenerate. https://github.com/mathiasbynens/regenerate#2015-06-1705:45lucasbradstreetAh, yeah, I think I’d rather recreate the functionality internally than pull in extra deps. Definitely a bit of a pain though.#2015-06-1705:47lucasbradstreetInteresting https://mathiasbynens.be/notes/javascript-unicode#2015-06-1705:47lucasbradstreetI wonder if this issue is true for all browsers#2015-06-1705:47lucasbradstreethttps://mathiasbynens.be/notes/es6-unicode-regex#2015-06-1705:48lucasbradstreetActually, if you could create a PR with a failing cljs test case that would be a good place to start#2015-06-1717:00aengelberghttps://github.com/lbradstreet/instaparse-cljs/pull/9#2015-06-1717:41aengelbergHmm, now I'm mildly concerned because circleci is passing... ;)#2015-06-1717:52aengelbergHmm, I think that's because there isn't really a notion of the cljs tests "passing" or "failing" (no exit codes)#2015-06-1806:47lucasbradstreet@aengelberg: I gave you commit access to instaparse-cljs, feel free to push anything you want, or create PRs and I’ll review them#2015-06-1815:03aengelbergThanks @lucasbradstreet! I'm working on some changes to clj instaparse that add special combinators for Unicode char ranges; it improves performance and makes it more portable to cljs.#2015-06-1902:20aengelberg@lucasbradstreet: have you considered using cljx for the instaparse source code as well, so it's easier to merge upstream changes?#2015-06-1902:25lucasbradstreetYes, that one is a bit of a trade off. There will definitely be more merge issues and it does kinda uglify the code a lot. It mostly depends on whether Mark would prefer separate cljs and clj, or cljx in an eventual upstream merge#2015-06-1902:32aengelbergI was thinking it might actually make it easier to merge upstream changes, because currently if a feature is changed in the clj version, it is quietly merged into the clj source without changing the cljs side at all.#2015-06-2316:34aengelberg[ANN] Instaparse 1.4.1 https://github.com/Engelberg/instaparse/blob/master/CHANGES.md#141#2015-06-2320:49aengelberg@lucasbradstreet: I've started a cljs-1.4.1 branch, that has the upstream commits merged in as well as a cljs port of the Unicode support (the unicode test case now passes!)#2015-06-2402:02lucasbradstreetNicely done!#2015-06-2405:18lucasbradstreet@aengelberg: you're a machine!#2015-06-2405:19lucasbradstreetI'll review the PR. Assuming all is well, I'll push a new snapshot and we can get ready for a prod artefact #2015-06-2405:24aengelberg@lucasbradstreet: Thanks!#2015-06-2405:43lucasbradstreetNew 1.4.1.0-SNAPSHOT on clojars simple_smile#2015-06-2405:44lucasbradstreetI’ll push 1.4.1.0 to clojars once mhuebert gives the OK#2015-06-2405:44lucasbradstreet(or once we’ve got instaparse-live#8 figured out)#2015-06-2405:44lucasbradstreetIt’s great that we’re in sync again#2015-06-2405:47aengelbergWould recommend giving instaparse live "the poop test" a.k.a. insert :shit: at various places in the UI and see what breaks#2015-06-2405:48aengelbergU+1F4A9#2015-06-2405:50aengelberg(or any Unicode supplementary character, but poop is the most fun :D)#2015-06-2505:50aengelbergYay, it works! http://instaparse-live.matt.is/#/-JsdFiDFdOmHFG9kb6YZ#2015-06-2507:49lucasbradstreetsweet!#2015-06-2507:49lucasbradstreetI’ll put out a release artifact then simple_smile#2015-06-2509:16lucasbradstreet1.4.1.0 artifact is up simple_smile#2015-06-2509:16lucasbradstreetGood work#2015-06-2516:04aengelbergThanks for being responsive / receptive to my changes! Looking forward to getting feedback from users on the new version.#2015-06-2516:04aengelbergcc @canweriotnow, unicode char ranges have now been restored to sanity 😄#2015-06-2516:04aengelbergABNF char ranges, that is. (unicode support is also there if needed)#2015-06-2523:13canweriotnow@aengelberg thx, awesome!#2015-06-2605:44lucasbradstreet@aengelberg: not a problem. I’m really happy the port is getting some use simple_smile#2015-06-2605:45aengelbergI know, I'm excited as well ;)#2015-07-2121:14marcofisetWhat are some approaches you guys use to interpret your parse tree?#2015-07-2121:22socksyyou can use multimethods dispatching on node type#2015-07-2121:25socksyuse a zipper to navigate the tree#2015-07-2121:33socksyor heck, just use https://clojure.github.io/clojure/clojure.walk-api.html#2015-07-2121:35socksyrecommend using postwalk, and then writing a function to do whatever you’re thinking#2015-07-2121:36socksydepends on how you want to eval, it’s a wee bit harder for compilation, but not much. You just need to make sure you emit a string. One solution would be to treat it as a side effect of whatever function you use#2015-07-2121:36socksyadmittedly, when I last did this I wasn’t using an instaparse’d parse tree, so maybe the technique varies a bit#2015-07-2121:37socksythink I re-implemented postwalk also#2015-07-2121:39socksyseems instaparse you could maybe just use insta/transform#2015-07-2121:40socksywhich must do a walk of the tree, and takes a map of node type to function#2015-07-2122:16aengelberg@marcofiset insta/transform has always been sufficient for my use cases. Is there more sophisticated functionality you're looking for? If so, clojure.walk/[pre|post]walk might be the next step up.#2015-07-2122:21marcofisetI am not looking for anything particular, just wanted to start a discussion on the subject. I'm using multi methods for the moment and I was curious about what other people might be using. #2015-07-2122:23marcofisetI didn't know about insta/transform, I'll definitely take a look.#2015-07-2122:25aengelbergYeah, instaparse.core/transform is a simple function that does the trick for simple parse tree consumption.#2015-07-2122:25aengelbergHey, that rhymes#2015-07-2122:25aengelberg@marcofiset Take a look at this section of the readme: https://github.com/engelberg/instaparse#transforming-the-tree#2015-07-2122:36marcofisetWow, I'm really impressed with the arithmetic example! Very straightforward and simple. My multi methods solution is going to the trash and will be replaced by something similar 😃#2015-07-2211:20marcofisetI’ve been noticing something with a new project using instaparse. It seems that everything I get out of the parser is wrapped in a list. I don’t recall having seen this behaviour before. Can someone enlighten me? simple_smile#2015-07-2212:02marcofisetAfter a couple of quick tests, it seems to be my particular grammar that causes this, but I'm not sure why.#2015-07-2216:35aengelberg@marcofiset: are you hiding the root tag (`<S> = ...`)? If so, this is by design as explained in this section of the readme: https://github.com/engelberg/instaparse#hiding-tags#2015-07-2220:31marcofisetYou're right, that was it.#2015-07-2220:32marcofisetI decided not to ignore it instead, and introduced a hidden sub-expression which handles the recursivity.#2015-08-0700:59michahello everyone, is there a grammar for clojure source that i can use in instaparse?#2015-08-0701:00michaor EDN would probably work in a pinch too#2015-08-0701:04aengelbergNot that I know of. Maybe this would help, you could port the ANTLR grammar into EBNF somehow https://stackoverflow.com/questions/3902813/is-there-a-language-spec-for-clojure#2015-08-0701:06aengelbergAlthough an EDN parser wouldn't be too hard to make from scratch. It's mostly balanced parens with various types of leafs or delimiters. The hard part is strings, handling all the \" stuff#2015-08-0701:07aengelbergAnd comments#2015-08-0701:07michayeah i was thinking that also, really i just need curly braces, because i only want to parse maps#2015-08-0701:07michabut curlies can appear in strings#2015-08-0701:07aengelbergMaps with no lists / sets as keys or values?#2015-08-0701:08michayeah the map can contain anything, but i can presumably slurp that in and just look for the matching curly#2015-08-0701:08aengelbergIf you get the string terminal right, Instaparse will be smart about closing parens even if parens appear within the string#2015-08-0701:09michahow about indentation-based languages?#2015-08-0701:09aengelbergehhhhhh#2015-08-0701:10michahaha yeah#2015-08-0701:11aengelberg"You're thinking in the wrong mindset" https://imgur.com/gallery/M5wl14r#2015-08-0701:12michahahahaha excellent reference there#2015-08-0701:12michathe project i'm working on is a generalized, abstract sort of markdown#2015-08-0701:13michait's designed to mix well with prose, so indentation based structure is a big win#2015-08-0701:14aengelbergSo like, prose at the top level, important stuff indented?#2015-08-0701:17micha
# This is line 1 of a certain type of block.
  The block continues here because of the indentation.

* This could be a list item

  p With a paragraph in it
    that continues on multiple lines...and
    has a strange #(inline something or other
    delimited by hash-parens)#...

  ~~~{:foo "bar", :baz 123} tags can also have
    attributes parsed as EDN...

* here is the next list item
#2015-08-0701:18michathere we go#2015-08-0701:18michathe parser will be a macro really#2015-08-0701:18michait will emit s-expressions#2015-08-0701:18michacalling multimethods#2015-08-0701:18michaso you can implement dispatches for any tags you like#2015-08-0701:19michaso what # foo means is up to you#2015-08-0701:19michathe indentation is crucial for making the thing general without special cases and hardcoded things#2015-08-0701:20aengelbergThe reason "ehhhh" is the visceral response to indentation based langs in instaparse is because in CFGs it's difficult if not impossible to remember how many spaces / tabs you're looking for on each line.#2015-08-0701:20aengelbergSo it's really only a problem if you have chunks within chunks that are indented even more.#2015-08-0701:21michayeah, and i want to support even more tricky things, like indentation plus extra whitespace at the front of the line#2015-08-0701:21michai have a naive handmade parser now to parse the blocks#2015-08-0701:21michait looks for tags that can start a block#2015-08-0701:22michathen it looks for the "outdent"#2015-08-0701:22michaso it doesn't look for a specific amount of indentation#2015-08-0701:22aengelbergExample?#2015-08-0701:22michait looks for a minimum amount of indentation, but you can use more#2015-08-0701:24micha
# This is all
  part of the
    same block
  and the next
  block
    doesnt
      start until the
  an "outdent" is seen

This is an outdent, so
the above block will 
have been ended.
#2015-08-0701:24michahowever,#2015-08-0701:25micha
# This is not
  all part of the same
  # block because this
    tag creates a nested
    block
#2015-08-0701:26aengelbergHmm, what if you parse each block and then run the parser AGAIN on the text in the block to find subblocks?#2015-08-0701:27michahm#2015-08-0701:27michai think i could give up the leading extra spaces thing, too#2015-08-0701:28michaand set indentation to some configurable fixed size#2015-08-0701:28aengelberg
(my-parser text) => ([:block "This is not" "all part of the same" "# block because this" "   tag ..."])
(insta/transform *1 {:block (fn [& strs] (my-parser (str/join "\n" strs)))})
#2015-08-0701:28aengelbergExcept recursively smarter#2015-08-0701:28michainteresting#2015-08-0701:29aengelbergI just thought of this, it might end up being impractical.#2015-08-0701:29aengelbergBut that might be the way to do it.#2015-08-0701:30michai will play around with it and let you know how it works out#2015-08-0701:30michai can at least instaparse the inline stuff, if not the blocks#2015-08-0701:32aengelbergTrue#2015-08-0701:32aengelbergAnyway, now that I think about it, this trick may be applicable to any indentation-based language#2015-08-0701:34michait's also an interesting case because i need to parse "any character that isn't a tag"#2015-08-0701:34michalike the text in between tags#2015-08-0701:34michai think i can use negative lookahead with regex like #"."#2015-08-0701:35michaanyway thanks for the help! i'll let you know how it all works out#2015-08-0701:39aengelbergNo problem, I'd love to hear how it goes#2015-08-1714:30micha@aengelberg: made some progress this weekend with my instaparse project! https://github.com/adzerk-oss/zerkdown#2015-08-1714:31michait's a work in progress of course#2015-08-1714:31michait does a braindead parsing of clojure maps/vectors#2015-08-1714:32michahttps://github.com/adzerk-oss/zerkdown/blob/master/src/adzerk/zerkdown/grammar.ebnf#2015-08-1714:32michafeedback appreciated simple_smile#2015-08-1720:18aengelberg@micha this project is very cool#2015-08-1720:18aengelbergDoes your ebnf allow [{]?#2015-08-1720:19michanot in a :CLJ or :VEC block#2015-08-1720:19micha["{"] would be ok though#2015-08-1720:20aengelberg<VEC-CHAR> = !(LSB | RSB | DQ) ANY-CHAR looks like it would allow mismatched map delimiters inside it#2015-08-1720:20michaoh interesting#2015-08-1720:20michayeah it's ambiguous#2015-08-1720:21micha!(LSB | RSB | STRING | MAP) ANY-CHAR would be nice there#2015-08-1720:22aengelbergthen it would still allow mismatched quotes and delimiters because strings and maps don't successfully parse simple_smile#2015-08-1720:22michahmm#2015-08-1720:23aengelbergmaybe (!(LSB | RSB | LCB | RCB | DQ) ANY-CHAR) | STRING | CLJ#2015-08-1720:23michayeah#2015-08-1720:23michai will try that#2015-08-1720:23michai am planning to do the recursion from clojure btw#2015-08-1720:24michai will parse one level of indentation, then for each :BLOCK call insta again on the body#2015-08-1720:25michait seems like it will be straightforward, i hope#2015-08-1720:25aengelbergcool#2015-08-1720:26aengelbergJust make sure instaparse Failures are returned / shortcircuited properly simple_smile#2015-08-1720:26michahow do you mean?#2015-08-1720:26aengelbergif a "sub-parse" returns a failure, then what?#2015-08-1720:29aengelbergI imagine it will be most idiomatic to call insta/parse again within the transformer. But my point is, if a parse failure arises (malformed zerkdown) within that, you will need to propagate that error properly simple_smile#2015-08-1720:32michaah right#2015-08-1720:32michawhat did you mean before about strings and maps not successfully parsing?#2015-08-1720:33aengelbergnegative lookahead = make sure this thing does not successfully parse#2015-08-1720:37aengelberg!STRING x means no "complete well-formed strings" allowed, but you probably wanted "no double-quotes of any kind really"#2015-08-1720:41aengelbergand then you can add in | STRING to allow well-formed strings#2015-08-1720:50michaoh i see#2015-08-1720:50michai actually don't care about double quotes#2015-08-1720:51michai just don't want well formed strings, because those can legitimately contain {}[] etc#2015-08-1720:51michai'm not trying to fully parse the clojure data, i just need to know where it ends#2015-08-1720:51michai send it as a string and use clojure.core/read-string on it later#2015-08-1720:52aengelbergthat's fair, but if [{]}] is allowed it's not exactly obvious where it ends simple_smile#2015-08-1720:52michahaha yes#2015-08-1720:53michavery interesting#2015-08-1720:53aengelberganyway I don't think it's super hard to make the delimiters correct. (!(LSB | RSB | LCB | RCB | DQ) ANY-CHAR) | STRING | CLJ#2015-08-1720:53aengelbergThat basically says "no double-quotes, UNLESS there is a well formed string"#2015-08-1720:54michayes that's awesome#2015-08-1720:54michatesting was pretty easy to do, by configuring with different start rules#2015-08-1720:54aengelbergyeah. just don't forget negative testing simple_smile#2015-08-1720:55michaah right#2015-08-1720:55michayeah i didn't think of that#2015-08-1720:57aengelbergreally cool idea. what is the intended use case for zerkdown?#2015-08-1721:16michawell i want to use it for just about everything!#2015-08-1721:17michamostly for websites#2015-08-1721:17michabut i can imagine using it for literate programming and things like that#2015-08-1721:17michabut for making webapps it's really nice to have a "prose" syntax you can customize for your use case#2015-08-1721:18michalike normally you have like
# My Title
that compiles down to
<h1>My Title</h1>
#2015-08-1721:20michabut what if you need something like
<h1>My Title <small>The Best Thing Ever</small></h1>
i want to be able to just define a new inline tag for that, like
# My Title <<The Best Thing Ever>>
#2015-08-1721:20michaor even more complex things with behavior and everything, like forms and buttons#2015-08-1812:33michai got the recursive parser construction working! one weird trick instaparse doesn't want you to know about... https://github.com/adzerk-oss/zerkdown/blob/master/src/adzerk/zerkdown/parser.clj#L48-L60#2015-08-1815:09aengelbergA programmer used this one weird trick to handle indented languages... click here to see what! (Lexer-based libraries HATE this!)#2015-08-2623:00nodenameI’m having a little trouble with a simple parser:#2015-08-2623:00nodename(def exp-parser (insta/parser "S = Sexp Sexp = Term | '(' Term* ')' Term = Char+ Char = #'[a-z]'"))#2015-08-2623:01nodenameuser=> (exp-parser "(hi)") [:S [:Sexp "(" [:Term [:Char "h"]] [:Term [:Char "i"]] ")"]] I thought the Char+ would make “hi” a single Term...#2015-08-2707:30ska@nodename: One Char is a character from the range given in the regexp. You allow many Chars with the +. Maybe you want Term = Char#2015-08-2707:31skaChar = #'[a-z]+'#2015-08-2707:32skaSo that the regexp eats the characters. Of course then "Char" is not a really good name anymore and might become sth like Symbol or Name#2015-08-2707:32nodenameYes, that works, but I expected what I had to work too. Don’t see why not.#2015-08-2707:49skaIt actually does that, it's just not the first result. Your grammar is ambiguous. See the result of insta/parses#2015-08-2707:50skaThe reason is, that + in instaparse is not greedy but allows all possible paths whereas + in the regexp would be greedy#2015-08-2707:51nodenameAh, thanks. Do you know how I could modify Term to make it greedy?#2015-08-2707:53skaMy approach would be to use the regexp mentioned above. To me that is kinda the tokenization step in instaparse. I don't know if other solutions exists from the top off my head#2015-08-2707:54skaLooks like I tried that already and came to the conclusion that there is no other way: https://github.com/ska2342/sourcetalk14/blob/master/de.skamphausen.stt14/src/de/skamphausen/stt14.clj#L460#2015-08-2707:55nodenameHa, OK, will read! Thanks!#2015-08-2822:52aengelberg@nodename you could use negative lookahead, i.e. Term = Char+ !Char#2015-08-2905:41nodename@aengelberg thanks!#2015-08-3107:34skainteresting, wasn't aware of that#2015-10-1422:23aaelonyhas anyone tried parsing between sql flavors? For example make a Redshift legal sql query a Hive legal sql query, assuming identical table and column names ? Would instaparse be a good fit for this kind of translation?#2015-10-1422:38aengelbergI'm not familiar with the differences of Redshift and Hive formats; how many gotchas are there when going between the two styles?#2015-10-1422:44aengelberg@aaelony: ^#2015-10-1423:23aaelonythere are differences in data types in create table statements, there are also differences in syntax and function calls#2015-10-1423:24aaelonyjust trying to gauge if such translations are a good fit to try in instaparse#2015-10-1423:47aengelbergInstaparse's job is to fully parse a string and return its meaning as a tree of data. If you need to fully parse a sql query and examine the data before you know what to do with it, Instaparse is a good choice. But if the problem could be solved by a regular expression searching the string for patterns, Instaparse might be overkill.#2015-10-1500:52aaelonyThat is useful, thank you. I think there are things idiomatic to each but maybe regexes are sufficient after all #2015-10-1509:14skaFunny enough, I spent the morning writing a grammar for some SQL inspired tiny query language simple_smile#2015-10-1516:37aaelonyyes, my initial thoughts were that sql syntax docs could be parsed in some way, e.g. https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html compared to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#2016-01-1208:01meowwhat is performance like for instaparse - someone I'm working with had a problem in the past with it being slow but never tracked down the source of the problem#2016-01-1210:12skaWhen parsing becomes slow, it might be due to the grammar being ambiguous. In general, I'd say that Instparse is good enough for many use cases, even if I've read papers complaining about the perf. The one that I still have was written in German, though.#2016-01-1210:27meowI'll tell the #C0J20813K devs to come here to find out how to parse fast#2016-01-1210:28meow@rafd: ^#2016-01-1210:29meow@ska: thanks for the info! simple_smile#2016-01-1213:06rafd@jamesnvc ^#2016-01-1215:47jamesnvchello#2016-01-1215:47jamesnvcYeah, I really like using instaparse server-side, but I tried writing one in cljs (pretty simple, just extracting links) and it was noticeably slow...#2016-01-1215:48jamesnvcit was being called on a lot of text, but is there something I should be doing?#2016-01-1215:55ska@jamesnvc: sometimes it is possible to chunk the text before parsing it. For example, when parsing logfiles use instaparse just for the lines not the whole file.#2016-01-1215:55jamesnvcThis is a bunch of little chunks of text; being called on the text of each message in a chat client#2016-01-1215:56jamesnvcI’ll try writing the grammar again and make sure it isn’t ambiguous#2016-01-1216:15ghadiI have a new parser that should be much faster than instaparse if you have no ambiguity#2016-01-1216:15ghadihttps://github.com/ghadishayban/pex#2016-01-1216:15ghadineeds better docs 😃#2016-01-1216:16ghadihttps://groups.google.com/d/msg/clojure/2ph-6o_Zydc/0O2DRDXBAwAJ#2016-01-1216:35jamesnvcoh cool, I’ll give that a shot, thanks!#2016-01-1216:37jamesnvc@ghadi oh, does this work in clojurescript though? That is my issue with instaparse — my clj perf is fine, but cljs leaves something to be desired#2016-01-1216:37ghadioh, no it doesn't#2016-01-1216:38ghadishould perform well there too, want to port a virtual machine ? 😉#2016-01-1216:38ghadihttps://github.com/ghadishayban/pex/blob/master/src-java/com/champbacon/pex/impl/PEGByteCodeVM.java#2016-01-1216:50jamesnvcOh, interesting...#2016-01-1216:50jamesnvcI may consider that, if just as a fun project!#2016-01-1217:07ghadiPeg.js is pretty nice, IMHO#2016-01-1301:11lucasbradstreetHi @jamesnvc, cljs perf is definitely a bit slow. You need to be using advanced mode compile otherwise it’s incredibly slow.#2016-01-1301:11lucasbradstreetI’m the guy who did the port#2016-01-1301:12jamesnvc@lucasbradstreet: Cool, thanks simple_smile Good to know I’m (maybe) not just doing something crazy#2016-01-1301:12lucasbradstreetDepending on what you’re doing, you can also serialise the parser definition and load it in directly#2016-01-1301:12lucasbradstreetCreating the initial parser in cljs can take quite a while, but the parsing itself can be pretty acceptable#2016-01-1301:19meowand my money was on crazy @jamesnvc#2016-01-1301:20meowcan any of that "job" be split between client and server?#2016-01-1301:20meowsay for a chat app#2016-01-1301:21meowjust send the user keystrokes to the server - do it in clj there#2016-01-1301:21meowjust brainstorming#2016-01-1301:21meowoutloud#2016-01-1301:23meowdoesn't each keystroke go to the server already - that's how you can display the fact that the user is typing#2016-01-1301:24meowso don't do any processing on the client - do the instaparse on the server#2016-01-1301:24meowand use yada or onyx or something to scale it#2016-01-1301:24meowwe can segregrate services on the server and compose them#2016-01-1301:25meowcompose microservices on the server and keep the client relatively stupid whenever the data is already on the server#2016-01-1301:36lucasbradstreetI was parsing excel formulas on the client and it was good enough#2016-01-1301:36lucasbradstreetCertainly faster than a round-trip to the server#2016-01-1301:41meowok#2016-01-1301:42meowI'll defer to @jamesnvc since I'm just blowing smoke#2016-01-1301:45lucasbradstreetMy overall experience was that creating the initial parser was very expensive, but overall parsing was OK, but that it had to be in advanced mode. All with a big chunk of “your mileage may vary”. Unfortunately I don’t have any time to work on performance any further#2016-01-1311:08jamesnvcCool, I was thinking of splitting it between client and server, but I will give it a shot with advanced compliation too#2016-01-1311:15lucasbradstreetAlso works. I’d measure how long it takes to do the individual parses, not just page load time - because that will be affected by creating the initial parser#2016-01-1313:20meow@jamesnvc: we should test both and not make assumptions either way, imnsho#2016-01-1313:21meow@lucasbradstreet: thanks for the help and suggestions - much appreciated#2016-01-1313:23lucasbradstreetAgreed. Though you have to assume some variability in request latency when testing the other method, which is why I ultimately went with the client side approach. That said, you can have slow CPU clients too. #2016-01-1313:26meowthen we should simulate issues with both environments and various combinations/permutations#2016-01-1313:26meowask the #C0J20813K team how good I am at doing that#2016-01-1313:27meowissues, oh yeah, I got issues#2016-01-1314:48lucasbradstreetha 👍#2016-01-1315:06meowsimple_smile#2016-02-0820:42wongisengHi, very basic question probably not specific to instaparse. From this basic example : "S = N | (N ('+' N)+); N = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';" if I want to enforce that not all N are 0s, should this be done in the grammar definition or by adding some logic on processing the parsed result ? I suspect the latter, but just in case anyone knows other ways to enforce this restriction directly in the grammar, I'd like to know. TIA#2016-02-0821:32aengelbergHi @wongiseng, I saw your question on gitter as well. Instaparse's job is to turn strings into meaningful data; any validation you want to do on that data probably should happen after the parse.#2016-02-0821:33aengelbergThe only real way to have more sophisticated validation on an input is to use lookahead and negative lookahead.#2016-02-0821:33aengelbergWell, those are the only ways to do sophisticated validation within instaparse.#2016-02-0821:34aengelbergIn this particular example you could use negative lookahead, e.g. S = !('0'*) (N | (N ('+' N)+));#2016-02-0821:34socksythis works, but it's ambiguous:
(def minimum-one-not-zero
  (insta/parser
    "EXP = N | S;
    S = (ZN '+')* N ('+' ZN)*;
    N =  '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';
    ZN = '0' | N;"))
#2016-02-0821:34aengelberg^ That would work as well#2016-02-0821:35socksy(tested)#2016-02-0821:36aengelbergThe advantage to writing your own validation after the parse is that when the input is wrong, you can write your own error message to say whatever you want instead of instaparse's failure message which might not be as readable.#2016-02-0821:36socksy^definitely
Parse error at line 1, column 4:
0+0
   ^
Expected:
"+"
#2016-02-0821:36aengelbergoops, my negative lookahead approach definitely wouldn't work because I totally didn't see the pluses in the input#2016-02-0821:38aengelbergMaybe
S = &(#".*[1-9]") (N | (N ('+' N)+);
#2016-02-0821:38aengelberge.g. "make sure there's some nonzero number somewhere, then parse as usual"#2016-02-0821:39aengelbergthat's lookahead not negative lookahead#2016-02-0821:40socksyif errors aren't important, and the fact you might get the "wrong" evaluation (e.g. "1+0+1" could be [:EXP [:S [:N "1"] "+" [:ZN "0"] "+" [:ZN [:N "1"]]]] or [:EXP [:S [:ZN [:N"1"]] "+" [:ZN 0] "+" [:N 1]]]) is also unimportant (e.g. you eval N and ZN the same), then you should be fine with the ambiguous grammar#2016-02-0821:41socksy(instaparse gives you the former)#2016-02-0821:45aengelberg@socksy how about
S = N ('+' N)* | (N '+')* '0' ('+' ZN)*;
#2016-02-0821:45aengelbergI'm just writing these off the top of my head, not evaluating them to be sure. I think that would be unambiguous though#2016-02-0821:46aengelberghmm, that's definitely wrong simple_smile#2016-02-0821:46aengelbergnot sure where that came from#2016-02-0821:47aengelbergUsing lookahead would likely be the easiest path, since the grammar would be unambiguous and easy to understand
#2016-02-0821:57wongisengCool, thanks for the explanations, I'll play a bit with look ahead, but eventually I guess i'll validate after the parse#2016-02-0821:58wongisengThe negative lookaheads makes the grammar hard to digest for me#2016-02-0823:21wongisengFor now I use @socksy's approach simple_smile https://github.com/wibisono/gnip-rule-validator-clj/blob/master/gnip-rule.bnf thanks a lot!#2016-02-0823:26wongisengMy actual problem was OR to have at least one positive term#2016-04-1105:51conawhey, anyone know how to match the \ character?#2016-04-1105:52conawI’m trying to match strings within a parsed file, something like this string = '\"' #'[^(?<!\\\)\"]* '\"'#2016-04-1105:53conawI keep getting either errors of unmatched parens, or instaparse errors when I’m doing#2016-04-1105:53conaw\ or \\#2016-04-1106:17conawfigured it out#2016-04-1106:17conawIf anyone’s interested#2016-04-1106:17conawstring = '\"' (#'[^\"]' | '\\\\\"') '\"'#2016-04-1106:19conawI’m know there should be a way to do it with lookbehind inside the regex, but at least now I only have one problem#2016-04-1106:19conawWould be great to have an instaparse wiki for common grammars, if that doesn’t already exist somewhere#2016-04-1106:20conawalso, would be great to know if anyone is using a combination of instaparse and any of the nlp libraries#2016-04-1209:06skaWhat kind of combination are you thinking about, @conaw ?#2016-04-1209:10skaOh, and regarding your string question, I did something similar with finding regexps in a query language which would be enclosed by slashes and allowed backslash-escaped slashed inside. The regexp for this was so weird, I completely forgot, how it worked, but here it is:
REGEXP = <'/'> #'(?:.(?!(?<![\\\\])/))+.?' <'/'>
(the grammar is defined in a Clojure string, thus the massive escaping)
#2016-04-1209:10conawnot sure yet to be honest — I’d like to be doing POS tagging, and tokenizing, but really enjoying instaparse and curious if anyone has used it in conjunction with something like opennlp#2016-04-1209:13skaI once did a workshop on Clojure with very basic NLP examples (it was at a faculty for computational linguistics), but I did not combine it with any existing NLP libraries. Here at work, the NLP stuff is mostly self-written as much of it predates the open source libs. And we do not (yet?) use Clojure in that area.#2016-04-1209:15skaHm, looks like I never polished that workshop to put it online somewhere. Sorry.#2016-04-1209:16skaBut you may be interested in the instaparse talk here: https://github.com/ska2342/clojure-talks/blob/master/instaparse/de.skamphausen.instaparse/src/de/skamphausen/instaparse.clj#2016-04-1209:16ska(enough boasting now; please excuse the self-plugging)#2016-04-1209:25conawNot boasting at all, I appreciate the link.#2016-04-1209:39conawAnother thing — Is there an idiomatic way to get the matched portion of a string for a given portion of a parse into the final transformed clojure data#2016-04-1209:39conawI’m trying to parse the same text multiple times iteratively — passing the result to a different more granular parser based on the first#2016-04-1209:40conawbasically I’m trying to split the text up using a parse#2016-04-1209:44conawspans looks like#2016-04-1211:14skaThere is a :partial option but it only returns the parse tree as far as it could be parsed. Maybe the total mode would help? Can't say. Sorry.#2016-04-1214:34ska@conaw, I just found the span function which takes a parse tree (result of parsing) and returns start and end index into the string. So, you could first parse partially and then as your input string for the covered substring.#2016-04-1214:37skaLike this:
(let [s "abcd"
               g "Q='a' 'b'"
               p (i/parser g)
               t (p s :partial true)]
               (apply subs
                             (into [s] (i/span t))))

#2016-04-1214:39ska(sorry for the broken indentation)#2016-04-1401:00conaw@ska yup, that was my last remark, should have used the code marker to make it more clear. I did have a little trouble figuring out how to get only the span details for particular tags though — my guess is that I should use tree-seq for that#2016-04-1401:01conawand now I know how not to use the code marker#2016-04-1409:27ska@conaw, ah now I understand your last comment. I misread it for an unfinished sentence and later forgot about it. Then span was a surprise to me. 😄#2016-04-2103:00bwstearnsDoes anyone have any quick guidance on this question at SO: https://stackoverflow.com/questions/36706854/instaparse-series-of-numbers-or-letters-as-one-leaf#2016-04-2103:01bwstearnsI think it might be an instance of lacking the right words to google for the answer effectively.#2016-04-2103:21aengelberg@bwstearns: You could concatenate all the strings as a transform step.#2016-04-2103:23aengelbergi.e. unhide the letter and number tags, but add :letter str, :number str into your transformer map.#2016-04-2103:24bwstearns@aengelberg: that's what I'm doing now. Because I'm doing it for a bunch of tags I was wondering if there was something built in for handling that as a common case or not.#2016-04-2103:24aengelbergOther than regexes, there's no way to concatenate strings in a way specified entirely by the instaparse grammar.#2016-04-2103:25aengelbergThe transform approach is the easiest way I could think of out of all the "do something to the tree, fresh out of the parser" possible approaches.#2016-04-2103:26bwstearnsThat makes sense. I think what I'll do is put the preprocessor transforms into another hash to keep the more meaninful transform actions less cluttered and then merge them right before usage.#2016-04-2103:27aengelbergThat can work. Or just call insta/transform twice, if you don't mind the performance impact of traversing the tree twice.#2016-04-2103:28bwstearnsthat works too. I don't think I have any performance issues on the horizon with this project.#2016-04-2103:32bwstearns@aengelberg: thanks a ton for taking a look. The question got some foot traffic but no feedback. If you're looking for internet points feel free to drop what you said in there and I'll accept it. Otherwise I'll copy it in as an own-answer for the next person.#2016-04-2103:50aengelberg@bwstearns: Any time! I've added an answer to your post#2016-04-2103:51bwstearnsawesome. thanks. Didn't think about the performant part, is that primarily due to the extra step of having to transform it or is it because regexes are inherently faster than using parser rules?#2016-04-2103:58aengelberg#'a+' is faster than 'a'+, as letting regexes do the work of searching for all possible "a"s is faster than having instaparse do that work#2016-04-2104:03aengelberg@bwstearns: ^#2016-04-2104:16bwstearnsRight, that makes sense because of the greediness. Thanks a ton for taking the time on this.#2016-04-2104:20aengelbergIt's not exactly *because* of the greediness, it's just speedier when a Java program is doing this task than Clojure simple_smile#2016-04-2321:41bwstearnsIn the wikipedia conventions section on EBNF it suggests that you can have bounded repetitions (https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form#Conventions), but the documentation doesn't say anything about it and I poked around the source and couldn't find anything in cfg that suggested it was there. Is the best way to get bounded reps to just merge in those rules from combinators?#2016-04-2706:39aengelberg@bwstearns sorry, didn't see your question until now. The instaparse ABNF notation (which is explained in ABNF.md) supports repetition with any lower and/or upper bound. e.g. (insta/parser "S = 3*5('a')" :input-format :abnf) matches 3 to 5 as. The default EBNF notation does not support it, so unless you want to switch to ABNF (a few conventions would have to change) then merging in combinators is the best approach.#2016-04-2714:05bwstearns@aengelberg: Sorry, I should have followed up with what I found, which is pretty much what you just said lol. I ended up implementing it by making a rep combinator and merging it into the main grammar, and then after looking at how unreadable it was getting I just threw a #'\d\d' regex at it (or something similar, I forget which part of the thing I was trying to parse). Thanks for following up though. Also, I knew I recognized you from something. Really enjoyed the automata presentation (I am of the youtube audience, haven’t had the good fortune to get to a clojure event in person yet).#2016-04-2907:35aengelberg@bwstearns: Glad you liked the talk! simple_smile If you have any more questions just @-mention me or say the word "instaparse" which also notifies me#2016-06-0108:32skaHi. I am aware that Instaparse collects the offsets to the strings it parses in the metadata of the elements of the parse tree. I would like to turn the parse tree into XML (using :output-format :enlive and clojure.xml/emit) and get the offset information somehow into the XML tree. Any pointers or ideas how to achieve this?#2016-06-0212:21skaDoes instaparse support special symbols for EOF and/or BOF?#2016-06-0216:06skaHm, first experiments give hint that having <EOF> = <#'\\Z'> in my grammar works.#2016-06-0216:37aengelbergYou could use the regex symbol for end of string, #"$" I believe#2016-06-0321:10davehey everyone! this is a shot in the dark, but is anyone interested in helping me speed up alda's parser? i have some notes here: https://github.com/alda-lang/alda/pull/238 i have to run at the moment, but thought i'd leave this here in case someone is interested in helping me tackle this#2016-06-1406:18aengelberg@ghaz FYI https://github.com/aengelberg/cljsee#2016-06-1406:18aengelberg@lucasbradstreet ^#2016-06-1406:19aengelbergI wrote this plugin with Instaparse in mind as a use case. I'm hoping this can help us get instaparse-cljs to (non cljx) portable code and maybe even merge this upstream.#2016-06-1406:46lucasbradstreetThanks @aengelberg. I retweeted your tweet about it because it's awesome! Nice work. Unfortunately I have absolutely no time to help with a port right now :(, but the approach sounds like a great way to go #2016-07-0401:53turbopapeWhat's the status of cljs support ? #2016-07-0401:54turbopapeIs it still living as a fork?#2016-07-0401:57aengelbergThe only cljs support still lives in lbradstreet/instaparse-cljs#2016-07-0401:59aengelbergBut I'm currently in the process of rewriting instaparse-cljs into a form that we'd be willing to accept back into upstream, now that cljsee exists#2016-07-0407:46aengelberg@seylerius: Here's a grammar that parses exponents like you were you asking:
boot.user=> (def p (insta/parser "
<S> = ows (exponent ows)+
<exponent> = token <'^'> super
super = token | <'{'> token <'}'>
<token> = #'[^\\s\\^{}]+'
<ows> = <#'\\s*'>
"))
#'boot.user/p
boot.user=> (p "foo^2 x^{x+1}")
("foo" [:super "2"] "x" [:super "x+1"])
This parser is pretty naive about the range of possible inputs, since I'm not totally sure myself what that range of inputs is in your use case.
#2016-07-0416:43seyleriusThanks!#2016-07-0416:47seyleriusAnother question: * / + = & ~ can appear in singles without being tokens. How would you represent that? Current parser: http://sprunge.us/GNDe#2016-07-0416:54seylerius@aengelberg: What I have will do for the moment, but it's a part of the spec I'd like to meet eventually.#2016-07-0417:03AndyHi, We switched recently for parsing user input using plain regex to instaparse. Code looks way better. However there are two corner cases where I am not sure what would be idiomatic way: 1) parsing of certain domain of inputs should result on noop. Our current solution is:
"sentence = define / explain / help / catchall
<<skipped definitions>>
 catchall = #'(.|[\n\r])*'"
with an intention to just ignore last part during transformation : catchall (fn [_] nil) Now I wonder if there is another way to catch this case and ignore without using exceptions. 2)`'(.|[\n\r])*'` comes with | which on JVM leads on recursion and might result in stack overflow. In fact it happened one to us. Is there a better way to write catchall which would account for anything including \n and \r.
#2016-07-0417:10aengelberg@happy.lisper for catchall you could do #'[\s\S]*'#2016-07-0417:10Andyty#2016-07-0417:11aengelbergSo your use case is: "Parse the entire string as a define, an explain, or a help, but if that doesn't work then return nil"?#2016-07-0417:11aengelbergBecause you could just run the parse and a transform, then check (insta/failure? result)#2016-07-0417:11Andyyes, where nil is just a signal to ignore the input.#2016-07-0417:13aengelberg
(def p (insta/parser ...))
(let [result (p input-string)
      transformed (insta/transform p {...})]
  (when-not (insta/failure? transformed)
    transformed))
#2016-07-0417:14aengelbergNote that insta/transform is specifically designed to pass through failures#2016-07-0417:15AndyLet me consider that 🙂.#2016-07-0417:19aengelberg@seylerius: Given an input ~a ~b, how do you know the a and b are to be parsed as individual ~'s, as opposed to a code string of "a " followed by "b"?#2016-07-0417:24seylerius@aengelberg: If I'm reading this correctly, the characters touching the inside of the tokens need to be alphanumeric, or at least non-whitespace.#2016-07-0417:27aengelbergso *a b c* shouldn't be allowed?#2016-07-0417:28aengelbergthe current grammar that I suggested would allow that. Just trying to get a sense of the range of inputs so I can help design a parser accordingly#2016-07-0417:29seylerius*foo* *bar* [:b "foo" "bar"] foo* bar* "foo* bar*" #2016-07-0417:33seylerius@aengelberg: that make sense?#2016-07-0417:34aengelbergfor the first example do you mean [:b "foo"] [:b "bar"]?#2016-07-0417:37aengelbergis there a guarantee that *a**b* won't happen?#2016-07-0417:38seylerius@aengelberg: Yes. And guarantee? No. Ambiguity in the spec we can lock to an interpretation? Yes.#2016-07-0417:45seyleriusWe basically get to decide if that's a pair of bold characters or a flat string we'll leave be.#2016-07-0417:45seyleriusIt would only likely happen as a typo.#2016-07-0417:45seylerius(Or a stupid user)#2016-07-0417:48seylerius@aengelberg: I'm basically upgrading organum. Sample org file: http://sprunge.us/KBbL#2016-07-0417:51aengelberghmm, thinking through how to enforce alphanumeric chars on the insides of tokens.#2016-07-0417:52aengelbergdoing a "lookbehind" on the last * is nontrivial.#2016-07-0418:01seyleriusWhat if I stripped leading and trailing whitespace before parsing, and modified the base string rule to start and end alphanumeric? Would that be easier?#2016-07-0418:05seyleriusBut, no, that wouldn't quite work.#2016-07-0418:11seylerius@aengelberg: Will the parser ignore escaped tokens, like \*?#2016-07-0418:12seyleriusAch. Clojure doesn't like \* in a string#2016-07-0418:30seylerius@aengelberg: Is here any way to mark tokens to not be parsed?#2016-07-0418:33Andywould angle brackets <> to hide parsed elements work?#2016-07-0418:35aengelberg@seylerius you'd have to do \\* if inside a Clojure string#2016-07-0418:36aengelbergthe goal is to avoid parsing *a * as [:b "a "]#2016-07-0418:37seylerius@aengelberg: Anything special I have to do to mark that? I just tried parsing \\*foo\\* and got ("\\" [:b "foo\\"]) #2016-07-0418:38aengelberginstaparse doesn't automatically handle backslashes in any special way besides what has been defined in your grammar.#2016-07-0418:41seyleriusOkay. How do you define a simple backslash replacement in this type of grammar, then?#2016-07-0418:45aengelbergMaybe replace <string> with:
<string> = '\\\\*' | #'[^*/_+=~^_\\\\]+'
user> (inline-markup "a\\* b")
("a" "\\*" " b")
#2016-07-0418:46aengelbergPretty messy, I know. (four backslashes 🙄)#2016-07-0418:48aengelbergI don't know if this solves your problem though; you don't want to escape *'s in every ** My Subsection text, do you?#2016-07-0418:49aengelbergsorry if I'm a bit unhelpful; phasing in and out of AFK#2016-07-0418:50seyleriusI'm thinking I'm just going to tell users that if they want a plain * they have to escape it.#2016-07-0418:51seyleriusHeadlines are already handled by the time this stage of parsing is invoked, so those won't be an issue.#2016-07-0418:53seyleriusAnd your special case of *a**b* is apparently already readily converted to ([:b "a"] [:b "b"])#2016-07-0420:11seylerius@aengelberg: Separate (earlier stage) parser: Is it possible (other than by having respective rules for #'^* ', #'^** ', #'^*** ', etc) to easily produce h1, h2, h3, etc?#2016-07-0420:20seyleriusActually, yeah. Just don't hide the token, and I can put that through a counter after the fact.#2016-07-0500:35seylerius@aengelberg: I'm trying to make blank lines in one parser flag as :blank, but they're staying as empty seqs. Parser: http://sprunge.us/RcOf Tester: http://sprunge.us/GGdK#2016-07-0500:39aengelbergcurrently that parser doesn't account for any newlines (`\n`) between the lines / blank lines, is that intentional?#2016-07-0500:54seyleriusThe library I'm modifying reads the file into a line-seq initially, so I'm mostly just going with that.#2016-07-0519:26seylerius@aengelberg: Think it would be easier if it was parsing the original, and not a line-seq?#2016-07-0519:30aengelbergit may be useful to, instead of line-seq, use a different parser on the original input that identifies the sections / subsections but not the inline syntax.#2016-07-0519:32seyleriusYeah. I'm already splitting it into multiple instaparsers. I take it insta can handle multi-line input?#2016-07-0519:34aengelbergyeah, just make sure all your strings / regexes handle them. All characters are equal citizens in instaparse input, it's up to the grammar to handle what it wants to handle. And make sure the grammar handles CRLFs (`\r\n`) which may appear.#2016-07-0522:22seylerius@aengelberg: So make sure my regexps are multi-line and whatnot?#2016-07-0522:27aengelbergyep#2016-07-0522:27aengelberge.g. . inside a regex matches any non-newline character#2016-07-0522:35aengelbergalso, \s inside a regex handles any kind of whitespace (including newlines)#2016-07-0522:40seyleriusGood caveats to know, thanks#2016-07-0706:05seylerius@aengelberg: Any particular instaparse way to go multi-line, or just specify it in the regexps?#2016-07-0717:10aengelberg@seylerius not sure what exactly you're confused about, but here are some examples: imagine you're parsing the following input:
aaaa
bbb
cccccc
the grammar could look like
S = A '\n' B '\n' C
A = 'a'+
B = 'b'+
C = 'c'+
#2016-07-0717:10aengelbergor
S = #'a+\n' #'b+\n' #'c+'
#2016-07-0717:11aengelbergeither \n or \\n would work if you are inside a Clojure string.#2016-07-0717:13aengelberg
S = A ows B ows C
A = 'a'+
B = 'b'+
C = 'c'+
(* optional whitespace *)
<ows> = <#'\s*'>
#2016-07-0717:13aengelbergFor that example, inside a Clojure string you would need to change \s to \\s.#2016-07-0717:42seylerius@aengelberg: Also, how would you modify what would normally be a .* to not eat an optional :[ that follows it? Or would you just post-process that out after?#2016-07-0717:44aengelbergso you're trying to parse #"shown-part hidden-part" but only return "shown-part" in the parse result?#2016-07-0717:46aengelbergyou could use the regex lookahead to omit it from the result, but then actually parse it (with instaparse's <> hiding feature) in order to properly advance the parser.
S = #'shown-part(?=hidden-part)' <#'hidden-part'>
#2016-07-0717:50aengelbergor unhide the second #'hidden-part' if you actually do want it in the parse tree, but separate from #'shown-part'.#2016-07-0718:39seylerius@aengelberg: More like org-mode headlines allow tags at the end in that style. Not hidden so much as separate.#2016-07-0804:28seylerius@aengelberg: Basically, how would you parse an optional non-hidden token that follows a token that may contain things somewhat similar to the optional non-hidden token?#2016-07-1118:21seyleriusAh, I see what you're talking about#2016-08-1800:35uwoIs there anyway to prevent intsta/parse from printing to the repl on error?#2016-08-1902:47seyleriusHrm. Is there a way to make some tokens higher priority than others? This parser (http://sprunge.us/hFCU) eats the entire input file, failing to break out the initial metadata.#2016-08-1902:49seyleriusWhen I try to make the content token reluctant (adding a ? to the *), it fails to match when the content section begins.#2016-08-1914:11dave@seylerius you could use ordered choice https://github.com/Engelberg/instaparse#ordered-choice#2016-08-1914:12davedefine a rule that could be one or the other, using / instead of |, and put the one you prefer first#2016-08-1914:13davealthough, it looks like your title rule is probably consuming everything#2016-08-1914:13dave#'.*'#2016-08-1914:13davethat will consume everything#2016-08-1914:14seyleriusNope, it's not eating newlines.#2016-08-1914:14seyleriusThis looks like it's going to do it.#2016-08-1914:14daveoh, you're right!#2016-08-1916:45aengelberg@uwo: instaparse doesn't print anything when a failure occurs. It returns a instaparse.Failure object which happens to print in a special way at the REPL#2016-08-2715:55seyleriusOkay, I'm producing hiccup-style structures from inataparse. I need help figuring out how to re-parse specific items within the structure.#2016-08-2715:56seyleriusSolo strings (unmatched with a tag) are one of the types I need to re- parse in place#2016-08-2715:57seyleriusWait, this would go better in #C03S1KBA2#2016-08-2715:59seyleriusDidn't realize I was still in instaparse#2016-08-2800:21aengelberg@seylerius this is a good place for that.#2016-08-2800:21aengelbergYou could put further "insta/parse"s in the functions inside the "insta/transform" map#2016-08-2800:21seyleriusWat#2016-08-2800:21seyleriusThis is awesome.#2016-08-2800:23aengelberg(insta/transform {:x (fn [s] (insta/parse otherparser s))} (insta/parse firstparser s)#2016-08-2800:23aengelbergHard to bang out a good example on mobile#2016-08-2800:23seyleriusLolyep.#2016-08-2800:23seyleriusThat looks fascinating.#2016-08-2800:24aengelbergIt would get weird if the nested parser had an error though.#2016-08-2800:24seyleriusYeah.#2016-08-2800:25seyleriusSo how deep does it go looking for :x?#2016-08-2800:26seyleriusAnd how do you make it check for loose strings?#2016-08-2800:26aengelbergIt does a full traversal of the hiccup / enlive, as long as all structures around the :x are valid hiccup / enlive#2016-08-2800:27seyleriusNice#2016-08-2800:59seylerius@aengelberg: How do you get solo strings?#2016-08-2821:01seyleriusGah, what's wrong with this parser? doc-metadata works fine, but running headlines on the remaining content just returns flat content. https://github.com/seylerius/organum#2016-08-2821:02seylerius@aengelberg: Got any clues?#2016-08-2821:03seyleriusSimple reproduction: (headlines (last (doc-metadata (slurp "")))) #2016-08-2821:05seyleriusIt's something in the h token, because that's the last thing I changed before it started failing.#2016-08-2821:10skaAt a first glance, the #'.+' looks suspicious to me. Is greediness biting you here? (Did not try it out, though)#2016-08-2821:25aengelberg@seylerius the regex you put for :content is probably not what you want. Due to the (?s) flag, seems to match everything including newlines, as long as the first character is not a *.#2016-08-2821:25aengelbergI'm not sure what your desired behavior is though.#2016-08-2821:26aengelbergBTW, both the first ^ and the ? in your regex appear redundant, if I understand it correctly.#2016-08-2821:26seyleriusThe content regexp is fine. It's after I changed a few things to tidy up :h and added tag parsing that it started failing.#2016-08-2821:26seyleriusBasically, a headline starts with some number of stars. Everything else isn't a headline.#2016-08-2821:26aengelbergI cloned your project and am looking at that parser. Is there a different version / branch I missed?#2016-08-2821:27seyleriusNope, I pushed the latest version just before I spoke up today.#2016-08-2821:28aengelbergSorry I may have been unclear. When I said :content I meant the content inside the headlines parser.#2016-08-2821:28aengelbergNot the doc-metadata parser#2016-08-2821:29aengelbergAs an experiment I removed all the hide-tags from the headlines parser, since I got that behavior you were talking about (flat content). That exposed the headlines' :content rule as being greedy.#2016-08-2821:30aengelberg
organum.core> (headlines content)
[:S [:token [:content "This is an attempt...
#2016-08-2821:30seyleriusYep. I've got an ordered choice making it prefer to define a section (headline then content) if possible, and just content if not. The defining difference between content and headline is whether it starts with stars.#2016-08-2821:31seyleriusAlthough, Hmmm. You've got a point about the mode there.#2016-08-2821:31aengelbergI think this is what happened: - The section rule failed at the start of the string - It then fell back to the content rule due to ordered choice - The content rule mistakenly parses the whole string (for the reason I mentioned above) - Parse is done#2016-08-2821:34seyleriusYeah. You're right. Making the content rule less accepting (not (?s)) fixes that part, and now I'm seeing failures to parse the first headline. Joy.#2016-08-2821:36seyleriusHow does inataparse play with non-capturing groups?#2016-08-2821:38aengelbergNot familiar with that term; are you referring to the groups returned by a Java regex match?#2016-08-2821:40seyleriusNon-capturing groups are for saying, "this should be here, but don't return it in a group"#2016-08-2821:40seyleriusOkay, new push. Can't manage to get tags out separate.#2016-08-2821:41aengelbergoh, you mean things like regex lookahead and lookbehind?#2016-08-2821:42seyleriusThey work if I make them mandatory, but get eaten by the headline body if they're optional. Would lookahead allow saying "if there's whitespace followed by a colon, stop here"?#2016-08-2821:45aengelbergThis is the instaparse source code that applies regexes, may shed some light on whether certain constructs would work. https://github.com/Engelberg/instaparse/blob/master/src/instaparse/gll.clj#L670#2016-08-2821:47aengelbergI would expect regex non matching lookaheads to work, but non-matching lookbehinds to NOT work. Instaparse runs a regex match on the substring of the current index onward, so previous characters are invisible. EDIT: I misunderstood the term "non-matching"#2016-08-2821:49aengelbergI see you're using (?:) now. I don't think "non capturing" is what you want#2016-08-2821:49seyleriusI think you're right.#2016-08-2821:49aengelberg
organum.core> (re-find #"a" "a")
"a"
organum.core> (re-find #"(?:a)" "a")
"a"
#2016-08-2821:49seyleriusWhat's weird is non-greedy options fail entirely.#2016-08-2821:50aengelberg(?:) basically means, if there are any other groups () inside that block, DON'T return them as an additional output.#2016-08-2821:51seyleriusAh, it looks like negative lookahead is the trick.#2016-08-2821:51aengelberg(?!=)?#2016-08-2821:51aengelbergthe ?: flag shouldn't affect Instaparse's usage of regexes at all. Instaparse throws away match groups#2016-08-2821:51seylerius(?!\\s+:)#2016-08-2821:52aengelbergseems legit#2016-08-2821:52seyleriusNope. Pushing. Still eats the tags.#2016-08-2821:53aengelberghmm#2016-08-2821:53seyleriusPushed#2016-08-2821:53aengelbergneed to run now, can probably help more in an hour or so. I'd say the next step is manually parsing the regexes on the strings.#2016-08-2821:54aengelbergand try gradually taking characters away from the regex to see what the problem is#2016-08-2821:54seyleriusOkay, thanks for the help. Talk with ya when you've got time.#2016-08-2821:54aengelbergfeel free to dump any further findings here#2016-08-2821:54seyleriusWill do. Slack has persistence, which is pretty handy#2016-08-2823:34seyleriusOkay, trying reluctance means I only get the first character of the headline, and the rest becomes part of the content.#2016-08-2823:34seyleriusTrying lookahead seems to just fail.#2016-08-2823:42seyleriusOkay, tags are mostly fixed, but it's only grabbing the first one.#2016-08-2823:42seyleriusPushed.#2016-08-2823:42seyleriusWould appreciate a look when you have time, @aengelberg#2016-08-2823:43seyleriusAch. It's also not getting second headlines. They're turning into content lines due to newline weirdness.#2016-08-2823:46seyleriusPushed again. Fixed newline weirdness#2016-08-2823:50seyleriusHah, fixed it. Required post-tag newline/whitespace.#2016-08-2823:50seyleriusGah. Org is a beautiful format, but it's a bitch to parse.#2016-08-2823:56aengelbergThe parser breaks if I put into the file
* The First : Section :foo:bar:
#2016-08-2823:56aengelbergNot sure if that's valid org-mode.#2016-08-2900:05aengelberg@seylerius This approach handles a variety of potential characters before the tags, at the expense of speed, since it parses every single character on the header line to get around regex greediness.
(def headlines
  (insta/parser
   "<S> = token (ows token)*
    <token> = section / content
    section = h (ows content)*
    h = ows stars <#'\\s+'> (todo <#'\\s+'>)? title
    <title> = (#'.'+ ws-line? tags) / #'.+'
    stars = #'^\\*+'
    todo = #'TODO|DONE'
    tags = <':'> (tag <':'>)+ ws
    <tag> = #'[
#2016-08-2900:08seyleriusProbably be an uncommon usage, but technically legal, @aengelberg. Probably ought to do something like this. Hmmm.#2016-08-2900:14seyleriusYeah, that works. Definitely going to need the follow-up concatenation I was planning on.#2016-08-2901:19seyleriusOkay, another puzzle for ya, @aengelberg. In this latest push, why isn't priority getting picked up? I've cleaned up some of the names and added an overall parse function that takes a string, to simplify testing.#2016-08-2901:20seylerius(parse (slurp "")) should work for testing.#2016-08-2901:20seylerius(Thanks a ton for the help, BTW)#2016-08-2901:25seyleriusA priority is defined as a letter preceded by a pound sign, in square brackets. [#A] or [#z], for example.#2016-08-2901:39seyleriusLol, whoops. Reversed my pound sign and bracket#2016-08-2903:52seylerius@aengelberg: Can you help me figure out why the parsed sample is not registering as compliant hiccup to insta/transform?#2016-08-2904:29seyleriusOkay, I see that it's looking for a root node. Tried to unhide the document node, but it's not showing up.#2016-08-2904:29aengelberg@seylerius lemme take a look#2016-08-2904:30seyleriusThanks.#2016-08-2904:32aengelberg@seylerius what's the repro case?#2016-08-2904:33seylerius(parse (slurp "")) returns a seq, not a vector with a root node.#2016-08-2904:34aengelberglooks like function reducing doesn't exist#2016-08-2904:34aengelbergCompilerException java.lang.RuntimeException: Unable to resolve symbol: reducing in this context#2016-08-2904:36seyleriusCleared that up.#2016-08-2904:36seyleriusThanks#2016-08-2904:37seyleriusStill isn't giving me a root node, though.#2016-08-2904:39seylerius([:author..., rather than [:document [:author...#2016-08-2904:41aengelberginteresting. transforming on a sequence should work. I think you may have found a bug in instaparse.#2016-08-2904:41aengelbergThe fact that there's a string in the uppermost level is what's throwing it off.#2016-08-2904:43seyleriusFascinating#2016-08-2904:46seylerius@aengelberg: Maybe make strings pass straight through transform?#2016-08-2904:46aengelbergI can confirm that's a bug#2016-08-2904:46aengelbergYeah, that's what I'm about to do.#2016-08-2904:46seyleriusAwesome#2016-08-2905:11aengelberg@seylerius https://github.com/Engelberg/instaparse/pull/145#2016-08-2905:19aengelbergoops, didn't link properly... edited#2016-08-2905:29seyleriusDownloaded, testing.#2016-08-2905:29seyleriusIt works! Thanks, @aengelberg!#2016-08-2918:30seylerius@aengelberg: Yep, that worked perfectly. What's the release schedule on that, out of curiosity?#2016-08-2918:33seyleriusIn the meantime, I'm moving on to adding additional parsers (drawers, blocks, footnotes, lists, tables, to name a few).#2016-08-2918:56aengelberg@seylerius For instaparse we prefer to merge fixes, bump version numbers / changelogs, and deploy to clojars all at once. And only my dad has the power to do the last part. So whenever he gets around to doing that is when I'd expect to see the latest version. Should be sometime today.#2016-08-2919:08seyleriusShiny. Looking forward to it. Also, shiny that y'all are a father-son team. Be cool if one of my kids coded with me (after I have some).#2016-08-2919:31aengelbergThanks. It's fun! You may enjoy the Clojure/west 2014 instaparse talk, which provides some backstory on the collaboration#2016-08-2919:42seyleriusAwesome. I'll check it out#2016-08-3020:25andreiI am trying to write a simple grammar that parses comments: /* some text */, is there a way in instaparse to say any character? e.g.
"comment = ‘/*’ .* ‘*/‘"
#2016-08-3020:27aengelberg@andrei Instaparse doesn't have a special character for that, but you can use regular expressions to cover any character#2016-08-3020:28aengelberge.g. comment = '/*' #'[\\s\\S]'* '*/'#2016-08-3020:29aengelberg(`#"[\s\S]"` is my personal favorite way to match any character in a regex)#2016-08-3020:30seylerius@andrei: Yeah, you'll want something like this:
"comment = <'/*'> #'.*' <'*/'>"
My version hides the comment tokens, though @aengelberg's regexp might be more appropriate.
#2016-08-3020:30andrei@aengelberg @seylerius thank you for the suggestions. I think I got a bit mislead by the source code, https://github.com/Engelberg/instaparse/blob/master/src/instaparse/abnf.clj#L19-L40 I thought there are some defaults in instaparse#2016-08-3020:31andreibut now reading through the doc strings, these are only to parse the grammar itself https://github.com/Engelberg/instaparse/blob/master/src/instaparse/abnf.clj#L2#2016-08-3020:31aengelberga couple things I see in @seylerius's solution: 1) . in a regex doesn't include newlines 2) .* will greedily match past the */ and won't be able to parse the end of a comment#2016-08-3020:32aengelberg@andrei Sorry for the misleading code. Those constants are available but only to the ABNF format.#2016-08-3020:32aengelbergEBNF is the default#2016-08-3020:33andreiare there constants for ebnf? looking at the code I think not#2016-08-3020:33seylerius@andrei A point to keep in mind with @aengelberg's solution is that you'll need to condense the individual characters of the output.#2016-08-3020:34andrei@seylerius @aengelberg is there a way for specifying in instaparse to group matches together, s.t. one doesn’t need to condense the matches?#2016-08-3020:34aengelbergyeah, thanks for clarifying that @seylerius#2016-08-3020:34seyleriusYou'll get output like [:comment "f" "o" "o" " " "b" "a" "r"] from input like /*foo bar*/#2016-08-3020:35andreiexactly#2016-08-3020:35andreithere are ways to use transform and apply str on it#2016-08-3020:35seyleriusYep.#2016-08-3020:35aengelberg@andrei The official specification for ABNF is more strict and specific than EBNF, and it dictates that those constants are available. EBNF is more of an ambiguous mashup of a variety of standards we were able to find on the internet#2016-08-3020:35andreiit just feels that there should be a grammar direct way#2016-08-3020:36aengelbergSo there are no constants in EBNF, since none of the EBNF resources we found seemed to indicate such#2016-08-3020:36seyleriusAnd remember to wrap your comment tokens in <> like I did, so you don't save the markup itself.#2016-08-3020:36aengelbergSadly there is no grammar direct way to concat the strings#2016-08-3020:36seyleriusTransform works pretty well, though.#2016-08-3020:37andreihmm, or a more elaborated reg exp#2016-08-3020:37andreiI am using smth like this for strings
<string> = dqoute #'([^"\\]|\\.)*' dqoute
   <dqoute> = <'\"'>
#2016-08-3020:38seylerius(insta/transform {:comment (partial apply str)} (comment-parser input-data)) #2016-08-3020:39andreiand probably the performance impact is small if one applies transforms#2016-08-3020:39seyleriusLolyep. Far as I can tell, inataparse does a good job with efficient transforms.#2016-08-3020:40aengelbergit depends on the size of the file. Probably actually creating all those individual strings is going to be the bottleneck rather than concatenating them later#2016-08-3020:40andreiI must admit I was lead astray by regexps vs transforms which is more efficient - although I think its a very premature optimisation#2016-08-3020:40aengelbergA regex is a sensible solution if you can get it right 🙂#2016-08-3020:41aengelbergMy first thought is to do a negative lookahead for */ as part of the regex#2016-08-3020:42seyleriusTrouble is, from what I've found, that the */ will get eaten in the .*#2016-08-3020:43seyleriusAnd the negative lookahead will pass because the end token was already eaten
#2016-08-3020:44andreiso more reg exp magic for me to look into. to give a bit more context I am playing around with parsing localizable strings.
/* This is a comment */

"hello" = "Hello!";

/* This is another comment */
"click_button" = "Click";

/* Title bar, prints the number of selected products (The translation should be short due to the limit of 100 characters for the title of the mobile app) */
"bar_print_$_selected_products" = "You Selected %@ Products”;
#2016-08-3020:44andreijust an experiment, nothing production related.#2016-08-3020:47andrei@aengelberg @seylerius thank you for your help, so far I enjoyed using instaparse. is cool that I can use some things that I learned in college to do some useful things#2016-08-3020:47andreialthough I must say that I need to re-learn things about parsers and defining grammars#2016-08-3020:48aengelberg@seylerius I meant a regex negative lookahead, i.e. #".*(?!=/\*)" or something#2016-08-3020:49aengelberg@andrei glad you're having fun! feel free to ask here if you have any more questions#2016-08-3020:49seylerius@aengelberg: That's what I thought. It winds up eating the end-token in the .* and passes the negative lookahead anyway. I was fighting that with the headline parser in organum over the weekend.#2016-08-3020:50seyleriusWhen I was trying to get it to parse tags.#2016-08-3020:50aengelbergoh, I guess the regex would pass, saying "here's a sequence of characters (including /*), and look, there is not a /* *after* these characters!"#2016-08-3020:51seyleriusBingo#2016-08-3020:51aengelbergso maybe #"((?!/\*).)*"#2016-08-3020:51aengelbergthat would generate a bunch of match groups though due to the ()#2016-08-3020:52seyleriusGah, lemme see what I did for that in the tags in organum.#2016-08-3020:53seyleriushttps://github.com/seylerius/organum/blob/master/src/organum/core.clj#2016-08-3020:54seyleriusYeah, ordered choice wound up featuring heavily.#2016-08-3020:55seyleriusMaybe (<'*/'> / #'.')+?#2016-08-3020:56seyleriusAlways prefer to end a comment if possible, otherwise continue eating characters?#2016-08-3020:56seyleriusWait, not quite#2016-08-3020:56seyleriusThat'll continue past the end.#2016-08-3020:57seyleriusAch. I need to drive back to the store; I'm done with this client. Check in with y'all in about ten.#2016-08-3021:03andreiI will also catch up with you guys a bit later too or early tomorrow, its getting a bit late here in Berlin.#2016-08-3021:35seyleriusHave a good one.#2016-08-3115:48aengelberg@seylerius the bug you encountered a couple days ago is now pushed to Clojars as Instaparse 1.4.3.#2016-08-3115:48aengelberghttps://github.com/Engelberg/instaparse/blob/master/CHANGES.md#143#2016-08-3115:52seylerius@aengelberg: Shiny! Thanks for the heads up.#2016-08-3115:53aengelbergOr rather, a fix for said bug.#2016-08-3115:53aengelberg:)#2016-08-3115:53seylerius😁#2016-10-2709:06seyleriusSo, (map (partial reparse-string is-table) (nth (parse-file "") 6)) fails, complaining about a lack of either #"\r\n" or #"[\r\n]". Essentially, it seems to think a :br should start the text line in that section.#2016-10-2716:23aengelberg@seylerius where is the grammar you're using to parse? apologies if you posted earlier and I missed it.#2016-10-2716:23aengelbergand sample file?#2016-10-2717:20seyleriusAch, forgot to link to the repo. https://github.com/seylerius/organum#2016-11-0415:58be9Hi, I need to parse strings like some text with spaces XXX 12345678 98765 43 222 11. Here are 3 parts: “some text with spaces”, “XXX 12345678”, and "98765 43 222 11”. While the last part is required, the “XXX 12345678” part is optional and will be considered as text by a naive greedy regex. How could I prevent this with Instaparse?#2016-11-0415:59seylerius@be9 Can you describe the requirements your text needs to meet?#2016-11-0416:00seyleriusOr give a few more specific examples?#2016-11-0416:08be9@seylerius Ok, let’s simplify even more. Two examples: John Doe AGE 50, Dohn Joe. An input string contains a name and might contain this age thing. I want to parse those eventually to {:name “John Doe” :age 50} and {:name “Dohn Joe”}.#2016-11-0416:08be9First one should not be {:name “John Doe AGE 50”} 🙂#2016-11-0416:09seyleriusOkay. This is a problem I've run into before.#2016-11-0416:09be9Names can be long and contain digits too#2016-11-0416:11be9John Doe AGE 50 AGE 50 would be preferrably parsed as {:name “John Doe AGE 50” :age 50}#2016-11-0416:12seyleriusBasically what you need is to have a name token, token, and then a name+age token. You then parse for this: "name-age / name"#2016-11-0416:12seyleriusThe slash allows you to express a preference for one over the other.#2016-11-0416:13seyleriusBasically, you're saying "if this string can match an age too, do that, otherwise it's just a name"#2016-11-0416:13seyleriusI do this a lot in my rebuild of organum, if you want to take a look at the repo.#2016-11-0416:13be9oh, the slash. I see, thanks!#2016-11-0416:14seyleriusYep. The slash is for preferential parsing.#2016-11-0416:14be9:+1: @seylerius, I guess that’s it 🙂#2016-11-0609:06alpiMay I ask questions on clojurescript port here?#2016-11-0718:29aengelberg@alpi sure#2016-12-2915:25gfredericksI don't imagine there's a way to unparse something#2016-12-2919:26seyleriusIn what sense? Reconstruct the input that was parsed?#2016-12-2919:28seylerius@gfredericks What are you trying to accomplish by "unparsing", and how much control do you have over the parsing?#2016-12-2920:26aengelberg@gfredericks: some discussion has happened about this in https://github.com/Engelberg/instaparse/issues/82#2016-12-2920:27aengelbergwhich I just noticed you saw and commented on#2016-12-2920:29aengelbergThe fact that "hide tag" (`<>`) is a thing makes it a complex problem to provide unparsing as a general solution#2016-12-2920:30aengelbergalso lookahead / negative lookahead#2016-12-2922:23gfredericksI want a canonical printer. Potentially pretty printing...which sounds hard#2016-12-2922:23gfredericksSo no, not reconstructing the original input exactly#2016-12-2923:47seyleriusHmmm.#2016-12-3000:25gfrederickspretty printing is complex enough that I'm convinced it would be crazy to try to mix it into a grammar somehow#2016-12-3103:00gfredericksinstaparse requires keywords for the names of the whatchamacallits?#2016-12-3103:00gfredericksI think I might be using instaparse in a weird enough way for that to be a very mild problem#2016-12-3103:01gfredericksbecause I have to gensym the names and so it's a memory leak#2016-12-3104:47seylerius@gfredericks It outputs either hiccup or enlive notation, so yes it probably would want keywords in reverse.#2016-12-3109:52aengelberg@gfredericks:
(def all-keywords-ever (map keyword (range)))

;; each time you dynamically create a parser
(let [my-syms ...
kws (zipmap my-syms all-keywords-ever)]
...)
#2016-12-3109:52aengelbergThat might be a way to conserve on keywords#2016-12-3109:55aengelbergOr do a string replace in the grammar to substitute non terminals with reusable symbols, then postwalk the resulting tree to convert back#2016-12-3114:24gfredericksI'm using the combinators, so it shouldn't be too hard to do something like that if I decide this matters#2016-12-3119:41zmaril@gfredericks @aengelberg if we can actually get generating from grammars going I'd still be really stoked#2016-12-3119:42zmarilI've been working on https://github.com/zmaril/instaparse-c the past few weeks and am getting within spitting distance of doing some fun stuff.#2016-12-3119:43zmarilIt can basically parse C at this point and I'm working on finishing the macro preprocessor now.#2016-12-3119:45zmarilThe goal is to get the output into datascript and queryable. But a side product of this is that if you have something that can generate strings from grammars then we already have something that can produce c programs (sans macros).#2016-12-3119:58gfredericks@zmaril do you or anybody know if all instaparse grammars are implemented using the combinators?#2016-12-3119:58gfrederickss/grammars/parser/#2016-12-3119:58zmarilYes they should be#2016-12-3119:59zmarilMy understanding is that the ebnf notation that everybody uses is actually parsed by a parser expressed in the combinators that transforms the output into combinators#2016-12-3120:00gfredericksI just glanced at the combinator list -- I think only the lookaheads are problematic, but that's probably a big deal for sophisticated parsers#2016-12-3120:00zmarilyep#2016-12-3120:00gfredericksso...oh well.#2016-12-3120:00zmarilhow does one express negation in generators now?#2016-12-3120:01gfredericksyou could implement them with gen/such-that but the generator would fail if the lookahead condition is unlikely to pass by chance#2016-12-3120:01gfredericksI have no how that would play out IRL#2016-12-3120:01zmarilThat should be fine then. For the parsers I write lookahead is typically used to implement reserved keywords.#2016-12-3120:02zmarilI've never used positive lookahead actually now that I think about it#2016-12-3120:02gfrederickswhen I made the regex→string generator I just decided not to support look[ahead|behind] for the same reason#2016-12-3120:02zmarilIt's one of those things that is academic to me at this point#2016-12-3120:03zmarilI'm pretty sure that 99% gen/such-that of the time would be fine#2016-12-3120:03gfredericksit might not be too hard to throw together a PoC#2016-12-3120:03gfredericksin fact that would potentially be useful for what I'm working on right now#2016-12-3120:04zmarilyeah, I think that would fit really well and mirror what spec is doing#2016-12-3120:04zmarilI've been using spec/conform the same way I use instaparse and it works really well#2016-12-3120:05zmarilSo I imagine we could use generators the same way spec does and it would work well (fingers crossed)#2016-12-3120:07gfredericks😂 I just realized that it would require using string-from-regex from test.chuck to support regexes in the grammars, and string-from-regex uses instaparse to parse the regex.#2016-12-3120:07zmarilturtles#2016-12-3120:08gfredericksindeed#2016-12-3120:08zmarilthat was the thing that was holding me up actually#2016-12-3120:08zmarilwas that I didn't want to mess with regexs#2016-12-3120:09aengelbergjust catching up#2016-12-3120:10aengelbergAfter I wrote "instagenerate" I realized going the generator route (as opposed to core.logic) would probably be easier, despite the lookahead such-that problem#2016-12-3120:10aengelbergBut what do you want to do about hide-tags?#2016-12-3120:11zmarilI think I have an idea, h/o#2016-12-3120:11zmarilwell, hmmm what is the problem you see with hide-tags?#2016-12-3120:12aengelbergIt depends on what you expect the "input" to the generator to be#2016-12-3120:12aengelberga parse tree still?#2016-12-3120:12gfredericksit'd be the combinator#2016-12-3120:12gfredericksit would generate totally random parsable things#2016-12-3120:12gfredericksnot based on same partial input#2016-12-3120:13aengelbergok, in that case I don't really have a problem with hide tags despite just waking up#2016-12-3120:13zmarilI think if we got something going that just took a grammar and gave back random strings, that would be a good first step#2016-12-3120:14aengelbergpart of why I did core.logic in instagenerate is @zmaril's initial request to go from partial input -> parseable strings, so I felt the need to put in the sophistication of logic programming as a general solver for all cases#2016-12-3120:15zmariloh, if we want to do partial input, we can provide skeletons with places to start generating from#2016-12-3120:15zmarilthen we just walk the skeleton and generate random strings at the indicated places#2016-12-3120:16zmarilstill not fully general but better#2016-12-3120:17zmariland then we could restrict the grammar inside the combinator somehow#2016-12-3120:21aengelberg
(def p (insta/parser "
S = A B A | B A B
<A> ('a' <'c'> 'b')+
<B> ('b' 'a')+
"))

(generate p [:S "a" "b" "b" "a" "a" "b"])
=> ("acbbaacb")
#2016-12-3120:23aengelbergseems hard to performantly solve generally#2016-12-3120:24zmarilwho said anything about performance#2016-12-3120:24aengelberg🙂 fair enough#2016-12-3120:25aengelbergbut a generator approach using such-that may never complete on a large enough grammar#2016-12-3120:25zmarilcross that bridge when we get there#2016-12-3120:25zmarilcomputers are like really fast#2016-12-3120:26zmarilthis is more of a what's possible idea than a production thing#2016-12-3120:27aengelbergcool#2016-12-3120:28aengelberglet me know if I can help out in whichever path you decide to try out#2016-12-3120:28zmarilfor sure!#2016-12-3120:38gfredericksyeah generators aren't generally for production stuff#2016-12-3120:43gfredericksI want a combinator that doesn't match anything#2016-12-3120:43gfredericksI thought maybe (combo/alt) but that returns ε#2016-12-3120:44zmaril(gen/such-that (constantly false)) or something?#2016-12-3120:44gfredericksa combinator, not a generator#2016-12-3120:44zmariloh right sorry#2016-12-3120:44gfredericksI guess I can do negative lookahead with epsilon?#2016-12-3120:44zmarilor a really unlikely string?#2016-12-3120:45zmarillike (string "THISWILLNEVERBEMATCHEDHOPEFULLY")#2016-12-3120:46gfredericks🙂#2016-12-3120:46zmarilwe're not fancy here#2016-12-3120:46gfredericks(string (str (java.util.UUID/randomUUID)))#2016-12-3120:46zmarilthat works!#2016-12-3120:48gfredericksI have an alternate thing in my codebase that could be called a parser, but instaparse also has something by that name so I called it a parsifier instead#2016-12-3120:48gfredericksand it's hard to remember that word because it could also have been parsinator#2016-12-3120:49zmarilhahaha#2016-12-3120:50zmaril
(defn enlive-output->datascript-datums [m]
 (if-not (map? m)
    {:type :value :value m}
    (as-> m $
        (assoc $ :meta (meta m))
        (assoc $ :db/id (d/tempid :mcc))
        (transform [:content ALL] enlive-output->datascript-datums $))))
This will take enlive output and make it so you can query it from datascript
#2016-12-3120:53gfredericksdoes instaparse use its own regex engine?#2016-12-3120:53zmarilno#2016-12-3120:53gfredericksI just got a misparse where the thing matches the regex but instaparse disagrees#2016-12-3120:53zmarildepends on java if I recall#2016-12-3120:53gfredericksand reordering a disjunction in the regex fixes it#2016-12-3120:54zmarilhmm#2016-12-3120:54gfredericksthis is the instparse-cljs thing in particular, but still on the jvm#2016-12-3120:54zmarilcheck if instaparse passes any flags in#2016-12-3120:55gfrederickshere's the failing version: https://www.refheap.com/124435#2016-12-3120:58zmarilhmm#2016-12-3120:58zmaril"0/2" parses#2016-12-3120:58zmarilcan you add in some parens to the second part to clarify your intent#2016-12-3120:59gfredericks"0/2" is not supposed to parse o_O#2016-12-3121:00gfredericksI see that's my fault though#2016-12-3121:03zmarilha#2016-12-3121:59aengelbergI second !epsilon as the "don't parse"#2016-12-3122:00aengelbergalso instaparse fails on infinite loop grammars, so this might work
never-succeed = never-succeed
(then use never-succeed wherever)
#2016-12-3122:01gfredericks@aengelberg do you think the current behavior of (combo/alt) is bad/weird?#2016-12-3122:02gfredericksmy hunch is that According To Math it should either throw or not match anything#2016-12-3122:03aengelbergyeah I agree with your instinct. Not really sure what the thinking was in that design.#2016-12-3122:03gfredericksmy argument is that because (combo/alt p) probably does not match ε, neither should (combo/alt)#2016-12-3122:03aengelbergMaybe since "don't parse anything" isn't really a common use case#2016-12-3122:03gfredericksyou shouldn't parse more things by removing an arg from combo/alt#2016-12-3122:03aengelbergagreed#2016-12-3122:04gfredericksyeah I always end up finding the uncommon use cases#2016-12-3122:04gfredericksfor a while every time I tried to use CLJS I ended up creating a jira ticket#2016-12-3122:04aengelberg#gobigorgohome#2016-12-3122:06aengelbergI think I know why your parser is failing#2016-12-3122:06aengelbergThe regex for the denominator, when given "25" as input, may arbitrarily decide to match either "2" or "25"#2016-12-3122:07aengelbergIn instaparse, whatever the regex decides is the one and only possible parse#2016-12-3122:07aengelberg
user=> (re-matches #"[2-9]|[1-9][0-9]+" "25")
"25"
user=> (re-seq #"[2-9]|[1-9][0-9]+" "25")
("2" "5")
user=> (re-find #"[2-9]|[1-9][0-9]+" "25")
"2"
#2016-12-3122:08gfredericksoh it's about re-matches vs re-find?#2016-12-3122:08aengelberghttps://github.com/engelberg/instaparse#regular-expressions-a-word-of-warning#2016-12-3122:08gfredericksoh I think I see#2016-12-3122:09aengelbergyou could instead do #"[2-9]" | #"[1-9][0-9]+"#2016-12-3122:09aengelbergIf you move logic from regexes into instaparse, you get flexibility at the cost of speed#2016-12-3122:11gfredericksso the fact that I fixed it by rearranging the regex is sort of an implementation detail I guess?#2016-12-3122:12aengelbergYes, so I would call rearranging the regex an improper solution#2016-12-3122:12aengelbergbut #"[2-9]" | #"[1-9][0-9]+" is proper#2016-12-3122:15gfredericksokay fine I'll switch it 😛#2017-02-1308:58doddeninoHi! Is there a way to do a step by step debug of a parser?#2017-02-1317:50dave@doddenino: https://github.com/Engelberg/instaparse#total-parse-mode <-- this is maybe not exactly what you want, but it can be helpful for debugging a parse failure#2017-02-1317:51aengelberg@doddenino you're probably looking for tracing mode. https://github.com/Engelberg/instaparse/blob/master/docs/Tracing.md#2017-02-1317:52doddeninoOh that's great! 🙂 Thanks! I'm having a lot of problems trying to make my parser work correctly 😞#2017-02-1317:53aengelbergAlso, calling insta/parse with :start overridden can also help debug certain small pieces of your parser.#2017-02-1317:56doddenino@aengelberg I think trace is perfect#2017-02-1318:10doddeninoI don't know how and why, but it's working correctly now 😅#2017-02-1318:12doddeninoCelebrating too soon 😞#2017-02-1409:13doddeninoI'm trying to write a lambda calculus expressions parser, but I'm having a hard time dealing applications being left associative and having higher precedence than abstractions. I can either parse "a b c" or "fn x . x a" correctly, but not both with the same parser 😞#2017-02-1418:59aengelberg@doddenino: I'm not familiar with the syntax you're trying to parse, but have you looked at the ordered choice / operator?#2017-02-1510:12doddeninoYes, I was using that already. I managed to solve it by enforcing a stricter syntax, dealing better with a specific edge case and randomly moving stuff around until it worked fine 😄#2017-02-1717:54frankI'm having trouble creating a parser using the grammar specified here: https://developers.google.com/protocol-buffers/docs/reference/proto3-spec#2017-02-1717:54frankI'm getting the feeling that there are syntax differences#2017-02-1717:57frankI'm slurping the grammar out of a separate file, but I feel like escaped quotes still aren't being handled as I intend (e.g. quote = "'" | '"')#2017-02-1718:03frankdoes anyone know how quotes ought to be escaped in instaparse ebnf strings?#2017-02-1718:13gfredericksthe way you have it looks likely to work to me#2017-02-1718:17frankmaybe there's unmatched quotes somewhere in the grammar that I copied and pasted 😕#2017-02-1718:19gfrederickstry making a trivial grammar that only matches a quote to make sure it works the way you expect#2017-02-1718:20seylerius^ This. So much this. When I'm making grammars, I often make little phrases to match a character I haven't tested before.#2017-02-1718:21frankI'll try that, thanks#2017-02-1718:29aengelberg"'" | '"' looks right, but there are sometimes additional layers of escaping you have to deal with.#2017-02-1718:29aengelberge.g. if you wrote your grammar as a string in a Clojure file, it would probably have to look like
(def parser (insta/parser "quote \"'\" | '\"'"))
#2017-02-1718:31aengelbergI see this in the protobuf spec
hexEscape = '\'
that will probably throw off instaparse, since it thinks you are escaping the second '
#2017-02-1718:31aengelbergso it should really be
hexEscape = '\\'
#2017-02-1718:32aengelberg@frank ^#2017-02-1718:32aengelbergalso, /[^\0\n\\]/ is not valid EBNF in instaparse (should be #"[^\0\n\\]")#2017-02-1718:42frankah, that's probably it!#2017-02-1718:43frankstrangely enough, #"[^\0\n\\]" isn't valid clojure regex syntax, so I stole the same regex syntax from https://github.com/arpagaus/clj-protobuf/blob/master/resources/proto.ebnf#2017-02-1718:44frankthey've got a few extra backslashes: #"[^\\0\\n]"#2017-02-1718:46frank@aengelberg what's the equivalent of the that they've got littered all over their grammar?#2017-02-1718:49aengelbergI think they meant that as a shorthand for alternating between all the digits. Sadly instaparse can't infer the intermediate values, so you would have to "0" | "1" | "2" | "3" | "4" | "5" | "6" | "8" | "9"#2017-02-1718:50frankah, gotcha#2017-02-1718:55frankalternatively, #"[0-9]" should work too, right?#2017-02-1718:56aengelbergcorrect#2017-02-1904:01bherrmannhas anyone used instaparse to mottle through oracle sql?#2017-02-1909:39aengelbergI vaguely recall hearing about people using instaparse for SQL, but I unfortunately can't point to something specific.#2017-02-1909:40aengelbergMight be something in the instaparse Google group archives#2017-02-2321:26anthony.naddeo@aengelberg I have a parser that I'm not sure how else to improve: https://github.com/naddeoa/elm-clojurescript-hello-parser/blob/master/src/elm_toolkit/parser.cljs#2017-02-2321:27anthony.naddeoI've eliminated a ton of ambiguity but I'm still seeing that its performance scales poorly with the size of the input, even when there are only 1 or two possible parses#2017-02-2321:27anthony.naddeobut that same input when fed chunk by chunk adds up to a much more reasonable parse time#2017-02-2321:28anthony.naddeoI was reading about the :optimize :memory option, but that seems to make it worse#2017-02-2321:28anthony.naddeoI even tried it with Java instead of Node to see if it was just a JS thing and the results were still similar#2017-02-2321:29anthony.naddeoMy next step was about to be to write a function that breaks the input up into chunks, but to do that I'll need to know where a chunk starts/stops or impose some sort of convention around the input#2017-02-2322:37hiredmanhave you seen https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md? I feel like I have seen splitting input in to chunks recommended for larger inputs to instaparse#2017-02-2322:40anthony.naddeoYeah that's where I got it from#2017-02-2322:40anthony.naddeoIt just seems like something I shouldn't have to do#2017-02-2322:41anthony.naddeoIf only because I need my parser to determine when a readable chunk starts/stops#2017-02-2322:41anthony.naddeoit isn't a line by line thing#2017-02-2322:42hiredmanI might try and eliminate regexes in the grammar definition, if I recall correctly those can be (or maybe they were?) inefficient#2017-02-2322:45anthony.naddeoAre there any examples you know of large grammars that scale linearly with the size of the input?#2017-02-2322:46anthony.naddeoBefore I go down a rabbit hole I want to make sure I know what to expect#2017-02-2322:48hiredmanno, I don't#2017-02-2322:49hiredmanI don't know of any large grammars, and I have only used smallish grammars on small inputs#2017-02-2322:50anthony.naddeoI'm just afraid that the performance actually won't get better as hard as I may try and I want to cut my losses if I can.#2017-02-2322:50anthony.naddeoOr just use this for things that only require parsing snippets#2017-02-2322:51anthony.naddeoThanks for the advice though @hiredman#2017-02-2322:53hiredmanit would be neat if instaparse could emit warnings and suggest fixes if your grammar is not LL(1)#2017-02-2322:57anthony.naddeoAt this point, it seems like what I really want is a mode where it parses in chunks itself. My grammar is really just a repetition of 4 possible blocks. It seems like it could just parse the first block independent of the second. That is to say, as soon as it matches just consider it a block and move on.#2017-02-2322:57anthony.naddeoGiven the right level of ambiguity I would be ok with massaging the results#2017-02-2323:00aengelbergHow large are your problematic inputs?#2017-02-2323:02anthony.naddeoWell, it scales poorly. There isn't a size in specific. It parses the Elm programming langauge. When testing, I start with a single function, then I just keep duplicating that function and observe the performance#2017-02-2323:02anthony.naddeoeach additional function adds more than its fair share of time#2017-02-2323:02anthony.naddeoI'll paste a snippet of something that takes too long#2017-02-2323:06anthony.naddeo
(def input "nextSource : Source a -> Source a
nextSource (Source a next) =
    Source (next a) next


nextSource : Source a -> Source a
nextSource (Source a next) =
    Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
    Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
    Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
    Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
    Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
  Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
  Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
  Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
  Source (next a) next

nextSource : Source a -> Source a
nextSource (Source a next) =
  Source (next a) next

")
#2017-02-2323:08anthony.naddeoThat one I actually just had to kill#2017-02-2323:08anthony.naddeobut if you cut it in half it only takes about a second or so to parse#2017-02-2323:08anthony.naddeoAnd its a pretty reasonable size for an Elm source file#2017-02-2323:10aengelbergOuch#2017-02-2323:11aengelbergI'll try to take a look soon#2017-02-2323:18anthony.naddeoCool, thanks a lot#2017-02-2323:19anthony.naddeoAlso, definitely pretty new to clojure still. If I can make the code more approachable/replable feel free to point that out#2017-02-2323:19anthony.naddeoI kind of just hacked everything together#2017-02-2323:36aengelberg@anthony.naddeo I fixed it before:
start = (<ws>* block <ws>*)+
after:
start = (<ws> block <ws>)+
#2017-02-2323:37anthony.naddeoThat? Let me try now#2017-02-2323:37aengelbergthe large input now takes 100ms on my machine#2017-02-2323:39anthony.naddeo100ms? I just tried it and it is WAY better, but I wish I had those numbers#2017-02-2323:39anthony.naddeoare you just in a node repl?#2017-02-2323:39anthony.naddeoI'm down to 900ms now#2017-02-2323:39anthony.naddeowhich is great#2017-02-2323:40anthony.naddeocompared to never stopping#2017-02-2323:40anthony.naddeoI wouldn't have thought to do that#2017-02-2323:40aengelbergsorry I'm in JVM, not JS#2017-02-2323:40aengelberg900ms makes sense for node, instaparse in cljs is known to be ~ 10x slower#2017-02-2323:41anthony.naddeooh interesting. Did you use my parser.clj file?#2017-02-2323:42anthony.naddeoyeah this is way better#2017-02-2323:42anthony.naddeoThanks a ton#2017-02-2323:43aengelbergI just copied the grammar string from parser.cljs into my clojure file since the EBNF syntax is the same on both platforms#2017-02-2323:43aengelbergnp#2017-02-2323:45anthony.naddeoOne odd thing though#2017-02-2323:45anthony.naddeoI would expect that the * impacts performance because it adds ambiguity right?#2017-02-2323:46anthony.naddeoIf so, shouldn't there be more than a single parse for it? (time (count (insta/parses parser/parser input :unhide :all))) returned 1 unless I did something wrong#2017-02-2323:50hiredmanambiguity can mean multiple results, but it can also mean a single result that required checking lots of different cases to arrive at#2017-02-2323:50anthony.naddeoWhat metrics could people use to determine ambiguity if not parse counts?#2017-02-2323:52hiredman#6 in that peformance doc suggests looking at rules individually to see if you have rules that in isolation can result in multiple parses#2017-02-2323:52aengelbergInstaparse de-duplicates results, so some internal ambiguity can lurk around but still impact the perf#2017-02-2323:53anthony.naddeointeresting#2017-02-2323:53anthony.naddeowell, I'm glad that was an easy fix, thanks guys#2017-02-2323:53anthony.naddeoI don't think I can go back to other parsers. I would have been heart broken#2017-02-2323:55aengelbergbtw @anthony.naddeo, I made the following optimization to your Name and name which halved the time for me:
Name = #'(?!\\b(if|then|else|in|let|case|of)\\b)[A-Z][a-zA-Z0-9]*'
    name = #'(?!\\b(if|then|else|in|let|case|of|type)\\b)[a-z][a-zA-Z0-9]*'
#2017-02-2323:56aengelbergActually the first \b doesn't really do anything, because the first character of this particular token is the first char in the string from the regex's perspective#2017-02-2323:57aengelbergBut my optimization was to shift more responsibility into the regex; that is always ideal#2017-02-2400:09aengelbergrelated:
symbol = #'(?!(->|=)\\b)[+/*<>:&|=^?%#~!-]+'
#2017-02-2400:10anthony.naddeooh awesome, let me try that now#2017-02-2400:10anthony.naddeoI had meant to circle back and fix all that stuff on the bottom...#2017-02-2400:32anthony.naddeo@aengelberg yeah, that's pretty awesome. Thanks a ton. I guess I have some work to do#2017-02-2400:32anthony.naddeoIts cut nearly in half on JS#2017-02-2400:34aengelbergjust curious, does Elm allow newlines pretty much anywhere there can be whitespace? if so, I would recommend making a universal ws and ows (optional whitespace) that allows whitespace, newlines, and comments, which you can then spam in your grammar anytime you would use <break>.#2017-02-2400:34aengelbergwhitespace and comments are sometimes the trickiest parts of grammar perf#2017-02-2400:38anthony.naddeoIts pretty flexible with newlines yeah. It does care about indentation (which I haven't attempted to model in anyway). Are you saying that break performs poorly or that it would be easier to maintain if I just settled on fewer whitespace tokens#2017-02-2400:39aengelbergI don't think anything's wrong with break, but you might run into false negatives i.e. valid Elm code that doesn't parse properly#2017-02-2400:40aengelbergand making ws easier to reason about makes you less likely to fall into ambiguity traps like two adjacent ws parsers#2017-02-2400:40anthony.naddeoyeah that's a good point#2017-02-2400:40anthony.naddeobreak probably can go. It was just the result of trial and error when I first picked up the parser, I shouldn't feel particularly attached to it if I can just use ws everywhere#2017-02-2400:42anthony.naddeothanks again. This is far more support than I expected#2017-02-2400:49aengelbergNo problem#2017-02-2400:49anthony.naddeoI think it might also make sense for me to roll comments into a rule similar to that#2017-02-2400:49anthony.naddeothey too can appear anywhere#2017-02-2400:50aengelbergAgreed. Your usage of single line comments seemed kind of arbitrary though admittedly I'm not familiar with Elm
#2017-02-2400:50anthony.naddeoyeah its totally arbitrary. I've punted on all inline comments atm#2017-02-2400:51anthony.naddeothe only challenge was modeling them in a meaningful way. I wanted to be able to convert the tree back into code and preserve the comments, or be able to include data about comments in queries against the parse tree#2017-03-0823:58bherrmannhow does instaparse compare to ANTLR#2017-03-0900:00aengelberg@bherrmann: in comparison, instaparse is slow and memory-inefficient, but far easier to use and accepts more types of grammars#2017-03-0900:00aengelbergI've never actually used ANTLR so I'm just guessing on both points#2017-03-0900:00bherrmannHa!#2017-03-0900:01bherrmannWe have a large oracle grammar… in ANTLR and I’m curious about using instaparse instead#2017-03-0900:02aengelbergSo it's already working in ANTLR?#2017-03-0900:02aengelbergWhy would you want to switch?#2017-03-0900:05aengelberg(genuinely curious)#2017-03-0900:08bherrmannwell. This might be the wrong reason#2017-03-0900:08bherrmannby when we make changes to the ANTLR grammar, it generates a java file which is too big to be compiled#2017-03-0900:09bherrmannso at the moment, we have to giggle the rules to keep the output small enough to be compiled.#2017-03-0900:09bherrmannIt is ANTLR V3#2017-03-0900:10seyleriusThat sounds... clumsy.#2017-03-0900:11bherrmannwell, it that old song of someone understanding how ANTLR v3 works and them leaving the company....#2017-03-0900:16bherrmannI’m working with a modified version of this http://www.antlr3.org/grammar/1209225566284/PLSQL3.g#2017-03-0900:17bherrmannI should really read this https://tomassetti.me/antlr-mega-tutorial/#2017-03-0900:18bherrmannalthough that page is ANTLR V4#2017-03-0900:18bherrmannI’m curious if the PLSQL3.g could easily be consumed by Instaparse#2017-03-0900:20aengelbergThe is_sql thing actually looks like something unique to ANTLR (not standard BNF)#2017-03-0900:20aengelbergi.e. setting local variables#2017-03-0900:23bherrmannyea, that is weird#2017-03-0900:24bherrmannwe dont use the is_sql in our copy.#2017-03-0900:28bherrmannso the grammar has about 1600 lines (ours has around 2k)… They appear otherwise very similar#2017-03-1422:38nathansmutzI'm sure this is asked a lot; but I'm not figuring out how to search it. Is there a standard way to get instaparse to pick grammar-fitting things out of a mess of other text? In text-mining, I've used regular expressions to carve out the bits I want to parse; but that's pretty redundant. Starting and ending your grammar with an <anything> pattern is slow; and, I'm sure, makes instaparser grind away on unnecessary work.#2017-03-1517:32aengelbergInstaparse only works on "full parses", so adding <anything> is the only way to go.#2017-03-1517:34aengelberg@nathansmutz if adding <#'[\\s\\S]'+> (anything) is too slow, you could maybe (str/replace #"things you definitely don't want to parse" "") beforehand#2017-03-1517:34aengelbergso instaparse isn't churning through too much garbage#2017-03-1716:26nathansmutzThanks @aengelberg. I wonder if my <anything> was more complicated than that when I tried it. Hmm, there may be a useful project in "parsing" instaparse code into regex suitable for grabbing parsable chunks. I'm pretty sure instaparse is generating regex on the backend; but the capture-groups would need some editing.#2017-03-1716:29hiredmanthat is not correct, instaparse parses context free langauges, which regexes cannot do#2017-03-1716:30hiredman(regular languages are a subset of context free languages)#2017-03-1717:11nathansmutz@hiredman I think I see what you mean. The problems I've been solving are probably simple enough that regex could define the same patterns. I suppose, getting into some real recursive stuff, it'd be less easy to go from instaparse-code to regex that says "grab a block of text that looks like this".#2017-04-0320:49wistb@aengelberg , I am trying instaparse against an IETF abnf (https://tools.ietf.org/html/rfc7950 ). I am getting the error Parse error at line 92, column 44: < URI in RFC 3986 >#2017-04-0320:49wistband this is what I see at that location#2017-04-0320:49wistburi-str = < a string that matches the rule > < URI in RFC 3986 >#2017-04-0320:50aengelberg< a string that matches the rule > isn't valid BNF. In that spec, it is used for prose that can only be understood by humans.#2017-04-0320:51aengelbergIn Instaparse we use <abc> completely differently, used to refer to the "hidden" version of the non-terminal abc.#2017-04-0321:00wistbthank you @aengelberg . I made a change and the parsing moved forward. Now, I am getting an error (in the last few pages of the grammar file, So, I am hoping the grammar is holding up well so far)#2017-04-0321:00wistbCompilerException java.lang.RuntimeException: Error parsing grammar specification: Parse error at line 915, column 32: action-keyword = %s"action"#2017-04-0321:00wistbI think the ietf doc wants to use 'non case sensitive' form.#2017-04-0321:24aengelbergI don't think %s is valid ABNF per the spec, though I see what it's getting at, and it wouldn't be too hard to implement.#2017-04-0321:26aengelbergIf you're ok with case insensitive, you could just do "action"#2017-04-0321:26wistb@aengelberg for now, I removed %s and moved on. It is getting close to the end of the file, but, I have this error :#2017-04-0321:26aengelbergIf you really want case sensitive, you could translate it to decimal or hex and use %d / %x#2017-04-0321:26aengelbergYou're going to want :input-format :abnf if you haven't set that already btw#2017-04-0321:28wistbCompilerException java.lang.RuntimeException: a occurs on the right-hand side of your grammar, but not on the left,#2017-04-0321:29wistbI am invoking , like so,#2017-04-0321:29wistb(def my-parser (insta/parser (http://clojure.java.io/resource "my.abnf") :input-format :abnf :trace true))#2017-04-0321:30wistbBut, this time, I don't know which particular text is the culprit.#2017-04-0321:35aengelberghmm.#2017-04-0321:35aengelbergOnly thing I can think of is there is a loose a or A somewhere...#2017-04-0321:36aengelbergOr it's one of the < a string that matches the rule > #2017-04-0321:36aengelbergand the a is the first of many invalid things in that expression.#2017-04-0321:37wistbLet me see. I removed those kind of usages. May be there is still something lurking ..#2017-04-0321:41wistb@aengelberg .. great . that is the issue. one such usage got left out. I changed it, now, that parsing is completing without error. Thank you.#2017-04-0321:41aengelbergsweet#2017-04-0321:41aengelbergnp#2017-04-0321:44wistbI am not sure about the change I made (I dont know much about grammars), though .. I changed the text from yang-version-arg-str = < a string that matches the rule > < yang-version-arg > to yang-version-arg-str = yang-version-arg yang-version-arg = "1.1"#2017-04-0321:45wistbif that is correct, I wonder why the ietf folks did not do the same. As it is , the ietf abnf is not parseable, right.#2017-04-0321:49aengelbergI think some grammars are written with the intention of helping humans to write programs, rather than to be fed to parser generators like Instaparse.#2017-04-0321:49aengelbergSo they don't feel the need to exactly follow the ABNF spec.#2017-04-1423:34gmercerHi - I was trying to use instaparse (clone from github) with lumo (or planck) I hit some issues (they are in the lumo channel) I am happy to cross-post but I thought it may be polite to not do so initially#2017-04-1500:14aengelbergInstaparse is known to be incompatible with bootstrapped cljs so I'm not surprised :(#2017-04-1500:15aengelbergSpecifically, the cljs version currently uses clj to do macro-time compile steps#2017-04-1500:16aengelbergIn theory, it doesn't have to do that logic on clj, but I don't know of an easy way to use reader conditionals, etc to write cross-compatible macros as opposed to functions#2017-04-1500:22gmercercross compatible macros - thanks, now I have a focus .. soon instaparse will be bootstrapped cljs compatible 😃#2017-04-1500:27gmerceralthough the earlier issue regarding the reader eagerly compiling the regex would add a little bit of hairiness 😞#2017-04-1500:28aengelbergLet me know if you need any assistance / explanations for some of the instaparse code#2017-04-1500:28gmercercheers#2017-06-0612:46wilkerluciohello people, I'm stuck with a instaparse rule, maybe someone here can help me out 🙂#2017-06-0612:46wilkerlucioI'm writing a parser for Javascript regexes#2017-06-0612:46wilkerluciothis is the current grammar:#2017-06-0612:46wilkerlucio
Regex = <'/'> Alternation <'/'> MatchFlag*

Alternation = Concatenation (<'|'> Concatenation)*

Concatenation = SuffixedExpr*

SuffixedExpr = SingleExpr Suffix?
SingleExpr = BaseExpr | ParenthesizedExpr
ParenthesizedExpr = <'('> GroupFlags? Alternation <')'>
Suffix = (Optional | Positive | NonNegative | CurlyRepetition) Quantifier?
Optional = <'?'>
Positive = <'+'>
NonNegative = <'*'>
CurlyRepetition = <'{'> #"\d+" (<','> #"\d+" ?) ? <'}'>
Quantifier = '?' | '+'
BaseExpr = CharExpr | LiteralChar | Anchor | BackReference

Anchor = '^' | '$' | '\\' #"[bB]"
LiteralChar = PlainChar | EscapedChar

BackReference = <'\\'> #"[1-9][0-9]*"

PlainChar = #"[^.|\\+*$^\[(){?]"
CharExpr = Dot | LiteralChar | BCC
Dot = '.'

BCC = <'['> BCCUnionLeft? <']'>

BCCUnionLeft = BCCNegation? BCCElemBase*

BCCNegation = '^'

BCCElemBase = BCCCharNonRange | SpecialCharClass | BCCRange | BCC
BCCRangeRightable = BCCCharEndRange | SpecialCharClass
BCCRange = BCCChar <'-'> BCCCharEndRange
BCCRangeWithBracket = <']-'> BCCCharEndRange
BCCCharNonRange = BCCChar !('-' BCCRangeRightable)
BCCChar = BCCPlainChar | EscapedChar
BCCCharEndRange = BCCPlainChar | EscapedChar
BCCPlainChar = #"[^\]\[\\]" | '\\b'

EscapedChar = SpecialCharClass | NormalSlashedCharacters | ControlChar | HexChar | BasicEscapedChar

HexChar = ShortHexChar | MediumHexChar | LongHexChar | VeryLongHexChar
ShortHexChar = <'\\x'> #'[0-9a-fA-F]{2}'
MediumHexChar = <'\\u'> #'[0-9a-fA-F]{4}'
LongHexChar = <'\\x{'> #'[0-9a-fA-F]{4}' <'}'>
VeryLongHexChar = <'\\x{'> #'[0-9a-fA-F]{6}' <'}'>
BasicEscapedChar = <'\\'> #"[\s\S]"
SpecialCharClass = <'\\'> #"[dDwWsSv0]"

NormalSlashedCharacters = #"\\[tnrf]"

ControlChar = <'\\c'> #"[A-Z]"

(** FLAGS **)
GroupFlags = NonCapturingMatchFlags
           | PositiveLookAheadFlag
           | NegativeLookAheadFlag

NonCapturingMatchFlags = <'?'> !')' <':'>
PositiveLookAheadFlag = <'?='>
NegativeLookAheadFlag = <'?!'>

MatchFlag = #"[gimuy]"
#2017-06-0612:48wilkerlucioI would like it to parse {, as a plain char, currently on the PlainChar definition this char is excluded to allow for the CurlyRepetition#2017-06-0612:48wilkerlucioI'm probably missing something#2017-06-0612:49wilkerluciobut if I allow the { at PlainChar#2017-06-0612:49wilkerluciothen when I try to do: a{2}#2017-06-0612:49wilkerlucioI was expect it to go into SuffixedExpr and match a PlainChar followed by a Suffix that would be a CurlyRepetition#2017-06-0612:50wilkerluciobut instead seems like it's not matching the suffix, and instead matches a series of plain chars#2017-06-0612:55wilkerluciosorry the noise, I got a much simplified version now, by this grammar:#2017-06-0612:55wilkerlucio
Concatenation = SuffixedExpr*

SuffixedExpr = LiteralChar CurlyRepetition?

CurlyRepetition = <'{'> #"\d+" <'}'>

LiteralChar = #"."
#2017-06-0612:56wilkerlucioI expected it to match "a{2}" as a LiteralChar followed by a CurlyRepetition, but instead it matches as [:Concatenation [:SuffixedExpr [:LiteralChar "a"]] [:SuffixedExpr [:LiteralChar "{"]] [:SuffixedExpr [:LiteralChar "2"]] [:SuffixedExpr [:LiteralChar "}"]]]#2017-06-0612:57wilkerluciohow can make it try force match the CurlyRepetition before stepping a level up and matching more literal chars?#2017-06-0616:00aengelberg@wilkerlucio maybe ordered choice / in instaparse might help here?#2017-06-0616:02aengelbergso you want a{2 to parse as 3 PlainChars, but a{2} to parse as a PlainChar followed by a CurlyRepetition?#2017-06-0617:34wilkerlucio@aengelberg yes, with some help I was able to figure it out, here is a way to handle it:#2017-06-0617:34wilkerlucio
Concatenation = SuffixedExpr*

SuffixedExpr = LiteralChar CurlyRepetition? / AnyLiteralChar

CurlyRepetition = <'{'> #"\d+" <'}'>

LiteralChar = #"[^{]"
AnyLiteralChar = #"."
#2017-06-0617:34wilkerlucioI had to restrict the first one a little and have a second more permissive rule, now it parses the way I was expecting 🙂#2017-06-0617:35aengelbergI suggest you take away the ? after CurlyRepetition, to make the grammar unambiguous (improving performance).#2017-06-0617:36wilkerlucio@aengelberg but it is needed there, because the curlyrepetition is optional#2017-06-0617:37wilkerlucioah, I think I got you said, it would match anyway#2017-06-0617:37wilkerlucioin my real case its a bit more complicated#2017-06-0617:37wilkerlucioand the latest AnyLiteral actually just matches the {, otherwise other complications arise#2017-06-0617:37wilkerlucioparsing regex is pretty annoying to be honest -.-#2017-06-0617:38aengelberglol yeah#2017-06-0617:38aengelbergalso, something to watch out for: . does not include the newline character#2017-06-0617:38aengelbergbut [^{] does#2017-06-0617:38wilkerlucioyeah, when I need everything I like to use something like [\s\S]#2017-06-0617:38wilkerlucioso it matches everything#2017-06-0617:39aengelbergthat’s exactly what I was going to suggest#2017-06-0617:39wilkerlucioin case you wonder, this is what my current grammar looks like (for parsing JS RegExp):#2017-06-0617:39wilkerlucio
Regex = <'/'> Alternation <'/'> MatchFlag*

Alternation = Concatenation (<'|'> Concatenation)*

Concatenation = SuffixedExpr*

SuffixedExpr = SingleExpr Suffix? / CurlyRepetition / LiteralSpecialChar
SingleExpr = BaseExpr | ParenthesizedExpr
ParenthesizedExpr = <'('> GroupFlags? Alternation <')'>
Suffix = (Optional | Positive | NonNegative | CurlyRepetition) Quantifier?
Optional = <'?'>
Positive = <'+'>
NonNegative = <'*'>
CurlyRepetition = <'{'> #"\d+" (<','> #"\d+" ?) ? <'}'>
Quantifier = '?' | '+'
BaseExpr = CharExpr | LiteralChar | Anchor | BackReference

Anchor = '^' | '$' | '\\' #"[bB]"
LiteralChar = PlainChar | EscapedChar
LiteralSpecialChar = '{'

BackReference = <'\\'> #"[1-9][0-9]*"

PlainChar = #"[^.|\\+*$^\[(){?]"
CharExpr = Dot / LiteralChar / BCCEmpty / BCC
Dot = '.'

BCC = <'['> BCCUnionLeft? <']'>
BCCEmpty = '[]'

BCCUnionLeft = BCCNegation? BCCElemBase*

BCCNegation = '^'

BCCElemBase = BCCCharNonRange | SpecialCharClass | BCCRange | BCC
BCCRangeRightable = BCCCharEndRange | SpecialCharClass
BCCRange = BCCChar <'-'> BCCCharEndRange
BCCRangeWithBracket = <']-'> BCCCharEndRange
BCCCharNonRange = BCCChar !('-' BCCRangeRightable)
BCCChar = BCCPlainChar | EscapedChar
BCCCharEndRange = BCCPlainChar | EscapedChar
BCCPlainChar = #"[^\]\[\\]" | '\\b'

EscapedChar = SpecialCharClass / NormalSlashedCharacters / ControlChar / HexChar / BasicEscapedChar

HexChar = ShortHexChar | MediumHexChar | LongHexChar | VeryLongHexChar
ShortHexChar = <'\\x'> #'[0-9a-fA-F]{2}'
MediumHexChar = <'\\u'> #'[0-9a-fA-F]{4}'
LongHexChar = <'\\x{'> #'[0-9a-fA-F]{4}' <'}'>
VeryLongHexChar = <'\\x{'> #'[0-9a-fA-F]{6}' <'}'>
BasicEscapedChar = <'\\'> #"[\s\S]"
SpecialCharClass = <'\\'> #"[dDwWsSv0]"

NormalSlashedCharacters = #"\\[tnrf]"

ControlChar = <'\\c'> #"[A-Z]"

(** FLAGS **)
GroupFlags = NonCapturingMatchFlags
           | PositiveLookAheadFlag
           | NegativeLookAheadFlag

NonCapturingMatchFlags = <'?'> !')' <':'>
PositiveLookAheadFlag = <'?='>
NegativeLookAheadFlag = <'?!'>

MatchFlag = #"[gimuy]"
#2017-06-0617:40wilkerlucioit doesn't need to be perfect, the usage of this is to port the test.chuck string-from-regex to CLJS#2017-06-0617:40aengelberggfredericks has already done work in test.chuck to parse Java regexes in instaparse, which may be useful: https://github.com/gfredericks/test.chuck/blob/master/resources/com/gfredericks/test/chuck/regex.bnf#2017-06-0617:40wilkerlucioyeah, this is actually based of that#2017-06-0617:41wilkerlucioI'm trying to port it to CLJS#2017-06-0617:41aengelbergoh lol, you’re right, I just needed to look closer#2017-06-0617:41aengelbergyou’re porting test.chuck to cljs?#2017-06-0617:41aengelbergor just the regex generator?#2017-06-0617:41wilkerluciojust the regex generator, the rest is already all cljc actually#2017-06-0617:41aengelbergok this explains a lot 🙂#2017-06-0617:42aengelbergso then how did you get the grammar into a weird state that behaved improperly with curly repetitions, just by removing certain things not part of the EcmaScript regex spec?#2017-06-0617:42aengelbergor was it already like that?#2017-06-0617:43wilkerluciothe Java Regex have many different features compared to JS one#2017-06-0617:43aengelberghmm I was under the impression that Java had a super-set of features to JS#2017-06-0617:43aengelbergclearly I was mistaken#2017-06-0617:43wilkerluciono, JS is more tolerant in some cases#2017-06-0617:43wilkerluciofor example, those are invalid on JVM, but ok on JS: /a{/ /a{}/ /[]/#2017-06-0617:44wilkerluciowhen JS sees an incomplete curly braces, it threats as literals#2017-06-0617:44wilkerluciowhere JVM throws an exception#2017-06-0617:45wilkerluciobut in general the JS is simpler, since it doens't support character class unions (that feature adds a lot of complexity on the JVM Regex grammar, see: https://www.regular-expressions.info/charclassintersect.html)#2017-06-0617:45wilkerlucioso, this is the kind of feature that has a strong dependency on the platform#2017-06-0617:47wilkerlucioGary did a great job making generative testing to check if the custom parser conforms with the platform regex parser, check this test: https://github.com/gfredericks/test.chuck/blob/master/test/com/gfredericks/test/chuck/regexes_test.clj#L117-L119#2017-06-0617:47wilkerlucioit generates random regexs and try to parse it with custom and native regex parsers, and they have to conform (all fail or all pass)#2017-06-0617:48wilkerlucioit's just very hard to get the grammar to work exactly like the native one, with all quirks dealt with#2017-06-2104:02fabraoHello all, how can I make something like
(def parser
  (insta/parser "regra = <'filtro'> <ws> elemento+
                elemento = operador <ws> operando
                operador = ('origem')
                operando = #'[a-zA-Z0-9\\-]([0-9a-zA-Z\\-]*)'
                ws = #'\\s+'"))
(parser "filtro 
        origem 001-ARTICO 
        origem 011-BALDACCI")

what´s wrong?
#2017-06-2111:11gfredericks@fabrao looks like the grammar doesn't allow whitespace in between elementos#2017-06-2120:45aengelberg^#2017-06-2400:41wistbtest#2017-06-2400:42wistbhi, beginner question.#2017-06-2400:42wistb(def abc-parser (insta/parser (http://clojure.java.io/resource "grammars/abc.abnf") :input-format :abnf :trace true :output-format :enlive)) ;;:auto-whitespace whitespace))#2017-06-2400:42wistbif I enable the whitespace option , I am getting a copile error when I run 'lein run'.#2017-06-2400:43wistblike this :#2017-06-2400:43wistb(def abc-parser (insta/parser (http://clojure.java.io/resource "grammars/abc.abnf") :input-format :abnf :trace true :output-format :enlive :auto-whitespace whitespace))#2017-06-2400:43wistbsame problem if I used :auto-whitespace :standard#2017-06-2400:46wistbCaused by: java.lang.IllegalArgumentException: No matching clause: :char at instaparse.combinators_source$auto_whitespace_parser.invokeStatic(combinators_source.clj:163) at instaparse.combinators_source$auto_whitespace_parser.invoke(combinators_source.clj:162) at instaparse.combinators_source$auto_whitespace$iter__426__430$fn__431.invoke(combinators_source.clj:184)#2017-07-0509:44matanWith instaparse, how do you elegantly match any sequence of characters up until a specific sequence of characters?#2017-07-0509:54matanCurrently I do that in a cumbersome way, given the impedence mismatch between grammars and regular expressions:
WrappedLabel = UnderscorePair Word UnderscorePair
UnderscorePair = "__"
Word = #".+(?=__)" (* a valid word is hereby contrained to anything that does not include an UnderScorePair *)
#2017-07-0509:55matan__ appears both as a grammar definition (`UnderscorePair`) as well as serving as a stop expression in the regex#2017-07-0509:56matan🤔 I am curious in case there's a solution I've not thought of#2017-07-0509:57matanThe above is supposed to catch and parse anything of the form __foo__, so that foo can be extracted (as part of a larger parse the details of which are quite plain and uninteresting)#2017-07-0516:13aengelberg@matan that's how I would do it. I'm not aware of a more elegant or efficient solution, besides making the underscore pairs part of the regex.#2017-07-0617:01matan@aengelberg many thanks for the confirmation! of course, it stems from the difference between what regular expression language is and what grammars are, not from how instaparse works doesn't it 🙂 Would you agree to this?#2017-07-0617:05aengelbergYou mean regexes’ greediness? I would say that’s an artifact of regexes’ supported use case not matching what Instaparse needs it to do. If regexes had a way to lazily emit multiple parse results of the same string, we could maybe use that to make regex non-terminals behave more intuitively.#2017-07-0617:08aengelberge.g. if I match #"^a+" on "aaa" I’d like to see "a", "aa", "aaa"#2017-07-0617:10matanYes, well, at a higher level, regular expressions and grammars generate disparate automatons, whereas what you mention is one difference in the detail thereof (the most relevant one). This relates to two area I'm lately looking at ― fuzzy parsing and scannerless parsing#2017-07-0617:10matanAs to regex, have you ever used something like this? https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches#2017-07-0617:11matanOh, not really, ignore that last one#2017-07-0617:12aengelbergThat’s returning one match for each of a variety of starting indexes, which is not quite what I want#2017-07-0617:12matanYes, again, ignore that#2017-07-0617:13aengelbergYou could theoretically do (for [i (range (count s))] (re-match re (subs s 0 i)))#2017-07-0617:13aengelbergbut that is super slow and might not actually work depending on the regex#2017-07-0617:15aengelbergthe automaton structure of regexes is simply not designed to (efficiently) reason about the set of all possible matches for one string, using backtracking and laziness#2017-07-0617:15aengelbergactually maybe if you compile a NDFA#2017-07-0617:16aengelbergthen run it through one character at a time#2017-07-0617:16aengelbergand mark every time you are in a success state#2017-07-0617:25matanThe interaction of a grammar with the need to account for syntactic categories (a.k.a variables, and/or non-terminals) that translate to non-finite sets of productions is very interesting. Right now, we typically "escape" to regular expressions for that. Of course, this goes beyond the scope of my original question with its particular silly use case.#2017-07-0617:27aengelbergIn this case, the set of productions is not infinite; for the current index i, all I really need to know is, the set of indexes j \in [0,N) such that s[i:j] is a complete parse according to the regex.#2017-07-0617:29matanRight, obviously, when looking at a specific input text (or string).#2017-07-0617:32matanBut in the general sense, I also meant to say that we use regular expressions for where we want to stipulate a non-finite set, whereas a plain CFG doesn't provide (IIRC) that kind of support, which is why we use regex along with it. I have to refresh and brush up more on formal languages though, maybe what I just said is totally incorrect, or just sufficiently inaccurate to be incorrect.#2017-07-1015:52mrchanceHi! In Instaparse, how do I specify operator precedence... I tried ordered choice, but it doesn't do what I want. Simple example:
(def tp (insta/parser "
s = expression
<expression> = binop / integer
integer = #'[0-9]+'
<binop> = times / plus
times = expression <'*'> expression
plus = expression <'+'> expression"
                      :auto-whitespace whitespace))

parser> (tp "5 + 3 * 7")
[:s [:times [:plus [:integer "5"] [:integer "3"]] [:integer "7"]]]
I have seen solutions that distinguish between add-expression and mul-expression, but that doesn't scale very well for more operators
#2017-07-1015:54aengelbergOrdered choice only works when considering parses available at a given position, not when ranking this parse here over another parse over there.#2017-07-1015:55aengelbergSo in your example, your binop / integer was making it try 5 + 3 before 5#2017-07-1016:26mrchanceah, so I can fix it by making expression unordered? Or what's the best way?#2017-07-1016:30aengelbergI don't think ordered choice is the best tool for "order of operations"... I'll send you another example in a sec#2017-07-1016:32mrchanceThanks!#2017-07-1016:36aengelberg@mrchance check out the arithmetic expr parser in https://github.com/engelberg/instaparse#transforming-the-tree#2017-07-1016:37mrchanceThanks, I'll check it out#2017-07-1016:37aengelbergnotice how no ordered choice is necessary, because the grammar is structured so that the order of operations follows naturally#2017-07-1016:54mrchancehm, ok, but my impression is that would get unwieldy when there is more operators. I'll give it a try though, I don't have that many 😉#2017-07-1016:55aengelbergperhaps. not sure.#2017-07-1016:57aengelbergHere’s a potentially more elegant way of expressing it (not sure if this actually works):
expr = add-sub
add-sub = mul-div (('+' | '-') mul-div)*
mul-div = term (('*' | '/') term)*
term = number | <'('> add-sub <')'>
#2017-07-1017:32mrchancehmm, wouldn't this disallow top level multiplication terms?#2017-07-1017:33mrchanceI am already wishing I had given my own language a lisp syntax 😉#2017-07-1019:22aengelberg@mrchance no, it would end up as [:expr [:add-sub [:mul-div "1" "*" "2"]]]#2017-07-1023:09mrchanceright 🙂 It worked, btw. For the Moment it's still quite manageable too. Thanks for the fast replies!#2017-07-1217:26fabraoHello all, how to I parse any kind of char in this: Groups: CN=THIAGO VITOR COSTA,OU=REPRESENTANTES,DC=DOM,DC=COM,DC=BR+OU=REPRESENTANTES,DC=DOM,DC=COM,DC=BR+CN=USUáRIOS DO DOMíNIO,CN=USERS,DC=DOM,DC=COM,DC=BR+CN=GPO_REPRESENTANTES_CN,OU=GRUPOS,DC=DOM,DC=COM,DC=BR+CN=USUáRIOS,CN=BUILTIN,DC=DOM,DC=COM,DC=BR+CN=GPO_REPRESENTANTES,OU=GRUPOS,DC=DOM,DC=COM,DC=BR? The Groups: is reserved and the rest is the capture information#2017-08-1804:47bbssI'm trying to parse Google's protobuf with Instaparse, but I can't find a recent file with it's EBNF spec.#2017-08-1804:48bbsshttps://developers.google.com/protocol-buffers/docs/reference/proto3-spec I've copied all the bits from this site, but some lines are giving errors when I try to parse them.#2017-08-1804:49bbssIs instaparse compatible with what they put in there? For example I see they use = where other files often seem to have :: ==.#2017-08-1804:51bbssAh, I just noticed the escape sequence part of the readme, let me look into that.#2017-08-1806:36bbssah I notice it's been discussed here before: https://clojurians-log.clojureverse.org/instaparse/2017-02-17.html#2017-08-1806:39bbss@frank did you perhaps end up succeeding parsing that spec? Can't find any more discussion in further dates on the clojureverse..#2017-08-1807:21bbssOkay, great I got it to work with the comments there. For any future visitors: https://gist.github.com/bbss/153e050f44db294cf7af3afc9a2f9a10#2017-08-1815:45franksorry just seeing this now - I had to make changes to it#2017-08-1815:46frankone thing that might need to be done is removing comments before feeding it into the parser (or adjusting the grammar to know about them)#2017-08-1904:54bbss@frank no worries, figured it out. I hadn't really used context free grammar before, but after meddling with those files for a bit I might actually use them more often. It seems like an indispensable tool for a Lisp programmer really 🙂#2017-08-1904:55bbssand you're right, I actually wrote a function to remove the comments.#2017-08-3020:13hlolliIn clojurescript with defparser, Im guessing the regexes are read wrongly, with insta/parser I get in my willingly generated token error
{:tag :regexp,
 :expecting #"^[0-9]+\.?[0-9]*"}
in same error via defparse
{:tag :regexp,
 :expecting #"^\/^[0-9]+\.?[0-9]*\/"}
both originating from <digit> = #'[0-9]+\\.?[0-9]*'
#2017-08-3020:15hlollimy first question should be, does some other clojurescript user experience this, as Im running my forked version of Instaparse 1.4.7 running on lumo.#2017-08-3020:16aengelbergInstaparse is known to have some bugs on Lumo#2017-08-3020:16aengelbergbecause Lumo behaves weirdly with cljc files#2017-08-3020:17hlolliyes, I know, I've fixed those on my fork and it's effectively working fine, just this one error with regexes, so Im only guessing this is unrelated.#2017-08-3020:17aengelbergsince defparse is evaluated at macro-time, not runtime, my guess is that Lumo trying to execute code on ClojureScript that was meant to be run on Clojure#2017-08-3020:18hlollior related, then in the way you described 🙂#2017-08-3020:18aengelbergThe regexp combinator in particular has special logic when run on ClojureScript https://github.com/Engelberg/instaparse/blob/master/src/instaparse/combinators_source.cljc#L93#2017-08-3020:18aengelbergwhich might be why you're only running into this now#2017-08-3020:19hlollihmm, ok, I try removing this constraint to see what happens...#2017-08-3020:19aengelbergalso maybe try not using defparser and see if that fixes it#2017-08-3020:20hlolliyes, that fixes it, I want to use defparser for the performance it brings, Im using the parser to evaluate musical expressions in realtime music application#2017-08-3020:28aengelbergmakes sense.#2017-08-3020:28aengelbergalso that sounds cool, is your application similar to https://github.com/alda-lang/alda ?#2017-08-3020:38gfredericksThe Adlaphone: a musical instrument for programmers#2017-08-3020:39hlollionly very partially similar to Alda, the use of parser is very limited, and only an extra feature Im implementing atm.#2017-08-3020:41hlollimore about native datatypes Im sending to Csound, audio processing language, and make simple repetitive patterns.#2017-08-3020:54hlolliloud thinking, I notice that the function regex gets called twice by defparser but once on parse, the second time the input is "/^[0-9]+\.?[0-9]*/" and returnes #"^\/^[0-9]+\.?[0-9]*\/"#2017-08-3020:58hlolliah fixed it 🙂#2017-08-3021:00hlolliin defparse macro I commented
;; Regexp terminals are handled differently in cljs
 ;; (= :regexp (:tag form))
 ;; `(merge (c/regexp ~(str (:regexp form)))
 ;;         ~(dissoc form :tag :regexp))
maybe I should test the standard cljs and see if this is also a problem there.
#2017-11-0113:27Empperidoes someone happen to know if there is a ready tool to convert W3C EBNF declarations to format which instaparse understands?#2017-11-0113:27Empperispecifically I’m looking for SPARQL ebnf declarations which I could feed into instaparse#2017-11-0113:32EmpperiI would also accept an ANTLR -> instaparse converter#2017-11-0116:07novelinstaparse understands the SPARQL grammar from w3c. you just have to adapt the production rules for the terminals so that instaparse will detect them as regexes. See for instance https://github.com/mladvladimir/sparqlom/blob/master/resources/sparql.ebnf#2017-11-0211:48Empperiyeah, I know it is almost the same. What you just linked looks like actually a ready made file which instaparse should be able to digest#2017-11-0211:48Empperilooks promising, need to test that out, thanks! 👍#2017-11-0211:53Empperiindeed, instaparse happily consumed that file. My first SPARQL query did not get through the parser though but it is much easier to proceed from here#2017-11-0813:57tbrookeI am looking at a project that uses https://nearley.js.org/ in javascript which uses Early parsing — I did find an old Clojure repo that mentioned early parsing --- I am somewhat new to parsing and wondered if anyone was up on early parsing or whether I could do the same thing in Clojure with instaparse or another library - I don’t want to use javascript and I would like to rewrite the parser with Clojure Clojurescript#2017-11-2316:55jeremysHey guys, I am playing with instaparse and I have a problem contructucting a grammar.#2017-11-2316:55jeremysHere is what I am going for
(insta/defparser ex7
  "
  doc = (text | tag)*
  text = #'[^@]*'
  tag = '@' #'[a-z]*' inner-text*
  inner-text = '{' #'[^}]*' '}'
  ")

(ex7 "some text @toto{inner text}")
#2017-11-2316:58jeremysThe problem is the parser when parsing a tag rule won’t consider the inner-text rule giving me the parse
[:doc [:text "some text "] [:tag "@" "toto"] [:text "{inner text}"]]
#2017-11-2316:59jeremysinstead of the desired
[:doc [:text "some text "] [:tag "@" "toto" [:inner-text "{" "inner text" "}"]]]
#2017-11-2317:01jeremysAny Idea how I can modify the grammar to consider the inner-text rule before going back to the text one ?#2017-11-2322:07aengelberg@jeremys Maybe change inner-text* to inner-text* !inner-text, to ensure that it parses as many inner-texts as it can.#2017-11-2323:19jeremys@aengelberg Thx Alex I’ll try to use the lookahead, I haven’t played with that yet.#2017-11-2412:58jeremys@aengelberg Thx for the help, the negative lookahead work perfectly. I also started a thread on clojureverse with that question. You can find it here https://clojureverse.org/t/need-a-bit-of-help-with-an-instaparse-grammar/965/4#2017-11-3018:23mrchancehi, is there a possibility in instaparse to bind parse results? For example, when looking for a matching closing tag?#2017-11-3018:34aengelberg@mrchance sadly no. I'd recommend writing your parser so that it accepts any pair of tags (matching or not), and then separately do some validation to make sure the tag pairs are well formed.#2017-11-3018:42mrchanceok, thanks. Should be easy enough to do in the transform step#2017-12-1309:48Empperihmm, I have a rather large EBNF that I’m using to initialize an instaparse parser in ClojureScript#2017-12-1309:48EmpperiIt works just fine but the problem is that the parser creation takes a long time and since it’s ClojureScript that causes the UI to freeze while it is being created#2017-12-1309:49EmpperiI was wondering if there is any way to do this in a webworker and if anyone has tried to do that before#2017-12-1309:49Empperimy initial feeling is that “no you cannot use webworkers” since they work with message passing and as such only Strings can be passed#2017-12-1309:50Empperiideas?#2017-12-1310:05hlolli@niklas.collin are you using the defparser macro?#2017-12-1310:05Empperino, I’m using instaparse.core/parser#2017-12-1310:06hlolliI wonder if you'd be faster if you'd be using the macro, then it precompiles a bit.#2017-12-1310:06Empperidunno, maybe#2017-12-1310:06EmpperiI could try#2017-12-1310:07Empperiwell, that’s what the defparser documentation in instaparse github page says that it should work better#2017-12-1310:07Empperithanks, will try that#2017-12-1310:28Empperiyeah, now performance is pretty much instant#2017-12-1310:28Emppericheers 🙂#2017-12-1317:04aengelberg@hlolli yep, that's exactly what it's meant for. @niklas.collin I'm glad it served its purpose!#2017-12-1416:31mbjarlandI'm playing around with instaparse and for kicks and giggles I wrote a parser to parse some log files I have laying around#2017-12-1416:31mbjarlandis there a way to define a fixed width "anything goes" string in instaparse#2017-12-1416:32mbjarlandi.e. if I just want to gobble up a few characters into a tree node and don't care about the content there, is that possible?#2017-12-1416:32aengelbergFixed width? Maybe #'.{N}'?#2017-12-1416:33mbjarlandright, yes regex does the job but is probably not very performant for just "take substring of 10 from where you are"#2017-12-1416:34mbjarlandok, so regex is the way to go for this in instaparse?#2017-12-1416:35aengelbergI think regex is the most performant way to grab a not-static set of characters#2017-12-1416:37mbjarland: ) well I should probably mention that I think instaparse is excellent and by far the best parser lib I've run across....so my intent was not to come here and critique it#2017-12-1416:38aengelbergThanks! And no worries, I was just answering your question from the perspective of what instaparse actually supports#2017-12-1416:38mbjarlandthat being said...if I parse 2G of log files (without instaparse) and compare the simplest regex match with (subs line 10 20), regex performace doesn't exactly shine#2017-12-1416:38aengelbergBut I see your point that if it theoretically supported a dedicated "substring" combinator, that would be faster#2017-12-1416:39mbjarlandanyway, figured I would ask, but regex does indeed do the job and perhaps what I'm doing with this parser is a bit of an edge case#2017-12-1416:40aengelbergMaybe we should support "custom combinators" so people like you with special use cases can write their own more performant specialized versions#2017-12-1416:40mbjarlandthat would be awesome#2017-12-1416:42mbjarlandyou would have to add some kind of extension point to the instaparse bnf syntax I guess#2017-12-1416:47aengelbergMaybe, or we don't allow extensions to the EBNF syntax and just let people make custom combinators for the combinator syntax#2017-12-1416:50mbjarlandah, ok, hadn't grokked the combinators syntax until now#2017-12-1416:56mbjarlandright now I'm considering writing my own mini language for this log parsing, I could use instaparse to parse that language and then do custom, optimized parsing based on the format specification tree coming out from instaparse...so still useful#2017-12-1417:21mbjarlandhmm, how come I need to double escape the not-inclusive rule in the following grammmar:
(def my-p 
  (instaparse.core/parser 
    "spec = (field-spec <' '?>)+
     field-spec = <'['>name ' '* <':'> ' '* (width | not-inclusive | not-exclusive | rest)<']'>
     name = #'[^:]+'
     width = <'{'> #'\\d+' <'}'>
     not-inclusive = <'\\\\'> #'.'
     not-exclusive = <'/'> #'.'
     rest = '*'    
    "))
#2017-12-1417:22aengelbergyou mean the '\\\\'?#2017-12-1417:22mbjarlandyeah#2017-12-1417:22mbjarlandshouldn't two have been enough?#2017-12-1417:23aengelbergbecause 1) you need to tell Clojure that you aren't escaping a character within a string 2) you need to tell Instaparse that you aren't escaping a character within a string combinator#2017-12-1417:23mbjarlandok, missed point 2 there#2017-12-1509:10Empperiis there some way in instaparse to ask for all possible grammar elements at certain point of the parse tree?#2017-12-1509:10Empperimeaning, I have parse result, I take a specific point in that parse tree and the would get a list of parse elements that could go there#2017-12-1509:10EmpperiI could theoretically write an EBNF analyzer to do just this but don’t feel like it unless I have to#2017-12-1509:11Empperisince I think instaparse already does this somewhere under the hood and has the necessary information#2017-12-1509:19Empperihmm, there is instaparse.cfg/ebnf, need look at it and if it would provide the necessary information#2017-12-1509:21Empperiactually it looks like it just might#2017-12-1516:59aengelberg@niklas.collin not sure what you're asking. what would be an example of using this functionality?#2017-12-1517:01EmpperiAutocomplete suggestions for code editor#2017-12-1517:01EmpperiAnd the result from cfg/ebnf looks ok usable#2017-12-1521:06aengelbergThere have been a few discussions about generating data for a parser, or listing possible inputs to a parser#2017-12-1521:06aengelbergNot sure the best path to exposing that#2017-12-1521:07aengelbergare you sure ebnf is what you want? That just creates a combinator based on an EBNF spec, it doesn't generate a list of things that could go there#2017-12-1617:42borkdudeHello. Why is my InstaParser so slow? 850ms vs 8ms hand-written Clojure: https://github.com/borkdude/aoc2017/blob/master/src/day16.clj#L119#2017-12-1617:43borkdudeMaybe it’s my grammar, but I don’t see it.#2017-12-1617:43aengelbergHow big is the input?#2017-12-1617:43borkdudeThe input is this: https://github.com/borkdude/aoc2017/blob/master/resources/day16.txt#2017-12-1617:44borkdudeHere’s the grammar: https://github.com/borkdude/aoc2017/blob/master/src/day16.clj#L17#2017-12-1617:48aengelbergI mean, you're basically generating one string per character in a 40kb file. I'm not surprised it's slow.#2017-12-1617:48aengelbergAnd wrapping with vectors, etc#2017-12-1617:50borkdudeThat’s fair, but even without wrapping the arguments I get a similar time#2017-12-1617:50borkdudeI had that before, but I wanted to transform the arguments to ints, that’s why I wrapped them later#2017-12-1617:51aengelbergHmm yeah#2017-12-1617:53borkdudeI mean, it’s not really a problem, but just curious why or if I made a mistake in my grammar#2017-12-1617:53aengelbergI don't think you made a mistake like an ambiguity issue or anything#2017-12-1617:54aengelbergIt's just really exercising the parser's dataflow overhead#2017-12-1617:55aengelbergI saw similar issues when trying to perf-tune @dave's Alda parser a while ago#2017-12-1617:56aengelbergBecause his rules were like "if you see this single character, parse this other single character"#2017-12-1617:57aengelbergInstaparse does a lot of bookkeeping during a parse to make sure it magically works with weirdly recursive grammars, so for super low level parsers like this it doesn't exactly shine#2017-12-1617:57borkdude
(def parse2
  (insta/parser
   "<INPUT>       = (INSTRUCTION <','>)+ INSTRUCTION 
    <INSTRUCTION> = SPIN | EXCHANGE | PARTNER
    SPIN          = <'s'> POSITION
    EXCHANGE      = <'x'> POSITION <'/'> POSITION
    PARTNER       = <'p'> PROGRAM  <'/'> PROGRAM
    <POSITION>    = #'\\d\\d?'
    <PROGRAM>     = #'[a-p]'"))
No nesting of position and program, 800ms
#2017-12-1617:58borkdudeok#2017-12-1617:59aengelbergJust curious, does anything improve if you change the first rule to INSTRUCTION (<','> INSTRUCTION)*?#2017-12-1618:04borkdudeI wondered about that rule as well. Quickbenching…#2017-12-1618:05borkdudeYup, 582ms!
#2017-12-1618:05borkdudeDoes the order of rules matter for performance?#2017-12-1618:06borkdudeI mean, when it’s more likely to encounter EXCHANGE, does it help putting that first?#2017-12-1704:09aengelbergI'm not actually sure why that is faster, just a hunch since I haven't seen the left-recursive usage a whole lot.#2017-12-1812:24Empperi@aengelberg definitely not sure if cfg/ebnf is the correct thing but it looks like it provides the necessary data. Not in the format that would be optimal but I think I can kinda do some kind of functionality based on it. For now, I’m mostly happy if I can find the string literals which can go in and I can ignore regexp literals and other more complex stuff. So, basically I do this recursive algorithm which retrieves elements from the combinatioral tree until it finds :string#2017-12-1812:24Empperibest I can come up with as it is now#2017-12-2009:52mbjarlandsay I have the following grammar:
(defn make-layout-parser-internal []
  (insta/parser
    "layout-string = col-delim (col-align col-delim)+
     col-delim    = (col-fill | col-padding)*
     col-fill     = ('F' | 'f')
     col-padding  = #'[^\\[\\]fF]*'
     col-align    = <'['> ('L' | 'l' | 'C' | 'c' | 'R' | 'r') <']'>"))
but I want to make a slight modification to it in a certain context. In essence I have two grammars with just a slight differentce between them and depending on the surrounding (non instaparse related) programming context I would like to parse using either grammar a or grammar b. Would I need to define two distinct grammars as per the above or is there some good way to share most of the grammar and have just a slight modification? In my specific case I would have a difference in the col-align value only
#2017-12-2706:39aengelberg@mbjarland you could use combinators to create separate submaps of common components that get merged together in each use case.#2018-01-0808:51mbjarlandI have a question about greedy parsing and ambiguous grammars, I have the following grammar:
(insta/parser
    "layout   = (align | delim | repeat)+
     repeat   = <'{'> (align | delim)+ <'}'>
     delim    = (fill | padding)+
     fill     = 'F'
     padding  = #'[^\\[\\]{}fF]*'
     align    = <'['> ('L' | 'C' | 'R' | 'V') <']'>"
    :string-ci true))
where the only relevant pieces are the layout = (align | delim | repeat)+ and delim = (fill | padding)+ pieces. Assume we get a ‘fill’ followed by a ‘padding’, the parser can here choose between [:delim [:fill "F"]] [:delim [:padding "xxx"]] and [:delim [:fill "F"] [:padding "xxx"]]. This is not really instaparse specific, but rather to do with bnf’s and disambiguating repeating elements. I would like to force the second interpretation where the delim = (fill | padding)+ is greedy and collects all contiguous delim elements into a list before proceeding. Any ideas much appreciated
#2018-01-0809:01aengelberg@mbjarland you could change (align | delim)+ to delim? (align delim?)*, that would force the parser to alternate between delim and align, effectively making the delim rule greedy#2018-01-0809:03mbjarlandand this is why I love the clojure community, thank you for the fast reply! : ) are we talking about the first rule layout = ...? In that case I would have to cook up a repeating pattern with 3 elements#2018-01-0809:04mbjarlandso what I’m doing here is defining a column layout#2018-01-0809:04aengelbergActually I was talking about the repeat rule. But you made a good point that we'd also have to apply a similar solution to the layout rule to get a similar effect#2018-01-0809:04mbjarlandwhere the repeat says “if the user comes in with more columns than we have defined, use the repeating group to fill out the layout”#2018-01-0809:05aengelbergColumn layout? Not sure what you mean#2018-01-0809:05mbjarlandnever mind, that is really application related and not related to the grammar#2018-01-0809:06mbjarlandthought actually explaining what the thing does might clarify, but I think it just muddles the waters even more : )#2018-01-0809:06mbjarlandI also thought about the / ordered parsing syntax, but I can not really see how to apply that to these repeating patterns#2018-01-0809:07aengelberg
layout = delim? ((align | repeat) delim?)*
repeat = <'{'> delim? (align delim?)* <'}'>
#2018-01-0809:08mbjarland: )#2018-01-0809:09aengelbergSince align and repeat both require nonzero characters to be present in order for the rule to succeed, wedging them in the repeat rule enforces some regularity in the repetition and thus unambiguates the parsing#2018-01-0809:09mbjarlandthat works, I tried it with insta/parses and my standard examples and it comes out with a single interpretation#2018-01-0809:10mbjarlandI will have to meditate on the exact mechanics#2018-01-0809:10aengelbergYeah, ordered choice doesn't really help here unless delim was competing with some other rule and you wanted to establish the priority. But in this case delim is just competing with itself, in a way.#2018-01-0809:10mbjarlandexactly, just a question on what level in the bnf the repetition happens#2018-01-0809:11aengelbergYeah#2018-01-0809:11aengelbergFortunately that's easily solved by restructuring the combinators, no fancy PEG combinators necessary#2018-01-0809:11mbjarlandthis is actually a higher level pattern, I will make sure to grok this properly for the next time I run into an ambiguous repetition#2018-01-0809:11mbjarlandthanks a ton!#2018-01-0809:12aengelbergNo problem#2018-01-0809:18aengelberg@mbjarland I just thought of another way you could have solved it: ignore the above changes I proposed and instead change delim rule to:
delim = (fill | padding)+ !delim
#2018-01-0809:18aengelbergPretty sure my first solution would be more performant, albeit more complex#2018-01-0811:33mbjarlandwhat does the !delim do?#2018-01-2203:26xiongtxDoes the order of rules for Instaparse matter? I seem to recall that rule order didn't for tools like Lex/JLex...but I could be wrong.#2018-01-2203:27aengelbergThe order of rules in an alternation does not matter, if that's what you're asking.#2018-01-2203:44xiongtxI meant the statements themselves; but funny you should bring that up, b/c I was just wondering about the order of alternations as well.#2018-01-2203:46xiongtxFor
type = 'int' | 'boolean' | className

className = identifier

identifier = #"[A-Za-z_]+[A-Za-z0-9_]*"
Is there a way to the order of alternations matter, i.e. (insta/parse parser "int") return {:type "int"] instead of (from my observation) [:type [:className [:identifier "int"]]]?
#2018-04-0812:04mishagreetings! I am having a trouble to match 2 consequent backslashes (clj):
(insta/parse
  (insta/parser "s = #'\\\\'")
  "\\\\")
=> Parse error at line 1, column 1:
\\
^
Expected:
#"\\" (followed by end-of-string)

(insta/parse
  (insta/parser "s = #\"\\\\\"")
  "\\\\")
=> Parse error at line 1, column 1:
\\
^
Expected:
#"\\" (followed by end-of-string)
at this point I am just brut forcing with no luck
#2018-04-0902:04aengelberg@misha I think you are just under-escaping. Try
(insta/parse
  (insta/parser "s = #'\\\\\\\\'")
  "\\\\")
#2018-04-0902:05aengelbergSince instaparse and Clojure strings both have their own notation of backslash escaping, you unfortunately have to use an obscene amount of backslashes to convey a legitimate backslash character#2018-04-0902:06aengelberghttps://github.com/engelberg/instaparse#escape-characters#2018-04-0902:07aengelbergAs that section in the readme explains, one way to get more predictable escaping behavior is to store your grammar in a separate resource file, and that removes one of the layers of escaping (Clojure strings)#2018-04-0906:43misha@aengelberg thank you, that works. I already put grammar and source string into their own files, before asking for help, might have missed something (like ns reload).#2018-04-0906:59mishahow can I distinguish between \n within a text, and a line end in multiline text? Can I do it without relying on next line's grammar/content? For example, here I need to extract "Simple__ communication example\non several lines" as a single value, w/o including "Alice". Is there a landmark I can use to stop at "visual" line's end?:
title __Simple__ communication example\non several lines
Alice -> Bob: Authentication Request
I tried #'(?m)...$' regex flag, but I doubt it will receive "visual" line as an input.
#2018-04-0915:29aengelberg@misha yeah, from the regex's perspective it's matching against the entire rest of the string, so $ means "end of file", not "end of line". You could use regex's lookahead feature to detect and end of line, like (?=\r?\n)#2018-04-1823:36gfrederickscan instaparse be used to parse significant-whitespace-indentation-things, in the python sense?#2018-04-1823:37aengelbergnot really, primarily because of how the "levels" work in said indentation-heavy languages#2018-04-1823:38gfredericks@aengelberg can you change your username to ængelberg#2018-04-1823:39gfredericksthanks#2018-04-1823:39aengelbergdone#2018-04-1823:40gfredericksokay well that's disappointing because I'm parsing such a thing and I guess I'll have to do it tediously by hand have you seen any general tools that can handle it?#2018-04-1823:41aengelbergif you could somehow pre-tokenize all of the indentation before passing it to instaparse, that might work#2018-04-1823:43aengelbergwhat I mean is that you'd have to convert
def f(x):
  if x == 1:
    return 2;
  else:
    return 3;
into
def f(x):
→if x == 1:
→return 2;
←else:
→return 3;
←←
#2018-04-1823:43aengelbergif that makes sense#2018-04-1823:43gfredericksand the arrows act like brackets?#2018-04-1823:43aengelbergbasically#2018-04-1823:44aengelbergbecause instaparse can't keep state of how far to the right you should be at any given point#2018-04-1823:44aengelbergbased on higher-level indentations#2018-04-1823:44gfredericksI think the combination of writing that code and having to write a grammar is probably more tedious than parsing it manually#2018-04-1823:44aengelbergperhaps#2018-04-1823:46aengelbergoh also you could maybe get sneaky with nested parsing#2018-04-1823:47aengelbergwhere, say, the top level parser gives you back
[:def "f(x)"
 [:nested-block
  "if x == 1:"
  "  return 2;"
  "else:"
  "  return 3;"]]
#2018-04-1823:48aengelbergand then you have some code that then takes said nested blocks and re-runs the parser on it#2018-04-1823:48aengelbergand so on#2018-04-1823:49gfredericksoh interesting#2018-04-1823:49gfredericksif you're not trampolining your parser, why bother getting up in the morning?#2018-04-1823:53aengelbergyou would have to re-append those nested block lines (with newlines in between)#2018-04-1823:54gfrederickssure#2018-04-1823:56aengelbergthe grammar would look like
S = def | if | ...
def = <'def '> thing-you're-deffing <':\n'> nested-block
nested-block = (' ' #".*\n?")*
#2018-04-1900:00gfrederickswe're relying on the greediness of .* to definitely eat the \n even though it`s got a ? on it?#2018-04-1900:01aengelbergwell, . by default means non-newline chars in regex-land#2018-04-1900:01gfredericksthe \n? had me worried that it might parse [" foo" " bar"] from " foo bar"#2018-04-1900:01aengelbergbut yes we are relying on the greediness#2018-04-1900:02gfredericks:+1:#2018-04-1900:02aengelberg#".*(\n|$)" might be safer#2018-04-1900:03aengelbergbtw I believe @dave has some prior art on nested parsers for Alda#2018-04-1900:03aengelbergactually just kidding I think he migrated away from instaparse in more recent versions of Alda#2018-04-1901:05daveindeed, i ended up rolling my own parser, mostly in the hope that someday we can start to asynchronously process parsed expressions/statements as they are parsed -- meaning that a score could start playing before it's even done being parsed#2018-04-1901:05dave...although in doing so, performance got significantly better, to the extent that we might never need to do that 😄#2018-04-1901:09davebefore moving away from instaparse, i was doing some pretty complicated stuff with multiple grammars. it was starting to make my head hurt a little, so that may have had something to do with the decision too#2018-04-1901:09davethe grammars needed to share some rules with each other, so i was defining grammars using bits and pieces strung together#2018-04-1901:11davethe reason for that was basically to try and get better perf (without being super knowledgeable about how to improve parser performance otherwise), instead of a big master grammar like i had before, i started parsing out large chunks with more specialized grammars, and do the parsing in multiple passes#2018-05-2214:49aengelberg#2018-05-2214:50aengelberg@sova Once your query has been parsed from a string into data (that's usually the hard part), you can use whatever strategy you want to actually evaluate it. insta/transform is meant to be just one tool in your toolbelt to serve the most common case.#2018-05-2214:52aengelbergI think this is the transform example you were referring to:
=> (->> (arithmetic "1-2/(3-4)+5*6")
     (insta/transform
       {:add +, :sub -, :mul *, :div /,
        :number clojure.edn/read-string :expr identity}))
33
#2018-05-2214:54aengelbergIn this case it isn't replacing :add with +, it's actually calling + on all of the elements that come after :add. So as the transformer works its way up the tree, it ends up evaluating the whole expression.#2018-05-2214:56aengelbergDoes that explanation help?#2018-05-2214:59sova-soars-the-sorawow! that is very cool, it's calling + ... this explanation is very helpful, it gives me a really strong starting point, but i'm still not sure how i can end up with a thing i can test logicals on. since i could replace different nodes with whatever function i see fit, there's a lot of power there, but i gotta end up with something that i could pour "lemon juice" through and if the internal query expression was ("lemon" | "momo") & "juice" ... it would work. do you have any ideas on transforming nodes into logic gates?#2018-05-2215:01sova-soars-the-soraokay, you mention that it works its way up the tree so presumably it starts at the leaves... so if on matching leaves we write down "true" and non matching leaves we write "false" and then eventually do an eval on the whole thing, it'll be like getting a truth statement back.#2018-05-2215:02sova-soars-the-soraThanks for taking the time to explain that, by the way.#2018-05-2221:07akirozFound this channel in #beginners and I just wanna say: @aengelberg thank you for the great library, it saved me multiple times from DSL hell 😊#2018-05-2221:09aengelbergthanks @akiroz, glad you are finding it useful#2018-05-2221:12aengelberg@sova yeah your true/false strategy should work. insta/transform is a recursive function, and whenever it's processing a node it transforms all of the children first, hence the "leaves first" approach.#2018-05-2221:13aengelbergyou can take a look at the source code of insta/transform. It has some boilerplate to make sure it works on instaparse's various output formats, but at its core it's a fairly simple depth-first iterator.#2018-05-2221:14aengelberghttps://github.com/Engelberg/instaparse/blob/master/src/instaparse/transform.cljc#L33-L46#2018-05-2221:18aengelbergthe super-simplified version of the logic looks like this:
(defn- hiccup-transform
  [transform-map parse-tree]
  (if (not (empty? parse-tree))
    (let [transform (transform-map (first parse-tree))]
      (apply transform (map (partial hiccup-transform transform-map)
                            (next parse-tree))))
    parse-tree))
#2018-05-2221:32sova-soars-the-soraCool! Thanks very much for providing source and also zooming in on the vital part.#2018-05-2221:34sova-soars-the-soraSo the (partial ..) function ... could you tell me more about how that comes into play?#2018-05-2221:45aengelbergpartial is a function built-in to Clojure that helps curry arguments in anonymous functions http://clojuredocs.org/clojure.core/partial#2018-05-2419:35sova-soars-the-soraHello again. I stumbled onto http://instaparse-live.matt.is/ and it's really awesome, I've been tinkering with it and with a repl-like feedback loop it's been pretty painless finding something that can parse logic like i need#2018-05-2419:36sova-soars-the-soraBut one thing I have noticed is that if there are parse variations, there may be multiple result parses#2018-05-2419:41sova-soars-the-soraThat pretty much encapsulates the nodes as I want to store them... i wonder if there is a cleaner or clearer representation for just & and | logic w/ parens#2018-05-2419:43sova-soars-the-soraSometimes, there are multiple valid interpretations, for example: (pardon the zoom level)...#2018-05-2419:46sova-soars-the-soraIt seems as though they are both perfectly valid so, maybe I can just use the first one of the result set#2018-05-2419:57sova-soars-the-sorabetter example..#2018-05-2420:08davewow, i haven't seen this before. so cool!#2018-05-2516:32sova-soars-the-sorayeah, LISP is insanely cool, instaparse is amazing, and context free grammars absolutely rock. seeing how easy it actually is to get tokens out of arbitrary syntax i'm really inspired to work on some NLP stuff!#2018-05-2718:50Logan Powell👋 Hi everyone!#2018-05-2719:01Logan Powellbtw, http://instaparse-live.matt.is/ is awesome#2018-05-2719:08akiroz@loganpowell Why not just read the clojure code as data? I mean lisp code IS data, there's no need to deal with string parsing#2018-05-2719:09Logan Powellhow would I pull out the pieces of the function definitions as strings?#2018-05-2719:09Logan PowellI'm converting it to markdown#2018-05-2719:10akirozjust print it to string#2018-05-2719:10Logan Powellhaha, I'm very stupid. That's a great idea#2018-05-2719:10akirozyou might want pprint actually, since it's for docs#2018-05-2719:11Logan Powellok, so I get the string that way, then how do I pull out the specific parts of that string that I need?#2018-05-2719:11akirozwell first parse the whole thing into data with the built-in read function#2018-05-2719:12akirozmanipulate the data as much as you want then print it#2018-05-2719:12Logan Powellhmm... let me give that a shot!#2018-05-2719:13akiroz@loganpowell if you need more advanced code analysis, check out tools.analyzer#2018-05-2719:13akirozhttps://github.com/clojure/tools.analyzer#2018-05-2719:14Logan PowellI'm using cljs, works the same?#2018-05-2719:14akirozyou mean the read part or analyser?#2018-05-2719:14Logan Powellboth#2018-05-2719:14akirozformer is called cljs.reader/read-string in cljs#2018-05-2719:15akirozlatter I have no idea if it works in cljs (I'm gonna guess no)#2018-05-2719:16Logan Powellis reader a part of core or do I need to add it as a :dependency?#2018-05-2719:16akirozit's built-in#2018-05-2719:16Logan Powellcool#2018-05-2719:18Logan Powellit's working 🙂 I was getting all excited about instaparse... now I have to calm down my curiosity and get to work 😄#2018-05-2719:19Logan Powelldo I use core.match with this?#2018-05-2719:19akirozHaha, I suppose building a parser yourself would be a great learning exercise too.... but code grammar is a bit complex.#2018-05-2719:20Logan Powellit looks as so, you're right#2018-05-2719:20akirozYou can use whatever tools you want to process the data, it's just a list#2018-05-2719:20Logan Powellok, let me give it a go#2018-05-2720:44aengelbergyeah, Instaparse only aims to help turn strings into data, so if you already have a way to do that (`read-string`) then instaparse won't be much help#2018-05-2720:45aengelbergthe "analysis" of your resulting data is always left as an exercise to the reader anyway 🙂#2018-05-2804:44sova-soars-the-soradoes it make sense to use instaparse on the input to (read-line) (e.g. getting input from std in?)#2018-05-2807:09sova-soars-the-sorabecause I want to parse numbers and strings. integers#2018-05-2822:54aengelbergnot sure what you mean @sova; instaparse parsers can be run on any string#2018-05-2900:03sova-soars-the-soraYeah, I had a hard time forming my question, i'm getting input from stdin, i have resorted to using edn/read-string which gives a vector of strings,#2018-05-3004:06fabraoHello all, I´m doing parser for fixed width with instaparse and the code is
(->>
   ((insta/parser
     "VALOR = CODIGO BARRAS
CODIGO = 2DIGIT
BARRAS = 3DIGIT
" :input-format :abnf) "12334")
   (insta/transform {:DIGIT (comp str)}))
How do I concat the "DIGITs" and keep CODIGO and BARRAS?
#2018-05-3006:21aengelberg@fabrao try adding transform entries for :CODIGO (partial apply str) :BARRAS (partial apply str)#2018-05-3012:29fabrao@aengelberg it did not work, this keep showing [:VALOR [:DIGIT "1"]:DIGIT2 [:DIGIT "3"][:DIGIT "3"]:DIGIT4]#2018-05-3012:31fabraoI had to change to
((insta/parser
     "VALOR = CODIGO BARRAS
CODIGO = #'.{3}'
BARRAS = #'.{2}'
") "12334")
#2018-05-3115:37sova-soars-the-sorahey @aengelberg, how does instaparse work? i was looking at the source and it looks like it makes multiple passes until it's successfully consumed the whole input string, correct? and if a parse doesn't work, an "error node is embedded in the tree" so it knows not to try that parse again?#2018-05-3116:27aengelberg@sova this talk explains the internals pretty well https://www.youtube.com/watch?v=b2AUW6psVcE#2018-05-3116:28sova-soars-the-soraoh nice. thank you! i was very curious because it's very powerful and very fast and i don't remember things being so fast in compilers class 😄#2018-05-3116:29aengelbergglad to hear it's fast! Although what makes Instaparse really unique is that it works well with left-recursive and ambiguous grammars#2018-05-3116:29aengelbergsomething like
S = 'a' | S 'a'
usually doesn't work in normal parsers
#2018-05-3116:30sova-soars-the-soraleftwards-building strings... I see that is cool#2018-05-3116:34sova-soars-the-soraI want to do some language processing stuff -- eventually reduce articles people write to synopses and relevant tags. i feel like that's very possible but i gotta think a bit more on the approach. like checking words against a dictionary to try and focus on nouns and verbs#2018-05-3116:55sova-soars-the-soraaha, now i know why the file is called .gll 😄#2018-05-3117:21sova-soars-the-sora"send me all your old magazine subscriptions before any new ones" is a good way to explain how it works briefly.#2018-08-0218:10mlimotteCan I get some help with a (hopefully) a simple grammar? I haven’t done much with CFGs so I could be totally off base. I want to find variable expressions in a string. For example: hello, {{name}}. This is similar to the Mustache variety of interpolation, but I need to pre-parse it to do something slightly different. I can recognize the pattern above pretty easily. My problem is having it ignore single brackets. For example: hello, {{name}}. Please choose {Yes, No}. The last part is not a double bracket expression and should just be treated like the other uninteresting text. So, my grammar looks like this (I’ve tried a bunch of other variations, this is the closest I’ve come):
(def p 
  (insta/parser
    "<S> = (block | TXT)*
     block = <'{{'> TXT <'}}'>
     <TXT> = (OPEN | CLOSE | A | block)*
     <OPEN> = !'{' '{'
     <CLOSE> = !'}' '}'
     <A> = #'[^{}]*'"))
#2018-08-0218:12mlimotteA call (p "x{a}") yields:
=> Parse error at line 1, column 2:
x{a}
 ^
Expected one of:
"{{"
"}"
NOT "{"
#2018-08-0218:12mlimotteSeems like the x got picked up by <A>. I would have liked it to match !‘{’, so that the next char could match in <OPEN>#2018-08-0218:13aengelbergtry changing the OPEN and CLOSE rules to
<OPEN> = '{' !'{'
<CLOSE> = '}' !'}'
#2018-08-0218:14mlimotte😄#2018-08-0218:14mlimotteThat seems to work.#2018-08-0218:15aengelbergThe problem in the original grammar was that the negative lookahead was conflicting with the token itself. It was basically saying "If there isn't an open bracket, please parse an open bracket"#2018-08-0218:15aengelbergWhereas what you really want is "Please parse an open bracket but only if there isn't another open bracket right after"#2018-08-0218:15mlimottehmm.. ok, i think that makes sense to me.#2018-08-0218:16mlimotteVery cool. Thanks for the quick help!#2018-08-0218:16aengelbergno problem#2018-08-0218:18mlimotteHere’s an edge case that still fails. But it’s a bit contrived, so if it’s not a trivial fix, I don’t need to worry about it. (p "{{y}")#2018-08-0218:19aengelbergdo you want that to parse as normal text?#2018-08-0218:19mlimotteyep#2018-08-0218:19mlimottenot a block#2018-08-0218:21aengelbergmaybe something like
<S> = TXT*
block = <'{{'> TXT <'}}'>
<TXT> = (block / A)*
<A> = #'[^{}]*' | '{' | '}'
#2018-08-0218:22aengelberghere I'm changing the A rule to match any text (including brackets and double brackets) but then using the ordered choice (`/`) to prefer parsing complete blocks when possible.#2018-08-0218:24mlimotteoh.. that’s great. I had tried an approach like that previously, but didn’t know how to prefer one parse over another … that / operator is new to me.#2018-08-0218:24aengelberg:+1:#2018-08-0218:25mlimottethanks for your help, again#2018-08-0218:25aengelbergnp#2018-08-2415:58aengelberg#2018-08-2415:58aengelberg@jeroenvandijk that grammar doesn't appear to be valid BNF; many of the string tokens are not properly quoted#2018-08-2416:00aengelbergAlso, instaparse has adopted the angle brackets <> to mean "hiding tags" (not an EBNF standard) but this AWS grammar uses them in all of the rule names, which might result in weird behavior#2018-08-2416:00aengelbergfor example
<condition_block> = "Condition" : { <condition_map> }
should be
condition_block = "Condition" ":" "{" condition_map "}"
#2018-08-2416:22hiredmanmy experience with instaparse, and other parsers for that matter, and external grammars, is pretty much no one provides complete grammars that are machine parseable.#2018-08-2416:24hiredmanit is incredibly frustrating to find out that, for example, the only grammar for the version 3 of the protobuf type description language available is incomplete and only published as fragments in <pre> blocks on the protobuf website#2018-08-2707:51jeroenvandijk@aengelberg @hiredman interesting. I was hoping it to have a parser one copy paste away 🙂#2018-08-2707:52jeroenvandijkWhen I remove the '<>', instaparse is complaining over the use of :, ... and it is missing condition_map. I think I have to do a proper investigation where this parser is being used to understand how it should work#2018-08-2707:52jeroenvandijkThanks for your feedback#2018-08-2715:07aengelbergno problem#2018-09-1111:55souenzzoThere is some repo with a colaborative collection of useful/example grammars?#2018-09-1917:05aengelberghmm not that I know of#2018-09-2019:22droneanyone know of an instaparse grammar for C? I’m checking out mcc (https://github.com/zmaril/mcc), but looks like it may be incomplete and bit-rotted#2018-09-2019:22aengelbergnot that I'm aware of#2018-09-2019:23dronethanks#2018-10-2521:03schmeehey folks! I’m trying to write my first parser for a very simple file format. I’ve made it work, but my solution uses negative lookahead. Is there any way to write a grammar that produces the same output without negative lookahead? here a REPL example:#2018-10-2521:04schmee
user=> (def s
   #_=>   "NAME=Thing 1
   #_=> ACTIVE=120201-171231
   #_=>
   #_=> NAME=Thing 2
   #_=> ACTIVE=120201-171231")
   #_=>
#'user/s

user=> (def grammar
   #_=>   (insta/parser
   #_=>     "top = (block <newline?>)+
   #_=>      block = line+ !line
   #_=>      line = key <'='> value <newline?>
   #_=>      key = #'[A-Z]+'
   #_=>      value = #'[^\n]*'
   #_=>      newline = '\n'"))
   #_=>
#'user/grammar

user=> (clojure.pprint/pprint (insta/parse grammar s))
[:top
 [:block
  [:line [:key "NAME"] [:value "Thing 1"]]
  [:line [:key "ACTIVE"] [:value "120201-171231"]]]
 [:block
  [:line [:key "NAME"] [:value "Thing 2"]]
  [:line [:key "ACTIVE"] [:value "120201-171231"]]]]
nil
#2018-10-2521:08schmeeif there’s any other way to simplify it I’d love to hear about it 🙂#2018-11-0112:13socksyI don't have a repl to hand, but couldn't you use a newline as a character separator rather than a negative look ahead for another line? #2018-11-1914:20Vincent CantinHello. I am starting to use instaparse, and I am matching the end of a multi-line string using #'\\Z'. While it seems to work, is it the correct way to match it?#2018-11-1916:53aengelberg@vincent.cantin that seems like a legit approach to me; I'd have to know more about your broader use case to know whether there's a more elegant overall approach.#2018-11-1917:00aengelbergfor example, you could use the instaparse negative lookahead feature (`!`) to determine whether there are no more tokens to match#2018-11-2009:30Vincent Cantin@aengelberg The context is: I am parsing a markdown document and I need to detect a line separator as either something based on \n and \r, or either the end of the document.#2018-11-2009:31Vincent CantinThank you for the negative lookahead hint, I will try it.#2018-11-2017:11aengelberg@vincent.cantin you could maybe structure the parser as
S = line (separator line)*
separator = '\n' | '\r'
then you don't have to explicitly check whether it reaches the end of the file.
#2018-11-2206:41Vincent CantinThat would be doable, but it requires to adapt the full grammar for that.#2018-11-2111:22Vincent CantinI am currently reading the CommonMark specification for parsing markdown format. Do you know if anybody already wrote such a parser with instaparse?#2018-11-2111:29Vincent CantinI found this project but it is a simple version, not the full spec. https://github.com/chameco/Hitman/blob/master/src/hitman/core.clj#2018-11-2114:13Vincent CantinThe reason I ask is that I am implementing one using instaparse. I just started recently, it’s called hiccdown.#2018-11-2206:35Vincent CantinI found a strange behavior with #'\\Z', I wonder if it is a bug or if it is normal.
((insta/parser
   "Paragraph = NonBlankLine+ BlankLine+
    BlankLine = #'[ \\t]'* EOL
    NonBlankLine = #'\\S'+ EOL
    EOL = (#'\\n' | EOF)
    EOF = #'\\Z'")
 "abc\ndef\n")

=> 
[:Paragraph
 [:NonBlankLine "a" "b" "c" [:EOL "\n"]]
 [:NonBlankLine "d" "e" "f" [:EOL [:EOF ""]]]
 [:BlankLine [:EOL "\n"]]]
#2018-11-2206:37Vincent CantinEOF appears before "\n" in the parsed result.#2018-11-2206:46Vincent CantinThis other approach which uses the negative lookahead does put the "\n" in the right place in the result, but there is another problem: The BlankLine is missing in the result. That may be a bug of instaparse. I am using the version 1.4.9.
((insta/parser
   "Paragraph = NonBlankLine+ BlankLine+
    BlankLine = #'[ \\t]'* EOL
    NonBlankLine = #'\\S'+ EOL
    EOL = (#'\\n' | EOF)
    EOF = !#'.'")
 "abc\ndef\n")
=>
[:Paragraph [:NonBlankLine "a" "b" "c" [:EOL "\n"]]
            [:NonBlankLine "d" "e" "f" [:EOL "\n"]]]
#2018-11-2207:08Vincent CantinI am going to use this workaround for now: append “EOF” at the end of the input and parse it. It works very well 🙂
((insta/parser
   "Paragraph = NonBlankLine+ BlankLine+
    BlankLine = #'[ \\t]'* EOL
    NonBlankLine = #'\\S'+ EOL
    EOL = (#'\\n' | EOF)
    EOF = 'EOF' #'\\Z'") ; works as well with !#'.'
 "abc\ndef\nEOF")
=>
[:Paragraph
 [:NonBlankLine "a" "b" "c" [:EOL "\n"]]
 [:NonBlankLine "d" "e" "f" [:EOL "\n"]]
 [:BlankLine [:EOL [:EOF "EOF" ""]]]]
#2018-11-2215:19sova-soars-the-soralove instaparse#2019-02-0419:52mattlyis there a way to get line/column numbers associated with the tokens generated from instaparse?#2019-02-0419:53aengelbergYeah, check out instaparse.line-col#2019-02-0419:56mattlyyay!#2019-02-0419:56mattlythanks#2019-02-0422:17souenzzoit's just DATA#2019-02-0422:18aengelbergit's not code, it's DATA#2019-05-2201:43souenzzois possible to write a grammar for clojure? including things like ^metadata and ;; comments#2019-05-2720:10aengelbergIt should be possible, yes#2019-05-2720:10aengelbergI'm not aware of any prior art#2019-08-1621:18aengelbergnot to my knowledge, unless you count ClojureScript#2019-08-1621:18aengelbergthe underlying engine is actually based on an existing algorithm called GLL, which was originally implemented in Racket https://github.com/epsil/gll#2019-08-1621:20aengelbergbut being able to create parsers based on arbitrary EBNF specifications (rather than a language-specific DSL of combinators) I think is unique to instaparse#2019-08-1621:20pepasyeah, I was so surprised when I started getting into parsers and no one seemed to take BNF as input#2019-08-1621:21pepasHuge kudos to you and your old man!#2019-08-1621:21aengelbergthanks!#2019-08-2303:43hiredmanis there I recommended way for dealing with C style comments like / ... / instagram parsers? I vaguely recall maybe trying to handle them with a custom whitespace rule, but it has been a long time#2019-10-1316:41skelterI have a combination of a parser and a particular data file that causes the parser to consume CPU and memory until out of memory exception.#2019-10-1421:47aengelbergthat's not good#2019-10-1602:17skelterI think I now have it isolated to its own project, if you’d like a captured specimen.#2019-10-1602:18aengelbergthat would be great#2019-10-1602:18aengelberggithub tickets also welcome#2019-10-1701:31skelterNot sure I want this particular specimen in the public pipeline. Maybe if I can reduce it to something more comfortable.#2019-10-1701:41skelterregarding the channel description line, what is a good example of trampolining a parser?#2019-10-1705:36hiredmanI suspect it is tongue in cheek, the internals of instaparse(and many gll parsers) use what you could call a trampoline to drive parsing#2019-10-1706:52aengelbergyeah, I think that came from a conversation with @gfredericks in which we discussed parsers generating parsers or some other unusual use case#2019-10-1822:39gfredericksSlander. I don't recall any such thing.#2019-10-2913:07Daniel HinesDoes there happen to exist a tool for taking an instaparse grammar and generating random strings with it?#2019-10-3005:41aengelberg@d4hines the idea has been tossed around a couple times, but the only implementation I'm aware of is my experiment https://github.com/aengelberg/instagenerate which isn't super useable in practice.#2019-10-3005:46aengelbergMy usage of core.logic was primarily motivated by a challenge to reverse-engineer output parse trees or fill in partial outputs. But if the main goal is to simply generate random inputs to a grammar (which most people really want), I think the implementation could be a lot cleaner, and leverage test.check.#2019-10-3005:47aengelbergThe hardest part I think would be coming up with a good solution to lookahead and negative lookahead, while guaranteeing terminable generation...#2019-10-3013:45Daniel HinesI'm a noob when it comes to this stuff. Do you need lookahead for an EBNF grammar?#2019-11-0115:29aengelbergI don’t think so, it’s just a nice feature that sometimes people take advantage of#2019-10-3012:49Daniel HinesThat's really cool!#2019-10-3013:03jeroenvandijkMaybe this project is relevant https://github.com/cs-au-dk/dk.brics.automaton ? We used it to generate data, also from regexps#2019-10-3014:27gfredericksWhen I wrote a generator for regexes I intentionally didn't try to support lookaheadbehinds#2019-10-3014:28gfredericksBut it seems like something you could support with the same caveats as such-that#2019-10-3014:28gfredericks(And have it fail the same way, rather than infinite loop)#2019-11-2611:52Ahmad Nazir RajaHi, This is related to greedy behavior of instaparse. I have the following grammar:
X := Y* Z*

Y := CHAR
Z := CHAR

CHAR := ('a' | 'b' | 'c')
With input aaa I get the output shown below. I expect greedy behavior and that Y should be matched instead of Z. Does anyone have an idea to why this happens or how to enforce greedy behavior?
#2019-12-0318:35aengelberg@U82LVQ5NX sorry for the late reply, but instaparse doesn’t guarantee greediness or non-greediness; in fact, if you call insta/parses you will get every version of the parse including ones where Y is parsing some or all of the chars.#2019-12-0318:38Ahmad Nazir RajaYes, I tried parses and I could see all versions. For some reason I thought there would be a way to prefer one version over the other. Anyway, thanks for the response.#2019-12-0318:38aengelbergYou can achieve greediness by using negative lookahead:
X := Y* !Y Z*
#2019-12-0318:38aengelbergthis ensures that it won’t start parsing Z until it can’t parse Y anymore.#2019-12-0318:33Daniel HinesHow do I read this?#2019-12-0318:33Daniel Hines#2019-12-0318:40aengelbergI think the instaparse equivalent of this would be:
remainder_sort_names := '' | ',' sort_name remainder_sorts
#2019-12-0318:40Daniel HinesOh! Duh! Epsilon!#2019-12-0318:40aengelbergalthough instaparse also supports ε as an alias for ''#2019-12-0318:41Daniel HinesNoob fail. Thanks a bunch!#2020-01-0420:01sova-soars-the-soraHi! Why is it called a "Context-Free" Grammar?#2020-01-0420:01sova-soars-the-soraIsn't it nothing BUT context?#2020-01-0420:02sova-soars-the-sora😃#2020-01-0420:16gfrederickswikipedia says rules in a context-sensitive grammar take the form α_A_β → αγβ#2020-01-0502:37sova-soars-the-soraHmmm I seeeee. Thank you @gfredericks#2020-01-0620:55hiredmanI believe the terminology(and concepts) comes out of linguistics first (not cs) and was first used as a way to construct (or produce) all the strings in a language (a constructive definition of all the strings in a language). So context free comes from the fact that when you are using the grammar to construct strings you can apply the production rules anywhere they match without other restrictions#2020-01-0620:58hiredmanhttps://en.m.wikipedia.org/wiki/Generative_grammar#2020-01-2515:27Adrian Smithhttps://i.imgur.com/G1xdp1F.png In this example how do I make it so the identifiers rule matches anything until the next rule along is valid?#2020-01-2515:31Adrian Smithah I see it kind of explains this in https://github.com/Engelberg/instaparse#regular-expressions-a-word-of-warning#2020-01-2613:02Adrian Smithhow would you go about writing a grammer for SQL columns? where commas must appear between elements#2020-02-2109:19mmeixNew to Instaparse and wrapping my head around grammars: how would I write a grammar that can do nested tag pairs like in xml: "<p><span>text</span></p>" => [:p [:span text]] ?#2020-02-2700:37aengelbergit’s possible to parse XML hierarchies into Clojure data, however I don’t think you can enforce that the tags must be matching.#2020-02-2700:38aengelbergYou can enforce that manually with your own custom logic after the fact, just not as part of the parser.#2020-02-2717:03mmeixSo I would just trust, that tags are properly matched/paired/nested.#2020-02-2717:03mmeixand take each closing tag as the next needed#2020-02-2112:30manutter51caveat: I haven’t had my coffee yet, but the basic idea is that you say something like “a BLOCK element is a P or a DIV or a TABLE (etc), an INLINE element is a SPAN or a B or TEXT (etc),” and then say “a P element is the literal string ‘<P>’ or ‘<p>’ followed by zero or more INLINE elements, followed by the literal string ‘</P>’ or ‘</p>’.” And similarly with SPAN.#2020-02-2113:11mmeixah! thanks ... that should start it#2020-02-2113:13manutter51The other caveat is that Instaparse is incredibly fun to work with and may be addictive. 😉#2020-02-2114:32mmeixConfirmed! 😁#2020-02-2716:22sova-soars-the-sora@mmeix it looks to me at first glance that you could make some rules span = <span> val </span> paragraph = <p> val </p> val = paragraph* | span* | val [editS: added stars to p and span in val]#2020-02-2716:22sova-soars-the-sorathe last rule allows recursion, which can infinitely nest spans or p's#2020-02-2716:22sova-soars-the-soranotice how i defined the rigid components of the grammar on the right-hand side, with my variable, and how i use variable names only on the left-hand side.#2020-02-2716:27mmeixThat looks like a good recipe. Thanks!#2020-02-2716:29sova-soars-the-sora@mmeix http://instaparse-live.matt.is/#/-M16TrdGzPQ0FFLRyCyd/v1#2020-02-2716:30sova-soars-the-sorai had to make sure it works before setting you out#2020-02-2716:30sova-soars-the-soranotice how there can be multiple valid parses with the recursion now.#2020-02-2716:32sova-soars-the-sorayou'll probably need to be creative with the output to get rid of tags you don't need 😃 i forget exactly how we would delete unnecessary strings in the grammar itself, there's a way with rules i think, it might just be mathematical though lol#2020-02-2716:49mmeixGetting rid of tags is done by enclosing them with <…>#2020-02-2716:49mmeixThanks for the gist!#2020-02-2716:51mmeixNow I’m thinking, if it would be possible to get a general solution without enumerating all possible tags (span, p , …). It would need to somehow remember the tag name until its closing cousin arrives#2020-02-2716:53mmeixDidn’t know http://instaparse-live.matt.is ! Great tool!#2020-02-2817:00sova-soars-the-sora@mmeix i think you probably want to parse this into a tree and go over it tree-style if you need opening/closing tag harmony. using a CFG to do that might be possible but it's more for orientable sequences. in reality if you wanted to you could just run over that thing wit a regex and make opening tags [: and closing tags ], throwing away </span> and </p> in favor of ] i would definitely write some sample input, some sample output, and then see what tool is best for the job. instaparse is indeed powerful, i used it to create an EBNF grammar representation of Japanese. https://learn-japanese.org/2020/01/04/japanese-grammar-in-ebnf-notation/#2020-03-0503:53bmaddyI have a question about how to do something in Instaparse. Imagine a string like this:
3 1 John 0 2 Jane 2 1 3 3 Bob 0
The first number is the number of people. Then, for each person, an id, name, number of people they supervise, and list of ids for the people they supervise. So in that example string there are 3 people. Jane (id = 2) supervises John (id = 1) and Bob (id = 3). Can I use Instaparse to parse stuff like that? Specifically, I'm wondering how to read a number n and parse exactly n items after that.
#2020-03-0504:29aengelbergsadly Instaparse isn't well suited for situations where you parse a thing and then use that as an input to some later part of the same parser.#2020-03-0504:30bmaddySounds good. Thank you!#2020-03-0517:28zaneOut of curiosity, what would be better suited for situations like that?#2020-03-0521:43thomhttps://github.com/youngnh/parsatron might be a better fit#2020-03-0521:44thomif you have a look at the way let->> can be used in https://github.com/youngnh/parsatron/blob/master/doc/guide.markdown#2020-04-1112:13sova-soars-the-soradoes one have to do anything special to use instaparse on the clientside (cljs) ?#2020-04-1115:01gfredericksI used it in https://gfredericks.com/things/bespoke-primes I don't remember doing anything special but that doesn't mean very much#2020-04-1122:31sova-soars-the-sorasweetness. looks like it should work outta da box!#2020-04-1401:08zaneCan confirm, does work outta da box. :+1::skin-tone-2:#2020-04-1418:19zaneI'm curious how often folks wind up using records (vs other alternatives) when writing polymorphic transformation functions for their instaparse parse trees.#2020-04-1619:56zaneJudging from the README there doesn't appear to be a (public) function to get a data representation of an Instaparse grammar, but could someone confirm?
#2020-04-1623:24gfredericksI was wanting that recently but I can't remember if it was for a good reason#2020-04-1718:18zaneI'm wanting to generate documentation, but perhaps there's a better way to go about that.#2020-05-2718:05sova-soars-the-soraHi everyone, I'm interesting in dynamically defining grammar components. I have a big list of nouns, I'd like to incorporate them into my grammar without too much crazy. Is there some way I can add in a dynamic rule with something like nouns = coll?#2020-05-2718:05aengelbergYou probably want instaparse.combinators#2020-05-2718:06sova-soars-the-sorathanks!#2020-05-2718:06sova-soars-the-sorai shall take a gander#2020-05-2718:07aengelberghttps://github.com/engelberg/instaparse#combinators#2020-05-2718:07sova-soars-the-soraWould it be ok to do something like (str/join "|" nouns-seq) in the grammar def#2020-05-2718:08aengelbergthat might be more error-prone and hard to read, but it wouldn’t be more or less efficient than the combinator version#2020-05-2718:09sova-soars-the-soraOkay cool#2020-05-2718:09aengelbergHow many nouns are we talking? I’d be wary of putting too many into a parser rule#2020-05-2718:10aengelbergBecause to parse text that satisfies the rule, it ultimately has to iterate through every option at every point in your text that could potentially be a noun#2020-05-2718:11aengelbergAnd if you pass a string to the parser that doesn’t successfully parse, the “failure” message will be really long#2020-05-2718:12sova-soars-the-soraI'm using about 5-20 nouns#2020-05-2718:12aengelbergah ok, that’s definitely manageable#2020-05-2718:13sova-soars-the-soraI'm trying to just do string join on the nouns list I have, but they lose their quotes#2020-05-2718:13aengelbergmaybe (apply str/join "|" (map #(str "'" % "'") nouns-seq))#2020-05-2718:14sova-soars-the-soraah beautiful#2020-05-2718:14sova-soars-the-sorathank you very much#2020-05-2718:19sova-soars-the-soraalmost perfect, it gives me a ('list' 'of' 'nouns') .. not sure how to make it digestible for the parser#2020-05-2718:19aengelbergI think that’s where the apply str/join comes in#2020-05-2718:20aengelbergoops, I was wrong, you don’t need apply#2020-05-2718:20aengelbergjust (str/join "|" …)#2020-05-2718:22sova-soars-the-soraExcellent! Thank you.#2020-05-2718:22sova-soars-the-soraI'm looking forward to showing you guys what I've come up with, once it's done! 😃#2020-05-2718:22aengelbergcan’t wait to see!#2020-05-2718:23sova-soars-the-sorait's a great wonder to have the power of magic at your fingertips! 😄#2020-05-2719:02sova-soars-the-soraI keep ending up with a vector although I would like to use a map... I don't know if it's clojurescript#2020-05-2719:02aengelbergyou mean for the input or the output?#2020-05-2719:03sova-soars-the-sorathe output of insta/parse#2020-05-2719:03aengelbergyou can set :output-format :enlive to get a map out of the parser#2020-05-2719:04sova-soars-the-sorawizardry#2020-05-2719:08sova-soars-the-soraHmm, not bad, but it has :tag and :content, I could do with simply swapping [] with {}#2020-05-2719:11sova-soars-the-soraThanks for all the help, I'll scratch around#2020-05-2723:15sova-soars-the-soraCan instaparse flatten a recursive structure with a rule?#2020-05-2816:57sova-soars-the-soraHi. I finished what I was working on. Phase one, anyway.#2020-05-2816:59sova-soars-the-soraA few months ago we created a tool that lets people input Japanese by clicking on boxes. Nouns, verbs, particles (grammar words in Japanese that mark grammatical role of a term). It's a useful teaching and learning tool. Students can create a native-Japanese composition with a tool that respects the basic grammar rules. I thought it might be possible to generate an English translation in real-time of the Japanese a student was generating. Turns out, it is! Turns out, Instaparse is super powerful and I was able to use some grammar and EBNF insights from January to work it into a real-time translation tool. Here's a brief gif, you can see it in Action.#2020-05-2817:01sova-soars-the-soraAnd here are some still shots to help show what's happening: A generated Japanese phrase is parsed, smushed, and sequenced into Plain 'Ainglish. This composition was made by a student! =)#2020-05-2817:03sova-soars-the-soraThanks a lot for Instaparse! 😄 You can do magical things indistinguishable from magic.#2020-07-1522:29mrchanceHi! If instaparse parses a string in Clojure but fails to do the same in cljs with an identical grammar, what's a good place to start debugging?#2020-07-1522:30aengelberginstaparse defers to the host language when it comes to regex syntax, so if you have a #"…" in your grammar that might be a good place to start#2020-07-1522:31mrchanceAha! I do, will check that, thanks#2020-07-1522:42mrchanceGot it, it was Javas [^]+] vs. JS' [^\]+] , thanks again 🙂 Writing the parser was a breeze btw!#2020-07-1522:42aengelbergnice! cool to hear that, thx#2020-08-1402:43scarrucciuHello all, is there a way to match on a value, but still use that value for a downstream rule? For example have a string that looks like "ABC*123~EFG*456~HIJ*789" and I would want to nest down a level when I see ABC and HIJ, but I also want to parse the ABC and HIJ are part of the regular segment structure that is using the ~ as the delim#2020-08-1402:44aengelbergWould the lookahead feature help?#2020-08-1402:45scarrucciuwas thinking that, would that allow me to essentially parse prospectively without actually pulling the value? Will try it now#2020-08-1402:46scarrucciuthat worked! and was way to simple. Thanks for the quick reply#2020-10-0113:13jeremysHi, I was wondering, is there a way to force instaparse to throw an error instead of backtracking and exploring another parse solution ?#2020-10-0116:23hiredmanIf your grammar can be expressed without alternatives then you don't need a parsing library#2020-10-0213:37jeremys@hiredman Hi, I don't know what to make of your answer. I could also say that even with a grammar that offer alternatives I could still make the choice not to use a parsing library and write a reader by hand. I am using instaparse because it makes it much easier to build a parser with it than without it. It also makes it easier to evolve the grammar. Maybe the answer to my question is that my grammar can be expressed another way instead of wanting to tell the parser not to backtrack. Still if the functionality existed I'd like to know about it! Cheers.#2020-10-0213:41manutter51What was your reason for wanting to error out instead of backtracking? Are you trying to raise an alert about invalid expressions in what you’re parsing, or are you trying to debug your grammar?#2020-10-0216:16jeremys@manutter51 hi! I want to raise an alert about invalid code. I am working on something like https://docs.racket-lang.org/pollen/ that allows to write code in the middle of text. My problem arises in particular situations. For instance, the entry rule of my grammar looks like this:
doc = (plain-text | embedded)*
If I write a pollen expression like this:
plain-text ◊str["some string"] plain-text
it is parsed as:
[:doc 
 "plain-text " 
 [:tag 
  [:tag-name "str"] 
  [:tag-clj-arg "[" " " "\"aaa\"" "]"]] 
 " plain-text"]
Now if i make a mistake balancing the quotes:
plain-text ◊str["some string""] plain-text
the way my grammar works I get:
[:doc 
 "plain-text " 
 [:tag 
  [:tag-name "str"]] 
 "[ \"some string\"\"]  plain-text"]
From the point of view of the parser there is no error here. The ["some string""] expression, which serves as arguments to the str function, couldn't be parsed as correct clojure code. However the parser can fall back to the plain-text grammatical rule and did just that. In this case I'd rather it didn't.
#2020-10-0216:26manutter51Perhaps you could define plain-text so that it’s not allowed to contain an unescaped character?#2020-10-0216:41jeremysIt is actually 🙂 That's how the grammar recognizes that there is a "tag-fn" there (in pollen's jargon) or embedded code in general. And so we rightly get the [:tag [:tag-name "str"]] part. What happens is that the arguments to the function are optional. Thus if the text that follows the function's name is malformed args, the parser can fall back to plain text. It may be be that the parser can't be made to throw in that case or that I can't gerrymander my grammar into doing what I want. It would would be cool if I could though.#2020-11-1613:31mishagreetings! is there a way to specify "greedy" matches in the grammar (instead of the tree transforming) other than using inline regex-es? (insta/parse (insta/parser "S = 'a'+") "aaaa") to get => [:S "aaaa"] instead of: => [:S "a" "a" "a" "a"] my actual use case is: I have a bunch of rules wrapped in <> (for "documentation"), so the tags will not show up in output tree, so I'd line to have a single match string in the output (like "aaaaa"). I'd like to avoid tree transforming, because grammar is "up for extension" for someone else, and making sure they update transformers as well add a line to grammar - is extra point of potential failure#2020-11-1613:39mishaand inline regexes do not compose well at all:
<phrase> =    #'\w+(\s+\w+)*'
text     = #'\s*\w+(\s+\w+)*\s*'
instead of
<space>  = #'\s+'
<word>   = #'\w+'
<phrase> = word (space word)*
text     = space? phrase space?
#2020-12-0804:31ZaymonHello all. I’m starting to learn parsing and EBNF and I am struggling to remove the ambiguity from my parser. I have constructed a simple example to demonstrate the problem I am having. The following parser tags text marked as emphasises like ***emphasis***
(def remove-ambiguity
  (insta/parser
   "S = (em / char)+ | epsilon
    em = <'*' '*'> char* <'*' '*'>
    <char> = #'.'")
Although with an input such as **em** **em** there are many possible parse results:
([:S [:em "e" "m" "*" "*" " " "*" "*" "e" "m"]]
 [:S "*" "*" "e" "m" "*" "*" " " "*" "*" "e" "m" "*" "*"]
 [:S [:em "e" "m" "*" "*" " "] "e" "m" "*" "*"]
 [:S "*" "*" "e" "m" "*" "*" " " [:em "e" "m"]]
 [:S "*" "*" "e" "m" [:em " " "*" "*" "e" "m"]]
 [:S "*" "*" "e" "m" [:em " "] "e" "m" "*" "*"]
 [:S [:em "e" "m"] " " "*" "*" "e" "m" "*" "*"]
 [:S [:em "e" "m"] " " [:em "e" "m"]] <-- This is the one I want

;; This makes sense since there are a few ways you can match up the asterisks to match the rule. However I only ever want to allow results like this `[:em "e" "m"] " " [:em "e" "m"]]
It’s almost like I want it to greedily take the first match possible and then ignore all others. But I have no idea how to express this. Any help would be greatly appreciated 😄.
#2020-12-0817:48hiredmanyour grammar says '' is both the start of an em sequence, and two chars, and that is the ambiguity#2020-12-0823:16ZaymonIs there a way I can force the correct behavior? I always want it to be the first found pair #2020-12-0823:53ZaymonHow do I specify that a char is any character or sequence of characters except **#2020-12-0900:27ZaymonLooks like I can use negative lookahead in the definition of char#2021-02-0205:57Vincent CantinHello#2021-02-0205:58Vincent CantinWas there any attempt to use Instaparse to propose auto-completion at the end of a string which is only matching the beginning side of a grammar? Does Instaparse have any support for this kind of use case?#2021-02-0206:05Vincent CantinFor example, assuming that we have the grammar:
my-grammar = 'he' | 'helsinky' | 'hello'
and we have the string "he", that matches the grammar already. It would be nice if Instaparse could say that the next characters for a grammar match could also be "lsinky" or "llo" .
#2021-02-0206:07Vincent CantinI am using Instaparse heavily in the project https://github.com/green-coder/girouette#2021-02-0206:25Vincent CantinMaybe a new "suggestion" mode could be added in Instaparse, where the resulting parse tree could contain some special nodes where we could query some suggestions of letter insertion.#2021-02-1022:14mathpunkI'm writing my first grammar. It's going pretty well but, I've captured all the stuff I care about and now I have some extra junk I don't care about. How can I express something like, "S = word data junk", where I'm interested in part of the pattern and then afterward there is maybe some "whatever"?#2021-02-1022:17aengelberg@mathpunk you might want the “hide” syntax (`<>`)?#2021-02-1022:17aengelberghttps://github.com/engelberg/instaparse#hiding-content#2021-02-1022:19aengelbergyou could capture all the other junk with <#'[\s\S]*'>#2021-02-1022:20mathpunkexcellent, thank you!#2021-02-1220:32mathpunkI love working with this library#2021-02-1220:34mathpunkI got pretty far with my goal, but my grammar doesn't handle all my cases. I'd like to see how many so I'm trying to do a (try... (catch to see the % of fails on my data. Are exceptions thrown by the parser different than typical java exceptions?#2021-02-1220:35mathpunkI might just be holding try/catch wrong#2021-02-1620:22aengelbergThe parser doesn’t actually throw exceptions, it instead returns a custom Failure object that prints out a special way#2021-02-1700:38mathpunkAh, thank you.#2021-02-1701:20Vincent Cantin@U0E9KE222 As I recall, there is a function in the API to test if a result is a failure.#2021-03-2519:32mathpunkDo you folks have tips on getting started with interpreting parsed output? The things that come to mind are, clojure.walk, spectre, core.match.... I haven't worked with hiccup so I don't know what patterns people to use to work with that#2021-03-2605:35Vincent CantinIt highly depends on what you want to do with the data, and if your data is contextual or not (i.e. if a label has a special meaning when under some special parents)#2021-03-2605:36Vincent CantinA recursive DIY parsing function may be the simplest to implement and maintain.#2021-04-1018:56Vincent CantinI needed to have an expression which would always fail to match anything in Instaparse. I used "S = &'nop' 'no-way'" . Was there another simpler way to do it?#2021-04-1020:55aengelbergmaybe ! eps?#2021-04-1020:57aengelberg#'$^' might be more performant#2021-04-1820:41sova-soars-the-soraHi. I was wondering, is there a way to do fuzzy matching or fuzzy parsing? And how I mean is that I want to draw rectangles around Japanese text, parsing it effectively, but I would like to be able to parse even if some terms are unknown or undefined. Is there a way to do that in instaparse, a way to parse with fuzziness so not every single term in the parse input is defined by the rules?#2021-04-1820:44aengelbergSadly no. Instaparse was designed to turn strings into data using a well-defined language, so partial matching and fuzzy matching aren't well-supported.#2021-04-1820:46sova-soars-the-soraNo problem. I'm wondering how I can do this ^.^ Maybe I can pre-process everything and do a sort of mini dictionary prep step.#2021-04-1820:51sova-soars-the-soraSo if I do a dictionary scan of the input text, I think it is smartest to start with longest strings first#2021-04-1820:52sova-soars-the-soramatch all the 7-letter words, 6-letter words, 5-letter words, and so on.#2021-04-1820:54sova-soars-the-soramaybe just cut it into slices? "Shewenttothemuseum" -> "Shewent" (no results) "emuseum" no results... but then "Shewent" (also no results) .... "museum" result found. mark it. keep it moving. Kinda like a sieve of erasthenes but on text#2021-04-1820:56sova-soars-the-soraMaking m stringlets of size n from a string sounds like linear in data, so we could probably do pretty large datasets but maybe not a whole novel conveniently this way. Hmm, I suppose it is easy if we split on sentence ends (periods 。) and then do the sieve approach on each sentence#2021-04-1820:57sova-soars-the-soraThis might actually work pretty darn well!#2021-04-1820:58sova-soars-the-soraPreprocess the input with a sieve + dictionary lookup, figure out the nouns and verbs throw them into the rules then try and run the parse on it. i'll still need some core rules for grammar but the idea is to have a lot of them hard-coded#2021-04-1820:59aengelbergA regex could be a good fit to quickly scan for valid dictionary words. Some regex libraries let you compile a large union of words ( #"word1|word2|word3|... ) into a finite state machine that can do a linear-time scan of text.#2021-04-1821:02sova-soars-the-soraohhh cool. that's a really neat idea. i think i might need to use web lookups but if i keep tabs on those results they could go into such a regex.#2021-04-2306:30SigveIf i understand correctly, :auto-whitespace :standard inserts <whitespace>? rules. So that for the parser
(def words-and-numbers-auto-whitespace
  (insta/parser
    "sentence = token+
     <token> = word | number
     word = #'[a-zA-Z]+'
     number = #'[0-9]+'"

    :auto-whitespace :standard))
(words-and-numbers-auto-whitespace "abc 123 45 de") and (words-and-numbers-auto-whitespace "abc123 45de") produces the same result. Is there any method of instead inserting non-optional whitespace rules, which in this case would disallow "abc123 45de"`?
#2021-04-2319:23sova-soars-the-sora@sigve.nordgaard the desired result is only letters and only numbers together in sequence? You could have tokens-numbers and tokens-letters and a sentence can be tokens-numbers+ | tokens-letters+#2021-04-2319:24sova-soars-the-soraIf I have understood the question. That would only allow contiguous digits or contiguous characters, not a mix#2021-04-2606:44Sigve@sova thanks for the answer, but i only used the grammar above as an example (taken from https://github.com/Engelberg/instaparse/blob/master/docs/ExperimentalFeatures.md#auto-whitespace). My problem is that i need the tokens of the grammar to be whitespace separated, so that keywords of the grammar cannot be "merged" with the following tokens. For example: replace word in the grammar i pasted with some keyword like 'power' , which then should be followed by some number. Then i need the string 'power 100' to be valid, but not the string power100. The problem is that the :auto-whitespace :standard feature allows both.#2021-06-0817:48EdDoes anyone know if there's an easy way to "unparse" something that parsed with intstaparse? I've written a grammar to parse something so I can transform it, and now need to spit it back out again as a string. I can write something that will recursively do walk the tree and do that, but I wondered if there was something I was missing that would do it for me 😉#2021-06-0819:47sova-soars-the-soraHmm, and you don't have access to the original string?#2021-06-0820:27EdI do, but I've changed the content. That was why I parsed it in the first place ;) ... I was just hoping I was missing an instaprint that went the other way. If nobody knows of anything like that, then I'll write something custom. It's not too complicated a grammar.#2021-06-0911:40EdSo I fiddled with my grammar a bit, so it captured some more strings than I needed to actually do the transformations I wanted to do using regexes, and the recursive printer ended up being
(defn write-tag [writer template]
  (if (vector? template)
    (doseq [s (next template)]
      (write-tag writer s))
    (.write writer template))
  writer)

(defn write-template [template]
  (.toString (write-tag (StringWriter.) template)))
Simples ... should have just tried to write it in the first place 😉
#2021-06-1000:05sova-soars-the-sora@l0st3d well done 😃#2021-06-1417:24markaddlemanDoes instaparse support round tripping? I have a string-based language that I want to manipulate. I'd like to parse it, manipulate the parse tree using meander and then generate a new string from the parse tree#2021-06-1418:12sova-soars-the-soraI think it's possible. I don't know what meander is...#2021-06-1418:13aengelbergthere isn’t a way to “unparse” though that’s been requested a few times#2021-06-1418:13aengelbergyou could write your own “unparser” that leverages insta/transform and implements a different string-reconstruction logic for each tag in your grammar#2021-06-1418:14aengelbergassuming you don’t use the “hide” rule (`<>`), those implementations would basically just be str#2021-06-1418:15markaddlemanthanks. I may be signing myself up for a world of hurt but my current approach is to use clojure spec to generate the parse tree and then unform to "unparse" it#2021-06-1418:17aengelberginstaparse will almost certainly be a better fit than clojure spec to do the initial parse, though I see why you’d want to use a library that gives you an “undo” function#2021-06-1418:18markaddlemanyeah, i feel like this is a no-win situation#2021-06-1418:18aengelbergI don’t think writing your own un-parser would be too challenging#2021-06-1501:22markaddlemanYou were right. Using parse options :unhide :all , I can easily use meander to unparse the parse tree#2021-06-1513:53markaddlemanThank you!#2021-06-1418:19aengelbergsince it’s mostly putting strings back together from a recursive tree#2021-06-1418:20markaddlemanhm. thanks. I'll give it a try#2021-06-2908:56borkdudeHey, someone here? :)#2021-06-2908:57borkdudeI was trying to make this ebnf grammar work with instaparse: https://github.com/cbeust/kash/blob/master/src/main/resources/bash.ebnf But so far it didn't work out#2021-06-2908:58borkdudeHere's what I got: https://gist.github.com/borkdude/98c5d9e2bf598b227e8e643e4271e61e
user=> (def parser (insta/parser "/Users/borkdude/Downloads/bash.ebnf"))
#'user/parser
user=> (parser "foo")
Parse error at line 1, column 1:
foo
^
Expected:
#"[0-9]"
#2021-06-2909:27SigveHi, when you do not specify a starting rule for the grammar instaparse selects the top rule for a starting point. In you case that is the number rule. https://github.com/engelberg/instaparse#parsing-from-another-start-rule This should work:
(parser "foo" :start :command)
#2021-06-2909:30Sigve(NB: surrounding an rule with angle brackets makes it hidden, since all commands are hidden you will probably only get an empty list on a successful parse)#2021-06-2910:03borkdudeaaah#2021-06-2910:05borkdude
user=> (parser "foo" :start :word)
("f" "o" "o")
#2021-06-2910:06borkdudebtw, it wasn't my choice to use angle brackets, I just copied that from the original ebnf#2021-06-2910:09borkdudeoh I see, hidden means you don't get it back in the structure, but directly?#2021-06-2910:13borkdudewhy does this succeed if I have set :partial to false:
user=> (parser "foo" :start :word :partial false)
[:word]
#2021-06-2910:13SigveThat was my hunch, which is why i thought a head's up was in it's place:) Yes, these is at good example of hiding here: https://github.com/engelberg/instaparse#hiding-content as mentioned, it is usually used for hiding whitespace and other tokens you do not care about in the final output, but if you hide the top rule, everything disapears#2021-06-2910:14borkdudeah I see, it was because of the hiding again:
user=> (parser "foo" :start :word :partial false)
[:word [:word [:word [:letter "f"]] [:letter "o"]] [:letter "o"]]
#2021-06-2910:16Sigve:partialallows a partially complete/successful parse to succeed, embedding the failure node in the AST where at the point where the output#2021-06-2910:16Sigveah:)#2021-06-2910:19borkdudeIt seems the original ebnf works a bit differently than instaparse. e.g.:
<for_command> ::=  'for' <word> <newline_list> 'do' <compound_list> 'done'
            |  'for' <word> <newline_list> '{' <compound_list> '}'
            |  'for' <word> ';' <newline_list> 'do' <compound_list> 'done'
            |  'for' <word> ';' <newline_list> '{' <compound_list> '}'
            |  'for' <word> <newline_list> 'in' <word_list> <list_terminator>
                   <newline_list> 'do' <compound_list> 'done'
            |  'for' <word> <newline_list> 'in' <word_list> <list_terminator>
                   <newline_list> '{' <compound_list> '}'
#2021-06-2910:19borkdudeseems to assume that the tokens are automatically separated by whitespace#2021-06-2910:20borkdudeif I have to rewrite the grammar anyway I'm more inclined to hand-roll my own parser#2021-06-3004:34sova-soars-the-sora😮#2021-06-3004:40sova-soars-the-sora#2021-06-2910:23SigveI think that for most yacc/bison parsers rules are separated by whitespace by default yes, instaparse supports adding this by using the auto-whitespace feature which has worked well for me https://github.com/Engelberg/instaparse/blob/master/docs/ExperimentalFeatures.md#auto-whitespace#2021-06-2910:24SigveI dont know what you are using this parser for, but in my experience using a proper grammar-based parser is more maintainable and flexible in the long run. Of course for small use cases it can be a lot to get into and learn#2021-06-2910:26borkdudethis parser should parse bash syntax#2021-06-2910:26borkdudebut bash is not such a big language#2021-06-2910:27borkdudeThis is the original: https://github.com/cbeust/kash/blob/master/src/main/resources/bash.ebnf#2021-06-2910:27borkdudeI just have some problems getting this to work with instaparse so far#2021-06-2910:27borkdudeit's not very important, just a fun project#2021-06-2910:32SigveThen i guess comes down to which approach you find most fun:) I think instaparse is quite amazing once you grok it, but again i understand i can be a hassle go get into. On the other side, hand written parsers can also be painful to get correct#2021-06-2919:50aengelbergThere isn’t really a single EBNF syntax specification or RFC, so every “EBNF grammar” you’ll find in the wild will have a slightly varied flavor of the syntax. Sometimes because a certain parser library chose a unique metasyntax, or sometimes because the grammar is meant to serve as documentation rather than compiled and executed.#2021-06-2919:54aengelbergInstaparse attempts to support most of the different flavors, which is why you can use either x? or [x] syntax for example#2021-06-2919:55aengelbergBut sometimes a grammar or a different parser library will make a particularly unusual syntax choice, like using angle brackets in rule names#2021-06-2919:56aengelbergOr a grammar will make an implicit logical assumption that Instaparse has no way to act upon, like whitespace being parsed between tokens#2021-06-2919:58aengelbergThe angle brackets are particularly unfortunate since Instaparse chose to use angle brackets for an instaparse-specific feature (hiding data from the output parse tree)#2021-06-2920:03aengelbergABNF, on the other hand, seems to be a much more regulated metasyntax, so copy and pasting ABNF grammars into instaparse (using :input-format :abnf) tends to be safer#2021-07-1716:00Rob HaisfieldHow does Instaparse compare to Megaparsack? https://twitter.com/lexi_lambda/status/1411768876753358851?s=21#2021-07-1716:14sova-soars-the-soraThat looks neat and it looks like it supports things that are not exactly CFG at least not the ones I am aware of… Instaparse is implemented under the hood as an LRR parser (correct me if I’m wrong) making it very super duper fast and powaful. Interesting find tho#2021-07-1909:15aengelbergInstaparse's engine is based on the GLL algorithm, if that helps#2021-07-1917:24sova-soars-the-soraGLL woo! Thanks for the keyword#2021-08-1020:03sova-soars-the-soraInstaparse live is so sweet. http://instaparse-live.matt.is/#2022-02-0218:39ghaskinsHi All, I’m trying to understand a failure related to trying to exclude “[” via regex negation#2022-02-0218:39ghaskinsthis grammar snippet
<unquoted-literal> ::= #"[^()\[\s]+"
#2022-02-0218:40ghaskinstriggers insta/failure? to return true but there is no info provided#2022-02-0218:40ghaskinsthis works#2022-02-0218:40ghaskins
<unquoted-literal> ::= #"[^()\s]+"
#2022-02-0218:40ghaskinsand it seems to be fine from a clojure/jvm regex perspective#2022-02-0218:40ghaskins
(re-find #"[^\[]+" "[foo]")
=> "foo]"
(re-find #"[^()\[]+" "[foo]")
=> "foo]"
(re-find #"[^()\[\s]+" "[ foo]")
=> "foo]"
#2022-02-0218:41ghaskins(im totally open to other/better ways to parse this outside of regex, too#2022-02-0218:42ghaskinsany help appreciated#2022-02-0218:47ghaskinsnm, i figured it out#2022-05-2421:54winsomeI'm trying to parse a rule like this: (parser "EOL ::= [#xD#xA]+"), but it blows up with a parse error:
EOL ::= [#xD#xA]+
         ^
Expected one of:
!
&
ε
eps
EPSILON
epsilon
Epsilon
<
(
{
[
#"#\"[^\"\\]*(?:\\.[^\"\\]*)*\"(?x) #Double-quoted regexp"
#"#'[^'\\]*(?:\\.[^'\\]*)*'(?x) #Single-quoted regexp"
#"\"[^\"\\]*(?:\\.[^\"\\]*)*\"(?x) #Double-quoted string"
#"'[^'\\]*(?:\\.[^'\\]*)*'(?x) #Single-quoted string"
(*
#"[^, \r\t\n<>(){}\[\]+*?:=|'"#&!;./]+(?x) #Non-terminal"
#2022-05-2421:54winsomeI'm going off of this EBNF syntax: https://www.w3.org/TR/REC-xml/#sec-notation#2022-05-2421:55winsome"#xN - where N is a hexadecimal integer, the expression matches the character whose number (code point) in ISO/IEC 10646 is N. The number of leading zeros in the #xN form is insignificant."#2022-05-2421:57winsomeDo I need to translate that syntax into some other representation? Is there one in particular that I should choose?#2022-05-2422:03hiredmaninstaparse uses clojure's syntax for regexes, so it expects # to be the start of a regex, maybe \ to escape it (would have to be \\ in a string literal)#2022-05-2422:04winsomeoh, it didn't occur to me that it would look for those inside a string.#2022-05-2422:04winsomeEscaping with \ and \\ produce the same problem, though.#2022-05-2422:05winsomeThese are the code points for cr lf, I believe, maybe I need to translate those into the the clojure versions#2022-05-2422:07hiredmanah, yes, well even if # didn't throw the above error, the syntax they use for matching octets is not a thing#2022-05-2422:08hiredman(codepoints, not octets)#2022-05-2422:08aengelbergyeah, the problem is that #xN is a pseudo-syntax that the XML specification may have invented for its own grammar, to help clarify the nuances of the character code points. But Instaparse doesn’t know how to interpret that as an actual parser.#2022-05-2422:09hiredmanthe way to embed a character by code point in a clojure string is \uN#2022-05-2422:10winsomeIs N a hex number?#2022-05-2422:10hiredman
user=> "\u0029"
")"
user=>
#2022-05-2422:10hiredman(yes)#2022-05-2422:11winsome
"\u000D\u000A"
"\r\n"
#2022-05-2422:11aengelbergI think this should work in instaparse:
EOL ::= "\u000D" | "\u000A"
#2022-05-2422:15aengelbergactually, this might not work if you’re slurping the grammar from a file and passing that into instaparse. the \u000A thing is a Clojure reader feature, not an instaparse feature#2022-05-2422:14aengelbergJava regexes also support referring to chars as code points, which means you can use the Instaparse regex feature as well:
EOL ::= #"[\\x0D\\x0A]"
#2022-05-2422:17winsome(grammar/parser "EOL ::= #\"[\\x0D\\x0A]\"") seems to work.#2022-05-2422:18winsomeAnd changing the double quote to a single quote makes it a little less messy: (grammar/parser "EOL ::= #'[\\x0D\\x0A]'")#2022-05-2422:18winsomeThanks!#2022-05-2422:19aengelbergno problem#2022-05-2912:23niclasnilssonHi everyone. I have a grammar problem I don’t know how to get around. I’m parsing guitar chords, and there is an “ambiguity” in the grammar (at least the way I implemented it). The problem is that a chord can have a quality (major/minor for instance). If it’s minor, it’s always written out, but if it’s major it’s often omitted (default). Then after the chord quality, there are intervals. Each interval also have an optional quality and always a number. My problem is that in some cases this leads to two different possible answers. In those cases, the one with a chord quality is always the right one. I could of course always do insta/parses and analyse the results and pick the right one, but I guess/hope there is a better way to express the grammar to avoid this and just get one (correct) result? I’d like the chord-quality to always take precedence / be “greedy”. Is there a way to write the grammar to solve this? Edit: One thing to add is that if there is no chord-quality, it means major (so major is implicitly default), if that helps in any way.
(ns chord-parser
  (:require
    [instaparse.core :as insta]))

(def chord-ebnf-small
  "chord = root chord-quality interval*
   root = #'[A-G]'
   chord-quality = quality?
   interval = quality? number
   number = '7' | '9' | '11' | '13'
   quality = major | minor
   major = 'M'
   minor = 'm'")

(def chord-parser (insta/parser chord-ebnf-small))


(insta/parses chord-parser "C9")
; => ([:chord 
;      [:root "C"] 
;      [:chord-quality] 
;      [:interval [:number "9"]])
; As expected.

(insta/parses chord-parser "Cm")
; => ([:chord 
;      [:root "C"] 
;      [:chord-quality [:quality [:minor "m"]]])
; As expected.

(insta/parses chord-parser "CmM9")
; => ([:chord
;      [:root "C"]
;      [:chord-quality [:quality [:minor "m"]]]
;      [:interval [:quality [:major "M"]] [:number "9"]])
; As expected.

(insta/parses chord-parser "Cm9")
; => ([:chord 
;      [:root "C"] 
;      [:chord-quality] 
;      [:interval [:quality [:minor "m"]] [:number "9"]]
;     [:chord 
;      [:root "C"] 
;      [:chord-quality [:quality [:minor "m"]]] 
;      [:interval [:number "9"]]])
; Ambiguous. The one with an actual chord-quality is the correct one.

(insta/parses chord-parser "Cm9m11")
; => ([:chord
;      [:root "C"]
;      [:chord-quality]
;      [:interval [:quality [:minor "m"]] [:number "9"]]
;      [:interval [:quality [:minor "m"]] [:number "11"]]
;     [:chord
;      [:root "C"]
;      [:chord-quality [:quality [:minor "m"]]]
;      [:interval [:number "9"]]
;      [:interval [:quality [:minor "m"]] [:number "11"]]])
; Ambiguous. The one with an actual chord-quality is the correct one.
#2022-05-3007:09Linus EricssonI would solve this by a separate function that processes the parsed chord entities to default major etc#2022-05-3007:38niclasnilssonYep, but for me that’s the step after whatever can be done in the parsing step. I changed the EBNF to the following, using the PEG extension “ordered choice” (the / ), and now insta/parse always returns the right one. I will still have to fill in major as default afterwards of course, since that information is implicit.
(def chord-ebnf-small
  "chord = root / root chord-quality / root chord-quality interval* / root interval*
   root = #'[A-G]'
   chord-quality = quality
   interval = quality? number
   number = '7' | '9' | '11' | '13'
   quality = major | minor
   major = 'M'
   minor = 'm'")
#2022-05-3007:45Linus EricssonYes I understood what you wanted to do. I just cannot see why you want instaparse to do it - it may be possible but it is not really a part of parsing the tabs. The implicit major chords are not really the same as explicit major chords.#2022-05-3007:48niclasnilssonAh, no, maybe I misunderstand you, but I didn’t want instaparse to fill in major as default. I wanted instaparse to pick the right choice and interpret Cm9 as C minor 9 and not as C (major) with a minor 9. With the PEG ordered choice I managed to get what I wanted.#2022-05-3007:50Linus EricssonAh, ok! Now I see. Great that it could be solved with ordered choice.#2022-05-3007:51niclasnilssonYes, this was the first time I looked into those extensions, but they seem pretty useful.#2022-05-3007:53Linus EricssonI would extend the root to be C C# D D# E F F# G G# A A# B (and maybe all the b versions as well)#2022-05-3007:53Linus EricssonAnd all the colorings, but that might be out of the scope for the example 🙂#2022-05-3007:54niclasnilssonAbsolutely. And there are 6, sus, dim and stuff missing as well 🙂#2022-05-3007:54niclasnilssoncolorings?#2022-05-3007:54niclasnilssonThat’s outside of my current music theory knowledge, but that sounds interesting!#2022-05-3007:55Linus EricssonNo, i meant chord types.#2022-05-3007:55niclasnilssonAh, as in playing the chord in different ways / places on the neck?#2022-05-3007:56Linus EricssonThat would be something - idk if tabulatures has notation for that, but that was not what I meant either https://www.guitarworld.com/lessons/10-gorgeous-color-chords-can-inspire-your-playing#2022-05-3007:59niclasnilssonAh, got it. I don’t think I’ve seen tab notation on that, apart from “slash chords” like Am7/G to note the bass note.#2022-05-3008:00niclasnilsson(which may or may not be part of the actual chord)#2022-05-3008:01niclasnilssonand when it’s part of the chord, I guess it actually becomes coloring?#2022-05-3008:01niclasnilssoninteresting#2022-05-3008:02Linus EricssonI'm in the deep end of the pond here but yes, if you was to play Cmaj7 it would be noted as that, and not C/B#2022-05-3008:04Linus EricssonI guess it would be a nice thing to be able to convert between (midi) note values and tabulatures, possibly with bass notes...#2022-05-3008:04niclasnilssonYes, I’m more thinking in the way of what’s “common” in chord progressions, like C, C/E, F#2022-05-3008:06niclasnilssonAnd I suppose since E is part of C major, it’s coloring, vs if the bass note was something outside of C major, it’s probably something else?#2022-05-3008:07Linus EricssonHmm, "as a" bass player i would parse C C/E F as C - E - F#2022-05-3008:07niclasnilssonFun stuff to think about and learn about!#2022-05-3008:07niclasnilssonExactly, but the guitarist can also play E in the bass of the C chord.#2022-05-3008:08Linus EricssonOne day I will learn musical notation by implementing it.#2022-05-3008:09niclasnilssonLearning (non-computer) stuff through coding is my favourite way of understanding stuff, by far.#2022-09-2614:33r0manHello, I have defined a grammar [1] to parse Java stacktrace with Instaparse. The grammar seems to work if I pass well formed input to it. What I would like to do next, is use the grammar to also parse input that has "garbage" at the beginning or the end of the input string. So I would go from something like this:
S = exception causes
exception = ...
causes = ...
to something like this:
S = <garbage?> exception causes <garbage?>
exception = ...
causes = ...
Now, the issue I am facing is how to define <garbage>. I tried to define it as #[\s\S]* but I believe it is too greedy and it messes up my grammar. For example, sometimes parsing succeeds with garbage, but most of the input is eaten by <garbage> and not by my actual stacktrace grammar. I'm staring to wonder if I actually should include the <garbage> into my grammar at all, or use some other functionality of Instaparse. I saw I can use insta/parses to get access to all parses tried so far, but they are quite a lot, and I am not sure which one to pick (I guess it depends on my application). How do you deal with garbage, or rules that are too greedy in Instaparse? Thanks for your help. [1] https://github.com/r0man/orchard/blob/stacktrace-at-point/resources/orchard/stacktrace/parser/java.bnf
#2022-11-0407:53sova-soars-the-soraI have a set of vowel patterns I want to match against, flexibly. Any idears?#2022-11-0619:34zaneMight want to provide some more details. simple_smile#2023-01-1418:36borkdudeIf anyone wants to try instaparse with #babashka check this out: https://github.com/babashka/instaparse.bb#2023-01-1608:42SigveIt would be interesting to read about the challenges in porting this, and the subsequent limitations#2023-01-1608:47borkdudeThe limitations are around not being able to serialize certain things like functions, for which there are solutions (e.g. send around quoted things and evaluate them later)#2023-01-1608:48borkdudeSince pod function calls are basically RPC calls#2023-01-1707:41SigveI see, thanks#2023-01-1716:50robert-stuttafordwe managed to take our servers down with an instaparse implementation that powers a css documentation system we built for ourselves. we literally lived the one problem, two problems regex today 😂#2023-01-3012:28licht1steinHi, I'm totally new to parsing, working on a linguistic side project. I'm trying to parse something that looks like this: "<pc>1,1<k1>a<k2>a<h>1<e>1\n<hom>1.</hom> <s>a</s>", where <pc>1,1 , <k1>a is a first type of tag, and <s>a</s> is another type of tag. I would like to get something like {:pc "1,1" :k1 "a" :k2 "a" :h "1"} for starters, because the second part should be simple xml. I've got this, which works as long as the string doesn't contain anything else, but breaks on the entire sample:
"
S = {tag}
tag = <tag-open> + key + <tag-close> + value
key = #'[a-zA-Z0-9]+'
value = #'[a-zA-Z0-9]+'
tag-open = #'<'
tag-close = '>'
"
I feel like I'm missing an understanding of some basic piece. I also don't know how to separate xml tags from these first kind of tags. Please help.
#2023-01-3016:35thomThis is presumably some form of SGML. You can probably find a Java library that’ll parse it already, but also most HTML libraries will do little fixups to unclosed tags (lots based on JSoup etc). If you want to parse it yourself you just need to introduce opening and closing tags to your grammar and make the closing ones optional.#2023-02-0908:59SigveIf anyone here is in the small subset of people using both Instaparse and (n)vim, i made a syntax file with some instaparse-specific things that makes it a bit nicer than using ebnf.vim or similar. Also includes a very basic indent script https://github.com/sigvesn/instaparse.vim#2023-03-0220:01MarkusHello y’all! I’m working on generating regular grammars using genetic programming. Now, I’m not completely sure how to use instaparse or, more specifically, I’m not quite sure how to write my rules, so that instaparse produces the desired output. I’m testing instaparse using tomita grammars and for that I’m converting some FSMs to rulesets. I got really confused using tomita-3 and instaparse, since the rules would be
A -> 'a'A
A -> 'b'B
A -> eps
B -> 'b'A
B -> 'a'C
B -> eps
C -> 'a'D
D -> 'a'C
D -> 'b'E
D -> eps
but when I typed the ruleset into instaparse using = instead of the -> of my specification, it wasn’t parsing anything. What would be the correct way to rewrite these rules for instaparse?
#2023-03-0220:04MarkusWhen I’m saying it wasn’t parsing anything, I actually mean that it was parsing valid words of tomita-3 as empty arrays or returning a failure.#2023-03-0315:20MarkusFor anyone else looking for a solution: It seems that instaparse strongly adheres to EBNF, which means that each non-terminal can only be defined once. This means, that all possible production rules for one non-terminal have to be concatenated with | like this:
A = 'a'A | 'b'B | eps
B = 'b'A | 'a'C | eps
C = 'a'D
D = 'a'C | 'b'D | eps
#2023-04-0721:09port19I'm working on a Lexer as part of following along with my college course on compilers. I tried to refactor my handwritten scanner for AS ::= { peter | petra | anna }, but didn't have much success using instaparse. Am I hitting a hammer with a chainsaw or am I missing something obvious? My small lexer is best viewed through this Clerk Notebook: https://port19.xyz/compnotes/#2023-04-0800:58hiredmanInstaparse is, I believe, a scannerless parser, meaning it is more powerful than a scanner (it can construct parse trees, not just return a stream of tokens), and also skips the intermediate step of scanning input into a stream of tokens and then build parse trees out of the stream of tokens#2023-04-0800:58hiredmanSo I am not sure why you would use it to construct a scanner#2023-04-0806:24port19Fair enough#2023-04-0809:07port19Are you aware of some scanner generator library? Or should I take this as an opportunity to write a scanner generator myself, perhaps even with macros if possible?#2023-04-2216:36diego.videcoNeed some help with my grammar. I have this (simplified):
pattern = (<whitespace>? token)*
<token> = word | cat
cat = pattern <whitespace> <'.'> <whitespace> pattern
whitespace = #"\s+"
word = #"[a-zA-Z]+"
And with this test string "a b . c" I get this result:
[:pattern
 [:word "a"]
 [:cat
  [:pattern [:word "b"]]
  [:pattern]]
 [:word "c"]]
But I am expecting to get this:
[:pattern
 [:cat
  [:pattern [:word "a"] [:word "b"]]
  [:pattern [:word "c"]]]]
However this string does provide the correct results: " a b . c" (notice the whitespace at the beginning). Any ideas how I can get the expected result?
#2023-04-2217:20hiredmanMake your grammar unambiguous, given that input, both what instaparse gives and what you expect are valid parses according to your grammar, I would expect if you asked instaparse for all the parses instead of just taking the first one it comes up with you would get both trees#2023-04-2217:23hiredmanYou have essentially two loops in your grammar, one is implicit in using * at the end of pattern, and the other goes pattern -> token -> cat -> pattern#2023-04-2217:23hiredmanAnd those overlap in what they recognize#2023-04-2219:48diego.videcoThanks, that worked#2023-04-3004:59diego.videcoHow can I disambiguate a grammar. I have this a?0.1 0.3 that returns this two parses:
{:parses
    ([:pattern
      [:cat [:degrade [:word "a"] [:op-degrade]] [:float "0.1"] [:float "0.3"]]]
     [:pattern
      [:cat
       [:degrade [:word "a"] [:op-degrade [:degrade-amount "0.1"]]]
       [:float "0.3"]]])}
I would just like to have the latter. Relevant parts of my grammar look like this:
pattern = cat (<ws> <'.'> <ws> cat)*

ws = #"\s+"
word = #"[a-zA-Z0-9]([a-zA-Z0-9]*)"
int = #"[0-9]+"
float = #"(?<!\?)\d+(\.\d+)"
<token> = word | int | float | degrade

cat = token (<ws>? token)*
degrade-amount = #"([0-9]*[.])?[0-9]+"
op-degrade = <'?'> degrade-amount?
degrade = token op-degrade
I am trying to have different regexes for float and degrade-amount , but it’s not actually working. I also tried this: float = (!'?' #"\d+(\.\d+)") but doesn’t seem to work.
#2023-05-0118:28ghaskins@diego.vid.eco it strikes me that you can probably accomplish disambiguation here using the ordered-choice feature (https://github.com/Engelberg/instaparse#ordered-choice)#2023-05-0123:10aengelbergYes, try something like
op-degrade = (<'?'> degrade-amount) / <'?'>
#2023-05-0304:15diego.videcoThanks @aengelberg and @ghaskins, actually my insta/parse was working as expected (prioritizing the correct version) but insta/parses is still showing more than one possibility, though I guess that this might prevent cases where the first parse is not the one I want right?#2023-05-0304:20aengelbergAh, yes, ordered choice still gives you all the possible parses. If you want it to be unambiguous, try a negative look ahead.
op-degrade = (<'?'> degrade-amount) | <'?'> !degrade-amount
#2023-05-0400:25diego.videcoThanks, that works great!#2023-05-0400:26diego.videcoI believe I was missusing the negative look ahead syntax. I appreciate the example!#2023-05-1000:02jacob.maineWe’re parsing large documents. We’ve found we get better performance (memory and CPU) if we split the documents up into chunks, and then parse each chunk separately. This keeps the grammar smaller and aligns with https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md#performance-tips. However, we also need to keep track of line and column numbers of the original document, which doesn’t work very well with the chunked approach. Say we have two chunks, lines 1-10 and lines 11-20. When we use insta/add-line-and-column-info-to-metadata on the second chunk, the line metadata starts at line 1, not line 11. At the moment we have a collection of helpers to walk the metadata after it’s generated, and offset it. But I was wondering if anyone has a better approach. I’ve submitted a PR with https://github.com/Engelberg/instaparse/pull/226, which pushes the complexity into instaparse itself. But if anyone has other tricks, I’d love to hear.#2023-05-1419:53thomclj-antlr is much faster than Instaparse if performance remains an issue. Not quite as nice an API though.#2023-05-1515:30jacob.maineThanks, I’ll check that out!#2023-05-2717:26tbrookeI am trying to port a grammar from pegjs and it says it is an abnf grammar. I does not work the grammar starts like this: /* ----- A.1 Lexical Grammar ----- */ SourceCharacter = . WhiteSpace “whitespace” = “\t” / “\v” / “\f” / ” ” / “\u00A0” / “\uFEFF” / Zs LineTerminator = [\n\r\u2028\u2029] What is this and can I translate it to something Instaparse understands?#2023-05-2722:47hiredmanLooks nothing like the example abnf here https://en.m.wikipedia.org/wiki/Augmented_Backus%E2%80%93Naur_form#2023-05-2722:49hiredmanThe "whitespace" (quoted like a terminal?) on the left of the = is like no grammar I've seen before#2023-05-2722:56hiredmanAh, pegjs has its own grammar syntax, what you are likely looking at is the grammar for abnf grammars, written in the pegjs grammar syntax#2023-05-2723:28tbrookeThank You - So it looks like I need to figure out pegjs grammar and manually translate to abnf#2023-06-0515:18tbrookeI am trying to parse a grammar that I would think would not be too difficult for Instaparse, I have looked at the tutorials and googled but I can actually find very few examples of instaparse grammars. The format I am parsing is as follows. Does anyone have an an example of something similar that would help me get a handle on this: namespace concept address { o String street } concept person { o String name o Integer age o Address address optional }#2023-06-0515:59respatializedIn my experience the best thing to do is to try and write the grammar in EBNF by hand on paper before specifying it in code. It always forces my to clarify the structure in my mind.#2023-08-1523:40tbrookeMy deps.edn is this: {:paths ["src"] :deps {org.clojure/clojure {:mvn/version “1.10.1"} instaparse/instaparse {:mvn/version “1.4.12”}}} in my file I have:`(ns scratch` (:require [instaparse.core :as insta ] [http://clojure.java.io :as io])))) (def xx (insta/parser “S = AB* AB = A B A = ‘a’+ B = ‘b’+“)) I get this error (it used to work I can’t figure out why it stopped: ; Syntax error compiling at (src/scratch.clj:6:3). ; No such namespace: insta Any suggestions?#2023-08-1620:46Samuel Ludwigwhen you first started your REPL, did you see the library getting downloaded? also, if you surround code in three backticks like '`' <put> <code> <here> '`' it'll come out a little nicer
<like>
  <so>
#2023-08-1700:02tbrookeI’m an idiot extra parenthesis in the ns#2023-09-0616:07roklenarcicHi. I am trying to create a very simple parser but I don’t undestand why it fails. If I have the following parser: S = E; E = 'N' | 'W' | 'S' | 'E' Then
(parser "N")
=> [:S [:E "N"]]
but if I add a rule to expand to multiple it stops parsing:
S = E; E = 'N' | 'W' | 'S' | 'E'; E = E E;
(parser "N")
=> Parse error at line 1, column 1:
N
^
Seems odd that adding more production rules would make it parse less….
#2023-09-0616:12hiredmanam not sure, but I would try adding the second E production as an alternative to the first E production#2023-09-0616:35roklenarcicHm I’ll just rewrite it all I think#2023-09-0616:36roklenarcicAnother puzzle:
S = DIR+; DIR = ('N' | 'W' | 'S' | 'E')+;
This produces:
(parser2 "WWWW")
=> [:S [:DIR "W"] [:DIR "W"] [:DIR "W"] [:DIR "W"]]
I would prefers that DIR was greedy in that it would only parse as a single element if there are multiple letters in a row… how would I do that?
#2023-09-0616:39hiredmanhttps://github.com/engelberg/instaparse#ambiguous-grammars#2023-11-0918:25Giles AlexanderHi, I’m having a weird issue with instaparse. I’ve written a grammar using EBNF. When parsing moderately long (~200 lines) documents using that grammar, if the document has an error then Instaparse will go into an infinite loop, but only if the document uses DOS line endings. Huh? If the document uses UNIX line endings, then Instaparse reports the error. EOL is part of the syntax of the grammar, and I have defined a terminal ('\r\n' | '\n' | '\r' | #"$"). There’s got to be something wrong with my grammar, and I want to fix that. But, it seems odd that I’m able to drop Instaparse into an infinite loop with only changing the line endings. Anyone have any ideas where I should start to look to produce a simpler test case? Thanks 🙏#2023-11-0918:27aengelbergSo it infinite loops if the input has \r\n ?#2023-11-0918:28Giles AlexanderNot exactly. Infinite loops if the document has \r\n and the document otherwise has a parse error.#2023-11-0918:30aengelbergI think #"$" might be dicey because it will detect the end of a line, not consume it (because it parses zero characters), meaning you could parse infinite empty lines in a row#2023-11-0918:30Giles AlexanderAhhh… I’m trying to match end of input as the same as an end of line#2023-11-0918:31aengelbergAlso, $ detects the end of a line, not the end of the file#2023-11-0918:31Giles Alexander\z instead?#2023-11-0918:32aengelbergYeah, that seems closer to what you want#2023-11-0918:32Giles AlexanderThanks! I’ll give it a try. And see if I can produce something to repro the infinite loop with that and without that change#2023-11-0918:33aengelbergI'd still be a little concerned at the potential for matching infinite EOF's#2023-11-0918:34Giles AlexanderOK. I see what you mean. I’ll have a think about a different way of expressing this#2023-11-0918:35aengelbergIt's possible you don't need to explicitly match on EOF, because instaparse will only consider a parse valid if it consumes the whole string