kevin's weblog

kevin's weblog kevin's weblog haskell web spider, part 3: more hxt 2007-11-01 12:20 in /tech/haskell in my last post about hxt i had gotten stuck at a performance problem in hxt that rendered it unusable. since then, i’ve exchanged a number of emails with uwe schmidt, the maintainer. he found where the exponential blowup was happening in the regex engine and fixed that problem. with that fix, my spider ran for a bit longer, but eventually failed due to hitting the per-process limit on open file descriptors. i tried adding stricta in a couple places in the code, but it did not resolve the resource leak. uwe claims this is a bug in network.http, and suggested the a_use_curl option to spawn an external curl program to do the fetching. while it sucks to be spawning hundreds of processes for this task, it did fix the resource leak. with those problems out of the way, i was able to focus on some issues in my own program, like trying to validate jpg images as xml, or to fetch mailto: links. i’m now reasonably happy with the program, which you can see in the hxt/practical section of the haskell wiki. the major area where this could still be improved is parallelization. verifying about 700 pages and links on my site takes 45 minutes, during which the program is only doing something for about 8, while the rest is waiting on the network. it would definitely be a good exercise to learn more about the concurrency capabilities of haskell, although the hidden system state in hxt makes me nervous about whether it’ll work at all. i’d probably want to do a couple simpler exercises in concurrent programming first, before attempting to parallelize this one. i have a few remaining complaints / suggestions for hxt. one which i believe uwe is already thinking about adding, is an option for adapting the parsing based on the content type in the response. currently, you have to specify html or xml parsing in the call to readdocument. this is not terribly useful in an application like this one. it would be much nicer if hxt used xml parsing if the content type is xml, html parsing when it’s text/html, and complained on anything else (like, image/jpeg). another frustration i had was that tracking parsing and validation errors to their source was very difficult in some cases. a missing end tag frequently doesn’t produce a parse error until much later in the document. the validator would catch this much earlier, but hxt does parsing and validation in two separate passes. one can insert missing end tags at the point of the parse error and then look at the resulting validation error, but the tree that the validator operates on doesn’t have any line or column information, so you can’t easily track a validation error down to a specific location in the source file. presumably the nodes in the tree could be augmented with this data fairly simply, but the other shortcomings of the two-pass approach are undoubtedly much more difficult to fix. permalink · 0 comments visiting nyc 2007-10-29 11:01 in /life/travel i spent half of last week in manhattan — the first time i’ve been there in a bit more than 6 years. a series of observations: i am always amazed at the infrastructure of the city, and that it is possible to sustain a city on this scale at all. walking around, you can’t avoid noticing the volume of trash being generated, and removed, each and every day. one is constantly walking under scaffolding. less visible is the constant stream of food into the city’s restaurants, and surprisingly little road work. among other things, it makes me thankful for modern health and safety practices. walking around mid-town, i find myself having that out-of-place, but trying-to-fake-it feeling. staying in a hotel that’s way more expensive than i could afford, surrounded by luxury stores and swanky clubs, i wonder how many of the people i see really belong in this lifestyle and how many are just pretending for a bit. i’ve lived in the part of the universe where “coffee” means “espresso” for so long that the dominance of drip coffee in new york seems like some sort of stubborn affectation. they recently opened a whole foods on the bowery. there seems to be some sort of economic ideal achieved here, wherein people out-source virtually everything: cooking, laundry, driving, etc. i don’t claim to really understand the local economy, but i wonder if in part it’s just so damn expensive that you can’t afford not to fully realize your comparative advantage, whatever it might be. this was my first time flying jetblue. not too shabby, but there’s just no way around the fact that red-eyes suck. at least i got a couple mindless movies to entertain me on the way home, since i forgot to bring a book to read on the flight. permalink · 0 comments you know you’ve been doing too much haskell... 2007-10-24 09:10 in /tech/haskell ...when you’re walking down the street and see this:   and your reaction is “hey, there’s no ‘gt’ operator in haskell!” permalink · 8 comments on reddit 2007-10-01 10:01 in /tech prompted to some degree by michael’s experience, i decided to submit my last two posts, about my attempt to use haskell’s xml libraries, to reddit. haskell is still quite a niche language but nonetheless there are lots of articles about it that make their way to the front page of the programming subreddit. unfortunately, my first article only got about 6 visits, and 2 downvotes and no comments. the second didn’t even generate a single visit to my site. perhaps i misunderstand how the “new” page on reddit is supposed to work, but i couldn’t even find this article listed there, although many older submissions were there. my fear for a while has been that reddit was basically an echo chamber. popular sites and blogs get lots of upvotes from their regular readers, but obscure sites never get seen and never even have the opportunity to get votes. this is unfortunate. i’m often disappointed at how little new and truly interesting content makes the front page of reddit, and how long many articles linger long after i finished reading them, or passed them over as uninteresting to me. in theory, the collaborative filtering algorithm is supposed to help with this, customizing my front page to my learned interests, but this just doesn’t seem to actually work. i’ve recently cut back my time spent on reddit substantially, because i realized i wasn’t getting much value out of it. i don’t imagine i’m going to drop it out of my rotation of regular web sites entirely, but i’m definitely not missing the wasted time, which i’m now using to actually write code and articles of my own. permalink · 2 comments haskell web spider, part 2: hxt, or i was promised there would be no side effects 2007-09-30 11:30 in /tech/haskell once haxml proved unsuitable for validating xthml, i turned my attention to hxt, the haskell xml toolkit. while the api for haxml looked pretty similar to what i might have designed myself, hxt has more of a learning curve. in particular, it is based on the arrow computational structure. like monads, arrows require learning new syntax and a new conceptual model. unlike monads, where tutorials are a dime a dozen, there’s little out there to help you learn to use arrows effectively. this is complicated by the fact that hxt extensively extends the base arrow definition, with little additional documentation. (my one sentence explanation of arrows is that they model computation as pipeline elements which can be performed in sequence or in parallel.) despite the paucity of documentation, i got much further along with hxt. in fact, i have a complete working program, except for a “but” that would satisfy sir mix-a-lot. i’m going to show most of this program, with annotations, then explain how things go wrong. type myarrow b c = iostatearrow (set.set string) b c runurl url = runx (consta url >>> settracelevel 1 >>> withotheruserstate set.empty >>> split >>> checkurl ) hxt throws attempts at purity and separation of concerns to the wind, and pushes everything it does into an iostatearrow (underneath which are the io and state monads). the state is separated into a system state and a user state, which is () by default. because i’m going to want to track urls that i’ve crawled, i specify a set of strings for my state. this code shoves the seed url into an arrow using consta, enables a low level of tracing, and sets up my initial state. with that setup done, we can start doing some real work. (the split will become clear in a second.) checkurl :: myarrow (string,string) string -- (url, base) checkurl = clearerrstatus >>> ifa (first seenurl) (fst ^>> tracestring 1 ("skipping (seen already) " ++) >>> none) (first markseen >>> ifp islocal checklocalurl checkremoteurl) this function checks to see if the url has been seen before, by checking it against the set in the user state, and if it has we emit a debugging message and then stop. (none is a function on listarrows that essentially produces an empty list, signifying no more work to be done.) if this is a new url, we mark it as seen, then branch based on whether it is a local url or a remote url. this is where the split above comes in — we’ll be keeping track of the previous url which this on was linked from, in order to figure out when we are leaving the original website. the most mysterious part of checkurl is the first line. originally i did not have this, and i observed that the spider would run for a while, but terminate before the whole site was crawled. after adding some additional debugging statements, i discovered something which i am inclined to consider a bug in hxt. after a document is encountered with errors in validation, something gets set in the global error state which causes all further attempts to read in a document to fail silently. so, after the spider found it’s first errors, it would terminate shortly thereafter, as it wasn’t managing to pick up any new urls to crawl. the addition of clearerrstatus before each new fetch prevents this failure. checklocalurl :: myarrow (string, string) string checklocalurl = constructlocalurl >>> split >>> first ( tracestring 0 ("checking " ++) >>> readfromdocument [] >>> selectlinks >>> tracestring 1 ("found link: " ++) ) >>> checkurl selectlinks :: arrowxml a => a xmltree string selectlinks = deep (iselem >>> hasname "a" >>> getattrvalue "href" >>> mktext) >>> gettext checklocalurl expands any relative urls, then reads in the resulting url and selects out any hrefs from the document. the result is a list of new urls to crawl, paired with the url of this document, which we pass recursively back into checkurl. what’s implicit in this code is that readfromdocument validates the document by default, and in addition to fetching the document itself also fetches the dtd including any subparts, thus avoiding the difficulties i had with haxml. somewhat oddly, the library simply prints the validation errors, rather than returning them from the function, but that’s something i can live with in this application. (i think it would be possible to specify an alternate error handler if you wanted to store the errors for later processing.) checkremoteurl is not terribly interesting, and for the purposes of this exposition, you can just consider it to be a synonym for none. checkremoteurl = none this code seems to be correct, but..... i set it running on my website and it chugs along for a while. then it hits a particular page (of perfectly valid xhtml, by the way), and starts validation and just never stops. i let it run for about 40 minutes with the cpu pegged before killing the process. some further investigations with a pared down version of the document showed that it’s not in an infinite loop, but that it’s in some nasty, presumably exponential, blowup in the regular expression engine. the source of this blowup is somehow non-local: removing either half of a list of paragraphs eliminates the problem, removing an enclosing div eliminates the problem, etc. the author of hxt chose to implement his own regex engine based on the concept of “derivatives of regular expressions”, which i take it are interesting academically but, it would seem, perhaps not ideal practically speaking. at the moment, this is where i’m stuck. i’m pretty comfortable with the program as it stands, but the library is letting me down. fortunately, this problem seems more tractable than the haxml problem. the decision is then whether to a) wait for the maintainer to fix it, b) try to fix it myself, c) try to replace the custom regex engine in hxt with one of the standard implementations, or d) rip out the regular expressions from the validation entirely and replace them with standard string functions which would be guaranteed to perform linearly. the biggest obstacle to working on this myself is not actually the code, but the fact that it takes nearly an hour to recompile the library on my computer. verdict: hxt has a somewhat steep learning curve, and the api is a little rough around the edges in places, particularly the state handing parts. there is desperate need for a better, more comprehensive, tutorial. this library was written as someone’s masters thesis, and the code has the look of something which has never seen serious code review (e.g., lots of typos, even in function names). i can see no good reason to reimplement regular expressions for this task. actually, i can see no good reason to use regular expressions at all for this task. this portion of the code should be completely overhauled. on the other hand, it is easy to add debugging statements and to thread state through the computation, and the arrow api has a certain elegance to it. to come: a resolution? parallelization? permalink · 2 comments haskell web spider, part 1: haxml 2007-09-29 23:40 in /tech/haskell i’ve been working on writing a validating web spider in haskell, mostly because it’s something that would be useful for me. i want to be able to validate the markup on my entire web site and the external services for (x)html validation usually only do 100 pages before they stop crawling. i also want to learn haskell better. of course, i recognize that this type of task is not haskell’s forte, and that’s exactly why i chose it. i want to push the edges of what the language is good at to understand its weaknesses as a general purpose language, as well as its strengths. along the way, i’ve been learning a lot about monads, arrows, error handling, managing state, as well as the usefulness of the community and the general quality of the libraries. there are three xml libraries for haskell: hxml, haxml, and hxt. hxml does not provide any validation functions, so it was immediately out of the running. haxml seems relatively simple and was the next library i located, so i first attempted to use it for this task. i’ll talk about hxt in the next article in this series. this presentation is not strictly chronological, as i’ve bounced back and forth between haxml and hxt when i’ve hit problems with one or the other. haxml is fairly minimalist. it doesn’t deal with fetching content off the net, and it doesn’t force you into the io or any other monad. on the one hand, this does seem like good library design, but it also leads to some shortcomings. a big one is that it’s hard to debug relative to hxt because you can’t just toss in trace messages at will, and, for added frustration, the data types don’t derive from show, so even once you get them to the outer layers of your program and have access to the io monad, you still can’t easily display them. despite this, the naive approach to fetching, parsing, and validating a document seems simple. use the simplehttp function from network.uri to fetch a url, then call xmlparse on the content, and then validate on the resulting dom. (i’m skipping over handing all the error cases which can arise.) checkurl :: string -> io [string] checkurl url = do contents <- geturl url case parse url contents of left msg -> return [msg] right (dtd, doc) -> return $ validate dtd doc parse :: string -> string -> either string (doctypedecl, element) parse file content = case dtd of left msg -> left msg right dtd' -> right (dtd', root) where doc = xmlparse file content root = getroot doc dtd = getdtd doc unfortunately, this fails miserably and reports that every single element and attribute in the document is invalid. i puzzled over this for a while before i realized that while my xml document contained a doctypedecl within it, this object was useless for validating because it only contained the identifier information for the dtd, but none of the contents. notice that we never actually fetched the dtd from the network in the above sequence. once i realized this, i added a little more code to get the system identifier out of the dtd declaration, fetch that document, and run it through dtdparse, then validate using that doctypedecl. checkurl :: string -> io [string] checkurl url = do contents <- geturl url case parse url contents of left msg -> return [msg] right (dtd,doc) -> getdtdandvalidate dtd doc getdtdandvalidate :: doctypedecl -> element -> io [string] getdtdandvalidate (dtd name mid _) doc = do case mid of nothing -> return ["no external id for " ++ name] just id -> fetchexternal id where fetchexternal :: externalid -> io [string] fetchexternal id = do dtd <- fetchdtd (systemliteral id) return $ validate dtd doc systemliteral (system (systemliteral s)) = s systemliteral (public p (systemliteral s)) = s fetchdtd :: string -> io doctypedecl fetchdtd dtd = do contents <- geturl dtd case dtdparse dtd contents of nothing -> error "no dtd" just dtd' -> return dtd' this would probably work for many cases, but it fails for xhtml because the dtd is actually split across multiple files. as a result, when it encounters an entity declaration from an external file, it fails with: *** exception: xhtml-lat1.ent: openfile: does not exist (no such file or directory). this appears to be a dead-end. short of parsing the dtd myself, finding the external references, fetching them, and doing the substitution; there seems to be no way around this problem. verdict: the api of haxml looks more-or-less like you’d expect, although it would be really nice to derive from show, at least for the major data types, and the addition of named fields would make things a little easier for users. unfortunately, it seems like what i would think to be a common use case was not considered, rendering the library limited in utility. i also can’t speak to the quality of the filter portion of the library, since if i couldn’t validate my seed document, there was no point in extracting links from it to crawl further. to come: adventures with hxt... permalink · 1 comments what we believe but cannot prove 2007-09-15 15:39 in /books/completed i’ve flipped through this book in stores a few times previously, and a month or two ago i finally picked up a copy. it’s a series of 1-3 page essays by various scientists, writers, and philosophers answering the question, “what do you believe is true even though you can’t prove it?”. overall, i think the book is worthwhile, although it seems that you could read most or all of the essays online at the world question center (for some reason, in a complete different order from the book, though.) occasionally it gets a bit repetitive as the editor has chosen to group similar answers together, and occasionally an answer seems overly specific to the author’s personal research or just seems to miss the point of the exercise entirely. for example, one has to wonder if freeman dyson falls into this category or if he was just feeling ornery when he wrote his disappointingly uninteresting response. personally, the most interesting essay for me came from susan blackmore, perhaps because i might have been inclined to give the opposite answer. that is, although i can’t prove it, and all my understanding of science argues to the contrary, i believe in free will. however, after reading her essay, i had to contemplate whether i really believe in free will, or merely want to believe it. at some point, i’ll have to think and write more on this subject. i’ll probably hang onto this book for a while so i can re-read parts of it. although you can read it fairly quickly, you could also spend years dissecting each of the essays in turn. for that, i think this is a book worth having around. permalink · 1 comments the subtle knife 2007-09-15 15:39 in /books/completed the book is volume 2 in the “his dark materials” trilogy, started by “the golden compass”. while i read it pretty voraciously, i didn’t find it as engaging as the first book. it suffers from the typical shortcoming of the middle book in trilogies. most of what happens in this book is setting up for the climax to come in book 3, and while there’s the beginning of the explanation of some of the mysteries of the first book, it doesn’t achieve the same sense of wonderment. permalink · 0 comments the golden compass 2007-09-15 15:39 in /books/completed this book has been highly recommended by a number of friends, and i finally got a chance to pick it up and start reading. i definitely recommend it myself as a fun, engaging piece of fantasy. although intended as children’s literature, it has enough substance to satisfy an adult. although you’ll probably be drawn to read it quickly, it’s worth paying attention as you go as there’s a fair bit early on which is referenced again later in the book. permalink · 1 comments first day of kindergarten 2007-09-11 09:40 in /life in yet another indication of how slack i’ve been about blogging, yesterday was the first day of kindergarten. there’s an 8-month long saga leading up to this, which i’ve been meaning to write about but failing. and, there’s a decent chance that it’s not over yet. but for now i’m not going to subject you to that story. as for yesterday, the little one seems to have had no problems, other than being disappointed at how little time they get outside compared to what she’s used to from preschool. (this is made worse because she’s currently in half-day, which means she’s only there for one of the three daily recess periods.) after the starting bell, they had some coffee and pastries set up for the kindergarten parents, so i stuck around for 15 minutes or so to make sure she settled in okay. when i poked my head in before leaving, everything seemed good. i looked in just as they were doing the pledge of allegiance, which actually caught me a bit off-guard. not that i should have been surprised; we did the same every day i went to public school. but, nonetheless, it bugged me a bit. i’m not one of those “america hating liberals”. i get up and sing the national anthem before sporting events. but, let’s face it, there’s something a little creepily nationalistic about the pledge of allegiance. permalink · 0 comments about all (444) books (53) life (108) meta (18) photo (19) politics (53) random (27) tech (161) copyright · colophon · t’rati profile reading current reading unconditional parenting, by alfie kohn recently completed what we believe but cannot prove, by various the subtle knife, by philip pullman the golden compass, by philip pullman angela’s ashes, by frank mccourt guns, germs, and steel, by jared diamond in queue bhagavad-gita, by swami prabhavananda (translator) introduction to tantra: the transformation of desire, by lama yeshe time-out for parents, by cheri huber the visual display of quantitative information, by edward r. tufte visual explanations, by edward r. tufte a people’s history of the united states, by howard zinn the hallowed hunt, by lois mcmaster bujold the deadline, by tom demarco crypto, by steven levy the puzzle palace, by james bamford a pattern language, by christopher alexander body of secrets, by james bamford words and rules: the ingredients of language, by steven pinker the language instinct, by steven pinker blogroll *the *transcendental *wildcard 9 chickweed lane adam bosworth's weblog ask a ninja - ipod ask bjørn hansen blog.pmarca.com bloglines | news botany photo of the day canspice.org carl pope: taking the initiative christopher smith's blog daring fireball (articles) dave viner's weblog democracy in america derek's rantings and musings dilbert discardia discardian dive into mark drawmatic eikonoklastes by michael hartl ellementk everybody loves eric raymond eye level pasadena foxtrot free exchange get fuzzy haskell hacking idle words jason and mike vs. the world jensense - making sense of contextual advertising jeremy zawodny's blog joel on software john battelle's searchblog jwz kasia in a nutshell kevin's weblog lambdacats liberty meadows lightly organized chaos loc : photojournal marc abramowitz ongoing paul graham paul graham: unofficial rss feed paul krugman peace be the journey perl.com perl.com phd comics philip greenspun's weblog photos from hippomama photos from kevin scaldeferri photos from toadkiller pinds.com - home planet haskell raganwald sarah elizabeth schneier on security shaveblog simon willison's weblog entries sirhc.us maxim.us sleepless in saskatoon squawks of the parrot strata lucida tales and ales the daily ack the official qc rss feed the perl foundation the show with zefrank the universe of discourse the unofficial blosxom user group transcendental bloviation voices from catland wayward wellingtonians william reardon's blog wondermark xkcd.com yahoo! search blog yodel anecdotal  

Acceuil

suivante

kevin's weblog   Prnom KEVIN : Etymologie, origine, popularit et signification du ...  Photo "MARC KEVIN 4 DANSEUSES" de MarcKevin (Musique > Disco) - wat.tv  Kevin Denis-Fortier : faits saillants et statistiques en carrire ...  Les Feux de l'Amour - Tout sur kevin  ..:: www.kevin-king.net ::... Bienvenue sur le Site Officiel de ...  Bruno Lussato - Billets marqus comme kevin bronstein  Kevin Hoyt  Blog Nascar: Une saison avec Kevin Harvick pisode #14. Boogity ...  Kevin Spacey Vente de dvd prix discount : CmoinCher  Actualit sur Kevin Saunderson - Dates de concerts, albums, Mp3 ...  Kvin (le pervers)  Kevin Ayers Puis-Je- lyrics  Le Speakerine du jour Tele7.fr : Kvin - Vidos TV - Voila  P45 Magazine Blog Archive Cloud nine et American Kevin  P45 Magazine Blog Archive Merci pour tout Kevin Costner  Degr Kevin - Dsencyclopdie  Pride : Kevin Randleman Vs Emilianenko - Vidos Sport Tlvision ...  kevin wen  Kevin Marsh  OnRoule.ca - Kevin Lacroix chez Walker Racing en 2008  Kevin & Perry, Film DVD vido Kevin & Perry  Expositions de photographies - Enfants du monde de Kevin Kling  Kevin Senio - Rugby - Rugbyrama  Mr Brooks - Avec Kevin Costner, Demi Moore  PRENOM SABRINA KEVIN - Blog violine - Si vous aimez les bbs ...  IL FAUT QU'ON PARLE DE KEVIN - Livres: LIONEL SHRIVER - Librairie ...  Kevin McDonald sur zoom-Cinema.fr  Prnom KEVIN : etymologie, signification et origine du prenom KEVIN  Cinrvue : Fiche personnalit - Kevin Costner  Kevin Parent : Infos, paroles, photos  Kevin Smith dans Battlestar Galactica - News : catgorie ...  Kevin Reynolds - Achat et Vente DVD Kevin Reynolds neufs et d ...  Lilly adore l'ecole ! Henkes Kevin - Achat et Vente Livres et BD ...  Liste des films avec Kevin Spacey sur www.sortiesdvd.com  Alexandre Vigneault : Kevin Parent de la cuisine au salon ...  Kevin Hill - TOUTES LES SERIES, l'encyclopdie des sries de ...  KEVIN PARENT Lyrics - SI SEUL  KEVIN PARENT Lyrics - LA JASETTE  L'information sur Kevin Donahoe,Fleurs de quatre heures Vers le ...  Kevin, Jordan et Cameron - le canard vex  Kevin Parent : Boomerang, Lyrics, paroles, mp3  Kevin Parent : V13, Lyrics, paroles, mp3  Kevin Ceballo - Mi Primer Amor (2001) / Yo Soy Ese Hombre (2003)  kevin macdonald DVD film. Tous les films de kevin macdonald en DVD ...  kevin costner DVD film. Tous les films de kevin costner en DVD ...  kvin : la tribue zappa  Kevin et les magiciens, T1, John Bellairs, Livres sur Fnac.com  Kevin Hill - Un papa d'enfer - Programme.TV  Tout sur le prnom Kevin Magicmaman  Kevin Brett Musique sur Last.fm  SciFlicks.com ---> Kevin Peter Hall - Biographie et Filmographie ...  Questions et rponses de Kevin - Yahoo! Questions/Rponses  KEVIN DEVINE - Blog  Vidos qui correspondent a vtre recherche : tag:"kevin"  Cin reporter : Fiche personnalit - Kevin Smith  Cin reporter : Fiche personnalit - Kevin Spacey  Entertainfo - Kevin Durand  Paroles Kevin Lyttle Lyrics  Programme tlvision : Kevin Slater sur programme-tv.com (castings)  Kevin Laigle - Champion du monde UIM 2003 - 2004