Books.org participates in affiliate programs including Bookshop.org and the Amazon Services LLC Associates Program. We may earn a commission from qualifying purchases made through links on this page, at no additional cost to you.
Overview
The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you.
Spidering Hacks takes you to the next level in Internet data retrieval—beyond search engines—by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented—you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you.
Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:
- Aggregate and associate data from disparate locations, then store and manipulate the data as you like
- Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
- Integrate third-party data into your own applications or web sites
- Make your own site easier to scrape and more usable to others
- Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day
With this crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when one has gone too far: what's acceptable and unacceptable), readers learn how to collect media files and data from databases; how to interpret and understand the data and repurpose it for use in other applications; and even build authorized interfaces to integrate the data into their own content.
Synopsis
With this crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when one has gone too far: what's acceptable and unacceptable), readers learn how to collect media files and data from databases; how to interpret and understand the data and repurpose it for use in other applications; and even build authorized interfaces to integrate the data into their own content.
Editorials
From Barnes & Noble
The Barnes & Noble ReviewThe Web has more extraordinarily useful content than anyone can conceivably get their arms around. But you can organize and use a whole lot more of it than you’re using now. The secret is in writing two forms of programs: spiders and scrapers.
Both types of programs fetch goodies from the Internet. Kinda like your dog bringing in the morning paper, except you can choose exactly what you want fetched. Spiders typically follow links to grab entire pages, files, or even sites. Scrapers generally grab specific pieces of information from within individual pages or files. Often, they’re used together: you send a spider to find the right pages, then send a scraper to pull the right excerpts.
Spiders and scrapers let you automatically pull together data from dozens of sites and present it any way you like. You could automatically keep up with anything that’s regularly posted on the Web (your favorite comic, your competitor’s new product introductions, the latest news from Iraq, new postings at your favorite blog).
There are millions of people who could benefit from writing these programs, but few of them know how. This stuff’s eminently learnable -- especially if you can already find your way around Perl. That’s where Spidering Hacks comes in.
This is the latest in O’Reilly’s Hacks series -- intended, in their words, to “reclaim the term ‘hacking’ for the good guys: innovators who explore and experiment, unearth shortcuts, create useful tools, and come up with fun things to try on your own.” We’ve raved about Google Hacks. Spidering Hacks is just as cool.
Kevin Hemenway and Tara Calishain first outline the basic concepts, techniques, and tools -- especially the Perl LWP modules for accessing web data, and automated tools like WWW::Mechanize. (They also cover the etiquette of spidering: how to “walk softly” on the sites you’re spidering, and respect requests not to spider.)
You’ll learn how to deal with secured site access and redesigns and improve your programs with feedback and progress bars. Next, the authors show how to retrieve media files (for example, downloading movies from the Library of Congress, retrieving daily comics, and finding MP3 files associated with an M3U playlist).
There’s extensive coverage of spidering database applications: for example, aggregating multiple search engine results, archiving Yahoo! Groups messages, scraping e-commerce site product reviews, collecting specific TV listings, automatically finding blogs you’re interested in, tracking overnight express packages, bargain hunting by automating the price comparison process.
As a bonus, the authors take you into some immensely useful hidden corners of the Web. For example, they show how to spider Lexical Freenet, which displays word relationships like puns, rhymes, concepts, relevant people, antonyms, and so forth. Add a simple spider, and you’ve built a truly amazing tool for any writer, librarian, or researcher.
There’s also more than a little fun here (for example, a spider that captures song lyrics and forwards them to a text-to-speech site that creates a .WAV file. Think of it as Robot Karaoke.) So, use your imagination. What do you want to spider today? Bill Camarda
Bill Camarda is a consultant, writer, and web/multimedia content developer. His 15 books include Special Edition Using Word 2000 and Upgrading & Fixing Networks for Dummies, Second Edition.