Welcome to Dream.In.Code
Getting Help is Easy!

Join 132,685 Programmers for FREE! Get instant access to thousands of experts, tutorials, code snippets, and more! There are 1,257 people online right now. Registration is fast and FREE... Join Now!




web crawler - help choosing language

 
Reply to this topicStart new topic

web crawler - help choosing language, language needs good unicode support

chu
post 21 Sep, 2008 - 11:35 AM
Post #1


New D.I.C Head

*
Joined: 20 Sep, 2008
Posts: 7


My Contributions


Hi
I'm trying to choose a language to program a focused web crawler in. The purpose of this project is, more than anything, to serve as a learning experience, so things like memory usage and speed are not priorities for this crawler. I also realize that there are some open source crawlers out there, but again, I'm doing this for the learning experience. I'd like help choosing a language based on the following criteria:

-good built in functions or libraries for parsing html and xml. I'm still a relatively novice programmer, so if these can save me some time, it would be helpful.
-good support for the following character encodings: UTF-8, Shift-JIS/x-sjis, EUC-JP

Also, I already have some experience using Java, Python, and PHP, so if any of these languages fits into the criteria mentioned above, that language would be preferred.

From experience, PHP doesn't have very good support of unicode (yet), and it doesn't seem like a very well suited language for programming a web crawler in either.

While I like Python a lot, and I've heard good things about lxml and BeautifulSoup, I'm not too sure about it's unicode support after reading some of the comments here:
http://lowkster.blogspot.com/2008/06/pytho...code-sucks.html

From what I can tell, a lot of web crawlers are written in Java or C/C++. Any recommendations?

This post has been edited by chu: 21 Sep, 2008 - 11:37 AM
User is offlineProfile CardPM

Go to the top of the page

abgorn
post 21 Sep, 2008 - 11:38 AM
Post #2


Hello Crap for Brains

Group Icon
Joined: 5 Jun, 2008
Posts: 865



Thanked 4 times

Dream Kudos: 25
My Contributions


I think Java would fit it well. It seems to fit your criteria well and Java's a fairly simple and straight forward language.

Does anyone else think this would be good in any other languages?
User is online!Profile CardPM

Go to the top of the page

xCraftyx
post 22 Sep, 2008 - 03:59 PM
Post #3


New D.I.C Head

Group Icon
Joined: 13 Sep, 2008
Posts: 33



Thanked 1 times
My Contributions


Here's an article about writing a web crawler in Java if you'd like to try it out: http://java.sun.com/developer/technicalArt...rty/WebCrawler/
User is offlineProfile CardPM

Go to the top of the page

chu
post 23 Sep, 2008 - 09:26 AM
Post #4


New D.I.C Head

*
Joined: 20 Sep, 2008
Posts: 7


My Contributions


Hey, thanks for the replies. I guess I'll try programming the crawler in Java.
User is offlineProfile CardPM

Go to the top of the page

abgorn
post 27 Sep, 2008 - 01:55 AM
Post #5


Hello Crap for Brains

Group Icon
Joined: 5 Jun, 2008
Posts: 865



Thanked 4 times

Dream Kudos: 25
My Contributions


If you did do it in Java you could do it like this:
http://www.java-tips.org/java-se-tips/java...-in-java-2.html
User is online!Profile CardPM

Go to the top of the page

Fast ReplyReply to this topicStart new topic
Time is now: 11/23/08 07:03AM

Live Help!

Tutorials

Programming

Web Development

Reference Sheets

Code Snippets

Bye Bye Ads

Free DIC T-Shirt

T-Shirt Example

Related Sites

Monthly Drawing

Thumb Drive

Partners

Top Contributors

Top 10 Kudos This Month