SourceFiles.org - Use the Source, Luke
Home | Register | News | Forums | Guide | MyLinks | Bookmark

Sponsored Links

Latest News
  General News
  Reviews
  Press Releases
  Software
  Hardware
  Security
  Tutorials
  Off Topic


Back to files

######################################################################### ############## Computational Linguistics Toolset v1.1.2 ################# ######################################################################### ######## Copyright (C) 2005 Wybo Wiersma <s.wiersma01@chello.nl> ######## ######################################################################### # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version.
#
# This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. #
# You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ######################################################################### # It is kindly requested that you acknowledge the use of these tools in # the publications reporting results produced with the help of them.

### What it is ###

The Computational Linguistics Toolset is a set of tools for computational linguistics. It contains re-usable code for cleaning, splitting, refining, and taking samples from corpora (ICE, Penn, and a native one), for tagging them using the TnT-tagger, for doing permutation statistics on N-grams (useful for finding statistically significant syntactical differences between any two sets of tagged texts), and various examination-tools. The tools themselves are well documented.

The individual tools are documented and versioned separately. Each time I make a significant new release (this includes bugfixes) of the entire package I will increase the version-number above.

### What is required ###

Perl (tested here on 5.8.4)

The following modules are needed but come standard with Perl: FileHandle
FindBin
List::Util
File::Basename

The least thing you need to do is set the $configbasedir variable within the central config file (named config)

### What to find where ###

The dir-structure of the package is as follows:

tools/corpus/ the tools for preparing corpora tools/examine/ tools for examining
tools/sensing/ tools for doing WordNet-related research tools/tagging/ tools for tagging
tools/permstat/ tools for doing permutation-statistics

Two special dirs are:

tools/mess/ quick-hack scripts that have little general usage tools/export/ tools for exporting (tarring & publishing these tools)

This structure is not guaranteed to remain the same forever...

### How to use it (in the easiest way) ###

To get more info on what a script does, run it with the -? option.

The *goall-scripts are used to do runs in which the tools are chained together, to allow adding corpora, or doing many tasks in sequence.

To use the tools within a sequence without changing anything to the configuration-files you should follow the following instructions

1 The tools-dir should be unpacked inside another dir, for example: research/

2 The raw corpus should be stored in the following dir within this base-dir (research/)corpusData/<corpusname>/raw

3 Some tools need lists of some sort (like corpuslexiconreducer.pl). Those should be stored in (research/)taskData/lists

4 Other dirs like corpusData, and dirs within corpusData/<corpusname> (for example the cleaned/ dir) are created automatically by the tools when needed.

You can always modify the *goall scripts to suit the needs of your particular research. However things might change between versions of this tool-package, so have a look at the changelog before overwriting your current install.

Better even; drop me a note if you are using this toolset, so I can keep an eye on possible update-problems (although of course I cannot accept any formal liabillity)...

### Changelog

1.1.1 -> 1.1.2

Added PermStatResultSelector as a proper tool

Added multiple ipnorm normalization rounds for extra precision

Fixed a few minor bugs

1.1.0 -> 1.1.1

Fixed and updated (sentence-length counting):

examine/rowstatter.pl

Also added some library functions.

1.0.5 -> 1.1.0

Added the following tools:

corpus/corpus2tagrow.pl
corpus/corpusrewritetagrow.pl
sensing/sensinggoall.pl
sensing/sentencesenser.pl
sensing/semanticgravitor.pl

Updated

sensing/wordcombinationfinder.pl
- bug fixed that caused some word-combinations not to be found - changed the default window-size to 5
sensing/listsenser.pl
- implemented the option for using an existing database - changed the database-format to cdb


Sponsored Links

Discussion Groups
  Beginners
  Distributions
  Networking / Security
  Software
  PDAs

About | FAQ | Privacy | Awards | Contact
Comments to the webmaster are welcome.
Copyright 2006 Sourcefiles.org All rights reserved.