Some cleanup and pythonic the code

15 views
Skip to first unread message

Alberto Paro

unread,
May 2, 2011, 3:45:24 AM5/2/11
to databrewery
Hi,

I'm started to do some cleanup/errors check and convert more pythonic
the code.

The branch is at https://github.com/aparo/brewery


My scope are in order:

- cleanup source (unused import, unused variable) (1)
- errors correction (there are some error in code in calling missing
fuction or not imported ones) (1)
- improvement in use of python structure to speed up code. (1)
- adding ElasticSeach via pyes mode for storing/processing data
- improve the code with design patterns for managing big datasets
- using systems such as eventlet/greenlet to improve parrallelism


(1) These changes can be done on current source code without
modification.

Hi,
Alberto

Stefan Urbanek

unread,
Jun 22, 2011, 8:58:25 PM6/22/11
to databrewery


On May 2, 9:45 am, Alberto Paro <alberto.p...@gmail.com> wrote:
> Hi,
>
>  I'm started to do some cleanup/errors check and convert more pythonic
> the code.
>
> The branch is athttps://github.com/aparo/brewery
>

Thanks, I've merged most of your changes. However, I've not included
forced requirements for packages:

+install_requires = ['pymongo', "SQLAlchemy", "gdata", "xlrd",
"PyYAML"]

Main reason is, that the mentioned packages are not really necessary
to make use of Brewery, they are needed only if you need specific data
source/target/feature. I should listed them in README though.

If a package is missing, then exceptions like this are raised:

Exception: Optional package 'gdata' is not installed. Please install
the package to be able to use: Google data (spreadsheet) source/target
Exception: Optional package 'sqlalchemy' is not installed. Please
install the package from http://www.sqlalchemy.org/ to be able to use:
SQL streams. Recommended version is > 0.7

It is done by handling import exception like this:

try:
import sqlalchemy
except:
from brewery.utils import MissingPackage
sqlalchemy = MissingPackage("sqlalchemy", "SQL streams", "http://
www.sqlalchemy.org/",
comment = "Recommended version is >
0.7")

maybe some of the packages will become required in the future,
depending on use of the framework. However, I do not want to include
not necessary dependencies. If someone wants to use it only on mongo
or only google spreadsheets then he does not have to have SQL alchemy.

> My scope are in order:
>
> - cleanup source (unused import, unused variable) (1)
> - errors correction (there are some error in code in calling missing
> fuction or not imported ones) (1)

that is possible, mostly remnants after refactoring, without proper
test case.

> - improvement in use of python structure to speed up code. (1)

Optimisation will be definitely needed. Current state of the code is:
"it just works", or more likely it is a working prototype.

> - adding ElasticSeach via pyes mode for storing/processing data

Can you give us some examples?

> - improve the code with design patterns for managing big datasets

Handling bigger datasets would be nice to have, current implementation
has no special treatment for big data. From my point of view, this is
not high priority at the moment, but depends on potential use of the
framework...

> - using systems such as eventlet/greenlet to improve parrallelism

As I am quite new to python (6 months), never heard about these, just
looked them up. Looks useful. Anyway, there are many options how to
implement and improve parallelism. I also considered using stackless
python. My current decision on using standard python threading was
because I didn't wanted to introduce unnecessary dependency, provide
some default solution. I know that current implementation of threads
in v2.7 has serious issues because of GIL (threaded is slower than
single thread, multiple threads on single core are faster than on
multiple cores), but it is less of an issue in the very avoided v3.2,
non-issue on other python implementations. There is a rule that I want
to follow with any potential threading/parallelisation module: it
should not make writing node 'run' method complicated.

Btw. to increase efficiency, I would prefer also a solution that would
optimise the network before running it. Like will try to find groups
of nodes that can be replaced by single node of different type, or
that some nodes can be put out of the network completely.

Regards,

Stefan
Reply all
Reply to author
Forward
0 new messages