By default Python uses a dict to store an object’s instance attributes. This is really helpful as it allows setting arbitrary new attributes at runtime. Upgrade your Python skills: Examining the Dictionary Photo by Romain Vignes on Unsplash a hash table (hash map) is a data structure that implements an associative array abstract data type, a structure that can map keys to values. If it smells like a Python dict, feels like a dict, and looks like one well, it must be a dict. 使用slots 但是,如果我们想要限制class的属性怎么办?比如,只允许对Student实例添加name和age属性。 为了达到限制的目的,Python允许在定义class的时候,定义一个特殊的slots变量,来限制该class能添加的属性: class Student(object).
PEP: | 412 |
---|---|
Title: | Key-Sharing Dictionary |
Author: | Mark Shannon <mark at hotpy.org> |
Status: | Final |
Type: | Standards Track |
Created: | 08-Feb-2012 |
Python-Version: | 3.3 or 3.4 |
Post-History: | 08-Feb-2012 |
Contents
The CPython interpreter simulates namespace semantics for locals using boxed pre-allocated slots. (Globals are stored in a dict.) Since memory addresses cannot be literally empty, the slots have to use a special sentinel value that represents “this name is unbound”, equivalent to the key being missing in a dict. (Possibly a nil pointer?). Python老鸟都应该看过那篇非常有吸引力的 Saving 9 GB of RAM with Python’s slots 文章,作者使用了slots让内存占用从25.5GB降到了16.2GB。在当时来说,这相当于用一个非常简单的方式就降低了30%的内存.
This PEP proposes a change in the implementation of the builtindictionary type dict. The new implementation allows dictionarieswhich are used as attribute dictionaries (the __dict__ attributeof an object) to share keys with other attribute dictionaries ofinstances of the same class.
The current dictionary implementation uses more memory than isnecessary when used as a container for object attributes as the keysare replicated for each instance rather than being shared across manyinstances of the same class. Despite this, the current dictionaryimplementation is finely tuned and performs very well as ageneral-purpose mapping object.
By separating the keys (and hashes) from the values it is possible toshare the keys between multiple dictionaries and improve memory use.By ensuring that keys are separated from the values only whenbeneficial, it is possible to retain the high-performance of thecurrent dictionary implementation when used as a general-purposemapping object.
The new dictionary behaves in the same way as the old implementation.It fully conforms to the Python API, the C API and the ABI.
Reduction in memory use is directly related to the number ofdictionaries with shared keys in existence at any time. Thesedictionaries are typically half the size of the current dictionaryimplementation.
Benchmarking shows that memory use is reduced by 10% to 20% forobject-oriented programs with no significant change in memory use forother programs.
The performance of the new implementation is dominated by memorylocality effects. When keys are not shared (for example in moduledictionaries and dictionary explicitly created by dict() or{}) then performance is unchanged (within a percent or two) fromthe current implementation.
For the shared keys case, the new implementation tends to separatekeys from values, but reduces total memory usage. This will improveperformance in many cases as the effects of reduced memory usageoutweigh the loss of locality, but some programs may show a small slowdown.
Benchmarking shows no significant change of speed for most benchmarks.Object-oriented benchmarks show small speed ups when they create largenumbers of objects of the same class (the gcbench benchmark shows a10% speed up; this is likely to be an upper limit).
Both the old and new dictionaries consist of a fixed-sized dict structand a re-sizeable table. In the new dictionary the table can befurther split into a keys table and values array. The keys tableholds the keys and hashes and (for non-split tables) the values aswell. It differs only from the original implementation in that itcontains a number of fields that were previously in the dict struct.If a table is split the values in the keys table are ignored, insteadthe values are held in a separate array.
When dictionaries are created to fill the __dict__ slot of an object,they are created in split form. The keys table is cached in the type,potentially allowing all attribute dictionaries of instances of oneclass to share keys. In the event of the keys of these dictionariesstarting to diverge, individual dictionaries will lazily convert tothe combined-table form. This ensures good memory use in the commoncase, and correctness in all cases.
When resizing a split dictionary it is converted to a combined table.If resizing is as a result of storing an instance attribute, and thereis only instance of a class, then the dictionary will be re-splitimmediately. Since most OO code will set attributes in the __init__method, all attributes will be set before a second instance is createdand no more resizing will be necessary as all further instancedictionaries will have the correct size. For more complex usepatterns, it is impossible to know what is the best approach, so theimplementation allows extra insertions up to the point of a resizewhen it reverts to the combined table (non-shared keys).
A deletion from a split dictionary does not change the keys table, itsimply removes the value from the values array.
Explicit dictionaries (dict() or {}), module dictionaries andmost other dictionaries are created as combined-table dictionaries. Acombined-table dictionary never becomes a split-table dictionary.Combined tables are laid out in much the same way as the tables in theold dictionary, resulting in very similar performance.
The new dictionary implementation is available at [1].
Significant memory savings for object-oriented applications. Smallimprovement to speed for programs which create lots of similarobjects.
Change to data structures: Third party modules which meddle with theinternals of the dictionary implementation will break.
Changes to repr() output and iteration order: For most cases, thiswill be unchanged. However, for some split-table dictionaries theiteration order will change.
Neither of these cons should be a problem. Modules which meddle withthe internals of the dictionary implementation are already broken andshould be fixed to use the API. The iteration order of dictionarieswas never defined and has always been arbitrary; it is different forJython and PyPy.
An alternative implementation for split tables, which could save evenmore memory, is to store an index in the value field of the keys table(instead of ignoring the value field). This index would explicitlystate where in the value array to look. The value array would thenonly require 1 field for each usable slot in the key table, ratherthan each slot in the key table.
This 'indexed' version would reduce the size of value array by aboutone third. The keys table would need an extra 'values_size' field,increasing the size of combined dicts by one word. The extraindirection adds more complexity to the code, potentially reducingperformance a little.
The 'indexed' version will not be included in this implementation, butshould be considered deferred rather than rejected, pending furtherexperimentation.
[1] | Reference Implementation:https://bitbucket.org/markshannon/cpython_new_dict |
This document has been placed in the public domain.
Source: https://github.com/python/peps/blob/master/pep-0412.txtPython's efficient key/value hash table structure is called a 'dict'. The contents of a dict can be written as a series of key:value pairs within braces { }, e.g. dict = {key1:value1, key2:value2, ... }. The 'empty dict' is just an empty pair of curly braces {}.
Looking up or setting a value in a dict uses square brackets, e.g. dict['foo'] looks up the value under the key 'foo'. Strings, numbers, and tuples work as keys, and any type can be a value. Other types may or may not work correctly as keys (strings and tuples work cleanly since they are immutable). Looking up a value which is not in the dict throws a KeyError -- use 'in' to check if the key is in the dict, or use dict.get(key) which returns the value or None if the key is not present (or get(key, not-found) allows you to specify what value to return in the not-found case).
A for loop on a dictionary iterates over its keys by default. The keys will appear in an arbitrary order. The methods dict.keys() and dict.values() return lists of the keys or values explicitly. There's also an items() which returns a list of (key, value) tuples, which is the most efficient way to examine all the key value data in the dictionary. All of these lists can be passed to the sorted() function.
There are 'iter' variants of these methods called iterkeys(), itervalues() and iteritems() which avoid the cost of constructing the whole list -- a performance win if the data is huge. However, I generally prefer the plain keys() and values() methods with their sensible names. In Python 3000 revision, the need for the iterkeys() variants is going away.
Strategy note: from a performance point of view, the dictionary is one of your greatest tools, and you should use it where you can as an easy way to organize data. For example, you might read a log file where each line begins with an IP address, and store the data into a dict using the IP address as the key, and the list of lines where it appears as the value. Once you've read in the whole file, you can look up any IP address and instantly see its list of lines. The dictionary takes in scattered data and makes it into something coherent.
The % operator works conveniently to substitute values from a dict into a string by name:
The 'del' operator does deletions. In the simplest case, it can remove the definition of a variable, as if that variable had not been defined. Del can also be used on list elements or slices to delete that part of the list and to delete entries from a dictionary.
The open() function opens and returns a file handle that can be used to read or write a file in the usual way. The code f = open('name', 'r') opens the file into the variable f, ready for reading operations, and use f.close() when finished. Instead of 'r', use 'w' for writing, and 'a' for append. The special mode 'rU' is the 'Universal' option for text files where it's smart about converting different line-endings so they always come through as a simple 'n'. The standard for-loop works for text files, iterating through the lines of the file (this works only for text files, not binary files). The for-loop technique is a simple and efficient way to look at all the lines in a text file:
Reading one line at a time has the nice quality that not all the file needs to fit in memory at one time -- handy if you want to look at every line in a 10 gigabyte file without using 10 gigabytes of memory. The f.readlines() method reads the whole file into memory and returns its contents as a list of its lines. The f.read() method reads the whole file into a single string, which can be a handy way to deal with the text all at once, such as with regular expressions we'll see later.
For writing, f.write(string) method is the easiest way to write data to an open output file. Or you can use 'print' with an open file, but the syntax is nasty: 'print >> f, string'. In python 3000, the print syntax will be fixed to be a regular function call with a file= optional argument: 'print(string, file=f)'.
The 'codecs' module provides support for reading a unicode file.
For writing, use f.write() since print does not fully support unicode.
Building a Python program, don't write the whole thing in one step. Instead identify just a first milestone, e.g. 'well the first step is to extract the list of words.' Write the code to get to that milestone, and just print your data structures at that point, and then you can do a sys.exit(0) so the program does not run ahead into its not-done parts. Once the milestone code is working, you can work on code for the next milestone. Being able to look at the printout of your variables at one state can help you think about how you need to transform those variables to get to the next state. Python is very quick with this pattern, allowing you to make a little change and run the program to see how it works. Take advantage of that quick turnaround to build your program in little steps.
Combining all the basic Python material -- strings, lists, dicts, tuples, files -- try the summary wordcount.py exercise in the Basic Exercises.