The present publication is directed to the field of conference call systems and Telephony over IP (ToIP), also known as “IP Telephony”, and more particularly to a system and method for automatically identifying who speaks at a given instant, in a teleconference. As soon as someone speaks, the person is identified and his/her name is displayed on the phone screen of the participants.

Speaker recognition method and system in IP telephony based teleconferencing

Usually the stakeholders of these phone conferences are invited to call a specific number at a determined time, number often associated to a "meeting number". The moderator of the conf call meeting, usually the "owner" of the phone number, and initiator of the teleconference, has to associate a "pin code" to the dialled number. The present system and method as shown in Figure 7 allows to manage both traditional PSTN communication 700 and VoIP communication 710. For VoIP connection, the silent suppression is forced by the conf call session manager during the session establishment (by intended signalling messages). In case of a traditional PSTN communication, a specific encoder 730 encodes the signal to get VoIP packet and silent packet is supressed before doing speaker activity detection. Conf call session manager 720 is modified to generate both the JOIN 760 and LEAVE 770 messages each time a new session is established or broken. These messages are also sent to the Speaker Activity Detector 740 to allow allocation and release of resources to handle this session. Speaker Activity Detector (SAD) 740 receives Voice packet directly from VoIP sessions and from the encoder 730 for the PSTN sessions. SAD generate messages SPEECH and SILENT to Conf call session manager 720.



The system and method disclosed in the present publication comprises two parts, one being "speaker activity detection", the second one being "speaker name display". In a preferred embodiment these two parts are implemented in the voice conferencing server.

Speaking Activity Detection (SAD) : the principle is to measure the numbers of voice packets received in a determined interval of time, and to determine if a speaker is in silent mode or talking mode, with thresholds. Specific method allows to count the number of voice packets received - or not - in a predetermined time slot, and allows to get rid of jitter phenomenon. A shift register allows to precisely count the number of packets received, and a packet counter gives the number of received packets in an interval.

User in communication may be in status "talking_mode" or "silent_mode". If the



number of packets received becomes greater than the Talking threshold when the user is in status "silent_mode", a "SPEECH" event is generated and the user is set in status "talking_mode". Inversely, if the number of packets received becomes lower than the Silent threshold when the user is in status "talking_mode", a "SILENT" event is generated and the user is set in "silent-mode". Packet are not received with the same timing as they are transmitted by the source due to the queuing in the network. As packet networks are typically asynchronous, each packet is supposed to arrive with a different delay, resulting in jitter. There is also no guarant...